All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
@ 2015-12-30  2:37 Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 01/25] docs: add colo readme Wen Congyang
                   ` (24 more replies)
  0 siblings, 25 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

This patchset implemented the COLO feature for Xen.
For detail/install/use of COLO feature, refer to:
  http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping

This patchset is based on:
1. http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg02881.html
2. http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg02884.html

Changlog from v8 to v9:
1. Rebased to the upstream xen
2. Fix some bugs found in the test

Changelog from v7 to v8:
1. Rebased to the latest libxl migration v2.

Changelog from v6 to v7:
1. Ported to Libxl migration v2
2. Send dirty bitmap from secondary to primary on libxc side
3. Address review comments

Changelog from v5 to v6:
1. based on migration v2(libxc)
2. split the patchset into prerequisite patchset and this main patchset.

Changelog from v4 to v5:
1. rebase to the latest xen upstream
2. disk replication: blktap2->qdisk
3. nic replication: colo-agent->colo-proxy

Changelog from v3 to v4:
1. rebase to newest xen
2. bug fix

Changlog from v2 to v3:
1. rebase to newest remus
2. add nic replication support

Changlog from v1 to v2:
1. rebase to newest remus
2. add disk replication support

Wen Congyang (25):
  docs: add colo readme
  docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo
    streams
  libxc/migration: Specification update for DIRTY_PFN_LIST records
  libxc/migration: export read_record for common use
  tools/libxl: add back channel support to write stream
  tools/libxl: write checkpoint_state records into the stream
  tools/libxl: add back channel support to read stream
  tools/libxl: handle checkpoint_state records in a libxl migration v2
    read stream
  tools/libx{l,c}: introduce should_checkpoint callback
  tools/libx{l,c}: add postcopy/suspend callback to restore side
  secondary vm suspend/resume/checkpoint code
  primary vm suspend/resume/checkpoint code
  libxc/restore: support COLO restore
  libxc/restore: send dirty pfn list to primary when checkpoint under
    colo
  send store gfn and console gfn to xl before resuming secondary vm
  libxc/save: support COLO save
  implement the cmdline for COLO
  Support colo mode for qemu disk
  COLO: use qemu block replication
  COLO proxy: implement setup/teardown of COLO proxy module
  COLO proxy: preresume, postresume and checkpoint
  COLO nic: implement COLO nic subkind
  setup and control colo proxy on primary side
  setup and control colo proxy on secondary side
  cmdline switches and config vars to control colo-proxy

 docs/README.colo                         |    9 +
 docs/man/xl.conf.pod.5                   |    6 +
 docs/man/xl.pod.1                        |   11 +-
 docs/misc/xl-disk-configuration.txt      |   50 ++
 docs/specs/libxc-migration-stream.pandoc |   24 +-
 docs/specs/libxl-migration-stream.pandoc |   25 +-
 tools/hotplug/Linux/Makefile             |    1 +
 tools/hotplug/Linux/colo-proxy-setup     |  135 ++++
 tools/libxc/include/xenguest.h           |   36 +
 tools/libxc/xc_sr_common.c               |   50 ++
 tools/libxc/xc_sr_common.h               |   26 +-
 tools/libxc/xc_sr_restore.c              |  244 +++++--
 tools/libxc/xc_sr_save.c                 |  102 ++-
 tools/libxc/xc_sr_stream_format.h        |    1 +
 tools/libxl/Makefile                     |    4 +
 tools/libxl/libxl.c                      |   97 ++-
 tools/libxl/libxl_colo.h                 |   40 ++
 tools/libxl/libxl_colo_nic.c             |  321 +++++++++
 tools/libxl/libxl_colo_proxy.c           |  292 ++++++++
 tools/libxl/libxl_colo_qdisk.c           |  262 ++++++++
 tools/libxl/libxl_colo_restore.c         | 1085 ++++++++++++++++++++++++++++++
 tools/libxl/libxl_colo_save.c            |  701 +++++++++++++++++++
 tools/libxl/libxl_create.c               |   79 ++-
 tools/libxl/libxl_device.c               |   54 ++
 tools/libxl/libxl_dm.c                   |  184 ++++-
 tools/libxl/libxl_dom_save.c             |   14 +-
 tools/libxl/libxl_internal.h             |  234 +++++--
 tools/libxl/libxl_qmp.c                  |   93 +++
 tools/libxl/libxl_save_callout.c         |    7 +-
 tools/libxl/libxl_save_msgs_gen.pl       |   13 +-
 tools/libxl/libxl_sr_stream_format.h     |   11 +
 tools/libxl/libxl_stream_read.c          |   96 ++-
 tools/libxl/libxl_stream_write.c         |   86 ++-
 tools/libxl/libxl_types.idl              |   10 +
 tools/libxl/libxlu_disk_l.l              |    7 +
 tools/libxl/xl.c                         |    3 +
 tools/libxl/xl.h                         |    1 +
 tools/libxl/xl_cmdimpl.c                 |   99 ++-
 tools/libxl/xl_cmdtable.c                |    4 +-
 tools/python/xen/migration/libxc.py      |    8 +
 tools/python/xen/migration/libxl.py      |    9 +
 41 files changed, 4340 insertions(+), 194 deletions(-)
 create mode 100644 docs/README.colo
 create mode 100755 tools/hotplug/Linux/colo-proxy-setup
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_nic.c
 create mode 100644 tools/libxl/libxl_colo_proxy.c
 create mode 100644 tools/libxl/libxl_colo_qdisk.c
 create mode 100644 tools/libxl/libxl_colo_restore.c
 create mode 100644 tools/libxl/libxl_colo_save.c

-- 
2.5.0

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 01/25] docs: add colo readme
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams Wen Congyang
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

add colo readme, refer to
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/README.colo | 9 +++++++++
 1 file changed, 9 insertions(+)
 create mode 100644 docs/README.colo

diff --git a/docs/README.colo b/docs/README.colo
new file mode 100644
index 0000000..466eb72
--- /dev/null
+++ b/docs/README.colo
@@ -0,0 +1,9 @@
+COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
+project is a high availability solution. Both primary VM (PVM) and secondary VM
+(SVM) run in parallel. They receive the same request from client, and generate
+response in parallel too. If the response packets from PVM and SVM are
+identical, they are released immediately. Otherwise, a VM checkpoint (on demand)
+is conducted.
+
+See the website at http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
+for details.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 01/25] docs: add colo readme Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2016-01-26 20:40   ` Konrad Rzeszutek Wilk
  2015-12-30  2:37 ` [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records Wen Congyang
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

It is the negotiation record for COLO.
Primary->Secondary:
control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
Secondary->Primary:
                0x00000001: Secondary VM is suspended
                0x00000002: Secondary VM is ready
                0x00000003: Secondary VM is resumed

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
 tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
 tools/python/xen/migration/libxl.py      |  9 +++++++++
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
index 2c97d86..5166d66 100644
--- a/docs/specs/libxl-migration-stream.pandoc
+++ b/docs/specs/libxl-migration-stream.pandoc
@@ -1,6 +1,6 @@
 % LibXenLight Domain Image Format
 % Andrew Cooper <<andrew.cooper3@citrix.com>>
-% Revision 1
+% Revision 2
 
 Introduction
 ============
@@ -119,7 +119,9 @@ type         0x00000000: END
 
              0x00000004: CHECKPOINT_END
 
-             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x00000005: CHECKPOINT_STATE
+
+             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
              0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
@@ -249,6 +251,25 @@ A checkpoint end record marks the end of a checkpoint in the image.
 The end record contains no fields; its body_length is 0.
 
 
+CHECKPOINT\_STATE
+--------------
+
+A checkpoint state record contains the control information for checkpoint.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | control_id             | padding                |
+    +------------------------+------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+------------     ---------------------------------------------------
+control_id       0x00000000: Secondary VM is out of sync, start a new checkpoint
+                 0x00000001: Secondary VM is suspended
+                 0x00000002: Secondary VM is ready
+                 0x00000003: Secondary VM is resumed
+--------------------------------------------------------------------
+
 Future Extensions
 =================
 
diff --git a/tools/libxl/libxl_sr_stream_format.h b/tools/libxl/libxl_sr_stream_format.h
index 54da360..75f5190 100644
--- a/tools/libxl/libxl_sr_stream_format.h
+++ b/tools/libxl/libxl_sr_stream_format.h
@@ -36,6 +36,7 @@ typedef struct libxl__sr_rec_hdr
 #define REC_TYPE_EMULATOR_XENSTORE_DATA 0x00000002U
 #define REC_TYPE_EMULATOR_CONTEXT       0x00000003U
 #define REC_TYPE_CHECKPOINT_END         0x00000004U
+#define REC_TYPE_CHECKPOINT_STATE       0x00000005U
 
 typedef struct libxl__sr_emulator_hdr
 {
@@ -47,6 +48,16 @@ typedef struct libxl__sr_emulator_hdr
 #define EMULATOR_QEMU_TRADITIONAL    0x00000001U
 #define EMULATOR_QEMU_UPSTREAM       0x00000002U
 
+typedef struct libxl_sr_checkpoint_state
+{
+    uint32_t id;
+} libxl_sr_checkpoint_state;
+
+#define CHECKPOINT_NEW               0x00000000U
+#define CHECKPOINT_SVM_SUSPENDED     0x00000001U
+#define CHECKPOINT_SVM_READY         0x00000002U
+#define CHECKPOINT_SVM_RESUMED       0x00000003U
+
 #endif /* LIBXL__SR_STREAM_FORMAT_H */
 
 /*
diff --git a/tools/python/xen/migration/libxl.py b/tools/python/xen/migration/libxl.py
index fc0acf6..d5f54dc 100644
--- a/tools/python/xen/migration/libxl.py
+++ b/tools/python/xen/migration/libxl.py
@@ -37,6 +37,7 @@ REC_TYPE_libxc_context          = 0x00000001
 REC_TYPE_emulator_xenstore_data = 0x00000002
 REC_TYPE_emulator_context       = 0x00000003
 REC_TYPE_checkpoint_end         = 0x00000004
+REC_TYPE_checkpoint_state       = 0x00000005
 
 rec_type_to_str = {
     REC_TYPE_end                    : "End",
@@ -44,6 +45,7 @@ rec_type_to_str = {
     REC_TYPE_emulator_xenstore_data : "Emulator xenstore data",
     REC_TYPE_emulator_context       : "Emulator context",
     REC_TYPE_checkpoint_end         : "Checkpoint end",
+    REC_TYPE_checkpoint_state       : "Checkpoint state"
 }
 
 # emulator_* header
@@ -212,6 +214,11 @@ class VerifyLibxl(VerifyBase):
         if len(content) != 0:
             raise RecordError("Checkpoint end record with non-zero length")
 
+    def verify_record_checkpoint_state(self, content):
+        """ Checkpoint state """
+        if len(content) == 0:
+            raise RecordError("Checkpoint state record with zero length")
+
 
 record_verifiers = {
     REC_TYPE_end:
@@ -224,4 +231,6 @@ record_verifiers = {
         VerifyLibxl.verify_record_emulator_context,
     REC_TYPE_checkpoint_end:
         VerifyLibxl.verify_record_checkpoint_end,
+    REC_TYPE_checkpoint_state:
+        VerifyLibxl.verify_record_checkpoint_state,
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 01/25] docs: add colo readme Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2016-01-26 20:44   ` Konrad Rzeszutek Wilk
  2015-12-30  2:37 ` [PATCH v9 04/25] libxc/migration: export read_record for common use Wen Congyang
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Used by secondary to send it's dirty bitmap to primary under COLO.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/specs/libxc-migration-stream.pandoc | 24 +++++++++++++++++++++++-
 tools/libxc/xc_sr_common.c               |  1 +
 tools/libxc/xc_sr_stream_format.h        |  1 +
 tools/python/xen/migration/libxc.py      |  8 ++++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index 8cd678f..ae1f1d0 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -227,7 +227,9 @@ type         0x00000000: END
 
              0x0000000E: CHECKPOINT
 
-             0x0000000F - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x0000000F: DIRTY_PFN_LIST
+
+             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
              0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
@@ -599,6 +601,26 @@ CHECKPOINT record or an END record.
 
 \clearpage
 
+DIRTY_PFN_LIST
+------------
+
+A dirty pfn list record is used to convey information about dirty memory
+in the VM. It is an unordered list of PFNs. Currently only applicable in
+the backchannel of a checkpointed stream.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+
+The count of pfns is: record->length/sizeof(uint64_t).
+
+\clearpage
+
 Layout
 ======
 
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 945cfa6..8150140 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -35,6 +35,7 @@ static const char *mandatory_rec_types[] =
     [REC_TYPE_X86_PV_VCPU_MSRS]     = "x86 PV vcpu msrs",
     [REC_TYPE_VERIFY]               = "Verify",
     [REC_TYPE_CHECKPOINT]           = "Checkpoint",
+    [REC_TYPE_DIRTY_PFN_LIST]       = "Dirty pfn list",
 };
 
 const char *rec_type_to_str(uint32_t type)
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
index 6d0f8fd..8b8533f 100644
--- a/tools/libxc/xc_sr_stream_format.h
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -75,6 +75,7 @@ struct xc_sr_rhdr
 #define REC_TYPE_X86_PV_VCPU_MSRS     0x0000000cU
 #define REC_TYPE_VERIFY               0x0000000dU
 #define REC_TYPE_CHECKPOINT           0x0000000eU
+#define REC_TYPE_DIRTY_PFN_LIST       0x0000000fU
 
 #define REC_TYPE_OPTIONAL             0x80000000U
 
diff --git a/tools/python/xen/migration/libxc.py b/tools/python/xen/migration/libxc.py
index b0255ac..47da5e3 100644
--- a/tools/python/xen/migration/libxc.py
+++ b/tools/python/xen/migration/libxc.py
@@ -60,6 +60,7 @@ REC_TYPE_toolstack            = 0x0000000b
 REC_TYPE_x86_pv_vcpu_msrs     = 0x0000000c
 REC_TYPE_verify               = 0x0000000d
 REC_TYPE_checkpoint           = 0x0000000e
+REC_TYPE_dirty_pfn_list       = 0x0000000f
 
 rec_type_to_str = {
     REC_TYPE_end                  : "End",
@@ -77,6 +78,7 @@ rec_type_to_str = {
     REC_TYPE_x86_pv_vcpu_msrs     : "x86 PV vcpu msrs",
     REC_TYPE_verify               : "Verify",
     REC_TYPE_checkpoint           : "Checkpoint",
+    REC_TYPE_dirty_pfn_list       : "Dirty pfn list"
 }
 
 # page_data
@@ -403,6 +405,10 @@ class VerifyLibxc(VerifyBase):
         if len(content) != 0:
             raise RecordError("Checkpoint record with non-zero length")
 
+    def verify_record_dirty_pfn_list(self, content):
+        """ dirty pfn list """
+        raise RecordError("Found dirty pfn list record in stream")
+
 
 record_verifiers = {
     REC_TYPE_end:
@@ -443,4 +449,6 @@ record_verifiers = {
         VerifyLibxc.verify_record_verify,
     REC_TYPE_checkpoint:
         VerifyLibxc.verify_record_checkpoint,
+    REC_TYPE_dirty_pfn_list:
+        VerifyLibxc.verify_record_dirty_pfn_list,
     }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 04/25] libxc/migration: export read_record for common use
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (2 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2016-01-26 20:45   ` Konrad Rzeszutek Wilk
  2015-12-30  2:37 ` [PATCH v9 05/25] tools/libxl: add back channel support to write stream Wen Congyang
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

read_record() could be used by primary to read dirty bitmap
record sent by secondary under COLO.
When used by save side, we need to pass the backchannel fd
instead of ctx->fd to read_record(), so we added a fd param to
it.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_sr_common.c  | 49 +++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common.h  | 14 ++++++++++
 tools/libxc/xc_sr_restore.c | 63 +--------------------------------------------
 3 files changed, 64 insertions(+), 62 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 8150140..42ee074 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -89,6 +89,55 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
     return -1;
 }
 
+int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rhdr rhdr;
+    size_t datasz;
+
+    if ( read_exact(fd, &rhdr, sizeof(rhdr)) )
+    {
+        PERROR("Failed to read Record Header from stream");
+        return -1;
+    }
+    else if ( rhdr.length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
+              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        rec->data = malloc(datasz);
+
+        if ( !rec->data )
+        {
+            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
+                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+
+        if ( read_exact(fd, rec->data, datasz) )
+        {
+            free(rec->data);
+            rec->data = NULL;
+            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
+                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+    }
+    else
+        rec->data = NULL;
+
+    rec->type   = rhdr.type;
+    rec->length = rhdr.length;
+
+    return 0;
+};
+
 static void __attribute__((unused)) build_assertions(void)
 {
     XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index bc99e9a..53d6129 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -370,6 +370,20 @@ static inline int write_record(struct xc_sr_context *ctx,
 }
 
 /*
+ * Reads a record from the stream, and fills in the record structure.
+ *
+ * Returns 0 on success and non-0 on failure.
+ *
+ * On success, the records type and size shall be valid.
+ * - If size is 0, data shall be NULL.
+ * - If size is non-0, data shall be a buffer allocated by malloc() which must
+ *   be passed to free() by the caller.
+ *
+ * On failure, the contents of the record structure are undefined.
+ */
+int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
+
+/*
  * This would ideally be private in restore.c, but is needed by
  * x86_pv_localise_page() if we receive pagetables frames ahead of the
  * contents of the frames they point at.
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index d4dc501..e543be3 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -69,67 +69,6 @@ static int read_headers(struct xc_sr_context *ctx)
 }
 
 /*
- * Reads a record from the stream, and fills in the record structure.
- *
- * Returns 0 on success and non-0 on failure.
- *
- * On success, the records type and size shall be valid.
- * - If size is 0, data shall be NULL.
- * - If size is non-0, data shall be a buffer allocated by malloc() which must
- *   be passed to free() by the caller.
- *
- * On failure, the contents of the record structure are undefined.
- */
-static int read_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
-{
-    xc_interface *xch = ctx->xch;
-    struct xc_sr_rhdr rhdr;
-    size_t datasz;
-
-    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
-    {
-        PERROR("Failed to read Record Header from stream");
-        return -1;
-    }
-    else if ( rhdr.length > REC_LENGTH_MAX )
-    {
-        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
-              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
-        return -1;
-    }
-
-    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
-
-    if ( datasz )
-    {
-        rec->data = malloc(datasz);
-
-        if ( !rec->data )
-        {
-            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
-                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
-            return -1;
-        }
-
-        if ( read_exact(ctx->fd, rec->data, datasz) )
-        {
-            free(rec->data);
-            rec->data = NULL;
-            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
-                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
-            return -1;
-        }
-    }
-    else
-        rec->data = NULL;
-
-    rec->type   = rhdr.type;
-    rec->length = rhdr.length;
-
-    return 0;
-};
-
-/*
  * Is a pfn populated?
  */
 static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
@@ -646,7 +585,7 @@ static int restore(struct xc_sr_context *ctx)
 
     do
     {
-        rc = read_record(ctx, &rec);
+        rc = read_record(ctx, ctx->fd, &rec);
         if ( rc )
         {
             if ( ctx->restore.buffer_all_records )
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 05/25] tools/libxl: add back channel support to write stream
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (3 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 04/25] libxc/migration: export read_record for common use Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 06/25] tools/libxl: write checkpoint_state records into the stream Wen Congyang
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Add back channel support to write stream. If the write stream is
a back channel stream, this means the write stream is used by
Secondary to send some records back.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_dom_save.c     |  1 +
 tools/libxl/libxl_internal.h     |  1 +
 tools/libxl/libxl_stream_write.c | 26 ++++++++++++++++++++------
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 86026ac..5fd9db2 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -409,6 +409,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
     dss->sws.ao  = dss->ao;
     dss->sws.dss = dss;
     dss->sws.fd  = dss->fd;
+    dss->sws.back_channel = false;
     dss->sws.completion_callback = stream_done;
 
     libxl__stream_write_start(egc, &dss->sws);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index b6929a9..b473748 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3040,6 +3040,7 @@ struct libxl__stream_write_state {
     libxl__ao *ao;
     libxl__domain_save_state *dss;
     int fd;
+    bool back_channel;
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_write_state *sws,
                                 int rc);
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index 852546f..538a1c6 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -49,6 +49,13 @@
  *  - if (hvm)
  *      - Emulator context record
  *  - Checkpoint end record
+ *
+ * For back channel stream:
+ * - libxl__stream_write_start()
+ *    - Set up the stream to running state
+ *
+ * - Add a new API to write the record. When the record is written
+ *   out, call stream->checkpoint_callback() to return.
  */
 
 /* Success/error/cleanup handling. */
@@ -225,6 +232,15 @@ void libxl__stream_write_start(libxl__egc *egc,
 
     stream->running = true;
 
+    dc->ao        = ao;
+    dc->readfd    = -1;
+    dc->copywhat  = "save v2 stream";
+    dc->writefd   = stream->fd;
+    dc->maxsz     = -1;
+
+    if (stream->back_channel)
+        return;
+
     if (dss->type == LIBXL_DOMAIN_TYPE_HVM) {
         stream->device_model_version =
             libxl__device_model_version_running(gc, dss->domid);
@@ -249,12 +265,7 @@ void libxl__stream_write_start(libxl__egc *egc,
         stream->emu_sub_hdr.index = 0;
     }
 
-    dc->ao        = ao;
-    dc->readfd    = -1;
     dc->writewhat = "stream header";
-    dc->copywhat  = "save v2 stream";
-    dc->writefd   = stream->fd;
-    dc->maxsz     = -1;
     dc->callback  = stream_header_done;
 
     rc = libxl__datacopier_start(dc);
@@ -279,6 +290,7 @@ void libxl__stream_write_start_checkpoint(libxl__egc *egc,
 {
     assert(stream->running);
     assert(!stream->in_checkpoint);
+    assert(!stream->back_channel);
     stream->in_checkpoint = true;
 
     write_emulator_xenstore_record(egc, stream);
@@ -584,7 +596,9 @@ static void stream_done(libxl__egc *egc,
         libxl__carefd_close(stream->emu_carefd);
     free(stream->emu_body);
 
-    check_all_finished(egc, stream, rc);
+    if (!stream->back_channel)
+        /* back channel stream doesn't have save helper */
+        check_all_finished(egc, stream, rc);
 }
 
 static void checkpoint_done(libxl__egc *egc,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 06/25] tools/libxl: write checkpoint_state records into the stream
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (4 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 05/25] tools/libxl: add back channel support to write stream Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 07/25] tools/libxl: add back channel support to read stream Wen Congyang
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

write checkpoint_state records into the stream, used by both
primary and secondary to send colo context.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_internal.h     |  5 ++++
 tools/libxl/libxl_stream_write.c | 60 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index b473748..3cf297f 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3052,6 +3052,7 @@ struct libxl__stream_write_state {
     bool running;
     bool in_checkpoint;
     bool sync_teardown;  /* Only used to coordinate shutdown on error path. */
+    bool in_checkpoint_state;
     libxl__save_helper_state shs;
 
     /* Main stream-writing data. */
@@ -3075,6 +3076,10 @@ _hidden void libxl__stream_write_start(libxl__egc *egc,
 _hidden void
 libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                      libxl__stream_write_state *stream);
+_hidden void
+libxl__stream_write_checkpoint_state(libxl__egc *egc,
+                                     libxl__stream_write_state *stream,
+                                     libxl_sr_checkpoint_state *srcs);
 _hidden void libxl__stream_write_abort(libxl__egc *egc,
                                        libxl__stream_write_state *stream,
                                        int rc);
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index 538a1c6..6541399 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -98,6 +98,12 @@ static void write_checkpoint_end_record(libxl__egc *egc,
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 
+/* checkpoint state */
+static void write_checkpoint_state_done(libxl__egc *egc,
+                                        libxl__stream_write_state *stream);
+static void checkpoint_state_done(libxl__egc *egc,
+                                  libxl__stream_write_state *stream, int rc);
+
 /*----- Helpers -----*/
 
 static void write_done(libxl__egc *egc,
@@ -583,6 +589,21 @@ static void stream_complete(libxl__egc *egc,
         return;
     }
 
+    if (stream->in_checkpoint_state) {
+        assert(rc);
+
+        /*
+         * If an error is encountered while in a checkpoint, pass it
+         * back to libxc.  The failure will come back around to us via
+         * 1. normal stream
+         *    libxl__xc_domain_save_done()
+         * 2. back_channel stream
+         *    libxl__stream_write_abort()
+         */
+        checkpoint_state_done(egc, stream, rc);
+        return;
+    }
+
     stream_done(egc, stream, rc);
 }
 
@@ -590,6 +611,7 @@ static void stream_done(libxl__egc *egc,
                         libxl__stream_write_state *stream, int rc)
 {
     assert(stream->running);
+    assert(!stream->in_checkpoint_state);
     stream->running = false;
 
     if (stream->emu_carefd)
@@ -650,7 +672,43 @@ static void check_all_finished(libxl__egc *egc,
         libxl__save_helper_inuse(&stream->shs))
         return;
 
-    stream->completion_callback(egc, stream, stream->rc);
+    if (stream->completion_callback)
+        /* back channel stream doesn't have completion_callback() */
+        stream->completion_callback(egc, stream, stream->rc);
+}
+
+/*----- checkpoint state -----*/
+void libxl__stream_write_checkpoint_state(libxl__egc *egc,
+                                          libxl__stream_write_state *stream,
+                                          libxl_sr_checkpoint_state *srcs)
+{
+    struct libxl__sr_rec_hdr rec;
+
+    assert(stream->running);
+    assert(!stream->in_checkpoint);
+    assert(!stream->in_checkpoint_state);
+    stream->in_checkpoint_state = true;
+
+    FILLZERO(rec);
+    rec.type = REC_TYPE_CHECKPOINT_STATE;
+    rec.length = sizeof(*srcs);
+
+    setup_write(egc, stream, "checkpoint state", &rec,
+                srcs, write_checkpoint_state_done);
+}
+
+static void write_checkpoint_state_done(libxl__egc *egc,
+                                        libxl__stream_write_state *stream)
+{
+    checkpoint_state_done(egc, stream, 0);
+}
+
+static void checkpoint_state_done(libxl__egc *egc,
+                                  libxl__stream_write_state *stream, int rc)
+{
+    assert(stream->in_checkpoint_state);
+    stream->in_checkpoint_state = false;
+    stream->checkpoint_callback(egc, stream, rc);
 }
 
 /*
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 07/25] tools/libxl: add back channel support to read stream
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (5 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 06/25] tools/libxl: write checkpoint_state records into the stream Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 08/25] tools/libxl: handle checkpoint_state records in a libxl migration v2 " Wen Congyang
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

This is used by primay to read records sent by secondary.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_create.c      |  1 +
 tools/libxl/libxl_internal.h    |  1 +
 tools/libxl/libxl_stream_read.c | 27 +++++++++++++++++++++++----
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 8087bcc..c2eeb9d 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1012,6 +1012,7 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     dcs->srs.dcs = dcs;
     dcs->srs.fd = restore_fd;
     dcs->srs.legacy = (dcs->restore_params.stream_version == 1);
+    dcs->srs.back_channel = false;
     dcs->srs.completion_callback = domcreate_stream_done;
 
     if (restore_fd >= 0) {
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 3cf297f..ea2cfb8 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3418,6 +3418,7 @@ struct libxl__stream_read_state {
     libxl__domain_create_state *dcs;
     int fd;
     bool legacy;
+    bool back_channel;
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_read_state *srs,
                                 int rc);
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 6ad2a27..6437639 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -104,6 +104,15 @@
  * Depending on the contents of the stream, there are likely to be several
  * parallel tasks being managed.  check_all_finished() is used to join all
  * tasks in both success and error cases.
+ *
+ * For back channel stream:
+ * - libxl__stream_read_start()
+ *    - Set up the stream to running state
+ *
+ * - libxl__stream_read_continue()
+ *     - Set up reading the next record from a started stream.
+ *       Add some codes to process_record() to handle the record.
+ *       Then call stream->checkpoint_callback() to return.
  */
 
 /* Success/error/cleanup handling. */
@@ -207,6 +216,17 @@ void libxl__stream_read_start(libxl__egc *egc,
     stream->running = true;
     stream->phase   = SRS_PHASE_NORMAL;
 
+    dc->ao       = stream->ao;
+    dc->copywhat = "restore v2 stream";
+    dc->writefd  = -1;
+
+    if (stream->back_channel) {
+        assert(!stream->legacy);
+
+        dc->readfd = stream->fd;
+        return;
+    }
+
     if (stream->legacy) {
         /* Convert the legacy stream. */
         libxl__conversion_helper_state *chs = &stream->chs;
@@ -229,10 +249,7 @@ void libxl__stream_read_start(libxl__egc *egc,
     }
     /* stream->fd is now a v2 stream. */
 
-    dc->ao       = stream->ao;
-    dc->copywhat = "restore v2 stream";
     dc->readfd   = stream->fd;
-    dc->writefd  = -1;
 
     /* Start reading the stream header. */
     rc = setup_read(stream, "stream header",
@@ -748,7 +765,9 @@ static void stream_done(libxl__egc *egc,
     LIBXL_STAILQ_FOREACH_SAFE(rec, &stream->record_queue, entry, trec)
         free_record(rec);
 
-    check_all_finished(egc, stream, rc);
+    if (!stream->back_channel)
+        /* back channel stream doesn't have restore helper */
+        check_all_finished(egc, stream, rc);
 }
 
 void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 08/25] tools/libxl: handle checkpoint_state records in a libxl migration v2 read stream
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (6 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 07/25] tools/libxl: add back channel support to read stream Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback Wen Congyang
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Read a checkpoint_state and call stream->checkpoint_callback to handle it.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_internal.h    |  3 +++
 tools/libxl/libxl_stream_read.c | 57 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ea2cfb8..5f9722b 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3430,6 +3430,7 @@ struct libxl__stream_read_state {
     bool running;
     bool in_checkpoint;
     bool sync_teardown; /* Only used to coordinate shutdown on error path. */
+    bool in_checkpoint_state;
     libxl__save_helper_state shs;
     libxl__conversion_helper_state chs;
 
@@ -3457,6 +3458,8 @@ _hidden void libxl__stream_read_start(libxl__egc *egc,
                                       libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                                  libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_checkpoint_state(libxl__egc *egc,
+                                                 libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_abort(libxl__egc *egc,
                                       libxl__stream_read_state *stream, int rc);
 static inline bool
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 6437639..e6421bc 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -152,6 +152,10 @@ static void write_emulator_done(libxl__egc *egc,
                                 libxl__datacopier_state *dc,
                                 int rc, int onwrite, int errnoval);
 
+/* Handlers for checkpoint state mini-loop */
+static void checkpoint_state_done(libxl__egc *egc,
+                                  libxl__stream_read_state *stream, int rc);
+
 /*----- Helpers -----*/
 
 /* Helper to set up reading some data from the stream. */
@@ -535,6 +539,7 @@ static bool process_record(libxl__egc *egc,
     STATE_AO_GC(stream->ao);
     libxl__domain_create_state *dcs = stream->dcs;
     libxl__sr_record_buf *rec;
+    libxl_sr_checkpoint_state *srcs;
     bool further_action_needed = false;
     int rc = 0;
 
@@ -605,6 +610,17 @@ static bool process_record(libxl__egc *egc,
         checkpoint_done(egc, stream, 0);
         break;
 
+    case REC_TYPE_CHECKPOINT_STATE:
+        if (!stream->in_checkpoint_state) {
+            LOG(ERROR, "Unexpected CHECKPOINT_STATE record in stream");
+            rc = ERROR_FAIL;
+            goto err;
+        }
+
+        srcs = rec->body;
+        checkpoint_state_done(egc, stream, srcs->id);
+        break;
+
     default:
         LOG(ERROR, "Unrecognised record 0x%08x", rec->hdr.type);
         rc = ERROR_FAIL;
@@ -716,6 +732,21 @@ static void stream_complete(libxl__egc *egc,
         return;
     }
 
+    if (stream->in_checkpoint_state) {
+        assert(rc);
+
+        /*
+         * If an error is encountered while in a checkpoint, pass it
+         * back to libxc.  The failure will come back around to us via
+         * 1. normal stream
+         *    libxl__xc_domain_restore_done()
+         * 2. back_channel stream
+         *    libxl__stream_read_abort()
+         */
+        checkpoint_state_done(egc, stream, rc);
+        return;
+    }
+
     stream_done(egc, stream, rc);
 }
 
@@ -746,6 +777,7 @@ static void stream_done(libxl__egc *egc,
 
     assert(stream->running);
     assert(!stream->in_checkpoint);
+    assert(!stream->in_checkpoint_state);
     stream->running = false;
 
     if (stream->incoming_record)
@@ -863,7 +895,30 @@ static void check_all_finished(libxl__egc *egc,
         libxl__conversion_helper_inuse(&stream->chs))
         return;
 
-    stream->completion_callback(egc, stream, stream->rc);
+    if (stream->completion_callback)
+        /* back channel stream doesn't have completion_callback() */
+        stream->completion_callback(egc, stream, stream->rc);
+}
+
+/*----- Checkpoint state handlers -----*/
+
+void libxl__stream_read_checkpoint_state(libxl__egc *egc,
+                                         libxl__stream_read_state *stream)
+{
+    assert(stream->running);
+    assert(!stream->in_checkpoint);
+    assert(!stream->in_checkpoint_state);
+    stream->in_checkpoint_state = true;
+
+    setup_read_record(egc, stream);
+}
+
+static void checkpoint_state_done(libxl__egc *egc,
+                                  libxl__stream_read_state *stream, int rc)
+{
+    assert(stream->in_checkpoint_state);
+    stream->in_checkpoint_state = false;
+    stream->checkpoint_callback(egc, stream, rc);
 }
 
 /*
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (7 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 08/25] tools/libxl: handle checkpoint_state records in a libxl migration v2 " Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2016-01-26 20:50   ` Konrad Rzeszutek Wilk
  2015-12-30  2:37 ` [PATCH v9 10/25] tools/libx{l, c}: add postcopy/suspend callback to restore side Wen Congyang
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Under COLO, we are doing checkpoint on demand, if this
callback returns 1, we will take another checkpoint.
0 indicates unexpected error.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/include/xenguest.h     | 18 ++++++++++++++++++
 tools/libxl/libxl_save_msgs_gen.pl |  7 ++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index bd133af..88d6e13 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -62,6 +62,15 @@ struct save_callbacks {
      * 1: take another checkpoint */
     int (*checkpoint)(void* data);
 
+    /*
+     * Called after the checkpoint callback.
+     *
+     * returns:
+     * 0: terminate checkpointing gracefully
+     * 1: take another checkpoint
+     */
+    int (*should_checkpoint)(void* data);
+
     /* Enable qemu-dm logging dirty pages to xen */
     int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
 
@@ -93,6 +102,15 @@ struct restore_callbacks {
 #define XGR_CHECKPOINT_FAILOVER 2 /* Failover and resume VM */
     int (*checkpoint)(void* data);
 
+    /*
+     * Called after the checkpoint callback.
+     *
+     * returns:
+     * 0: terminate checkpointing gracefully
+     * 1: take another checkpoint
+     */
+    int (*should_checkpoint)(void* data);
+
     /* to be provided as the last argument to each callback function */
     void* data;
 };
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index d6d2967..9107a86 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -26,11 +26,12 @@ our @msgs = (
     [  3, 'scxA',   "suspend", [] ],
     [  4, 'scxA',   "postcopy", [] ],
     [  5, 'srcxA',  "checkpoint", [] ],
-    [  6, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
+    [  6, 'srcxA',  "should_checkpoint", [] ],
+    [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
                                               unsigned enable)] ],
-    [  7, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
+    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
                                               'unsigned long', 'console_mfn'] ],
-    [  8, 'srW',    "complete",              [qw(int retval
+    [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
 );
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 10/25] tools/libx{l, c}: add postcopy/suspend callback to restore side
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (8 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 11/25] secondary vm suspend/resume/checkpoint code Wen Congyang
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Secondary(restore side) is running under COLO, we also need
postcopy/suspend callbacks.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/include/xenguest.h     | 10 ++++++++++
 tools/libxl/libxl_save_msgs_gen.pl |  4 ++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 88d6e13..7c74977 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -95,6 +95,16 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
+    /* Called after a new checkpoint to suspend the guest.
+     */
+    int (*suspend)(void* data);
+
+    /* Called after the secondary vm is ready to resume.
+     * Callback function resumes the guest & the device model,
+     * returns to xc_domain_restore.
+     */
+    int (*postcopy)(void* data);
+
     /* A checkpoint record has been found in the stream.
      * returns: */
 #define XGR_CHECKPOINT_ERROR    0 /* Terminate processing */
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 9107a86..7c9859b 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -23,8 +23,8 @@ our @msgs = (
                                                  STRING doing_what),
                                                 'unsigned long', 'done',
                                                 'unsigned long', 'total'] ],
-    [  3, 'scxA',   "suspend", [] ],
-    [  4, 'scxA',   "postcopy", [] ],
+    [  3, 'srcxA',  "suspend", [] ],
+    [  4, 'srcxA',  "postcopy", [] ],
     [  5, 'srcxA',  "checkpoint", [] ],
     [  6, 'srcxA',  "should_checkpoint", [] ],
     [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 11/25] secondary vm suspend/resume/checkpoint code
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (9 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 10/25] tools/libx{l, c}: add postcopy/suspend callback to restore side Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 12/25] primary " Wen Congyang
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Secondary vm is running in colo mode. So we will do
the following things again and again:
1. Resume secondary vm
   a. Send CHECKPOINT_SVM_READY to master.
   b. If it is not the first resume, call libxl__checkpoint_devices_preresume().
   c. If it is the first resume(resume right after live migration),
      - call libxl__xc_domain_restore_done() to build the secondary vm.
      - enable secondary vm's logdirty.
      - call libxl__domain_resume() to resume secondary vm.
      - call libxl__checkpoint_devices_setup() to setup checkpoint devices.
   d. Send CHECKPOINT_SVM_RESUMED to master.
2. Wait a new checkpoint
   a. Call libxl__checkpoint_devices_commit().
   b. Read CHECKPOINT_NEW from master.
3. Suspend secondary vm
   a. Suspend secondary vm.
   b. Call libxl__checkpoint_devices_postsuspend().
   c. Send CHECKPOINT_SVM_SUSPENDED to master.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 tools/libxl/Makefile             |    1 +
 tools/libxl/libxl_colo.h         |   24 +
 tools/libxl/libxl_colo_restore.c | 1041 ++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_create.c       |   37 ++
 tools/libxl/libxl_internal.h     |   19 +
 tools/libxl/libxl_save_callout.c |    7 +-
 tools/libxl/libxl_stream_read.c  |   12 +
 7 files changed, 1140 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_restore.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index d075a30..19a95a9 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -63,6 +63,7 @@ LIBXL_OBJS-y += libxl_no_convert_callout.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_colo_restore.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
new file mode 100644
index 0000000..8bea1a2
--- /dev/null
+++ b/tools/libxl/libxl_colo.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef LIBXL_COLO_H
+#define LIBXL_COLO_H
+
+extern void libxl__colo_restore_setup(libxl__egc *egc,
+                                      libxl__colo_restore_state *crs);
+extern void libxl__colo_restore_teardown(libxl__egc *egc, void *dcs_void,
+                                         int ret, int retval, int errnoval);
+
+#endif
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
new file mode 100644
index 0000000..4f25f27
--- /dev/null
+++ b/tools/libxl/libxl_colo_restore.c
@@ -0,0 +1,1041 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *         Yang Hongyang <hongyang.yang@easystack.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+#include "libxl_sr_stream_format.h"
+
+enum {
+    LIBXL_COLO_SETUPED,
+    LIBXL_COLO_SUSPENDED,
+    LIBXL_COLO_RESUMED,
+};
+
+typedef struct libxl__colo_restore_checkpoint_state libxl__colo_restore_checkpoint_state;
+struct libxl__colo_restore_checkpoint_state {
+    libxl__domain_suspend_state dsps;
+    libxl__logdirty_switch lds;
+    libxl__colo_restore_state *crs;
+    libxl__stream_write_state sws;
+    int status;
+    bool preresume;
+    /* used for teardown */
+    int teardown_devices;
+    int saved_rc;
+    char *state_file;
+
+    void (*callback)(libxl__egc *,
+                     libxl__colo_restore_checkpoint_state *,
+                     int);
+};
+
+
+static void libxl__colo_restore_domain_resume_callback(void *data);
+static void libxl__colo_restore_domain_checkpoint_callback(void *data);
+static void libxl__colo_restore_domain_should_checkpoint_callback(void *data);
+static void libxl__colo_restore_domain_suspend_callback(void *data);
+
+static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    NULL,
+};
+
+/* ===================== colo: common functions ===================== */
+static void colo_enable_logdirty(libxl__colo_restore_state *crs, libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    EGC_GC;
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid,
+                          XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                          NULL, 0, NULL, 0, NULL) < 0) {
+        LOG(ERROR, "cannot enable secondary vm's logdirty");
+        lds->callback(egc, lds, ERROR_FAIL);
+        return;
+    }
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(egc, domid, 1, lds);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_disable_logdirty(libxl__colo_restore_state *crs,
+                                  libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    EGC_GC;
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                          NULL, 0, NULL, 0, NULL) < 0)
+        LOG(WARN, "cannot disable secondary vm's logdirty");
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(egc, domid, 0, lds);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_resume_vm(libxl__egc *egc,
+                           libxl__colo_restore_checkpoint_state *crcs,
+                           int restore_device_model)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    EGC_GC;
+
+    if (!crs->saved_cb) {
+        /* TODO: sync mmu for hvm? */
+        if (restore_device_model) {
+            rc = libxl__domain_restore_device_model(gc, crs->domid,
+                                                    crcs->state_file);
+            if (rc) {
+                LOG(ERROR, "cannot restore device model for secondary vm");
+                crcs->callback(egc, crcs, rc);
+                return;
+            }
+        }
+        rc = libxl__domain_resume(gc, crs->domid, 0);
+        if (rc)
+            LOG(ERROR, "cannot resume secondary vm");
+
+        crcs->callback(egc, crcs, rc);
+        return;
+    }
+
+    /*
+     * TODO: get store gfn and console gfn
+     *  We should call the callback restore_results in
+     *  xc_domain_restore() before resuming the guest.
+     */
+    libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
+
+    return;
+}
+
+static int init_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* init device subkind-specific state in the libxl ctx */
+    int rc;
+    STATE_AO_GC(cds->ao);
+
+    rc = 0;
+    return rc;
+}
+
+static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* cleanup device subkind-specific state in the libxl ctx */
+    STATE_AO_GC(cds->ao);
+}
+
+
+/* ================ colo: setup restore environment ================ */
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid);
+
+static int init_dsps(libxl__domain_suspend_state *dsps)
+{
+    int rc = ERROR_FAIL;
+    libxl_domain_type type;
+
+    STATE_AO_GC(dsps->ao);
+
+    type = libxl__domain_type(gc, dsps->domid);
+    if (type == LIBXL_DOMAIN_TYPE_INVALID)
+        goto out;
+
+    libxl__xswait_init(&dsps->pvcontrol);
+    libxl__ev_evtchn_init(&dsps->guest_evtchn);
+    libxl__ev_xswatch_init(&dsps->guest_watch);
+    libxl__ev_time_init(&dsps->guest_timeout);
+
+    if (type == LIBXL_DOMAIN_TYPE_HVM)
+        dsps->hvm = 1;
+    else
+        dsps->hvm = 0;
+
+    dsps->guest_evtchn.port = -1;
+    dsps->guest_evtchn_lockfd = -1;
+    dsps->guest_responded = 0;
+    dsps->dm_savefile = libxl__device_model_savefile(gc, dsps->domid);
+
+    /* Secondary vm is not created, so we cannot get evtchn port */
+
+    rc = 0;
+
+out:
+    return rc;
+}
+
+void libxl__colo_restore_setup(libxl__egc *egc,
+                               libxl__colo_restore_state *crs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs;
+    int rc = ERROR_FAIL;
+
+    /* Convenience aliases */
+    libxl__srm_restore_autogen_callbacks *const callbacks =
+        &dcs->srs.shs.callbacks.restore.a;
+    const int domid = crs->domid;
+
+    STATE_AO_GC(crs->ao);
+
+    GCNEW(crcs);
+    crs->crcs = crcs;
+    crcs->crs = crs;
+
+    /* setup dsps */
+    crcs->dsps.ao = ao;
+    crcs->dsps.domid = domid;
+    if (init_dsps(&crcs->dsps))
+        goto err;
+
+    callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
+    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
+    callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
+    callbacks->should_checkpoint = libxl__colo_restore_domain_should_checkpoint_callback;
+
+    /*
+     * Secondary vm is running in colo mode, so we need to call
+     * libxl__xc_domain_restore_done() to create secondary vm.
+     * But we will exit in domain_create_cb(). So replace the
+     * callback here.
+     */
+    crs->saved_cb = dcs->callback;
+    dcs->callback = libxl__colo_domain_create_cb;
+    crcs->state_file = GCSPRINTF(LIBXL_DEVICE_MODEL_RESTORE_FILE".%d", domid);
+    crcs->status = LIBXL_COLO_SETUPED;
+
+    libxl__logdirty_init(&crcs->lds);
+    crcs->lds.ao = ao;
+
+    crcs->sws.fd = crs->send_fd;
+    crcs->sws.ao = ao;
+    crcs->sws.back_channel = true;
+
+    dcs->cds.concrete_data = crs;
+
+    libxl__stream_write_start(egc, &crcs->sws);
+
+    rc = 0;
+
+out:
+    crs->callback(egc, crs, rc);
+    return;
+
+err:
+    goto out;
+}
+
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid)
+{
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    crcs->callback(egc, crcs, rc);
+}
+
+
+/* ================ colo: teardown restore environment ================ */
+static void colo_restore_teardown_devices_done(libxl__egc *egc,
+    libxl__checkpoint_devices_state *cds, int rc);
+static void do_failover(libxl__egc *egc, libxl__colo_restore_state *crs);
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc);
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc);
+static void libxl__colo_restore_teardown_done(libxl__egc *egc,
+                                              libxl__colo_restore_state *crs,
+                                              int rc);
+
+void libxl__colo_restore_teardown(libxl__egc *egc, void *dcs_void,
+                                  int ret, int retval, int errnoval)
+{
+    libxl__domain_create_state *dcs = dcs_void;
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+    int rc = 1;
+
+    /* convenience aliases */
+    libxl__colo_restore_state *const crs = &dcs->crs;
+    EGC_GC;
+
+    if (ret == 0 && retval == 0)
+        rc = 0;
+
+    LOG(INFO, "%s", rc ? "colo fails" : "failover");
+
+    libxl__stream_write_abort(egc, &crcs->sws, 1);
+    if (crs->saved_cb) {
+        /* crcs->status is LIBXL_COLO_SETUPED */
+        dcs->srs.completion_callback = NULL;
+    }
+    libxl__xc_domain_restore_done(egc, dcs, ret, retval, errnoval);
+
+    crcs->saved_rc = rc;
+    if (!crcs->teardown_devices) {
+        colo_restore_teardown_devices_done(egc, &dcs->cds, 0);
+        return;
+    }
+
+    dcs->cds.callback = colo_restore_teardown_devices_done;
+    libxl__checkpoint_devices_teardown(egc, &dcs->cds);
+}
+
+static void colo_restore_teardown_devices_done(libxl__egc *egc,
+    libxl__checkpoint_devices_state *cds, int rc)
+{
+    libxl__colo_restore_state *crs = cds->concrete_data;
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    EGC_GC;
+
+    if (rc)
+        LOG(ERROR, "COLO: failed to teardown device for guest with domid %u,"
+            " rc %d", cds->domid, rc);
+
+    if (crcs->teardown_devices)
+        cleanup_device_subkind(cds);
+
+    rc = crcs->saved_rc;
+    if (!rc) {
+        crcs->callback = do_failover_done;
+        do_failover(egc, crs);
+        return;
+    }
+
+    libxl__colo_restore_teardown_done(egc, crs, rc);
+}
+
+static void do_failover(libxl__egc *egc, libxl__colo_restore_state *crs)
+{
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const int status = crcs->status;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    EGC_GC;
+
+    switch(status) {
+    case LIBXL_COLO_SETUPED:
+        /*
+         * We will come here only when reading emulator xenstore data or
+         * emulator context fails, and libxl__xc_domain_restore_done()
+         * is not called. In this case, the migration is not finished,
+         * so we cannot do failover.
+         */
+        LOG(ERROR, "migration fails");
+        crcs->callback(egc, crcs, ERROR_FAIL);
+        return;
+    case LIBXL_COLO_SUSPENDED:
+    case LIBXL_COLO_RESUMED:
+        /* disable logdirty first */
+        lds->callback = colo_disable_logdirty_done;
+        colo_disable_logdirty(crs, egc);
+        return;
+    default:
+        LOG(ERROR, "invalid status: %d", status);
+        crcs->callback(egc, crcs, ERROR_FAIL);
+    }
+}
+
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    EGC_GC;
+
+    if (rc)
+        LOG(ERROR, "cannot do failover");
+
+    libxl__colo_restore_teardown_done(egc, crs, rc);
+}
+
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+
+    EGC_GC;
+
+    if (rc)
+        LOG(WARN, "cannot disable logdirty");
+
+    if (crcs->status == LIBXL_COLO_SUSPENDED) {
+        /*
+         * failover when reading state from master, so no need to
+         * call libxl__domain_restore().
+         */
+        colo_resume_vm(egc, crcs, 0);
+        return;
+    }
+
+    /* If we cannot disable logdirty, we still can do failover */
+    crcs->callback(egc, crcs, 0);
+}
+
+static void libxl__colo_restore_teardown_done(libxl__egc *egc,
+                                              libxl__colo_restore_state *crs,
+                                              int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    EGC_GC;
+
+    /* convenience aliases */
+    const int domid = crs->domid;
+    const libxl_ctx *const ctx = libxl__gc_owner(gc);
+    xc_interface *const xch = ctx->xch;
+
+    if (!rc)
+        /* failover, no need to destroy the secondary vm */
+        goto out;
+
+    xc_domain_destroy(xch, domid);
+
+out:
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+    }
+
+    dcs->callback(egc, dcs, rc, crs->domid);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. checkpoint
+ * 2. resume
+ * 3. should_checkpoint
+ * 4. suspend
+ */
+static void colo_common_write_stream_done(libxl__egc *egc,
+                                          libxl__stream_write_state *stream,
+                                          int rc);
+static void colo_common_read_stream_done(libxl__egc *egc,
+                                         libxl__stream_read_state *stream,
+                                         int rc);
+/* ======================== colo: checkpoint ======================= */
+/*
+ * Do the following things when resuming secondary vm:
+ *  1. read emulator xenstore data
+ *  2. read emulator context
+ *  3. REC_TYPE_CHECKPOINT_END
+ */
+static void libxl__colo_restore_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_read_state *srs = CONTAINER_OF(shs, *srs, shs);
+    libxl__domain_create_state *dcs = CONTAINER_OF(srs, *dcs, srs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    crcs->callback = NULL;
+    dcs->srs.checkpoint_callback = colo_common_read_stream_done;
+    libxl__stream_read_start_checkpoint(shs->egc, &dcs->srs);
+}
+
+
+/* ===================== colo: resume secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm the first time:
+ *  1. resume secondary vm
+ *  2. enable log dirty
+ *  3. setup checkpoint devices
+ *  4. write CHECKPOINT_SVM_READY
+ *  5. unpause secondary vm
+ *  6. write CHECKPOINT_SVM_RESUMED
+ *
+ * Do the following things when resuming secondary vm:
+ *  1. write CHECKPOINT_SVM_READY
+ *  2. resume secondary vm
+ *  3. write CHECKPOINT_SVM_RESUMED
+ */
+static void colo_send_svm_ready(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs);
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc);
+static void colo_restore_preresume_cb(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+static void colo_restore_resume_vm(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs);
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc);
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs);
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int retval);
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc);
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc);
+static void colo_setup_checkpoint_devices(libxl__egc *egc,
+                                          libxl__colo_restore_state *crs);
+static void colo_restore_setup_cds_done(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc);
+static void colo_unpause_svm(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state *crcs);
+
+static void libxl__colo_restore_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_read_state *srs = CONTAINER_OF(shs, *srs, shs);
+    libxl__domain_create_state *dcs = CONTAINER_OF(srs, *dcs, srs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    if (crcs->teardown_devices)
+        colo_send_svm_ready(shs->egc, crcs);
+    else
+        colo_restore_resume_vm(shs->egc, crcs);
+}
+
+static void colo_send_svm_ready(libxl__egc *egc,
+                               libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl_sr_checkpoint_state srcs = { .id = CHECKPOINT_SVM_READY };
+
+    crcs->callback = colo_send_svm_ready_done;
+    crcs->sws.checkpoint_callback = colo_common_write_stream_done;
+    libxl__stream_write_checkpoint_state(egc, &crcs->sws, &srcs);
+}
+
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &dcs->cds;
+
+    if (!crcs->preresume) {
+        crcs->preresume = true;
+        colo_unpause_svm(egc, crcs);
+        return;
+    }
+
+    cds->callback = colo_restore_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, cds);
+}
+
+static void colo_restore_preresume_cb(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc)
+{
+    libxl__colo_restore_state *crs = cds->concrete_data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "preresume fails");
+        goto out;
+    }
+
+    colo_restore_resume_vm(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_restore_resume_vm(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs)
+{
+
+    crcs->callback = colo_resume_vm_done;
+    colo_resume_vm(egc, crcs, 1);
+}
+
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "cannot resume secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_RESUMED;
+
+    /* avoid calling stream->completion_callback() more than once */
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+
+        dcs->srs.completion_callback = NULL;
+
+        lds->callback = colo_enable_logdirty_done;
+        colo_enable_logdirty(crs, egc);
+        return;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl_sr_checkpoint_state srcs = { .id = CHECKPOINT_SVM_RESUMED };
+
+    crcs->callback = NULL;
+    crcs->sws.checkpoint_callback = colo_common_write_stream_done;
+    libxl__stream_write_checkpoint_state(egc, &crcs->sws, &srcs);
+}
+
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    EGC_GC;
+
+    if (rc) {
+        /*
+         * log-dirty already enabled? There's no test op,
+         * so attempt to disable then reenable it
+         */
+        lds->callback = colo_reenable_logdirty;
+        colo_disable_logdirty(crs, egc);
+        return;
+    }
+
+    colo_setup_checkpoint_devices(egc, crs);
+}
+
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    lds->callback = colo_reenable_logdirty_done;
+    colo_enable_logdirty(crs, egc);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    colo_setup_checkpoint_devices(egc, crcs->crs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+/*
+ * We cannot setup checkpoint devices in libxl__colo_restore_setup(),
+ * because the guest is not ready.
+ */
+static void colo_setup_checkpoint_devices(libxl__egc *egc,
+                                          libxl__colo_restore_state *crs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &dcs->cds;
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    STATE_AO_GC(crs->ao);
+
+    /* TODO: disk/nic support */
+    cds->device_kind_flags = 0;
+    cds->callback = colo_restore_setup_cds_done;
+    cds->ao = ao;
+    cds->domid = crs->domid;
+    cds->ops = colo_restore_ops;
+
+    if (init_device_subkind(cds))
+        goto out;
+
+    crcs->teardown_devices = 1;
+
+    libxl__checkpoint_devices_setup(egc, cds);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_restore_setup_cds_done(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc)
+{
+    libxl__colo_restore_state *crs = cds->concrete_data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "COLO: failed to setup device for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
+
+    colo_send_svm_ready(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+static void colo_unpause_svm(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int rc;
+
+    /* Convenience aliases */
+    const uint32_t domid = crcs->crs->domid;
+    libxl__save_helper_state *const shs = &dcs->srs.shs;
+
+    EGC_GC;
+
+    /* We have enabled secondary vm's logdirty, so we can unpause it now */
+    rc = libxl_domain_unpause(CTX, domid);
+    if (rc) {
+        LOG(ERROR, "cannot unpause secondary vm");
+        goto out;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+static void colo_restore_commit_cb(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc);
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__colo_restore_checkpoint_state *crcs,
+                                  int real_size);
+
+static void libxl__colo_restore_domain_should_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_read_state *srs = CONTAINER_OF(shs, *srs, shs);
+    libxl__domain_create_state *dcs = CONTAINER_OF(srs, *dcs, srs);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &dcs->cds;
+
+    cds->callback = colo_restore_commit_cb;
+    libxl__checkpoint_devices_commit(shs->egc, cds);
+}
+
+static void colo_restore_commit_cb(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc)
+{
+    libxl__colo_restore_state *crs = cds->concrete_data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "commit fails");
+        goto out;
+    }
+
+    crcs->callback = colo_stream_read_done;
+    dcs->srs.checkpoint_callback = colo_common_read_stream_done;
+    libxl__stream_read_checkpoint_state(egc, &dcs->srs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, 0);
+}
+
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__colo_restore_checkpoint_state *crcs,
+                                  int id)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok = 0;
+
+    EGC_GC;
+
+    if (id != CHECKPOINT_NEW) {
+        LOG(ERROR, "invalid section: %d", id);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, ok);
+}
+
+
+/* ===================== colo: suspend secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm:
+ *  1. suspend secondary vm
+ *  2. send CHECKPOINT_SVM_SUSPENDED
+ */
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state *dsps,
+                                 int ok);
+static void colo_restore_postsuspend_cb(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc);
+
+static void libxl__colo_restore_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_read_state *srs = CONTAINER_OF(shs, *srs, shs);
+    libxl__domain_create_state *dcs = CONTAINER_OF(srs, *dcs, srs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    STATE_AO_GC(dcs->ao);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state *const dsps = &crcs->dsps;
+
+    /* suspend secondary vm */
+    dsps->callback_common_done = colo_suspend_vm_done;
+
+    libxl__domain_suspend(shs->egc, dsps);
+}
+
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state *dsps,
+                                 int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(dsps, *crcs, dsps);
+    libxl__colo_restore_state *crs = crcs->crs;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *cds = &dcs->cds;
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "cannot suspend secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_SUSPENDED;
+
+    cds->callback = colo_restore_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, !rc);
+}
+
+static void colo_restore_postsuspend_cb(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds,
+                                        int rc)
+{
+    libxl__colo_restore_state *crs = cds->concrete_data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    libxl_sr_checkpoint_state srcs = { .id = CHECKPOINT_SVM_SUSPENDED };
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "postsuspend fails");
+        goto out;
+    }
+
+    crcs->callback = NULL;
+    crcs->sws.checkpoint_callback = colo_common_write_stream_done;
+    libxl__stream_write_checkpoint_state(egc, &crcs->sws, &srcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, !rc);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_write_stream_done(libxl__egc *egc,
+                                          libxl__stream_write_state *stream,
+                                          int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs =
+        CONTAINER_OF(stream, *crcs, sws);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok;
+
+    EGC_GC;
+
+    if (rc < 0) {
+        /* TODO: it may be a internal error, but we don't know */
+        LOG(ERROR, "sending data fails");
+        ok = 2;
+        goto out;
+    }
+
+    if (!crcs->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    crcs->callback(egc, crcs, 0);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, ok);
+}
+
+static void colo_common_read_stream_done(libxl__egc *egc,
+                                         libxl__stream_read_state *stream,
+                                         int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(stream, *dcs, srs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+    int ok;
+
+    EGC_GC;
+
+    if (rc < 0) {
+        /* TODO: it may be a internal error, but we don't know */
+        LOG(ERROR, "reading data fails");
+        ok = 2;
+        goto out;
+    }
+
+    if (!crcs->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    /* rc contains the id */
+    crcs->callback(egc, crcs, rc);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->srs.shs, ok);
+}
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index c2eeb9d..62aa7c9 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_colo.h"
 
 #include <xc_dom.h>
 #include <xenguest.h>
@@ -960,6 +961,23 @@ static void domcreate_console_available(libxl__egc *egc,
                                         dcs->aop_console_how.for_event));
 }
 
+static void libxl__colo_restore_setup_done(libxl__egc *egc,
+                                           libxl__colo_restore_state *crs,
+                                           int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "colo restore setup fails: %d", rc);
+        domcreate_stream_done(egc, &dcs->srs, rc);
+        return;
+    }
+
+    libxl__stream_read_start(egc, &dcs->srs);
+}
+
 static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc)
@@ -975,6 +993,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     libxl__srm_restore_autogen_callbacks *const callbacks =
         &dcs->srs.shs.callbacks.restore.a;
     const int checkpointed_stream = dcs->restore_params.checkpointed_stream;
+    libxl__colo_restore_state *const crs = &dcs->crs;
+    libxl_domain_build_info *const info = &d_config->b_info;
 
     if (rc) {
         domcreate_rebuild_done(egc, dcs, rc);
@@ -1004,6 +1024,13 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     /* Restore */
     callbacks->checkpoint = libxl__remus_domain_restore_checkpoint_callback;
 
+    /* COLO only supports HVM now */
+    if (info->type != LIBXL_DOMAIN_TYPE_HVM &&
+        checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
     rc = libxl__build_pre(gc, domid, d_config, state);
     if (rc)
         goto out;
@@ -1017,6 +1044,16 @@ static void domcreate_bootloader_done(libxl__egc *egc,
 
     if (restore_fd >= 0) {
         switch (checkpointed_stream) {
+        case LIBXL_CHECKPOINTED_STREAM_COLO:
+            /* colo restore setup */
+            crs->ao = ao;
+            crs->domid = domid;
+            crs->send_fd = dcs->send_fd;
+            crs->recv_fd = restore_fd;
+            crs->hvm = (info->type == LIBXL_DOMAIN_TYPE_HVM);
+            crs->callback = libxl__colo_restore_setup_done;
+            libxl__colo_restore_setup(egc, crs);
+            break;
         case LIBXL_CHECKPOINTED_STREAM_REMUS:
             libxl__remus_restore_setup(egc, dcs);
             /* fall through */
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 5f9722b..79b0c6d 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3468,6 +3468,23 @@ libxl__stream_read_inuse(const libxl__stream_read_state *stream)
     return stream->running;
 }
 
+/* colo related structure */
+typedef struct libxl__colo_restore_state libxl__colo_restore_state;
+typedef void libxl__colo_callback(libxl__egc *,
+                                  libxl__colo_restore_state *, int rc);
+struct libxl__colo_restore_state {
+    /* must set by caller of libxl__colo_(setup|teardown) */
+    libxl__ao *ao;
+    uint32_t domid;
+    int send_fd;
+    int recv_fd;
+    int hvm;
+    libxl__colo_callback *callback;
+
+    /* private, colo restore checkpoint state */
+    libxl__domain_create_cb *saved_cb;
+    void *crcs;
+};
 
 struct libxl__domain_create_state {
     /* filled in by user */
@@ -3484,6 +3501,8 @@ struct libxl__domain_create_state {
     /* private to domain_create */
     int guest_domid;
     libxl__domain_build_state build_state;
+    libxl__colo_restore_state crs;
+    libxl__checkpoint_devices_state cds;
     libxl__bootloader_state bl;
     libxl__stub_dm_spawn_state dmss;
         /* If we're not doing stubdom, we use only dmss.dm,
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 631e3e2..5a250c7 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -15,6 +15,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 /* stream_fd is as from the caller (eventually, the application).
  * It may be 0, 1 or 2, in which case we need to dup it elsewhere.
@@ -68,7 +69,11 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs,
     shs->ao = ao;
     shs->domid = domid;
     shs->recv_callback = libxl__srm_callout_received_restore;
-    shs->completion_callback = libxl__xc_domain_restore_done;
+    if (dcs->restore_params.checkpointed_stream ==
+                                                LIBXL_CHECKPOINTED_STREAM_COLO)
+        shs->completion_callback = libxl__colo_restore_teardown;
+    else
+        shs->completion_callback = libxl__xc_domain_restore_done;
     shs->caller_state = dcs;
     shs->need_results = 1;
 
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index e6421bc..eaae72b 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -832,6 +832,18 @@ void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void,
      */
     if (libxl__stream_read_inuse(stream)) {
         switch (checkpointed_stream) {
+        case LIBXL_CHECKPOINTED_STREAM_COLO:
+            if (stream->completion_callback) {
+                /*
+                 * restore, just build the secondary vm, don't close
+                 * the stream
+                 */
+                stream->completion_callback(egc, stream, 0);
+            } else {
+                /* failover, just close the stream */
+                stream_complete(egc, stream, 0);
+            }
+            break;
         case LIBXL_CHECKPOINTED_STREAM_REMUS:
             /* failover */
             stream_complete(egc, stream, 0);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 12/25] primary vm suspend/resume/checkpoint code
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (10 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 11/25] secondary vm suspend/resume/checkpoint code Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 13/25] libxc/restore: support COLO restore Wen Congyang
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

We will do the following things again and again:
1. Suspend primary vm
   a. Suspend primary vm
   b. do postsuspend
   c. Read CHECKPOINT_SVM_SUSPENDED sent by secondary
2. Resume primary vm
   a. Read CHECKPOINT_SVM_READY from slave
   b. Do presume
   c. Resume primary vm
   d. Read CHECKPOINT_SVM_RESUMED from slave
3. Wait a new checkpoint
   a. Wait a new checkpoint(not implemented)
   b. Send CHECKPOINT_NEW to slave

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 tools/libxl/Makefile          |   2 +-
 tools/libxl/libxl.c           |   6 +-
 tools/libxl/libxl_colo.h      |  10 +
 tools/libxl/libxl_colo_save.c | 552 ++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_dom_save.c  |  13 +-
 tools/libxl/libxl_internal.h  | 168 +++++++------
 tools/libxl/libxl_types.idl   |   1 +
 7 files changed, 672 insertions(+), 80 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_save.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 19a95a9..b11cf34 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -63,7 +63,7 @@ LIBXL_OBJS-y += libxl_no_convert_callout.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
-LIBXL_OBJS-y += libxl_colo_restore.o
+LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 481824d..7d227d7 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -17,6 +17,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 #define PAGE_TO_MEMKB(pages) ((pages) * 4)
 #define BACKEND_STRING_SIZE 5
@@ -882,7 +883,10 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     assert(info);
 
     /* Point of no return */
-    libxl__remus_setup(egc, &dss->rs);
+    if (libxl_defbool_val(info->colo))
+        libxl__colo_save_setup(egc, &dss->css);
+    else
+        libxl__remus_setup(egc, &dss->rs);
     return AO_INPROGRESS;
 
  out:
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 8bea1a2..39515c4 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -21,4 +21,14 @@ extern void libxl__colo_restore_setup(libxl__egc *egc,
 extern void libxl__colo_restore_teardown(libxl__egc *egc, void *dcs_void,
                                          int ret, int retval, int errnoval);
 
+extern void libxl__colo_save_domain_suspend_callback(void *data);
+extern void libxl__colo_save_domain_checkpoint_callback(void *data);
+extern void libxl__colo_save_domain_resume_callback(void *data);
+extern void libxl__colo_save_domain_should_checkpoint_callback(void *data);
+extern void libxl__colo_save_setup(libxl__egc *egc,
+                                   libxl__colo_save_state *css);
+extern void libxl__colo_save_teardown(libxl__egc *egc,
+                                      libxl__colo_save_state *css,
+                                      int rc);
+
 #endif
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
new file mode 100644
index 0000000..d6b4e7b
--- /dev/null
+++ b/tools/libxl/libxl_colo_save.c
@@ -0,0 +1,552 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *         Yang Hongyang <hongyang.yang@easystack.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+
+static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    NULL,
+};
+
+/* ================= helper functions ================= */
+static int init_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* init device subkind-specific state in the libxl ctx */
+    int rc;
+    STATE_AO_GC(cds->ao);
+
+    rc = 0;
+    return rc;
+}
+
+static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
+{
+    /* cleanup device subkind-specific state in the libxl ctx */
+    STATE_AO_GC(cds->ao);
+}
+
+/* ================= colo: setup save environment ================= */
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc);
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc);
+
+void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
+
+    STATE_AO_GC(dss->ao);
+
+    if (dss->type != LIBXL_DOMAIN_TYPE_HVM) {
+        LOG(ERROR, "COLO only supports hvm now");
+        goto out;
+    }
+
+    css->send_fd = dss->fd;
+    css->recv_fd = dss->recv_fd;
+    css->svm_running = false;
+
+    /* TODO: disk/nic support */
+    cds->device_kind_flags = 0;
+    cds->ops = colo_ops;
+    cds->callback = colo_save_setup_done;
+    cds->ao = ao;
+    cds->domid = dss->domid;
+    cds->concrete_data = css;
+
+    css->srs.ao = ao;
+    css->srs.fd = css->recv_fd;
+    css->srs.back_channel = true;
+    libxl__stream_read_start(egc, &css->srs);
+
+    if (init_device_subkind(cds))
+        goto out;
+
+    libxl__checkpoint_devices_setup(egc, &dss->cds);
+
+    return;
+
+out:
+    libxl__ao_complete(egc, ao, ERROR_FAIL);
+}
+
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    EGC_GC;
+
+    if (!rc) {
+        libxl__domain_save(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "COLO: failed to setup device for guest with domid %u",
+        dss->domid);
+    cds->callback = colo_save_setup_failed;
+    libxl__checkpoint_devices_teardown(egc, cds);
+}
+
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc)
+{
+    STATE_AO_GC(cds->ao);
+
+    if (rc)
+        LOG(ERROR, "COLO: failed to teardown device after setup failed"
+            " for guest with domid %u, rc %d", cds->domid, rc);
+
+    cleanup_device_subkind(cds);
+    libxl__ao_complete(egc, ao, rc);
+}
+
+
+/* ================= colo: teardown save environment ================= */
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc);
+
+void libxl__colo_save_teardown(libxl__egc *egc,
+                               libxl__colo_save_state *css,
+                               int rc)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    LOG(WARN, "COLO: Domain suspend terminated with rc %d,"
+        " teardown COLO devices...", rc);
+
+    libxl__stream_read_abort(egc, &css->srs, 1);
+
+    dss->cds.callback = colo_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &dss->cds);
+    return;
+}
+
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    cleanup_device_subkind(cds);
+    dss->callback(egc, dss, rc);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. suspend
+ * 2. checkpoint
+ * 3. resume
+ * 4. should_checkpoint
+ */
+static void colo_common_write_stream_done(libxl__egc *egc,
+                                          libxl__stream_write_state *stream,
+                                          int rc);
+static void colo_common_read_stream_done(libxl__egc *egc,
+                                         libxl__stream_read_state *stream,
+                                         int rc);
+/* ===================== colo: suspend primary vm ===================== */
+
+static void colo_read_svm_suspended_done(libxl__egc *egc,
+                                         libxl__colo_save_state *css,
+                                         int id);
+/*
+ * Do the following things when suspending primary vm:
+ * 1. suspend primary vm
+ * 2. do postsuspend
+ * 3. read CHECKPOINT_SVM_SUSPENDED
+ * 4. read secondary vm's dirty pages
+ */
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dsps,
+                                         int ok);
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc);
+
+void libxl__colo_save_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    libxl__domain_save_state *dss = sws->dss;
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state *dsps = &dss->dsps;
+
+    dsps->callback_common_done = colo_suspend_primary_vm_done;
+    libxl__domain_suspend(egc, dsps);
+}
+
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dsps,
+                                         int rc)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(dsps, *dss, dsps);
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "cannot suspend primary vm");
+        goto out;
+    }
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
+
+    cds->callback = colo_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
+    return;
+
+out:
+    dss->rc = rc;
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, !rc);
+}
+
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "postsuspend fails");
+        goto out;
+    }
+
+    if (!css->svm_running) {
+        rc = 0;
+        goto out;
+    }
+
+    /*
+     * read CHECKPOINT_SVM_SUSPENDED
+     */
+    css->callback = colo_read_svm_suspended_done;
+    css->srs.checkpoint_callback = colo_common_read_stream_done;
+    libxl__stream_read_checkpoint_state(egc, &css->srs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, !rc);
+}
+
+static void colo_read_svm_suspended_done(libxl__egc *egc,
+                                         libxl__colo_save_state *css,
+                                         int id)
+{
+    int ok = 0;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (id != CHECKPOINT_SVM_SUSPENDED) {
+        LOG(ERROR, "invalid section: %d, expected: %d", id,
+            CHECKPOINT_SVM_SUSPENDED);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, ok);
+}
+
+
+/* ===================== colo: send tailbuf ========================== */
+void libxl__colo_save_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    libxl__domain_save_state *dss = sws->dss;
+
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+
+    /* write emulator xenstore data, emulator context, and checkpoint end */
+    css->callback = NULL;
+    dss->sws.checkpoint_callback = colo_common_write_stream_done;
+    libxl__stream_write_start_checkpoint(shs->egc, &dss->sws);
+}
+
+/* ===================== colo: resume primary vm ===================== */
+/*
+ * Do the following things when resuming primary vm:
+ *  1. read CHECKPOINT_SVM_READY
+ *  2. do preresume
+ *  3. resume primary vm
+ *  4. read CHECKPOINT_SVM_RESUMED
+ */
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css,
+                                     int id);
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc);
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css,
+                                       int id);
+
+void libxl__colo_save_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    libxl__domain_save_state *dss = sws->dss;
+
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+
+    EGC_GC;
+
+    /* read CHECKPOINT_SVM_READY */
+    css->callback = colo_read_svm_ready_done;
+    css->srs.checkpoint_callback = colo_common_read_stream_done;
+    libxl__stream_read_checkpoint_state(egc, &css->srs);
+}
+
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css,
+                                     int id)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (id != CHECKPOINT_SVM_READY) {
+        LOG(ERROR, "invalid section: %d, expected: %d", id,
+            CHECKPOINT_SVM_READY);
+        goto out;
+    }
+
+    css->svm_running = true;
+    dss->cds.callback = colo_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, &dss->cds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, 0);
+}
+
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "preresume fails");
+        goto out;
+    }
+
+    /* Resumes the domain and the device model */
+    if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1)) {
+        LOG(ERROR, "cannot resume primary vm");
+        goto out;
+    }
+
+    /* read CHECKPOINT_SVM_RESUMED */
+    css->callback = colo_read_svm_resumed_done;
+    css->srs.checkpoint_callback = colo_common_read_stream_done;
+    libxl__stream_read_checkpoint_state(egc, &css->srs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, 0);
+}
+
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css,
+                                       int id)
+{
+    int ok = 0;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (id != CHECKPOINT_SVM_RESUMED) {
+        LOG(ERROR, "invalid section: %d, expected: %d", id,
+            CHECKPOINT_SVM_RESUMED);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, ok);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+/*
+ * Do the following things:
+ * 1. do commit
+ * 2. wait for a new checkpoint
+ * 3. write CHECKPOINT_NEW
+ */
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc);
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+
+void libxl__colo_save_domain_should_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    libxl__domain_save_state *dss = sws->dss;
+    libxl__egc *egc = dss->sws.shs.egc;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
+
+    cds->callback = colo_device_commit_cb;
+    libxl__checkpoint_devices_commit(egc, cds);
+}
+
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (rc) {
+        LOG(ERROR, "commit fails");
+        goto out;
+    }
+
+    /* TODO: wait a new checkpoint */
+    colo_start_new_checkpoint(egc, cds, 0);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, 0);
+}
+
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc)
+{
+    libxl__colo_save_state *css = cds->concrete_data;
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    libxl_sr_checkpoint_state srcs = { .id = CHECKPOINT_NEW };
+
+    if (rc)
+        goto out;
+
+    /* write CHECKPOINT_NEW */
+    css->callback = NULL;
+    dss->sws.checkpoint_callback = colo_common_write_stream_done;
+    libxl__stream_write_checkpoint_state(egc, &dss->sws, &srcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, 0);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_write_stream_done(libxl__egc *egc,
+                                          libxl__stream_write_state *stream,
+                                          int rc)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(stream, *dss, sws);
+    int ok;
+
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+
+    EGC_GC;
+
+    if (rc < 0) {
+        /* TODO: it may be a internal error, but we don't know */
+        LOG(ERROR, "sending data fails");
+        ok = 0;
+        goto out;
+    }
+
+    if (!css->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    css->callback(egc, css, 0);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, ok);
+}
+
+static void colo_common_read_stream_done(libxl__egc *egc,
+                                         libxl__stream_read_state *stream,
+                                         int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(stream, *css, srs);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    int ok;
+
+    EGC_GC;
+
+    if (rc < 0) {
+        /* TODO: it may be a internal error, but we don't know */
+        LOG(ERROR, "reading data fails");
+        ok = 0;
+        goto out;
+    }
+
+    if (!css->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    /* rc contains the id */
+    css->callback(egc, css, rc);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->sws.shs, ok);
+}
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 5fd9db2..776306f 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -16,6 +16,7 @@
 #include "libxl_osdeps.h" /* must come before any other headers */
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 #include <xen/errno.h>
 
@@ -401,6 +402,11 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
         callbacks->suspend = libxl__remus_domain_suspend_callback;
         callbacks->postcopy = libxl__remus_domain_resume_callback;
         callbacks->checkpoint = libxl__remus_domain_save_checkpoint_callback;
+    } else if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        callbacks->suspend = libxl__colo_save_domain_suspend_callback;
+        callbacks->postcopy = libxl__colo_save_domain_resume_callback;
+        callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
+        callbacks->should_checkpoint = libxl__colo_save_domain_should_checkpoint_callback;
     } else
         callbacks->suspend = libxl__domain_suspend_callback;
 
@@ -442,12 +448,15 @@ static void domain_save_done(libxl__egc *egc,
 
     if (dss->remus) {
         /*
-         * With Remus, if we reach this point, it means either
+         * With Remus/COLO, if we reach this point, it means either
          * backup died or some network error occurred preventing us
          * from sending checkpoints. Teardown the network buffers and
          * release netlink resources.  This is an async op.
          */
-        libxl__remus_teardown(egc, &dss->rs, rc);
+        if (libxl_defbool_val(dss->remus->colo))
+            libxl__colo_save_teardown(egc, &dss->css, rc);
+        else
+            libxl__remus_teardown(egc, &dss->rs, rc);
         return;
     }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 79b0c6d..54903af 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2798,7 +2798,7 @@ typedef struct libxl__save_helper_state {
 /*
  * The abstract checkpoint device layer exposes a common
  * set of API to [external] libxl for manipulating devices attached to
- * a guest protected by Remus. The device layer also exposes a set of
+ * a guest protected by Remus/COLO. The device layer also exposes a set of
  * [internal] interfaces that every device type must implement.
  *
  * The following API are exposed to libxl:
@@ -2816,7 +2816,7 @@ typedef struct libxl__save_helper_state {
  *  +libxl__checkpoint_devices_commit
  *
  * Each device type needs to implement the interfaces specified in
- * the libxl__checkpoint_device_instance_ops if it wishes to support Remus.
+ * the libxl__checkpoint_device_instance_ops if it wishes to support Remus/COLO.
  *
  * The high-level control flow through the checkpoint device layer is shown
  * below:
@@ -2836,7 +2836,7 @@ typedef struct libxl__checkpoint_device_instance_ops libxl__checkpoint_device_in
 
 /*
  * Interfaces to be implemented by every device subkind that wishes to
- * support Remus. Functions must be implemented unless otherwise
+ * support Remus/COLO. Functions must be implemented unless otherwise
  * stated. Many of these functions are asynchronous. They call
  * dev->aodev.callback when done.  The actual implementations may be
  * synchronous and call dev->aodev.callback directly (as the last
@@ -3013,6 +3013,89 @@ static inline bool libxl__conversion_helper_inuse
                     (const libxl__conversion_helper_state *chs)
 { return libxl__ev_child_inuse(&chs->child); }
 
+/* State for manipulating a libxl migration v2 stream */
+typedef struct libxl__domain_create_state libxl__domain_create_state;
+
+typedef void libxl__domain_create_cb(libxl__egc *egc,
+                                     libxl__domain_create_state*,
+                                     int rc, uint32_t domid);
+
+typedef struct libxl__stream_read_state libxl__stream_read_state;
+
+typedef struct libxl__sr_record_buf {
+    /* private to stream read helper */
+    LIBXL_STAILQ_ENTRY(struct libxl__sr_record_buf) entry;
+    libxl__sr_rec_hdr hdr;
+    void *body; /* iff hdr.length != 0 */
+} libxl__sr_record_buf;
+
+struct libxl__stream_read_state {
+    /* filled by the user */
+    libxl__ao *ao;
+    libxl__domain_create_state *dcs;
+    int fd;
+    bool legacy;
+    bool back_channel;
+    void (*completion_callback)(libxl__egc *egc,
+                                libxl__stream_read_state *srs,
+                                int rc);
+    void (*checkpoint_callback)(libxl__egc *egc,
+                                libxl__stream_read_state *srs,
+                                int rc);
+    /* Private */
+    int rc;
+    bool running;
+    bool in_checkpoint;
+    bool sync_teardown; /* Only used to coordinate shutdown on error path. */
+    bool in_checkpoint_state;
+    libxl__save_helper_state shs;
+    libxl__conversion_helper_state chs;
+
+    /* Main stream-reading data. */
+    libxl__datacopier_state dc; /* Only used when reading a record */
+    libxl__sr_hdr hdr;
+    LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
+    enum {
+        SRS_PHASE_NORMAL,
+        SRS_PHASE_BUFFERING,
+        SRS_PHASE_UNBUFFERING,
+    } phase;
+    bool recursion_guard;
+
+    /* Only used while actively reading a record from the stream. */
+    libxl__sr_record_buf *incoming_record; /* NOGC */
+
+    /* Both only used when processing an EMULATOR record. */
+    libxl__datacopier_state emu_dc;
+    libxl__carefd *emu_carefd;
+};
+
+_hidden void libxl__stream_read_init(libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_start(libxl__egc *egc,
+                                      libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_start_checkpoint(libxl__egc *egc,
+                                                 libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_checkpoint_state(libxl__egc *egc,
+                                                 libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_abort(libxl__egc *egc,
+                                      libxl__stream_read_state *stream, int rc);
+static inline bool
+libxl__stream_read_inuse(const libxl__stream_read_state *stream)
+{
+    return stream->running;
+}
+
+/*----- colo related state structure -----*/
+typedef struct libxl__colo_save_state libxl__colo_save_state;
+struct libxl__colo_save_state {
+    int send_fd;
+    int recv_fd;
+
+    /* private */
+    libxl__stream_read_state srs;
+    void (*callback)(libxl__egc *, libxl__colo_save_state *, int);
+    bool svm_running;
+};
 
 /*----- Domain suspend (save) state structure -----*/
 /*
@@ -3146,7 +3229,12 @@ struct libxl__domain_save_state {
     int hvm;
     int xcflags;
     libxl__domain_suspend_state dsps;
-    libxl__remus_state rs;
+    union {
+        /* for Remus */
+        libxl__remus_state rs;
+        /* for COLO */
+        libxl__colo_save_state css;
+    };
     libxl__checkpoint_devices_state cds;
     libxl__stream_write_state sws;
     libxl__logdirty_switch logdirty;
@@ -3396,78 +3484,6 @@ _hidden int libxl__destroy_qdisk_backend(libxl__gc *gc, uint32_t domid);
 
 /*----- Domain creation -----*/
 
-typedef struct libxl__domain_create_state libxl__domain_create_state;
-
-typedef void libxl__domain_create_cb(libxl__egc *egc,
-                                     libxl__domain_create_state*,
-                                     int rc, uint32_t domid);
-
-/* State for manipulating a libxl migration v2 stream */
-typedef struct libxl__stream_read_state libxl__stream_read_state;
-
-typedef struct libxl__sr_record_buf {
-    /* private to stream read helper */
-    LIBXL_STAILQ_ENTRY(struct libxl__sr_record_buf) entry;
-    libxl__sr_rec_hdr hdr;
-    void *body; /* iff hdr.length != 0 */
-} libxl__sr_record_buf;
-
-struct libxl__stream_read_state {
-    /* filled by the user */
-    libxl__ao *ao;
-    libxl__domain_create_state *dcs;
-    int fd;
-    bool legacy;
-    bool back_channel;
-    void (*completion_callback)(libxl__egc *egc,
-                                libxl__stream_read_state *srs,
-                                int rc);
-    void (*checkpoint_callback)(libxl__egc *egc,
-                                libxl__stream_read_state *srs,
-                                int rc);
-    /* Private */
-    int rc;
-    bool running;
-    bool in_checkpoint;
-    bool sync_teardown; /* Only used to coordinate shutdown on error path. */
-    bool in_checkpoint_state;
-    libxl__save_helper_state shs;
-    libxl__conversion_helper_state chs;
-
-    /* Main stream-reading data. */
-    libxl__datacopier_state dc; /* Only used when reading a record */
-    libxl__sr_hdr hdr;
-    LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
-    enum {
-        SRS_PHASE_NORMAL,
-        SRS_PHASE_BUFFERING,
-        SRS_PHASE_UNBUFFERING,
-    } phase;
-    bool recursion_guard;
-
-    /* Only used while actively reading a record from the stream. */
-    libxl__sr_record_buf *incoming_record; /* NOGC */
-
-    /* Both only used when processing an EMULATOR record. */
-    libxl__datacopier_state emu_dc;
-    libxl__carefd *emu_carefd;
-};
-
-_hidden void libxl__stream_read_init(libxl__stream_read_state *stream);
-_hidden void libxl__stream_read_start(libxl__egc *egc,
-                                      libxl__stream_read_state *stream);
-_hidden void libxl__stream_read_start_checkpoint(libxl__egc *egc,
-                                                 libxl__stream_read_state *stream);
-_hidden void libxl__stream_read_checkpoint_state(libxl__egc *egc,
-                                                 libxl__stream_read_state *stream);
-_hidden void libxl__stream_read_abort(libxl__egc *egc,
-                                      libxl__stream_read_state *stream, int rc);
-static inline bool
-libxl__stream_read_inuse(const libxl__stream_read_state *stream)
-{
-    return stream->running;
-}
-
 /* colo related structure */
 typedef struct libxl__colo_restore_state libxl__colo_restore_state;
 typedef void libxl__colo_callback(libxl__egc *,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index db001ad..7c46bc2 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -762,6 +762,7 @@ libxl_domain_remus_info = Struct("domain_remus_info",[
     ("netbuf",       libxl_defbool),
     ("netbufscript", string),
     ("diskbuf",      libxl_defbool),
+    ("colo",         libxl_defbool)
     ])
 
 libxl_event_type = Enumeration("event_type", [
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 13/25] libxc/restore: support COLO restore
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (11 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 12/25] primary " Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 14/25] libxc/restore: send dirty pfn list to primary when checkpoint under colo Wen Congyang
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

call the callbacks resume/checkpoint/suspend while secondary vm
status is consistent with primary.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_common.h  |  6 +++--
 tools/libxc/xc_sr_restore.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 53d6129..e768a6d 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -175,10 +175,12 @@ struct xc_sr_context
      * migration stream
      * 0: Plain VM
      * 1: Remus
+     * 2: COLO
      */
     enum {
         MIG_STREAM_NONE, /* plain stream */
         MIG_STREAM_REMUS,
+        MIG_STREAM_COLO,
     } migration_stream;
 
     union /* Common save or restore data. */
@@ -223,13 +225,13 @@ struct xc_sr_context
             uint32_t guest_page_size;
 
             /* Plain VM, or checkpoints over time. */
-            bool checkpointed;
+            int checkpointed;
 
             /* Currently buffering records between a checkpoint */
             bool buffer_all_records;
 
 /*
- * With Remus, we buffer the records sent by the primary at checkpoint,
+ * With Remus/COLO, we buffer the records sent by the primary at checkpoint,
  * in case the primary will fail, we can recover from the last
  * checkpoint state.
  * This should be enough for most of the cases because primary only send
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index e543be3..f01a081 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -456,6 +456,49 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
     else
         ctx->restore.buffer_all_records = true;
 
+    if ( ctx->restore.checkpointed == MIG_STREAM_COLO )
+    {
+#define HANDLE_CALLBACK_RETURN_VALUE(ret)                   \
+    do {                                                    \
+        if ( ret == 1 )                                     \
+            rc = 0; /* Success */                           \
+        else                                                \
+        {                                                   \
+            if ( ret == 2 )                                 \
+                rc = BROKEN_CHANNEL;                        \
+            else                                            \
+                rc = -1; /* Some unspecified error */       \
+            goto err;                                       \
+        }                                                   \
+    } while (0)
+
+        /* COLO */
+
+        /* We need to resume guest */
+        rc = ctx->restore.ops.stream_complete(ctx);
+        if ( rc )
+            goto err;
+
+        /* TODO: call restore_results */
+
+        /* Resume secondary vm */
+        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+
+        /* Wait for a new checkpoint */
+        ret = ctx->restore.callbacks->should_checkpoint(
+                                                ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+
+        /* suspend secondary vm */
+        ret = ctx->restore.callbacks->suspend(ctx->restore.callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(ret);
+
+#undef HANDLE_CALLBACK_RETURN_VALUE
+
+        /* TODO: send dirty pfn list to primary */
+    }
+
  err:
     return rc;
 }
@@ -627,6 +670,15 @@ static int restore(struct xc_sr_context *ctx)
     } while ( rec.type != REC_TYPE_END );
 
  remus_failover:
+
+    if ( ctx->restore.checkpointed == MIG_STREAM_COLO )
+    {
+        /* With COLO, we have already called stream_complete */
+        rc = 0;
+        IPRINTF("COLO Failover");
+        goto done;
+    }
+
     /*
      * With Remus, if we reach here, there must be some error on primary,
      * failover from the last checkpoint state.
@@ -681,6 +733,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     if ( checkpointed_stream )
         assert(callbacks->checkpoint);
 
+    if ( ctx.restore.checkpointed == MIG_STREAM_COLO )
+    {
+        /* this is COLO restore */
+        assert(callbacks->suspend &&
+               callbacks->postcopy &&
+               callbacks->should_checkpoint);
+    }
+
     DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
             ", checkpointed_stream %d", io_fd, dom, hvm, pae,
             superpages, checkpointed_stream);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 14/25] libxc/restore: send dirty pfn list to primary when checkpoint under colo
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (12 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 13/25] libxc/restore: support COLO restore Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 15/25] send store gfn and console gfn to xl before resuming secondary vm Wen Congyang
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Send dirty pfn list to primary when checkpoint under colo.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_sr_common.h  |   4 ++
 tools/libxc/xc_sr_restore.c | 120 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index e768a6d..1c661cb 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -217,6 +217,10 @@ struct xc_sr_context
             struct xc_sr_restore_ops ops;
             struct restore_callbacks *callbacks;
 
+            int send_fd;
+            unsigned long p2m_size;
+            xc_hypercall_buffer_t dirty_bitmap_hbuf;
+
             /* From Image Header. */
             uint32_t format_version;
 
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index f01a081..83df539 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -411,6 +411,92 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
     return rc;
 }
 
+/*
+ * Send dirty_bitmap to primary.
+ */
+static int send_dirty_pfn_list(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    unsigned count, written;
+    uint64_t i, *pfns = NULL;
+    struct iovec *iov = NULL;
+    xc_shadow_op_stats_t stats = { 0, ctx->restore.p2m_size };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_DIRTY_PFN_LIST,
+    };
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->restore.dirty_bitmap_hbuf);
+
+    if ( xc_shadow_control(
+             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+             HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
+             NULL, 0, &stats) != ctx->restore.p2m_size )
+    {
+        PERROR("Failed to retrieve logdirty bitmap");
+        goto err;
+    }
+
+    for ( i = 0, count = 0; i < ctx->restore.p2m_size; i++ )
+    {
+        if ( test_bit(i, dirty_bitmap) )
+            count++;
+    }
+
+
+    pfns = malloc(count * sizeof(*pfns));
+    if ( !pfns )
+    {
+        ERROR("Unable to allocate %zu bytes of memory for dirty pfn list",
+              count * sizeof(*pfns));
+        goto err;
+    }
+
+    for ( i = 0, written = 0; i < ctx->restore.p2m_size; ++i )
+    {
+        if ( !test_bit(i, dirty_bitmap) )
+            continue;
+
+        if ( written > count )
+        {
+            ERROR("Dirty pfn list exceed");
+            goto err;
+        }
+
+        pfns[written++] = i;
+    }
+
+    /* iovec[] for writev(). */
+    iov = malloc(3 * sizeof(*iov));
+    if ( !iov )
+    {
+        ERROR("Unable to allocate memory for sending dirty bitmap");
+        goto err;
+    }
+
+    rec.length = count * sizeof(*pfns);
+
+    iov[0].iov_base = &rec.type;
+    iov[0].iov_len = sizeof(rec.type);
+
+    iov[1].iov_base = &rec.length;
+    iov[1].iov_len = sizeof(rec.length);
+
+    iov[2].iov_base = pfns;
+    iov[2].iov_len = count * sizeof(*pfns);
+
+    if ( writev_exact(ctx->restore.send_fd, iov, 3) )
+    {
+        PERROR("Failed to write dirty bitmap to stream");
+        goto err;
+    }
+
+    rc = 0;
+ err:
+    return rc;
+}
+
 static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec);
 static int handle_checkpoint(struct xc_sr_context *ctx)
 {
@@ -496,7 +582,9 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
 
 #undef HANDLE_CALLBACK_RETURN_VALUE
 
-        /* TODO: send dirty pfn list to primary */
+        rc = send_dirty_pfn_list(ctx);
+        if ( rc )
+            goto err;
     }
 
  err:
@@ -568,6 +656,21 @@ static int setup(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     int rc;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->restore.dirty_bitmap_hbuf);
+
+    if ( ctx->restore.checkpointed == MIG_STREAM_COLO )
+    {
+        dirty_bitmap = xc_hypercall_buffer_alloc_pages(xch, dirty_bitmap,
+                                NRPAGES(bitmap_size(ctx->restore.p2m_size)));
+
+        if ( !dirty_bitmap )
+        {
+            ERROR("Unable to allocate memory for dirty bitmap");
+            rc = -1;
+            goto err;
+        }
+    }
 
     rc = ctx->restore.ops.setup(ctx);
     if ( rc )
@@ -601,10 +704,15 @@ static void cleanup(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     unsigned i;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->restore.dirty_bitmap_hbuf);
 
     for ( i = 0; i < ctx->restore.buffered_rec_num; i++ )
         free(ctx->restore.buffered_records[i].data);
 
+    if ( ctx->restore.checkpointed == MIG_STREAM_COLO )
+        xc_hypercall_buffer_free_pages(xch, dirty_bitmap,
+                                   NRPAGES(bitmap_size(ctx->restore.p2m_size)));
     free(ctx->restore.buffered_records);
     free(ctx->restore.populated_pfns);
     if ( ctx->restore.ops.cleanup(ctx) )
@@ -715,6 +823,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       int checkpointed_stream,
                       struct restore_callbacks *callbacks, int back_fd)
 {
+    xen_pfn_t nr_pfns;
     struct xc_sr_context ctx =
         {
             .xch = xch,
@@ -728,6 +837,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     ctx.restore.xenstore_domid = store_domid;
     ctx.restore.checkpointed = checkpointed_stream;
     ctx.restore.callbacks = callbacks;
+    ctx.restore.send_fd = back_fd;
 
     /* Sanity checks for callbacks. */
     if ( checkpointed_stream )
@@ -762,6 +872,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     if ( read_headers(&ctx) )
         return -1;
 
+    if ( xc_domain_nr_gpfns(xch, dom, &nr_pfns) < 0 )
+    {
+        PERROR("Unable to obtain the guest p2m size");
+        return -1;
+    }
+
+    ctx.restore.p2m_size = nr_pfns;
+
     if ( ctx.dominfo.hvm )
     {
         ctx.restore.ops = restore_ops_x86_hvm;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 15/25] send store gfn and console gfn to xl before resuming secondary vm
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (13 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 14/25] libxc/restore: send dirty pfn list to primary when checkpoint under colo Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 16/25] libxc/save: support COLO save Wen Congyang
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
we need store gfn and console gfn when rebuilding secondary vm. So make
restore_results a function pointer in callback struct and struct
{save,restore}_callbacks, and use this callback to send store gfn and
console gfn to xl.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/include/xenguest.h     | 8 ++++++++
 tools/libxc/xc_sr_restore.c        | 7 +++++--
 tools/libxl/libxl_colo_restore.c   | 5 -----
 tools/libxl/libxl_create.c         | 2 ++
 tools/libxl/libxl_save_msgs_gen.pl | 4 ++--
 5 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 7c74977..df36147 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -121,6 +121,14 @@ struct restore_callbacks {
      */
     int (*should_checkpoint)(void* data);
 
+    /*
+     * callback to send store gfn and console gfn to xl
+     * if we want to resume vm before xc_domain_save()
+     * exits.
+     */
+    void (*restore_results)(xen_pfn_t store_gfn, xen_pfn_t console_gfn,
+                            void *data);
+
     /* to be provided as the last argument to each callback function */
     void* data;
 };
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 83df539..6c480d3 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -565,7 +565,9 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
         if ( rc )
             goto err;
 
-        /* TODO: call restore_results */
+        ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
+                                                ctx->restore.console_gfn,
+                                                ctx->restore.callbacks->data);
 
         /* Resume secondary vm */
         ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
@@ -848,7 +850,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
         /* this is COLO restore */
         assert(callbacks->suspend &&
                callbacks->postcopy &&
-               callbacks->should_checkpoint);
+               callbacks->should_checkpoint &&
+               callbacks->restore_results);
     }
 
     DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index 4f25f27..f23ef8f 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -139,11 +139,6 @@ static void colo_resume_vm(libxl__egc *egc,
         return;
     }
 
-    /*
-     * TODO: get store gfn and console gfn
-     *  We should call the callback restore_results in
-     *  xc_domain_restore() before resuming the guest.
-     */
     libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
 
     return;
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 62aa7c9..b495741 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1042,6 +1042,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     dcs->srs.back_channel = false;
     dcs->srs.completion_callback = domcreate_stream_done;
 
+    callbacks->restore_results = libxl__srm_callout_callback_restore_results;
+
     if (restore_fd >= 0) {
         switch (checkpointed_stream) {
         case LIBXL_CHECKPOINTED_STREAM_COLO:
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 7c9859b..aea07b4 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -29,8 +29,8 @@ our @msgs = (
     [  6, 'srcxA',  "should_checkpoint", [] ],
     [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
                                               unsigned enable)] ],
-    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
-                                              'unsigned long', 'console_mfn'] ],
+    [  8, 'rcx',    "restore_results",       ['unsigned long', 'store_gfn',
+                                              'unsigned long', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
 );
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 16/25] libxc/save: support COLO save
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (14 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 15/25] send store gfn and console gfn to xl before resuming secondary vm Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 17/25] implement the cmdline for COLO Wen Congyang
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

After suspend primary vm, get dirty bitmap on secondary vm,
and send pages both dirty on primary/secondary to secondary.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_common.h |   2 +
 tools/libxc/xc_sr_save.c   | 102 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 101 insertions(+), 3 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 1c661cb..76e2cdc 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -187,6 +187,8 @@ struct xc_sr_context
     {
         struct /* Save data. */
         {
+            int recv_fd;
+
             struct xc_sr_save_ops ops;
             struct save_callbacks *callbacks;
 
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index a49d083..aaabf1f 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -516,6 +516,58 @@ static int send_memory_live(struct xc_sr_context *ctx)
     return rc;
 }
 
+static int merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_record rec;
+    uint64_t *pfns = NULL;
+    uint64_t pfn;
+    unsigned count, i;
+    int rc;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    rc = read_record(ctx, ctx->save.recv_fd, &rec);
+    if ( rc )
+        goto err;
+
+    if ( rec.type != REC_TYPE_DIRTY_PFN_LIST )
+    {
+        PERROR("Expect dirty bitmap record, but received %u", rec.type );
+        rc = -1;
+        goto err;
+    }
+
+    if ( rec.length % sizeof(*pfns) )
+    {
+        PERROR("Invalid dirty pfn list record length %u", rec.length );
+        rc = -1;
+        goto err;
+    }
+
+    count = rec.length / sizeof(*pfns);
+    pfns = rec.data;
+
+    for ( i = 0; i < count; i++ )
+    {
+        pfn = pfns[i];
+        if (pfn > ctx->save.p2m_size)
+        {
+            PERROR("Invalid pfn %#lx", pfn );
+            rc = -1;
+            goto err;
+        }
+
+        set_bit(pfn, dirty_bitmap);
+    }
+
+    rc = 0;
+
+ err:
+    free(rec.data);
+    return rc;
+}
+
 /*
  * Suspend the domain and send dirty memory.
  * This is the last iteration of the live migration and the
@@ -557,6 +609,16 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
 
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
+    if ( !ctx->save.live && ctx->save.checkpointed == MIG_STREAM_COLO )
+    {
+        rc = merge_secondary_dirty_bitmap(ctx);
+        if ( rc )
+        {
+            PERROR("Failed to get secondary vm's dirty pages");
+            goto out;
+        }
+    }
+
     rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
     if ( rc )
         goto out;
@@ -787,11 +849,42 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             if ( rc )
                 goto err;
 
-            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            if ( ctx->save.checkpointed == MIG_STREAM_COLO )
+            {
+                rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
+                if ( !rc )
+                {
+                    rc = -1;
+                    goto err;
+                }
+            }
 
-            rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
-            if ( rc <= 0 )
+            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            if ( !rc )
+            {
+                rc = -1;
                 goto err;
+            }
+
+            if ( ctx->save.checkpointed == MIG_STREAM_COLO )
+            {
+                rc = ctx->save.callbacks->should_checkpoint(
+                                                    ctx->save.callbacks->data);
+                if ( rc <= 0 )
+                    goto err;
+            }
+            else if ( ctx->save.checkpointed == MIG_STREAM_REMUS )
+            {
+                rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
+                if ( rc <= 0 )
+                    goto err;
+            }
+            else
+            {
+                ERROR("Unknown checkpointed stream");
+                rc = -1;
+                goto err;
+            }
         }
     } while ( ctx->save.checkpointed != MIG_STREAM_NONE );
 
@@ -837,6 +930,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
     ctx.save.live  = !!(flags & XCFLAGS_LIVE);
     ctx.save.debug = !!(flags & XCFLAGS_DEBUG);
     ctx.save.checkpointed = checkpointed_stream;
+    ctx.save.recv_fd = back_fd;
 
     /*
      * TODO: Find some time to better tweak the live migration algorithm.
@@ -852,6 +946,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
         assert(callbacks->switch_qemu_logdirty);
     if ( ctx.save.checkpointed )
         assert(callbacks->checkpoint && callbacks->postcopy);
+    if ( ctx.save.checkpointed == MIG_STREAM_COLO )
+        assert(callbacks->should_checkpoint);
 
     DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
             io_fd, dom, max_iters, max_factor, flags, hvm);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 17/25] implement the cmdline for COLO
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (15 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 16/25] libxc/save: support COLO save Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 18/25] Support colo mode for qemu disk Wen Congyang
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Add a new option -c to the command 'xl remus'. If you want
to use COLO HA instead of Remus HA, please use -c option.

Update man pages to reflect the addition of a new option to
'xl remus' command.

Also add a new option -c to the internal command 'xl migrate-receive'.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 docs/man/xl.pod.1         | 12 +++++++--
 tools/libxl/libxl.c       | 23 ++++++++++++++++--
 tools/libxl/xl_cmdimpl.c  | 62 ++++++++++++++++++++++++++++++++++++-----------
 tools/libxl/xl_cmdtable.c |  4 ++-
 4 files changed, 82 insertions(+), 19 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 4279c7c..1c6dd87 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -447,12 +447,15 @@ Print huge (!) amount of debug during the migration process.
 
 =item B<remus> [I<OPTIONS>] I<domain-id> I<host>
 
-Enable Remus HA for domain. By default B<xl> relies on ssh as a transport
-mechanism between the two hosts.
+Enable Remus HA or COLO HA for domain. By default B<xl> relies on ssh as a
+transport mechanism between the two hosts.
 
 N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
+     COLO support in xl is still in experimental (proof-of-concept) phase.
+     There is no support for network or disk at the moment.
+
 B<OPTIONS>
 
 =over 4
@@ -498,6 +501,11 @@ Disable network output buffering. Requires enabling unsafe mode.
 
 Disable disk replication. Requires enabling unsafe mode.
 
+=item B<-c>
+
+Enable COLO HA. This conflicts with B<-i> and B<-b>, and memory
+checkpoint compression must be disabled.
+
 =back
 
 =item B<pause> I<domain-id>
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 7d227d7..ca6bca1 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -851,12 +851,28 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
         goto out;
     }
 
+    /* The caller must set this defbool */
+    if (libxl_defbool_is_default(info->colo)) {
+        LOG(ERROR, "colo mode must be enabled/disabled");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
     libxl_defbool_setdefault(&info->allow_unsafe, false);
     libxl_defbool_setdefault(&info->blackhole, false);
-    libxl_defbool_setdefault(&info->compression, true);
+    libxl_defbool_setdefault(&info->compression,
+                             !libxl_defbool_val(info->colo));
     libxl_defbool_setdefault(&info->netbuf, true);
     libxl_defbool_setdefault(&info->diskbuf, true);
 
+    if (libxl_defbool_val(info->colo)) {
+        if (libxl_defbool_val(info->compression)) {
+            LOG(ERROR, "cannot use memory checkpoint compression in COLO mode");
+            rc = ERROR_FAIL;
+            goto out;
+        }
+    }
+
     if (!libxl_defbool_val(info->allow_unsafe) &&
         (libxl_defbool_val(info->blackhole) ||
          !libxl_defbool_val(info->netbuf) ||
@@ -878,7 +894,10 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     dss->live = 1;
     dss->debug = 0;
     dss->remus = info;
-    dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_REMUS;
+    if (libxl_defbool_val(info->colo))
+        dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_COLO;
+    else
+        dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_REMUS;
 
     assert(info);
 
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 6580b59..70b8b82 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4435,6 +4435,8 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     char rc_buf;
     char *migration_domname;
     struct domain_create dom_info;
+    const char *ha = checkpointed == LIBXL_CHECKPOINTED_STREAM_COLO ?
+                     "COLO" : "Remus";
 
     signal(SIGPIPE, SIG_IGN);
     /* if we get SIGPIPE we'd rather just have it as an error */
@@ -4455,6 +4457,9 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = checkpointed;
+    if (checkpointed == LIBXL_CHECKPOINTED_STREAM_COLO)
+        /* COLO uses stdout to send control message to master */
+        dom_info.quiet = 1;
 
     rc = create_domain(&dom_info);
     if (rc < 0) {
@@ -4467,11 +4472,12 @@ static void migrate_receive(int debug, int daemonize, int monitor,
 
     switch (checkpointed) {
     case LIBXL_CHECKPOINTED_STREAM_REMUS:
+    case LIBXL_CHECKPOINTED_STREAM_COLO:
         /* If we are here, it means that the sender (primary) has crashed.
          * TODO: Split-Brain Check.
          */
-        fprintf(stderr, "migration target: Remus Failover for domain %u\n",
-                domid);
+        fprintf(stderr, "migration target: %s Failover for domain %u\n",
+                ha, domid);
 
         /*
          * If domain renaming fails, lets just continue (as we need the domain
@@ -4487,16 +4493,20 @@ static void migrate_receive(int debug, int daemonize, int monitor,
             rc = libxl_domain_rename(ctx, domid, migration_domname,
                                      common_domname);
             if (rc)
-                fprintf(stderr, "migration target (Remus): "
+                fprintf(stderr, "migration target (%s): "
                         "Failed to rename domain from %s to %s:%d\n",
-                        migration_domname, common_domname, rc);
+                        ha, migration_domname, common_domname, rc);
         }
 
+        if (checkpointed == LIBXL_CHECKPOINTED_STREAM_COLO)
+            /* The guest is running after failover in COLO mode */
+            exit(rc ? -ERROR_FAIL: 0);
+
         rc = libxl_domain_unpause(ctx, domid);
         if (rc)
-            fprintf(stderr, "migration target (Remus): "
+            fprintf(stderr, "migration target (%s): "
                     "Failed to unpause domain %s (id: %u):%d\n",
-                    common_domname, domid, rc);
+                    ha, common_domname, domid, rc);
 
         exit(rc ? -ERROR_FAIL: 0);
     default:
@@ -4644,7 +4654,7 @@ int main_migrate_receive(int argc, char **argv)
     libxl_checkpointed_stream checkpointed = LIBXL_CHECKPOINTED_STREAM_NONE;
     int opt;
 
-    SWITCH_FOREACH_OPT(opt, "Fedr", NULL, "migrate-receive", 0) {
+    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
     case 'F':
         daemonize = 0;
         break;
@@ -4658,6 +4668,9 @@ int main_migrate_receive(int argc, char **argv)
     case 'r':
         checkpointed = LIBXL_CHECKPOINTED_STREAM_REMUS;
         break;
+    case 'c':
+        checkpointed = LIBXL_CHECKPOINTED_STREAM_COLO;
+        break;
     }
 
     if (argc-optind != 0) {
@@ -8023,11 +8036,8 @@ int main_remus(int argc, char **argv)
     int config_len;
 
     memset(&r_info, 0, sizeof(libxl_domain_remus_info));
-    /* Defaults */
-    r_info.interval = 200;
-    libxl_defbool_setdefault(&r_info.blackhole, false);
 
-    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:e", NULL, "remus", 2) {
+    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:ec", NULL, "remus", 2) {
     case 'i':
         r_info.interval = atoi(optarg);
         break;
@@ -8055,11 +8065,32 @@ int main_remus(int argc, char **argv)
     case 'e':
         daemonize = 0;
         break;
+    case 'c':
+        libxl_defbool_set(&r_info.colo, true);
     }
 
     domid = find_domain(argv[optind]);
     host = argv[optind + 1];
 
+    /* Defaults */
+    libxl_defbool_setdefault(&r_info.blackhole, false);
+    libxl_defbool_setdefault(&r_info.colo, false);
+    if (!libxl_defbool_val(r_info.colo) && !r_info.interval)
+        r_info.interval = 200;
+
+    if (libxl_defbool_val(r_info.colo)) {
+        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
+            perror("Option -c conflicts with -i or -b");
+            exit(-1);
+        }
+
+        if (libxl_defbool_is_default(r_info.compression)) {
+            perror("COLO can't be used with memory compression. "
+                   "Disable memory checkpoint compression now...");
+            libxl_defbool_set(&r_info.compression, false);
+        }
+    }
+
     if (!r_info.netbufscript)
         r_info.netbufscript = default_remus_netbufscript;
 
@@ -8074,8 +8105,9 @@ int main_remus(int argc, char **argv)
         if (!ssh_command[0]) {
             rune = host;
         } else {
-            xasprintf(&rune, "exec %s %s xl migrate-receive -r %s",
+            xasprintf(&rune, "exec %s %s xl migrate-receive %s %s",
                       ssh_command, host,
+                      libxl_defbool_val(r_info.colo) ? "-c" : "-r",
                       daemonize ? "" : " -e");
         }
 
@@ -8103,7 +8135,8 @@ int main_remus(int argc, char **argv)
      * domain to force failover
      */
     if (libxl_domain_info(ctx, 0, domid)) {
-        fprintf(stderr, "Remus: Primary domain has been destroyed.\n");
+        fprintf(stderr, "%s: Primary domain has been destroyed.\n",
+                libxl_defbool_val(r_info.colo) ? "COLO" : "Remus");
         close(send_fd);
         return 0;
     }
@@ -8115,7 +8148,8 @@ int main_remus(int argc, char **argv)
     if (rc == ERROR_GUEST_TIMEDOUT)
         fprintf(stderr, "Failed to suspend domain at primary.\n");
     else {
-        fprintf(stderr, "Remus: Backup failed? resuming domain at primary.\n");
+        fprintf(stderr, "%s: Backup failed? resuming domain at primary.\n",
+                libxl_defbool_val(r_info.colo) ? "COLO" : "Remus");
         libxl_domain_resume(ctx, domid, 1, 0);
     }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index fdc1ac6..b6b630c 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -499,7 +499,9 @@ struct cmd_spec cmd_table[] = {
       "-b                      Replicate memory checkpoints to /dev/null (blackhole).\n"
       "                        Works only in unsafe mode.\n"
       "-n                      Disable network output buffering. Works only in unsafe mode.\n"
-      "-d                      Disable disk replication. Works only in unsafe mode."
+      "-d                      Disable disk replication. Works only in unsafe mode.\n"
+      "-c                      Enable COLO HA. It is conflict with -i and -b, and memory\n"
+      "                        checkpoint must be disabled"
     },
 #endif
     { "devd",
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 18/25] Support colo mode for qemu disk
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (16 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 17/25] implement the cmdline for COLO Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 19/25] COLO: use qemu block replication Wen Congyang
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
For QEMU block replication details:
http://wiki.qemu.org/Features/BlockReplication

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 docs/man/xl.pod.1                   |   2 +-
 docs/misc/xl-disk-configuration.txt |  50 ++++++++++
 tools/libxl/libxl.c                 |  62 +++++++++++-
 tools/libxl/libxl_create.c          |  25 ++++-
 tools/libxl/libxl_device.c          |  54 +++++++++++
 tools/libxl/libxl_dm.c              | 184 ++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_types.idl         |   7 ++
 tools/libxl/libxlu_disk_l.l         |   7 ++
 8 files changed, 382 insertions(+), 9 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 1c6dd87..4f1901d 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -454,7 +454,7 @@ N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
      COLO support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network or disk at the moment.
+     There is no support for network at the moment.
 
 B<OPTIONS>
 
diff --git a/docs/misc/xl-disk-configuration.txt b/docs/misc/xl-disk-configuration.txt
index 6a2118d..5be4e76 100644
--- a/docs/misc/xl-disk-configuration.txt
+++ b/docs/misc/xl-disk-configuration.txt
@@ -234,6 +234,56 @@ were intentionally created non-sparse to avoid fragmentation of the
 file.
 
 
+===============
+COLO PARAMETERS
+===============
+
+
+colo
+----
+
+Enable COLO HA for disk. For better understanding block replication on
+QEMU, please refer to:
+http://wiki.qemu.org/Features/BlockReplication
+
+
+colo-host
+---------
+Description:           Secondary host's address
+Mandatory:             Yes when COLO enabled
+
+
+colo-port
+---------
+Description:           Secondary port
+                       We will run a nbd server on secondary host,
+                       and the nbd server will listen this port.
+Mandatory:             Yes when COLO enabled
+
+
+colo-export
+---------
+Description:           We will run a nbd server on secondary host,
+                       exportname is the nbd server's disk export name.
+Mandatory:             Yes when COLO enabled
+
+
+active-disk
+-----------
+
+Description:           This is used by secondary. Secondary guest's write
+                       will be buffered in this disk.
+Mandatory:             Yes when COLO enabled
+
+
+hidden-disk
+-----------
+
+Description:           This is used by secondary. It buffers the original
+                       content that is modified by the primary VM.
+Mandatory:             Yes when COLO enabled
+
+
 ============================================
 DEPRECATED PARAMETERS, PREFIXES AND SYNTAXES
 ============================================
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index ca6bca1..e770723 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2315,6 +2315,8 @@ int libxl__device_disk_setdefault(libxl__gc *gc, libxl_device_disk *disk)
     int rc;
 
     libxl_defbool_setdefault(&disk->discard_enable, !!disk->readwrite);
+    libxl_defbool_setdefault(&disk->colo_enable, false);
+    libxl_defbool_setdefault(&disk->colo_restore_enable, false);
 
     rc = libxl__resolve_domid(gc, disk->backend_domname, &disk->backend_domid);
     if (rc < 0) return rc;
@@ -2513,6 +2515,18 @@ static void device_disk_add(libxl__egc *egc, uint32_t domid,
                 flexarray_append(back, "params");
                 flexarray_append(back, GCSPRINTF("%s:%s",
                               libxl__device_disk_string_of_format(disk->format), disk->pdev_path));
+                if (libxl_defbool_val(disk->colo_enable)) {
+                    flexarray_append(back, "colo-host");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_host));
+                    flexarray_append(back, "colo-port");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_port));
+                    flexarray_append(back, "colo-export");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_export));
+                    flexarray_append(back, "active-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->active_disk));
+                    flexarray_append(back, "hidden-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->hidden_disk));
+                }
                 assert(device->backend_kind == LIBXL__DEVICE_KIND_QDISK);
                 break;
             default:
@@ -2628,7 +2642,10 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         goto cleanup;
     }
 
-    /* "params" may not be present; but everything else must be. */
+    /*
+     * "params" and "colo-host" may not be present; but everything
+     * else must be.
+     */
     tmp = xs_read(ctx->xsh, XBT_NULL,
                   GCSPRINTF("%s/params", be_path), &len);
     if (tmp && strchr(tmp, ':')) {
@@ -2638,6 +2655,49 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         disk->pdev_path = tmp;
     }
 
+    tmp = xs_read(ctx->xsh, XBT_NULL,
+                  GCSPRINTF("%s/colo-host", be_path), &len);
+    if (tmp) {
+        libxl_defbool_set(&disk->colo_enable, true);
+        disk->colo_host = tmp;
+    } else {
+        libxl_defbool_set(&disk->colo_enable, false);
+    }
+
+    if (libxl_defbool_val(disk->colo_enable)) {
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/colo-port", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
+            goto cleanup;
+        }
+        disk->colo_port = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/colo-export", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/colo-export", be_path);
+            goto cleanup;
+        }
+        disk->colo_export = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/active-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/active-disk", be_path);
+            goto cleanup;
+        }
+        disk->active_disk = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/hidden-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/hidden-disk", be_path);
+            goto cleanup;
+        }
+        disk->hidden_disk = tmp;
+    }
+
 
     tmp = libxl__xs_read(gc, XBT_NULL,
                          GCSPRINTF("%s/type", be_path));
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b495741..7cb3c6a 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1782,12 +1782,29 @@ static void domain_create_cb(libxl__egc *egc,
 
     libxl__ao_complete(egc, ao, rc);
 }
-    
+
+static void set_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, true);
+}
+
+static void unset_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, false);
+}
+
 int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
+    unset_disk_colo_restore(d_config);
     return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
                             ao_how, aop_console_how);
 }
@@ -1798,6 +1815,12 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
 {
+    if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        set_disk_colo_restore(d_config);
+    } else {
+        unset_disk_colo_restore(d_config);
+    }
+
     return do_domain_create(ctx, d_config, domid, restore_fd, send_fd, params,
                             ao_how, aop_console_how);
 }
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 8bb5e93..039afc6 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -196,6 +196,10 @@ static int disk_try_backend(disk_try_backend_args *a,
             goto bad_format;
         }
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->backend_domid != LIBXL_TOOLSTACK_DOMID) {
             LOG(DEBUG, "Disk vdev=%s, is using a storage driver domain, "
                        "skipping physical device check", a->disk->vdev);
@@ -218,6 +222,10 @@ static int disk_try_backend(disk_try_backend_args *a,
     case LIBXL_DISK_BACKEND_TAP:
         if (a->disk->script) goto bad_script;
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->is_cdrom) {
             LOG(DEBUG, "Disk vdev=%s, backend tap unsuitable for cdroms",
                        a->disk->vdev);
@@ -236,6 +244,22 @@ static int disk_try_backend(disk_try_backend_args *a,
 
     case LIBXL_DISK_BACKEND_QDISK:
         if (a->disk->script) goto bad_script;
+        if (libxl_defbool_val(a->disk->colo_enable)) {
+            if (!a->disk->colo_host)
+                goto bad_colo_host;
+
+            if (!a->disk->colo_port)
+                goto bad_colo_port;
+
+            if (!a->disk->colo_export)
+                goto bad_colo_export;
+
+            if (!a->disk->active_disk)
+                goto bad_active_disk;
+
+            if (!a->disk->hidden_disk)
+                goto bad_hidden_disk;
+        }
         return backend;
 
     default:
@@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
     LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
         a->disk->vdev, libxl_disk_backend_to_string(backend));
     return 0;
+
+ bad_colo:
+    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_host:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_port:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-port=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_export:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-export=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_active_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs active-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_hidden_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs hidden-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
 }
 
 int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 0aaefd9..f34b3ac 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -750,6 +750,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
     }
 }
 
+/* colo mode */
+enum {
+    LIBXL__COLO_NONE = 0,
+    LIBXL__COLO_PRIMARY,
+    LIBXL__COLO_SECONDARY,
+};
+
+static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
+                                         int unit, const char *format,
+                                         const libxl_device_disk *disk,
+                                         int colo_mode)
+{
+    char *drive = NULL;
+    const char *exportname = disk->colo_export;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = libxl__sprintf
+            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=scsi,bus=0,unit=x,cache=writeback,driver=quorum,\
+         *  id=exportname,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  read-pattern=fifo,\
+         *  vote-threshold=1
+         */
+        drive = GCSPRINTF(
+            "if=scsi,bus=0,unit=%d,cache=writeback,driver=quorum,"
+            "id=%s,"
+            "children.0.file.filename=%s,"
+            "children.0.driver=%s,"
+            "read-pattern=fifo,"
+            "vote-threshold=1",
+            unit, exportname, pdev_path, format);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=scsi,bus=0,unit=x,cache=writeback,driver=replication,\
+         *  mode=secondary,\
+         *  file.driver=qcow2,\
+         *  file.file.filename=active_disk,\
+         *  file.backing.driver=qcow2,\
+         *  file.backing.file.filename=hidden_disk,\
+         *  file.backing.backing=exportname,
+         */
+        drive = GCSPRINTF(
+            "if=scsi,bus=0,unit=%d,cache=writeback,driver=replication,"
+            "mode=secondary,"
+            "file.driver=qcow2,"
+            "file.file.filename=%s,"
+            "file.backing.driver=qcow2,"
+            "file.backing.file.filename=%s,"
+            "file.backing.backing=%s",
+            unit, active_disk, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    return drive;
+}
+
+static char *qemu_disk_ide_drive_string(libxl__gc *gc, const char *pdev_path,
+                                        int unit, const char *format,
+                                        const libxl_device_disk *disk,
+                                        int colo_mode)
+{
+    char *drive = NULL;
+    const char *exportname = disk->colo_export;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = GCSPRINTF
+            ("file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=ide,index=x,media=disk,cache=writeback,driver=quorum,\
+         *  id=exportname,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  read-pattern=fifo,\
+         *  vote-threshold=1
+         */
+        drive = GCSPRINTF(
+            "if=ide,index=%d,media=disk,cache=writeback,driver=quorum,"
+            "id=%s,"
+            "children.0.file.filename=%s,"
+            "children.0.driver=%s,"
+            "read-pattern=fifo,"
+            "vote-threshold=1",
+             unit, exportname, pdev_path, format);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=ide,index=x,media=disk,cache=writeback,driver=replication,\
+         *  mode=secondary,\
+         *  file.driver=qcow2,\
+         *  file.file.filename=active_disk,\
+         *  file.backing.driver=qcow2,\
+         *  file.backing.file.filename=hidden_disk,\
+         *  file.backing.backing=exportname,
+         */
+        drive = GCSPRINTF(
+            "if=ide,index=%d,media=disk,cache=writeback,driver=replication,"
+            "mode=secondary,"
+            "file.driver=qcow2,"
+            "file.file.filename=%s,"
+            "file.backing.driver=qcow2,"
+            "file.backing.file.filename=%s,"
+            "file.backing.backing=%s",
+            unit, active_disk, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    return drive;
+}
+
 static int libxl__build_device_model_args_new(libxl__gc *gc,
                                         const char *dm, int guest_domid,
                                         const libxl_domain_config *guest_config,
@@ -1163,6 +1296,7 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
             const char *format = qemu_disk_format_string(disks[i].format);
             char *drive;
             const char *pdev_path;
+            int colo_mode;
 
             if (dev_number == -1) {
                 LOG(WARN, "unable to determine"" disk number for %s",
@@ -1207,10 +1341,32 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
                  * For other disks we translate devices 0..3 into
                  * hd[a-d] and ignore the rest.
                  */
+                if (libxl_defbool_val(disks[i].colo_enable)) {
+                    if (libxl_defbool_val(disks[i].colo_restore_enable))
+                        colo_mode = LIBXL__COLO_SECONDARY;
+                    else
+                        colo_mode = LIBXL__COLO_PRIMARY;
+                } else {
+                    colo_mode = LIBXL__COLO_NONE;
+                }
+
                 if (strncmp(disks[i].vdev, "sd", 2) == 0) {
-                    drive = libxl__sprintf
-                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
-                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
+                    if (colo_mode == LIBXL__COLO_SECONDARY) {
+                        /*
+                         * -drive if=none,driver=format,file=pdev_path,\
+                         * id=exportname
+                         */
+                        drive = libxl__sprintf
+                            (gc, "if=none,driver=%s,file=%s,id=%s",
+                             format, pdev_path, disks[i].colo_export);
+
+                        flexarray_append(dm_args, "-drive");
+                        flexarray_append(dm_args, drive);
+                    }
+                    drive = qemu_disk_scsi_drive_string(gc, pdev_path, disk,
+                                                        format,
+                                                        &disks[i],
+                                                        colo_mode);
                 } else if (strncmp(disks[i].vdev, "xvd", 3) == 0) {
                     /*
                      * Do not add any emulated disk when PV disk are
@@ -1233,12 +1389,28 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
                         LOG(ERROR, "qemu-xen doesn't support read-only IDE disk drivers");
                         return ERROR_INVAL;
                     }
-                    drive = libxl__sprintf
-                        (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
-                         pdev_path, disk, format);
+                    if (colo_mode == LIBXL__COLO_SECONDARY) {
+                        /*
+                         * -drive if=none,driver=format,file=pdev_path,\
+                         * id=exportname
+                         */
+                        drive = libxl__sprintf
+                            (gc, "if=none,driver=%s,file=%s,id=%s",
+                             format, pdev_path, disks[i].colo_export);
+
+                        flexarray_append(dm_args, "-drive");
+                        flexarray_append(dm_args, drive);
+                    }
+                    drive = qemu_disk_ide_drive_string(gc, pdev_path, disk,
+                                                       format,
+                                                       &disks[i],
+                                                       colo_mode);
                 } else {
                     continue; /* Do not emulate this disk */
                 }
+
+                if (!drive)
+                    continue;
             }
 
             flexarray_append(dm_args, "-drive");
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 7c46bc2..38f37f2 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -574,6 +574,13 @@ libxl_device_disk = Struct("device_disk", [
     ("is_cdrom", integer),
     ("direct_io_safe", bool),
     ("discard_enable", libxl_defbool),
+    ("colo_enable", libxl_defbool),
+    ("colo_restore_enable", libxl_defbool),
+    ("colo_host", string),
+    ("colo_port", string),
+    ("colo_export", string),
+    ("active_disk", string),
+    ("hidden_disk", string)
     ])
 
 libxl_device_nic = Struct("device_nic", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 1a5deb5..58da943 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -176,6 +176,13 @@ script=[^,]*,?	{ STRIP(','); SAVESTRING("script", script, FROMEQUALS); }
 direct-io-safe,? { DPC->disk->direct_io_safe = 1; }
 discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, true); }
 no-discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, false); }
+colo,?		{ libxl_defbool_set(&DPC->disk->colo_enable, true); }
+no-colo,?	{ libxl_defbool_set(&DPC->disk->colo_enable, false); }
+colo-host=[^,]*,?	{ STRIP(','); SAVESTRING("colo-host", colo_host, FROMEQUALS); }
+colo-port=[^,]*,?	{ STRIP(','); SAVESTRING("colo-port", colo_port, FROMEQUALS); }
+colo-export=[^,]*,?	{ STRIP(','); SAVESTRING("colo-export", colo_export, FROMEQUALS); }
+active-disk=[^,]*,?	{ STRIP(','); SAVESTRING("active-disk", active_disk, FROMEQUALS); }
+hidden-disk=[^,]*,?	{ STRIP(','); SAVESTRING("hidden-disk", hidden_disk, FROMEQUALS); }
 
  /* the target magic parameter, eats the rest of the string */
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 19/25] COLO: use qemu block replication
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (17 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 18/25] Support colo mode for qemu disk Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 20/25] COLO proxy: implement setup/teardown of COLO proxy module Wen Congyang
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Use qemu block replication as our block replication solution.
Note that guest must be paused before starting COLO, otherwise,
the disk won't be consistent between primary and secondary.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
for commit message,
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
---
 tools/libxl/Makefile             |   1 +
 tools/libxl/libxl_colo_qdisk.c   | 262 +++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_colo_restore.c |  31 ++++-
 tools/libxl/libxl_colo_save.c    |  45 ++++++-
 tools/libxl/libxl_internal.h     |  34 +++++
 tools/libxl/libxl_qmp.c          |  93 ++++++++++++++
 6 files changed, 462 insertions(+), 4 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_qdisk.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index b11cf34..a4156c1 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -64,6 +64,7 @@ endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
+LIBXL_OBJS-y += libxl_colo_qdisk.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo_qdisk.c b/tools/libxl/libxl_colo_qdisk.c
new file mode 100644
index 0000000..d5de278
--- /dev/null
+++ b/tools/libxl/libxl_colo_qdisk.c
@@ -0,0 +1,262 @@
+/*
+ * Copyright (C) 2015 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+typedef struct libxl__colo_qdisk {
+    bool setuped;
+} libxl__colo_qdisk;
+
+/* ========== init() and cleanup() ========== */
+int init_subkind_qdisk(libxl__checkpoint_devices_state *cds)
+{
+    /*
+     * We don't know if we use qemu block replication, so
+     * we cannot start block replication here.
+     */
+    return 0;
+}
+
+void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds)
+{
+}
+
+/* ========== setup() and teardown() ========== */
+static void colo_qdisk_setup(libxl__egc *egc, libxl__checkpoint_device *dev,
+                             bool primary)
+{
+    const libxl_device_disk *disk = dev->backend_dev;
+    int ret, rc = 0;
+    libxl__colo_qdisk *colo_qdisk = NULL;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = dev->cds;
+    const char *host = disk->colo_host;
+    const char *port = disk->colo_port;
+    const char *export_name = disk->colo_export;
+    const int domid = cds->domid;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    if (disk->backend != LIBXL_DISK_BACKEND_QDISK ||
+        !libxl_defbool_val(disk->colo_enable)) {
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+
+    dev->matched = true;
+
+    GCNEW(colo_qdisk);
+    dev->concrete_data = colo_qdisk;
+
+    if (primary) {
+        libxl__colo_save_state *css = cds->concrete_data;
+
+        css->qdisk_used = true;
+        /* NBD server is not ready, so we cannot start block replication now */
+        goto out;
+    } else {
+        libxl__colo_restore_state *crs = cds->concrete_data;
+
+        if (!crs->qdisk_used) {
+            /* start nbd server */
+            ret = libxl__qmp_nbd_server_start(gc, domid, host, port);
+            if (ret) {
+                rc = ERROR_FAIL;
+                goto out;
+            }
+            crs->host = host;
+            crs->port = port;
+        } else {
+            if (strcmp(crs->host, host) || strcmp(crs->port, port)) {
+                LOG(ERROR, "The host and port of all disks must be the same");
+                rc = ERROR_FAIL;
+                goto out;
+            }
+        }
+
+        crs->qdisk_used = true;
+
+        ret = libxl__qmp_nbd_server_add(gc, domid, export_name);
+        if (ret)
+            rc = ERROR_FAIL;
+
+        colo_qdisk->setuped = true;
+    }
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_qdisk_teardown(libxl__egc *egc, libxl__checkpoint_device *dev,
+                                bool primary)
+{
+    int ret, rc = 0;
+    const libxl__colo_qdisk *colo_qdisk = dev->concrete_data;
+    const libxl_device_disk *disk = dev->backend_dev;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = dev->cds;
+    const int domid = cds->domid;
+    const char *export_name = disk->colo_export;
+
+    EGC_GC;
+
+    if (primary) {
+        libxl__colo_save_state *css = cds->concrete_data;
+
+        if (css->qdisk_setuped) {
+            css->qdisk_setuped = false;
+            ret = libxl__qmp_block_stop_replication(gc, domid, false);
+            if (ret)
+                rc = ERROR_FAIL;
+        }
+
+        if (!colo_qdisk->setuped)
+            goto out;
+
+        /*
+         * There is no way to get the child name, but we know it is children.1
+         */
+        ret = libxl__qmp_x_blockdev_change(gc, domid, export_name,
+                                           "children.1", NULL);
+        if (ret)
+            rc = ERROR_FAIL;
+    } else {
+        libxl__colo_restore_state *crs = cds->concrete_data;
+
+        if (crs->qdisk_setuped) {
+            crs->qdisk_setuped = false;
+
+            ret = libxl__qmp_block_stop_replication(gc, domid, false);
+            if (ret)
+                rc = ERROR_FAIL;
+        }
+
+        if (crs->qdisk_used) {
+            ret = libxl__qmp_nbd_server_stop(gc, domid);
+            if (ret)
+                rc = ERROR_FAIL;
+        }
+    }
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+/* ========== checkpointing APIs ========== */
+/* should be called after libxl__checkpoint_device_instance_ops.preresume */
+int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid)
+{
+    GC_INIT(ctx);
+    int ret;
+
+    ret = libxl__qmp_block_do_checkpoint(gc, domid);
+
+    GC_FREE;
+    return ret;
+}
+
+int colo_qdisk_start(libxl__egc *egc, domid_t domid, bool primary)
+{
+    EGC_GC;
+
+    return libxl__qmp_block_start_replication(gc, domid, primary);
+}
+
+static void colo_qdisk_save_preresume(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    libxl__colo_qdisk *colo_qdisk = dev->concrete_data;
+    const libxl_device_disk *disk = dev->backend_dev;
+    int ret, rc = 0;
+    char *node = NULL;
+    char *cmd = NULL;
+
+    /* Convenience aliases */
+    const int domid = dev->cds->domid;
+    const char *host = disk->colo_host;
+    const char *port = disk->colo_port;
+    const char *export_name = disk->colo_export;
+
+    EGC_GC;
+
+    if (colo_qdisk->setuped)
+        goto out;
+
+    /* qmp command doesn't support the driver "nbd" */
+    node = GCSPRINTF("colo_node%d",
+                     libxl__device_disk_dev_number(disk->vdev, NULL, NULL));
+    cmd = GCSPRINTF("drive_add buddy driver=replication,mode=primary,"
+                    "file.driver=nbd,file.host=%s,file.port=%s,"
+                    "file.export=%s,node-name=%s,if=none",
+                    host, port, export_name, node);
+    ret = libxl__qmp_hmp(gc, domid, cmd);
+    if (ret)
+        rc = ERROR_FAIL;
+
+    ret = libxl__qmp_x_blockdev_change(gc, domid, export_name, NULL, node);
+    if (ret)
+        rc = ERROR_FAIL;
+
+    colo_qdisk->setuped = true;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+/* ======== primary ======== */
+static void colo_qdisk_save_setup(libxl__egc *egc,
+                                  libxl__checkpoint_device *dev)
+{
+    colo_qdisk_setup(egc, dev, true);
+}
+
+static void colo_qdisk_save_teardown(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    colo_qdisk_teardown(egc, dev, true);
+}
+
+const libxl__checkpoint_device_instance_ops colo_save_device_qdisk = {
+    .kind = LIBXL__DEVICE_KIND_VBD,
+    .setup = colo_qdisk_save_setup,
+    .teardown = colo_qdisk_save_teardown,
+    .preresume = colo_qdisk_save_preresume,
+};
+
+/* ======== secondary ======== */
+static void colo_qdisk_restore_setup(libxl__egc *egc,
+                                     libxl__checkpoint_device *dev)
+{
+    colo_qdisk_setup(egc, dev, false);
+}
+
+static void colo_qdisk_restore_teardown(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    colo_qdisk_teardown(egc, dev, false);
+}
+
+const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk = {
+    .kind = LIBXL__DEVICE_KIND_VBD,
+    .setup = colo_qdisk_restore_setup,
+    .teardown = colo_qdisk_restore_teardown,
+};
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index f23ef8f..e5cfbe5 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -50,7 +50,10 @@ static void libxl__colo_restore_domain_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_should_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_suspend_callback(void *data);
 
+extern const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk;
+
 static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    &colo_restore_device_qdisk,
     NULL,
 };
 
@@ -150,7 +153,11 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_qdisk(cds);
+    if (rc)  goto out;
+
     rc = 0;
+out:
     return rc;
 }
 
@@ -158,6 +165,8 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
+
+    cleanup_subkind_qdisk(cds);
 }
 
 
@@ -217,6 +226,8 @@ void libxl__colo_restore_setup(libxl__egc *egc,
     GCNEW(crcs);
     crs->crcs = crcs;
     crcs->crs = crs;
+    crs->qdisk_setuped = false;
+    crs->qdisk_used = false;
 
     /* setup dsps */
     crcs->dsps.ao = ao;
@@ -585,6 +596,22 @@ static void colo_restore_preresume_cb(libxl__egc *egc,
         goto out;
     }
 
+    if (crs->qdisk_used && !crs->qdisk_setuped) {
+         if (colo_qdisk_start(egc, crs->domid, false)) {
+             LOG(ERROR, "starting block replication fails");
+             goto out;
+         }
+         crs->qdisk_setuped = true;
+    }
+
+    if (crs->qdisk_setuped) {
+        rc = colo_qdisk_preresume(CTX, crs->domid);
+        if (rc) {
+            LOG(ERROR, "colo_qdisk_preresume() fails");
+            goto out;
+        }
+    }
+
     colo_restore_resume_vm(egc, crcs);
 
     return;
@@ -742,8 +769,8 @@ static void colo_setup_checkpoint_devices(libxl__egc *egc,
 
     STATE_AO_GC(crs->ao);
 
-    /* TODO: disk/nic support */
-    cds->device_kind_flags = 0;
+    /* TODO: nic support */
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
     cds->callback = colo_restore_setup_cds_done;
     cds->ao = ao;
     cds->domid = crs->domid;
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index d6b4e7b..78fcc60 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -19,7 +19,10 @@
 #include "libxl_internal.h"
 #include "libxl_colo.h"
 
+extern const libxl__checkpoint_device_instance_ops colo_save_device_qdisk;
+
 static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    &colo_save_device_qdisk,
     NULL,
 };
 
@@ -30,7 +33,11 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_qdisk(cds);
+    if (rc) goto out;
+
     rc = 0;
+out:
     return rc;
 }
 
@@ -38,6 +45,8 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
+
+    cleanup_subkind_qdisk(cds);
 }
 
 /* ================= colo: setup save environment ================= */
@@ -65,9 +74,12 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->send_fd = dss->fd;
     css->recv_fd = dss->recv_fd;
     css->svm_running = false;
+    css->paused = true;
+    css->qdisk_setuped = false;
+    css->qdisk_used = false;
 
-    /* TODO: disk/nic support */
-    cds->device_kind_flags = 0;
+    /* TODO: nic support */
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
     cds->ops = colo_ops;
     cds->callback = colo_save_setup_done;
     cds->ao = ao;
@@ -373,12 +385,41 @@ static void colo_preresume_cb(libxl__egc *egc,
         goto out;
     }
 
+    if (css->qdisk_used && !css->qdisk_setuped) {
+         if (colo_qdisk_start(egc, dss->domid, true)) {
+             LOG(ERROR, "starting block replication fails");
+             goto out;
+         }
+         css->qdisk_setuped = true;
+    }
+
+    if (!css->paused) {
+        rc = colo_qdisk_preresume(CTX, dss->domid);
+        if (rc) {
+            LOG(ERROR, "colo_qdisk_preresume() fails");
+            goto out;
+        }
+    }
+
     /* Resumes the domain and the device model */
     if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1)) {
         LOG(ERROR, "cannot resume primary vm");
         goto out;
     }
 
+    /*
+     * The guest should be paused before doing colo because there is
+     * no disk migration.
+     */
+    if (css->paused) {
+        rc = libxl_domain_unpause(CTX, dss->domid);
+        if (rc) {
+            LOG(ERROR, "cannot unpause primary vm");
+            goto out;
+        }
+        css->paused = false;
+    }
+
     /* read CHECKPOINT_SVM_RESUMED */
     css->callback = colo_read_svm_resumed_done;
     css->srs.checkpoint_callback = colo_common_read_stream_done;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 54903af..ddf4980 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1767,6 +1767,25 @@ _hidden int libxl__qmp_set_global_dirty_log(libxl__gc *gc, int domid, bool enabl
 _hidden int libxl__qmp_insert_cdrom(libxl__gc *gc, int domid, const libxl_device_disk *disk);
 /* Add a virtual CPU */
 _hidden int libxl__qmp_cpu_add(libxl__gc *gc, int domid, int index);
+/* Start NBD server */
+_hidden int libxl__qmp_nbd_server_start(libxl__gc *gc, int domid,
+                                        const char *host, const char *port);
+/* Add a disk to NBD server */
+_hidden int libxl__qmp_nbd_server_add(libxl__gc *gc, int domid, const char *disk);
+/* Start block replication */
+_hidden int libxl__qmp_block_start_replication(libxl__gc *gc, int domid, bool primary);
+/* Do block checkpoint */
+_hidden int libxl__qmp_block_do_checkpoint(libxl__gc *gc, int domid);
+/* Stop block replication */
+_hidden int libxl__qmp_block_stop_replication(libxl__gc *gc, int domid,
+                                              bool primary);
+/* Stop NBD server */
+_hidden int libxl__qmp_nbd_server_stop(libxl__gc *gc, int domid);
+/* Add or remove a child to/from quorum */
+_hidden int libxl__qmp_x_blockdev_change(libxl__gc *gc, int domid, const char *parant,
+                                         const char *child, const char *node);
+/* run a hmp command in qmp mode */
+_hidden int libxl__qmp_hmp(libxl__gc *gc, int domid, const char *command_line);
 /* close and free the QMP handler */
 _hidden void libxl__qmp_close(libxl__qmp_handler *qmp);
 /* remove the socket file, if the file has already been removed,
@@ -2878,6 +2897,10 @@ int init_subkind_nic(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_nic(libxl__checkpoint_devices_state *cds);
 int init_subkind_drbd_disk(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_drbd_disk(libxl__checkpoint_devices_state *cds);
+int init_subkind_qdisk(libxl__checkpoint_devices_state *cds);
+void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds);
+int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid);
+int colo_qdisk_start(libxl__egc *egc, domid_t domid, bool primary);
 
 typedef void libxl__checkpoint_callback(libxl__egc *,
                                         libxl__checkpoint_devices_state *,
@@ -3095,6 +3118,11 @@ struct libxl__colo_save_state {
     libxl__stream_read_state srs;
     void (*callback)(libxl__egc *, libxl__colo_save_state *, int);
     bool svm_running;
+    bool paused;
+
+    /* private, used by qdisk block replication */
+    bool qdisk_used;
+    bool qdisk_setuped;
 };
 
 /*----- Domain suspend (save) state structure -----*/
@@ -3500,6 +3528,12 @@ struct libxl__colo_restore_state {
     /* private, colo restore checkpoint state */
     libxl__domain_create_cb *saved_cb;
     void *crcs;
+
+    /* private, used by qdisk block replication */
+    bool qdisk_used;
+    bool qdisk_setuped;
+    const char *host;
+    const char *port;
 };
 
 struct libxl__domain_create_state {
diff --git a/tools/libxl/libxl_qmp.c b/tools/libxl/libxl_qmp.c
index eec8a44..d5a8d7f 100644
--- a/tools/libxl/libxl_qmp.c
+++ b/tools/libxl/libxl_qmp.c
@@ -978,6 +978,99 @@ int libxl__qmp_cpu_add(libxl__gc *gc, int domid, int idx)
     return qmp_run_command(gc, domid, "cpu-add", args, NULL, NULL);
 }
 
+int libxl__qmp_nbd_server_start(libxl__gc *gc, int domid,
+                                const char *host, const char *port)
+{
+    libxl__json_object *args = NULL;
+    libxl__json_object *addr = NULL;
+    libxl__json_object *data = NULL;
+
+    /* 'addr': {
+     *   'type': 'inet',
+     *   'data': {
+     *     'host': '$nbd_host',
+     *     'port': '$nbd_port'
+     *   }
+     * }
+     */
+    qmp_parameters_add_string(gc, &data, "host", host);
+    qmp_parameters_add_string(gc, &data, "port", port);
+
+    qmp_parameters_add_string(gc, &addr, "type", "inet");
+    qmp_parameters_common_add(gc, &addr, "data", data);
+
+    qmp_parameters_common_add(gc, &args, "addr", addr);
+
+    return qmp_run_command(gc, domid, "nbd-server-start", args, NULL, NULL);
+}
+
+int libxl__qmp_nbd_server_add(libxl__gc *gc, int domid, const char *disk)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_string(gc, &args, "device", disk);
+    qmp_parameters_add_bool(gc, &args, "writable", true);
+
+    return qmp_run_command(gc, domid, "nbd-server-add", args, NULL, NULL);
+}
+
+int libxl__qmp_block_start_replication(libxl__gc *gc, int domid, bool primary)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_bool(gc, &args, "enable", true);
+    qmp_parameters_add_bool(gc, &args, "primary", primary);
+
+    return qmp_run_command(gc, domid, "xen-set-block-replication", args,
+                           NULL, NULL);
+}
+
+int libxl__qmp_block_do_checkpoint(libxl__gc *gc, int domid)
+{
+    return qmp_run_command(gc, domid, "xen-do-block-checkpoint", NULL,
+                           NULL, NULL);
+}
+
+int libxl__qmp_block_stop_replication(libxl__gc *gc, int domid, bool primary)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_bool(gc, &args, "enable", false);
+    qmp_parameters_add_bool(gc, &args, "primary", primary);
+
+    return qmp_run_command(gc, domid, "xen-set-block-replication", args,
+                           NULL, NULL);
+}
+
+int libxl__qmp_nbd_server_stop(libxl__gc *gc, int domid)
+{
+    return qmp_run_command(gc, domid, "nbd-server-stop", NULL, NULL, NULL);
+}
+
+int libxl__qmp_x_blockdev_change(libxl__gc *gc, int domid, const char *parent,
+                                 const char *child, const char *node)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_string(gc, &args, "parent", parent);
+    if (child)
+        qmp_parameters_add_string(gc, &args, "child", child);
+    if (node)
+        qmp_parameters_add_string(gc, &args, "node", node);
+
+    return qmp_run_command(gc, domid, "x-blockdev-change", args, NULL, NULL);
+}
+
+int libxl__qmp_hmp(libxl__gc *gc, int domid, const char *command_line)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_string(gc, &args, "command-line", command_line);
+
+    return qmp_run_command(gc, domid, "human-monitor-command", args,
+                           NULL, NULL);
+}
+
 int libxl__qmp_initializations(libxl__gc *gc, uint32_t domid,
                                const libxl_domain_config *guest_config)
 {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 20/25] COLO proxy: implement setup/teardown of COLO proxy module
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (18 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 19/25] COLO: use qemu block replication Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 21/25] COLO proxy: preresume, postresume and checkpoint Wen Congyang
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

setup/teardown of COLO proxy module.
we use netlink to communicate with proxy module.
About colo-proxy module:
https://lkml.org/lkml/2015/6/18/32
How to use:
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile           |   1 +
 tools/libxl/libxl_colo.h       |   2 +
 tools/libxl/libxl_colo_proxy.c | 230 +++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h   |  15 +++
 4 files changed, 248 insertions(+)
 create mode 100644 tools/libxl/libxl_colo_proxy.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index a4156c1..8c7e5c0 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -65,6 +65,7 @@ endif
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 LIBXL_OBJS-y += libxl_colo_qdisk.o
+LIBXL_OBJS-y += libxl_colo_proxy.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 39515c4..2604b0f 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -31,4 +31,6 @@ extern void libxl__colo_save_teardown(libxl__egc *egc,
                                       libxl__colo_save_state *css,
                                       int rc);
 
+extern int colo_proxy_setup(libxl__colo_proxy_state *cps);
+extern void colo_proxy_teardown(libxl__colo_proxy_state *cps);
 #endif
diff --git a/tools/libxl/libxl_colo_proxy.c b/tools/libxl/libxl_colo_proxy.c
new file mode 100644
index 0000000..e07e640
--- /dev/null
+++ b/tools/libxl/libxl_colo_proxy.c
@@ -0,0 +1,230 @@
+/*
+ * Copyright (C) 2015 FUJITSU LIMITED
+ * Author: Yang Hongyang <hongyang.yang@easystack.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+#include <linux/netlink.h>
+
+#define NETLINK_COLO 28
+
+enum colo_netlink_op {
+    COLO_QUERY_CHECKPOINT = (NLMSG_MIN_TYPE + 1),
+    COLO_CHECKPOINT,
+    COLO_FAILOVER,
+    COLO_PROXY_INIT,
+    COLO_PROXY_RESET, /* UNUSED, will be used for continuous FT */
+};
+
+/* ========= colo-proxy: helper functions ========== */
+
+static int colo_proxy_send(libxl__colo_proxy_state *cps, uint8_t *buff,
+                           uint64_t size, int type)
+{
+    struct sockaddr_nl sa;
+    struct nlmsghdr msg;
+    struct iovec iov;
+    struct msghdr mh;
+    int ret;
+
+    STATE_AO_GC(cps->ao);
+
+    memset(&sa, 0, sizeof(sa));
+    sa.nl_family = AF_NETLINK;
+    sa.nl_pid = 0;
+    sa.nl_groups = 0;
+
+    msg.nlmsg_len = NLMSG_SPACE(0);
+    msg.nlmsg_flags = NLM_F_REQUEST;
+    if (type == COLO_PROXY_INIT) {
+        msg.nlmsg_flags |= NLM_F_ACK;
+    }
+    msg.nlmsg_seq = 0;
+    /* This is untrusty */
+    msg.nlmsg_pid = cps->index;
+    msg.nlmsg_type = type;
+
+    iov.iov_base = &msg;
+    iov.iov_len = msg.nlmsg_len;
+
+    mh.msg_name = &sa;
+    mh.msg_namelen = sizeof(sa);
+    mh.msg_iov = &iov;
+    mh.msg_iovlen = 1;
+    mh.msg_control = NULL;
+    mh.msg_controllen = 0;
+    mh.msg_flags = 0;
+
+    ret = sendmsg(cps->sock_fd, &mh, 0);
+    if (ret <= 0) {
+        LOG(ERROR, "can't send msg to kernel by netlink: %s",
+            strerror(errno));
+    }
+
+    return ret;
+}
+
+/* error: return -1, otherwise return 0 */
+static int64_t colo_proxy_recv(libxl__colo_proxy_state *cps, uint8_t **buff,
+                               unsigned int timeout_us)
+{
+    struct sockaddr_nl sa;
+    struct iovec iov;
+    struct msghdr mh = {
+        .msg_name = &sa,
+        .msg_namelen = sizeof(sa),
+        .msg_iov = &iov,
+        .msg_iovlen = 1,
+    };
+    struct timeval tv;
+    uint32_t size = 16384;
+    int64_t len = 0;
+    int ret;
+
+    STATE_AO_GC(cps->ao);
+    uint8_t *tmp = libxl__malloc(NOGC, size);
+
+    if (timeout_us) {
+        tv.tv_sec = timeout_us / 1000000;
+        tv.tv_usec = timeout_us % 1000000;
+        setsockopt(cps->sock_fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+    }
+
+    iov.iov_base = tmp;
+    iov.iov_len = size;
+next:
+    ret = recvmsg(cps->sock_fd, &mh, 0);
+    if (ret <= 0) {
+        if (errno != EAGAIN && errno != EWOULDBLOCK)
+            LOGE(ERROR, "can't recv msg from kernel by netlink");
+        goto err;
+    }
+
+    len += ret;
+    if (mh.msg_flags & MSG_TRUNC) {
+        size += 16384;
+        tmp = libxl__realloc(NOGC, tmp, size);
+        iov.iov_base = tmp + len;
+        iov.iov_len = size - len;
+        goto next;
+    }
+
+    *buff = tmp;
+    ret = len;
+    goto out;
+
+err:
+    free(tmp);
+    *buff = NULL;
+
+out:
+    if (timeout_us) {
+        tv.tv_sec = 0;
+        tv.tv_usec = 0;
+        setsockopt(cps->sock_fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+    }
+    return ret;
+}
+
+/* ========= colo-proxy: setup and teardown ========== */
+
+int colo_proxy_setup(libxl__colo_proxy_state *cps)
+{
+    int skfd = 0;
+    struct sockaddr_nl sa;
+    struct nlmsghdr *h;
+    int i = 1;
+    int ret = ERROR_FAIL;
+    uint8_t *buff = NULL;
+    int64_t size;
+
+    STATE_AO_GC(cps->ao);
+
+    skfd = socket(PF_NETLINK, SOCK_RAW, NETLINK_COLO);
+    if (skfd < 0) {
+        LOG(ERROR, "can not create a netlink socket: %s", strerror(errno));
+        goto out;
+    }
+    cps->sock_fd = skfd;
+    memset(&sa, 0, sizeof(sa));
+    sa.nl_family = AF_NETLINK;
+    sa.nl_groups = 0;
+retry:
+    sa.nl_pid = i++;
+
+    if (i > 10) {
+        LOG(ERROR, "netlink bind error");
+        goto out;
+    }
+
+    ret = bind(skfd, (struct sockaddr *)&sa, sizeof(sa));
+    if (ret < 0 && errno == EADDRINUSE) {
+        LOG(ERROR, "colo index %d has already in used", sa.nl_pid);
+        goto retry;
+    } else if (ret < 0) {
+        LOG(ERROR, "netlink bind error");
+        goto out;
+    }
+
+    cps->index = sa.nl_pid;
+    ret = colo_proxy_send(cps, NULL, 0, COLO_PROXY_INIT);
+    if (ret < 0) {
+        goto out;
+    }
+    /* receive ack */
+    size = colo_proxy_recv(cps, &buff, 500000);
+    if (size < 0) {
+        LOG(ERROR, "Can't recv msg from kernel by netlink: %s",
+            strerror(errno));
+        goto out;
+    }
+
+    if (size) {
+        h = (struct nlmsghdr *)buff;
+        if (h->nlmsg_type == NLMSG_ERROR) {
+            /* ack's type is NLMSG_ERROR */
+            struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(h);
+
+            if (size - sizeof(*h) < sizeof(*err)) {
+                LOG(ERROR, "NLMSG_LENGTH is too short");
+                goto out;
+            }
+
+            if (err->error) {
+                LOG(ERROR, "NLMSG_ERROR contains error %d", err->error);
+                goto out;
+            }
+        }
+    }
+
+    ret = 0;
+
+out:
+    free(buff);
+    if (ret) {
+        close(cps->sock_fd);
+        cps->sock_fd = -1;
+    }
+    return ret;
+}
+
+void colo_proxy_teardown(libxl__colo_proxy_state *cps)
+{
+    if (cps->sock_fd >= 0) {
+        close(cps->sock_fd);
+        cps->sock_fd = -1;
+    }
+}
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ddf4980..abaa98c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3109,6 +3109,15 @@ libxl__stream_read_inuse(const libxl__stream_read_state *stream)
 }
 
 /*----- colo related state structure -----*/
+typedef struct libxl__colo_proxy_state libxl__colo_proxy_state;
+struct libxl__colo_proxy_state {
+    /* set by caller of colo_proxy_setup */
+    libxl__ao *ao;
+
+    int sock_fd;
+    int index;
+};
+
 typedef struct libxl__colo_save_state libxl__colo_save_state;
 struct libxl__colo_save_state {
     int send_fd;
@@ -3123,6 +3132,9 @@ struct libxl__colo_save_state {
     /* private, used by qdisk block replication */
     bool qdisk_used;
     bool qdisk_setuped;
+
+    /* private, used by colo-proxy */
+    libxl__colo_proxy_state cps;
 };
 
 /*----- Domain suspend (save) state structure -----*/
@@ -3534,6 +3546,9 @@ struct libxl__colo_restore_state {
     bool qdisk_setuped;
     const char *host;
     const char *port;
+
+    /* private, used by colo-proxy */
+    libxl__colo_proxy_state cps;
 };
 
 struct libxl__domain_create_state {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 21/25] COLO proxy: preresume, postresume and checkpoint
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (19 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 20/25] COLO proxy: implement setup/teardown of COLO proxy module Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 22/25] COLO nic: implement COLO nic subkind Wen Congyang
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

preresume, postresume and checkpoint

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_colo.h       |  4 +++
 tools/libxl/libxl_colo_proxy.c | 62 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 2604b0f..ea17faa 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -33,4 +33,8 @@ extern void libxl__colo_save_teardown(libxl__egc *egc,
 
 extern int colo_proxy_setup(libxl__colo_proxy_state *cps);
 extern void colo_proxy_teardown(libxl__colo_proxy_state *cps);
+extern void colo_proxy_preresume(libxl__colo_proxy_state *cps);
+extern void colo_proxy_postresume(libxl__colo_proxy_state *cps);
+extern int colo_proxy_checkpoint(libxl__colo_proxy_state *cps,
+                                 unsigned int timeout_us);
 #endif
diff --git a/tools/libxl/libxl_colo_proxy.c b/tools/libxl/libxl_colo_proxy.c
index e07e640..c714b52 100644
--- a/tools/libxl/libxl_colo_proxy.c
+++ b/tools/libxl/libxl_colo_proxy.c
@@ -228,3 +228,65 @@ void colo_proxy_teardown(libxl__colo_proxy_state *cps)
         cps->sock_fd = -1;
     }
 }
+
+/* ========= colo-proxy: preresume, postresume and checkpoint ========== */
+
+void colo_proxy_preresume(libxl__colo_proxy_state *cps)
+{
+    colo_proxy_send(cps, NULL, 0, COLO_CHECKPOINT);
+    /* TODO: need to handle if the call fails... */
+}
+
+void colo_proxy_postresume(libxl__colo_proxy_state *cps)
+{
+    /* nothing to do... */
+}
+
+typedef struct colo_msg {
+    bool is_checkpoint;
+} colo_msg;
+
+/*
+ * Return value:
+ * -1: error
+ *  0: no checkpoint event is received before timeout
+ *  1: do checkpoint
+ */
+int colo_proxy_checkpoint(libxl__colo_proxy_state *cps,
+                          unsigned int timeout_us)
+{
+    uint8_t *buff;
+    int64_t size;
+    struct nlmsghdr *h;
+    struct colo_msg *m;
+    int ret = -1;
+
+    STATE_AO_GC(cps->ao);
+
+    size = colo_proxy_recv(cps, &buff, timeout_us);
+
+    /* timeout, return no checkpoint message. */
+    if (size <= 0) {
+        return 0;
+    }
+
+    h = (struct nlmsghdr *) buff;
+
+    if (h->nlmsg_type == NLMSG_ERROR) {
+        LOG(ERROR, "receive NLMSG_ERROR");
+        goto out;
+    }
+
+    if (h->nlmsg_len < NLMSG_LENGTH(sizeof(*m))) {
+        LOG(ERROR, "NLMSG_LENGTH is too short");
+        goto out;
+    }
+
+    m = NLMSG_DATA(h);
+
+    ret = m->is_checkpoint ? 1 : 0;
+
+out:
+    free(buff);
+    return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 22/25] COLO nic: implement COLO nic subkind
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (20 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 21/25] COLO proxy: preresume, postresume and checkpoint Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 23/25] setup and control colo proxy on primary side Wen Congyang
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

implement COLO nic subkind.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/hotplug/Linux/Makefile         |   1 +
 tools/hotplug/Linux/colo-proxy-setup | 135 +++++++++++++++
 tools/libxl/Makefile                 |   1 +
 tools/libxl/libxl_colo_nic.c         | 321 +++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h         |   5 +
 tools/libxl/libxl_types.idl          |   1 +
 6 files changed, 464 insertions(+)
 create mode 100755 tools/hotplug/Linux/colo-proxy-setup
 create mode 100644 tools/libxl/libxl_colo_nic.c

diff --git a/tools/hotplug/Linux/Makefile b/tools/hotplug/Linux/Makefile
index 6e10118..9bb852b 100644
--- a/tools/hotplug/Linux/Makefile
+++ b/tools/hotplug/Linux/Makefile
@@ -28,6 +28,7 @@ XEN_SCRIPTS += block-iscsi
 XEN_SCRIPTS += block-tap
 XEN_SCRIPTS += block-drbd-probe
 XEN_SCRIPTS += $(XEN_SCRIPTS-y)
+XEN_SCRIPTS += colo-proxy-setup
 
 SUBDIRS-$(CONFIG_SYSTEMD) += systemd
 
diff --git a/tools/hotplug/Linux/colo-proxy-setup b/tools/hotplug/Linux/colo-proxy-setup
new file mode 100755
index 0000000..94e2034
--- /dev/null
+++ b/tools/hotplug/Linux/colo-proxy-setup
@@ -0,0 +1,135 @@
+#! /bin/bash
+
+dir=$(dirname "$0")
+. "$dir/xen-hotplug-common.sh"
+. "$dir/hotplugpath.sh"
+
+findCommand "$@"
+
+if [ "$command" != "setup" -a  "$command" != "teardown" ]
+then
+    echo "Invalid command: $command"
+    log err "Invalid command: $command"
+    exit 1
+fi
+
+evalVariables "$@"
+
+: ${vifname:?}
+: ${forwarddev:?}
+: ${mode:?}
+: ${index:?}
+: ${bridge:?}
+
+forwardbr="colobr0"
+
+if [ "$mode" != "primary" -a "$mode" != "secondary" ]
+then
+    echo "Invalid mode: $mode"
+    log err "Invalid mode: $mode"
+    exit 1
+fi
+
+if [ $index -lt 0 ] || [ $index -gt 100 ]; then
+    echo "index overflow"
+    exit 1
+fi
+
+function setup_primary()
+{
+    do_without_error tc qdisc add dev $vifname root handle 1: prio
+    do_without_error tc filter add dev $vifname parent 1: protocol ip prio 10 \
+        u32 match u32 0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter add dev $vifname parent 1: protocol arp prio 11 \
+        u32 match u32 0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter add dev $vifname parent 1: protocol ipv6 prio \
+        12 u32 match u32 0 0 flowid 1:2 action mirred egress mirror \
+        dev $forwarddev
+
+    do_without_error modprobe nf_conntrack_ipv4
+    do_without_error modprobe xt_PMYCOLO sec_dev=$forwarddev
+
+    iptables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    ip6tables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    do_without_error arptables -I INPUT -i $forwarddev -j MARK --set-mark $index
+}
+
+function teardown_primary()
+{
+    do_without_error tc filter del dev $vifname parent 1: protocol ip prio 10 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter del dev $vifname parent 1: protocol arp prio 11 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc filter del dev $vifname parent 1: protocol ipv6 prio 12 u32 match u32 \
+        0 0 flowid 1:2 action mirred egress mirror dev $forwarddev
+    do_without_error tc qdisc del dev $vifname root handle 1: prio
+
+    do_without_error iptables -t mangle -D PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    do_without_error ip6tables -t mangle -D PREROUTING -m physdev --physdev-in \
+        $vifname -j PMYCOLO --index $index
+    do_without_error arptables -F
+    do_without_error rmmod xt_PMYCOLO
+}
+
+function setup_secondary()
+{
+    do_without_error brctl delif $bridge $vifname
+    do_without_error brctl addbr $forwardbr
+    do_without_error brctl addif $forwardbr $vifname
+    do_without_error brctl addif $forwardbr $forwarddev
+    do_without_error ip link set dev $forwardbr up
+    do_without_error modprobe xt_SECCOLO
+
+    iptables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+    ip6tables -t mangle -I PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+}
+
+function teardown_secondary()
+{
+    do_without_error brctl delif $forwardbr $forwarddev
+    do_without_error brctl delif $forwardbr $vifname
+    do_without_error brctl delbr $forwardbr
+    do_without_error brctl addif $bridge $vifname
+
+    do_without_error iptables -t mangle -D PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+    do_without_error ip6tables -t mangle -D PREROUTING -m physdev --physdev-in \
+        $vifname -j SECCOLO --index $index
+    do_without_error rmmod xt_SECCOLO
+}
+
+case "$command" in
+    setup)
+        if [ "$mode" = "primary" ]
+        then
+            setup_primary
+        else
+            setup_secondary
+        fi
+
+        success
+        ;;
+    teardown)
+        if [ "$mode" = "primary" ]
+        then
+            teardown_primary
+        else
+            teardown_secondary
+        fi
+        ;;
+esac
+
+if [ "$mode" = "primary" ]
+then
+    log debug "Successful colo-proxy-setup $command for $vifname." \
+              " vifname: $vifname, index: $index, forwarddev: $forwarddev."
+else
+    log debug "Successful colo-proxy-setup $command for $vifname." \
+              " vifname: $vifname, index: $index, forwarddev: $forwarddev,"\
+              " forwardbr: $forwardbr."
+fi
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 8c7e5c0..407794e 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -66,6 +66,7 @@ LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 LIBXL_OBJS-y += libxl_colo_qdisk.o
 LIBXL_OBJS-y += libxl_colo_proxy.o
+LIBXL_OBJS-y += libxl_colo_nic.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o
diff --git a/tools/libxl/libxl_colo_nic.c b/tools/libxl/libxl_colo_nic.c
new file mode 100644
index 0000000..998e09c
--- /dev/null
+++ b/tools/libxl/libxl_colo_nic.c
@@ -0,0 +1,321 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+typedef struct libxl__colo_device_nic {
+    int devid;
+    const char *vif;
+} libxl__colo_device_nic;
+
+enum {
+    primary,
+    secondary,
+};
+
+
+/* ========== init() and cleanup() ========== */
+int init_subkind_colo_nic(libxl__checkpoint_devices_state *cds)
+{
+    return 0;
+}
+
+void cleanup_subkind_colo_nic(libxl__checkpoint_devices_state *cds)
+{
+}
+
+/* ========== helper functions ========== */
+static void colo_save_setup_script_cb(libxl__egc *egc,
+                                     libxl__async_exec_state *aes,
+                                     int rc, int status);
+static void colo_save_teardown_script_cb(libxl__egc *egc,
+                                         libxl__async_exec_state *aes,
+                                         int rc, int status);
+
+/*
+ * If the device has a vifname, then use that instead of
+ * the vifX.Y format.
+ * it must ONLY be used for remus because if driver domains
+ * were in use it would constitute a security vulnerability.
+ */
+static const char *get_vifname(libxl__checkpoint_device *dev,
+                               const libxl_device_nic *nic)
+{
+    const char *vifname = NULL;
+    const char *path;
+    int rc;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    /* Convenience aliases */
+    const uint32_t domid = dev->cds->domid;
+
+    path = GCSPRINTF("%s/backend/vif/%d/%d/vifname",
+                     libxl__xs_get_dompath(gc, 0), domid, nic->devid);
+    rc = libxl__xs_read_checked(gc, XBT_NULL, path, &vifname);
+    if (!rc && !vifname) {
+        vifname = libxl__device_nic_devname(gc, domid,
+                                            nic->devid,
+                                            nic->nictype);
+    }
+
+    return vifname;
+}
+
+/*
+ * the script needs the following env & args
+ * $vifname
+ * $forwarddev
+ * $mode(primary/secondary)
+ * $index
+ * $bridge
+ * setup/teardown as command line arg.
+ */
+static void setup_async_exec(libxl__checkpoint_device *dev, char *op,
+                             libxl__colo_proxy_state *cps, int side,
+                             char *colo_proxy_script)
+{
+    int arraysize, nr = 0;
+    char **env = NULL, **args = NULL;
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+    libxl__checkpoint_devices_state *cds = dev->cds;
+    libxl__async_exec_state *aes = &dev->aodev.aes;
+    const libxl_device_nic *nic = dev->backend_dev;
+
+    STATE_AO_GC(cds->ao);
+
+    /* Convenience aliases */
+    const char *const vif = colo_nic->vif;
+
+    arraysize = 11;
+    GCNEW_ARRAY(env, arraysize);
+    env[nr++] = "vifname";
+    env[nr++] = libxl__strdup(gc, vif);
+    env[nr++] = "forwarddev";
+    env[nr++] = libxl__strdup(gc, nic->forwarddev);
+    env[nr++] = "mode";
+    if (side == primary)
+        env[nr++] = "primary";
+    else
+        env[nr++] = "secondary";
+    env[nr++] = "index";
+    env[nr++] = GCSPRINTF("%d", cps->index);
+    env[nr++] = "bridge";
+    env[nr++] = libxl__strdup(gc, nic->bridge);
+    env[nr++] = NULL;
+    assert(nr == arraysize);
+
+    arraysize = 3; nr = 0;
+    GCNEW_ARRAY(args, arraysize);
+    args[nr++] = colo_proxy_script;
+    args[nr++] = op;
+    args[nr++] = NULL;
+    assert(nr == arraysize);
+
+    aes->ao = dev->cds->ao;
+    aes->what = GCSPRINTF("%s %s", args[0], args[1]);
+    aes->env = env;
+    aes->args = args;
+    aes->timeout_ms = LIBXL_HOTPLUG_TIMEOUT * 1000;
+    aes->stdfds[0] = -1;
+    aes->stdfds[1] = -1;
+    aes->stdfds[2] = -1;
+
+    if (!strcmp(op, "teardown"))
+        aes->callback = colo_save_teardown_script_cb;
+    else
+        aes->callback = colo_save_setup_script_cb;
+}
+
+/* ========== setup() and teardown() ========== */
+static void colo_nic_setup(libxl__egc *egc, libxl__checkpoint_device *dev,
+                           libxl__colo_proxy_state *cps, int side,
+                           char *colo_proxy_script)
+{
+    int rc;
+    libxl__colo_device_nic *colo_nic;
+    const libxl_device_nic *nic = dev->backend_dev;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    /*
+     * thers's no subkind of nic devices, so nic ops is always matched
+     * with nic devices, we begin to setup the nic device
+     */
+    dev->matched = 1;
+
+    if (!nic->forwarddev) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    GCNEW(colo_nic);
+    dev->concrete_data = colo_nic;
+    colo_nic->devid = nic->devid;
+    colo_nic->vif = get_vifname(dev, nic);
+    if (!colo_nic->vif) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    setup_async_exec(dev, "setup", cps, side, colo_proxy_script);
+    rc = libxl__async_exec_start(&dev->aodev.aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_save_setup_script_cb(libxl__egc *egc,
+                                      libxl__async_exec_state *aes,
+                                      int rc, int status)
+{
+    libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+    libxl__checkpoint_devices_state *cds = dev->cds;
+    const char *out_path_base, *hotplug_error = NULL;
+
+    EGC_GC;
+
+    /* Convenience aliases */
+    const uint32_t domid = cds->domid;
+    const int devid = colo_nic->devid;
+    const char *const vif = colo_nic->vif;
+
+    if (status && !rc)
+        rc = ERROR_FAIL;
+    if (rc)
+        goto out;
+
+    out_path_base = GCSPRINTF("%s/colo_proxy/%d",
+                              libxl__xs_libxl_path(gc, domid), devid);
+
+    rc = libxl__xs_read_checked(gc, XBT_NULL,
+                                GCSPRINTF("%s/hotplug-error", out_path_base),
+                                &hotplug_error);
+    if (rc)
+        goto out;
+
+    if (hotplug_error) {
+        LOG(ERROR, "colo_proxy script %s setup failed for vif %s: %s",
+            aes->args[0], vif, hotplug_error);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (status) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = 0;
+
+out:
+    aodev->rc = rc;
+    aodev->callback(egc, aodev);
+}
+
+static void colo_nic_teardown(libxl__egc *egc, libxl__checkpoint_device *dev,
+                              libxl__colo_proxy_state *cps, int side,
+                              char *colo_proxy_script)
+{
+    int rc;
+    libxl__colo_device_nic *colo_nic = dev->concrete_data;
+
+    if (!colo_nic || !colo_nic->vif) {
+        /* colo nic has not yet been set up, just return */
+        rc = 0;
+        goto out;
+    }
+
+    setup_async_exec(dev, "teardown", cps, side, colo_proxy_script);
+
+    rc = libxl__async_exec_start(&dev->aodev.aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void colo_save_teardown_script_cb(libxl__egc *egc,
+                                         libxl__async_exec_state *aes,
+                                         int rc, int status)
+{
+    libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
+
+    if (status && !rc)
+        rc = ERROR_FAIL;
+    else
+        rc = 0;
+
+    aodev->rc = rc;
+    aodev->callback(egc, aodev);
+}
+
+/* ======== primary ======== */
+static void colo_nic_save_setup(libxl__egc *egc, libxl__checkpoint_device *dev)
+{
+    libxl__colo_save_state *css = dev->cds->concrete_data;
+
+    colo_nic_setup(egc, dev, &css->cps, primary, css->colo_proxy_script);
+}
+
+static void colo_nic_save_teardown(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    libxl__colo_save_state *css = dev->cds->concrete_data;
+
+    colo_nic_teardown(egc, dev, &css->cps, primary, css->colo_proxy_script);
+}
+
+const libxl__checkpoint_device_instance_ops colo_save_device_nic = {
+    .kind = LIBXL__DEVICE_KIND_VIF,
+    .setup = colo_nic_save_setup,
+    .teardown = colo_nic_save_teardown,
+};
+
+/* ======== secondary ======== */
+static void colo_nic_restore_setup(libxl__egc *egc,
+                                   libxl__checkpoint_device *dev)
+{
+    libxl__colo_restore_state *crs = dev->cds->concrete_data;
+
+    colo_nic_setup(egc, dev, &crs->cps, secondary, crs->colo_proxy_script);
+}
+
+static void colo_nic_restore_teardown(libxl__egc *egc,
+                                      libxl__checkpoint_device *dev)
+{
+    libxl__colo_restore_state *crs = dev->cds->concrete_data;
+
+    colo_nic_teardown(egc, dev, &crs->cps, secondary, crs->colo_proxy_script);
+}
+
+const libxl__checkpoint_device_instance_ops colo_restore_device_nic = {
+    .kind = LIBXL__DEVICE_KIND_VIF,
+    .setup = colo_nic_restore_setup,
+    .teardown = colo_nic_restore_teardown,
+};
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index abaa98c..06e1ccc 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2901,6 +2901,8 @@ int init_subkind_qdisk(libxl__checkpoint_devices_state *cds);
 void cleanup_subkind_qdisk(libxl__checkpoint_devices_state *cds);
 int colo_qdisk_preresume(libxl_ctx *ctx, domid_t domid);
 int colo_qdisk_start(libxl__egc *egc, domid_t domid, bool primary);
+int init_subkind_colo_nic(libxl__checkpoint_devices_state *cds);
+void cleanup_subkind_colo_nic(libxl__checkpoint_devices_state *cds);
 
 typedef void libxl__checkpoint_callback(libxl__egc *,
                                         libxl__checkpoint_devices_state *,
@@ -3122,6 +3124,7 @@ typedef struct libxl__colo_save_state libxl__colo_save_state;
 struct libxl__colo_save_state {
     int send_fd;
     int recv_fd;
+    char *colo_proxy_script;
 
     /* private */
     libxl__stream_read_state srs;
@@ -3536,6 +3539,7 @@ struct libxl__colo_restore_state {
     int recv_fd;
     int hvm;
     libxl__colo_callback *callback;
+    char *colo_proxy_script;
 
     /* private, colo restore checkpoint state */
     libxl__domain_create_cb *saved_cb;
@@ -3565,6 +3569,7 @@ struct libxl__domain_create_state {
     libxl_asyncprogress_how aop_console_how;
     /* private to domain_create */
     int guest_domid;
+    const char *colo_proxy_script;
     libxl__domain_build_state build_state;
     libxl__colo_restore_state crs;
     libxl__checkpoint_devices_state cds;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 38f37f2..1981f8b 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -598,6 +598,7 @@ libxl_device_nic = Struct("device_nic", [
     ("rate_bytes_per_interval", uint64),
     ("rate_interval_usecs", uint32),
     ("gatewaydev", string),
+    ("forwarddev", string)
     ])
 
 libxl_device_pci = Struct("device_pci", [
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 23/25] setup and control colo proxy on primary side
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (21 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 22/25] COLO nic: implement COLO nic subkind Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 24/25] setup and control colo proxy on secondary side Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 25/25] cmdline switches and config vars to control colo-proxy Wen Congyang
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

setup and control colo proxy on primary side

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_colo_save.c | 124 +++++++++++++++++++++++++++++++++++++++---
 tools/libxl/libxl_internal.h  |   1 +
 2 files changed, 117 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index 78fcc60..e3d4f91 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -19,9 +19,11 @@
 #include "libxl_internal.h"
 #include "libxl_colo.h"
 
+extern const libxl__checkpoint_device_instance_ops colo_save_device_nic;
 extern const libxl__checkpoint_device_instance_ops colo_save_device_qdisk;
 
 static const libxl__checkpoint_device_instance_ops *colo_ops[] = {
+    &colo_save_device_nic,
     &colo_save_device_qdisk,
     NULL,
 };
@@ -33,9 +35,15 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
-    rc = init_subkind_qdisk(cds);
+    rc = init_subkind_colo_nic(cds);
     if (rc) goto out;
 
+    rc = init_subkind_qdisk(cds);
+    if (rc) {
+        cleanup_subkind_colo_nic(cds);
+        goto out;
+    }
+
     rc = 0;
 out:
     return rc;
@@ -46,6 +54,7 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
 
+    cleanup_subkind_colo_nic(cds);
     cleanup_subkind_qdisk(cds);
 }
 
@@ -77,9 +86,16 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->paused = true;
     css->qdisk_setuped = false;
     css->qdisk_used = false;
+    libxl__ev_child_init(&css->child);
 
-    /* TODO: nic support */
-    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
+    if (dss->remus->netbufscript)
+        css->colo_proxy_script = libxl__strdup(gc, dss->remus->netbufscript);
+    else
+        css->colo_proxy_script = GCSPRINTF("%s/colo-proxy-setup",
+                                           libxl__xen_script_dir_path());
+
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VIF) |
+                             (1 << LIBXL__DEVICE_KIND_VBD);
     cds->ops = colo_ops;
     cds->callback = colo_save_setup_done;
     cds->ao = ao;
@@ -90,6 +106,12 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->srs.fd = css->recv_fd;
     css->srs.back_channel = true;
     libxl__stream_read_start(egc, &css->srs);
+    css->cps.ao = ao;
+    if (colo_proxy_setup(&css->cps)) {
+        LOG(ERROR, "COLO: failed to setup colo proxy for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
 
     if (init_device_subkind(cds))
         goto out;
@@ -167,6 +189,7 @@ static void colo_teardown_done(libxl__egc *egc,
     libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
 
     cleanup_device_subkind(cds);
+    colo_proxy_teardown(&css->cps);
     dss->callback(egc, dss, rc);
 }
 
@@ -361,6 +384,8 @@ static void colo_read_svm_ready_done(libxl__egc *egc,
         goto out;
     }
 
+    colo_proxy_preresume(&css->cps);
+
     css->svm_running = true;
     dss->cds.callback = colo_preresume_cb;
     libxl__checkpoint_devices_preresume(egc, &dss->cds);
@@ -446,6 +471,8 @@ static void colo_read_svm_resumed_done(libxl__egc *egc,
         goto out;
     }
 
+    colo_proxy_postresume(&css->cps);
+
     ok = 1;
 
 out:
@@ -454,6 +481,91 @@ out:
 
 
 /* ===================== colo: wait new checkpoint ===================== */
+
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+static void colo_proxy_async_wait_for_checkpoint(libxl__colo_save_state *css);
+static void colo_proxy_async_call_done(libxl__egc *egc,
+                                       libxl__ev_child *child,
+                                       int pid,
+                                       int status);
+
+static void colo_proxy_async_call(libxl__egc *egc,
+                                  libxl__colo_save_state *css,
+                                  void func(libxl__colo_save_state *),
+                                  libxl__ev_child_callback callback)
+{
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+    int pid = -1, rc;
+
+    STATE_AO_GC(dss->cds.ao);
+
+    /* Fork and call */
+    pid = libxl__ev_child_fork(gc, &css->child, callback);
+    if (pid == -1) {
+        LOG(ERROR, "unable to fork");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (!pid) {
+        /* child */
+        func(css);
+        /* notreached */
+        abort();
+    }
+
+    return;
+
+out:
+    callback(egc, &css->child, -1, 1);
+}
+
+static void colo_proxy_wait_for_checkpoint(libxl__egc *egc,
+                                           libxl__colo_save_state *css)
+{
+    colo_proxy_async_call(egc, css,
+                          colo_proxy_async_wait_for_checkpoint,
+                          colo_proxy_async_call_done);
+}
+
+static void colo_proxy_async_wait_for_checkpoint(libxl__colo_save_state *css)
+{
+    int req;
+
+    req = colo_proxy_checkpoint(&css->cps, 5000000);
+    if (req < 0) {
+        /* some error happens */
+        _exit(1);
+    } else if (!req) {
+        /* no checkpoint is needed, do a checkpint every 5s */
+        _exit(0);
+    } else {
+        /* net packets is not consistent, we need to start a checkpoint */
+        _exit(0);
+    }
+}
+
+static void colo_proxy_async_call_done(libxl__egc *egc,
+                                       libxl__ev_child *child,
+                                       int pid,
+                                       int status)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(child, *css, child);
+    libxl__domain_save_state *dss = CONTAINER_OF(css, *dss, css);
+
+    EGC_GC;
+
+    if (status) {
+        LOG(ERROR, "failed to wait for new checkpoint");
+        colo_start_new_checkpoint(egc, &dss->cds, ERROR_FAIL);
+        return;
+    }
+
+    colo_start_new_checkpoint(egc, &dss->cds, 0);
+}
+
 /*
  * Do the following things:
  * 1. do commit
@@ -463,9 +575,6 @@ out:
 static void colo_device_commit_cb(libxl__egc *egc,
                                   libxl__checkpoint_devices_state *cds,
                                   int rc);
-static void colo_start_new_checkpoint(libxl__egc *egc,
-                                      libxl__checkpoint_devices_state *cds,
-                                      int rc);
 
 void libxl__colo_save_domain_should_checkpoint_callback(void *data)
 {
@@ -495,8 +604,7 @@ static void colo_device_commit_cb(libxl__egc *egc,
         goto out;
     }
 
-    /* TODO: wait a new checkpoint */
-    colo_start_new_checkpoint(egc, cds, 0);
+    colo_proxy_wait_for_checkpoint(egc, css);
     return;
 
 out:
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 06e1ccc..c94e2e9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3138,6 +3138,7 @@ struct libxl__colo_save_state {
 
     /* private, used by colo-proxy */
     libxl__colo_proxy_state cps;
+    libxl__ev_child child;
 };
 
 /*----- Domain suspend (save) state structure -----*/
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 24/25] setup and control colo proxy on secondary side
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (22 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 23/25] setup and control colo proxy on primary side Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  2015-12-30  2:37 ` [PATCH v9 25/25] cmdline switches and config vars to control colo-proxy Wen Congyang
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

setup and control colo proxy on secondary side

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_colo_restore.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index e5cfbe5..e9126ef 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -50,9 +50,11 @@ static void libxl__colo_restore_domain_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_should_checkpoint_callback(void *data);
 static void libxl__colo_restore_domain_suspend_callback(void *data);
 
+extern const libxl__checkpoint_device_instance_ops colo_restore_device_nic;
 extern const libxl__checkpoint_device_instance_ops colo_restore_device_qdisk;
 
 static const libxl__checkpoint_device_instance_ops *colo_restore_ops[] = {
+    &colo_restore_device_nic,
     &colo_restore_device_qdisk,
     NULL,
 };
@@ -153,8 +155,14 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     STATE_AO_GC(cds->ao);
 
+    rc = init_subkind_colo_nic(cds);
+    if (rc) goto out;
+
     rc = init_subkind_qdisk(cds);
-    if (rc)  goto out;
+    if (rc) {
+        cleanup_subkind_colo_nic(cds);
+        goto out;
+    }
 
     rc = 0;
 out:
@@ -166,6 +174,7 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
     /* cleanup device subkind-specific state in the libxl ctx */
     STATE_AO_GC(cds->ao);
 
+    cleanup_subkind_colo_nic(cds);
     cleanup_subkind_qdisk(cds);
 }
 
@@ -345,6 +354,8 @@ static void colo_restore_teardown_devices_done(libxl__egc *egc,
     if (crcs->teardown_devices)
         cleanup_device_subkind(cds);
 
+    colo_proxy_teardown(&crs->cps);
+
     rc = crcs->saved_rc;
     if (!rc) {
         crcs->callback = do_failover_done;
@@ -612,6 +623,8 @@ static void colo_restore_preresume_cb(libxl__egc *egc,
         }
     }
 
+    colo_proxy_preresume(&crs->cps);
+
     colo_restore_resume_vm(egc, crcs);
 
     return;
@@ -648,6 +661,8 @@ static void colo_resume_vm_done(libxl__egc *egc,
 
     crcs->status = LIBXL_COLO_RESUMED;
 
+    colo_proxy_postresume(&crs->cps);
+
     /* avoid calling stream->completion_callback() more than once */
     if (crs->saved_cb) {
         dcs->callback = crs->saved_cb;
@@ -769,13 +784,20 @@ static void colo_setup_checkpoint_devices(libxl__egc *egc,
 
     STATE_AO_GC(crs->ao);
 
-    /* TODO: nic support */
-    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VBD);
+    cds->device_kind_flags = (1 << LIBXL__DEVICE_KIND_VIF) |
+                             (1 << LIBXL__DEVICE_KIND_VBD);
     cds->callback = colo_restore_setup_cds_done;
     cds->ao = ao;
     cds->domid = crs->domid;
     cds->ops = colo_restore_ops;
 
+    crs->cps.ao = ao;
+    if (colo_proxy_setup(&crs->cps)) {
+        LOG(ERROR, "COLO: failed to setup colo proxy for guest with domid %u",
+            cds->domid);
+        goto out;
+    }
+
     if (init_device_subkind(cds))
         goto out;
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 25/25] cmdline switches and config vars to control colo-proxy
  2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (23 preceding siblings ...)
  2015-12-30  2:37 ` [PATCH v9 24/25] setup and control colo proxy on secondary side Wen Congyang
@ 2015-12-30  2:37 ` Wen Congyang
  24 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2015-12-30  2:37 UTC (permalink / raw)
  To: xen devel, Andrew Cooper, Ian Campbell, Ian Jackson, Wei Liu
  Cc: Lars Kurth, Changlong Xie, Wen Congyang, Gui Jianfeng,
	Jiang Yunhong, Dong Eddie, Shriram Rajagopalan, Yang Hongyang

Add cmdline switches to 'xl migrate-receive' command to specify
a domain-specific hotplug script to setup COLO proxy.

Add a new config var 'colo.default.agentscript' to xl.conf, that
allows the user to override the default global script used to
setup COLO proxy.

Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/man/xl.conf.pod.5      |  6 ++++++
 docs/man/xl.pod.1           |  1 -
 tools/libxl/libxl.c         |  6 ++++++
 tools/libxl/libxl_create.c  | 14 ++++++++++++--
 tools/libxl/libxl_types.idl |  1 +
 tools/libxl/xl.c            |  3 +++
 tools/libxl/xl.h            |  1 +
 tools/libxl/xl_cmdimpl.c    | 47 ++++++++++++++++++++++++++++++++++-----------
 8 files changed, 65 insertions(+), 14 deletions(-)

diff --git a/docs/man/xl.conf.pod.5 b/docs/man/xl.conf.pod.5
index 8ae19bb..8f7fd28 100644
--- a/docs/man/xl.conf.pod.5
+++ b/docs/man/xl.conf.pod.5
@@ -111,6 +111,12 @@ Configures the default script used by Remus to setup network buffering.
 
 Default: C</etc/xen/scripts/remus-netbuf-setup>
 
+=item B<colo.default.proxyscript="PATH">
+
+Configures the default script used by COLO to setup colo-proxy.
+
+Default: C</etc/xen/scripts/colo-proxy-setup>
+
 =item B<output_format="json|sxp">
 
 Configures the default output format used by xl when printing "machine
diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 4f1901d..edeafcf 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -454,7 +454,6 @@ N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
      COLO support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network at the moment.
 
 B<OPTIONS>
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index e770723..56e86d8 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -3398,6 +3398,11 @@ void libxl__device_nic_add(libxl__egc *egc, uint32_t domid,
         flexarray_append(back, nic->ifname);
     }
 
+    if (nic->forwarddev) {
+        flexarray_append(back, "forwarddev");
+        flexarray_append(back, nic->forwarddev);
+    }
+
     flexarray_append(back, "mac");
     flexarray_append(back,GCSPRINTF(LIBXL_MAC_FMT, LIBXL_MAC_BYTES(nic->mac)));
     if (nic->ip) {
@@ -3520,6 +3525,7 @@ static int libxl__device_nic_from_xs_be(libxl__gc *gc,
     nic->ip = READ_BACKEND(NOGC, "ip");
     nic->bridge = READ_BACKEND(NOGC, "bridge");
     nic->script = READ_BACKEND(NOGC, "script");
+    nic->forwarddev = READ_BACKEND(NOGC, "forwarddev");
 
     /* vif_ioemu nics use the same xenstore entries as vif interfaces */
     tmp = READ_BACKEND(gc, "type");
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 7cb3c6a..4ba87df 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1054,6 +1054,11 @@ static void domcreate_bootloader_done(libxl__egc *egc,
             crs->recv_fd = restore_fd;
             crs->hvm = (info->type == LIBXL_DOMAIN_TYPE_HVM);
             crs->callback = libxl__colo_restore_setup_done;
+            if (dcs->colo_proxy_script)
+                crs->colo_proxy_script = libxl__strdup(gc, dcs->colo_proxy_script);
+            else
+                crs->colo_proxy_script = GCSPRINTF("%s/colo-proxy-setup",
+                                                   libxl__xen_script_dir_path());
             libxl__colo_restore_setup(egc, crs);
             break;
         case LIBXL_CHECKPOINTED_STREAM_REMUS:
@@ -1596,6 +1601,7 @@ static void domain_create_cb(libxl__egc *egc,
 static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid, int restore_fd, int send_fd,
                             const libxl_domain_restore_params *params,
+                            const char *colo_proxy_script,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
@@ -1619,6 +1625,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     }
     cdcs->dcs.callback = domain_create_cb;
     cdcs->dcs.domid_soft_reset = INVALID_DOMID;
+    cdcs->dcs.colo_proxy_script = colo_proxy_script;
     libxl__ao_progress_gethow(&cdcs->dcs.aop_console_how, aop_console_how);
     cdcs->domid_out = domid;
 
@@ -1805,7 +1812,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncprogress_how *aop_console_how)
 {
     unset_disk_colo_restore(d_config);
-    return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
+    return do_domain_create(ctx, d_config, domid, -1, -1, NULL, NULL,
                             ao_how, aop_console_how);
 }
 
@@ -1815,14 +1822,17 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
 {
+    char *colo_proxy_script = NULL;
+
     if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        colo_proxy_script = params->colo_proxy_script;
         set_disk_colo_restore(d_config);
     } else {
         unset_disk_colo_restore(d_config);
     }
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_fd, params,
-                            ao_how, aop_console_how);
+                            colo_proxy_script, ao_how, aop_console_how);
 }
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 1981f8b..835a946 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -384,6 +384,7 @@ libxl_domain_create_info = Struct("domain_create_info",[
 libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
     ("stream_version", uint32, {'init_val': '1'}),
+    ("colo_proxy_script", string),
     ])
 
 libxl_domain_sched_params = Struct("domain_sched_params",[
diff --git a/tools/libxl/xl.c b/tools/libxl/xl.c
index dfae84a..a272258 100644
--- a/tools/libxl/xl.c
+++ b/tools/libxl/xl.c
@@ -45,6 +45,7 @@ char *default_bridge = NULL;
 char *default_gatewaydev = NULL;
 char *default_vifbackend = NULL;
 char *default_remus_netbufscript = NULL;
+char *default_colo_proxy_script = NULL;
 enum output_format default_output_format = OUTPUT_FORMAT_JSON;
 int claim_mode = 1;
 bool progress_use_cr = 0;
@@ -179,6 +180,8 @@ static void parse_global_config(const char *configfile,
 
     xlu_cfg_replace_string (config, "remus.default.netbufscript",
         &default_remus_netbufscript, 0);
+    xlu_cfg_replace_string (config, "colo.default.proxyscript",
+        &default_colo_proxy_script, 0);
 
     xlu_cfg_destroy(config);
 }
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index bdab125..709c3fd 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -189,6 +189,7 @@ extern char *default_bridge;
 extern char *default_gatewaydev;
 extern char *default_vifbackend;
 extern char *default_remus_netbufscript;
+extern char *default_colo_proxy_script;
 extern char *blkdev_start;
 
 enum output_format {
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 70b8b82..c363a26 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -158,6 +158,7 @@ struct domain_create {
     const char *config_file;
     char *extra_config; /* extra config string */
     const char *restore_file;
+    char *colo_proxy_script;
     int migrate_fd; /* -1 means none */
     int send_fd; /* -1 means none */
     char **migration_domname_r; /* from malloc */
@@ -1048,6 +1049,8 @@ static int parse_nic_config(libxl_device_nic *nic, XLU_Config **config, char *to
         replace_string(&nic->model, oparg);
     } else if (MATCH_OPTION("rate", token, oparg)) {
         parse_vif_rate(config, oparg, nic);
+    } else if (MATCH_OPTION("forwarddev", token, oparg)) {
+        replace_string(&nic->forwarddev, oparg);
     } else if (MATCH_OPTION("accel", token, oparg)) {
         fprintf(stderr, "the accel parameter for vifs is currently not supported\n");
     } else {
@@ -2893,6 +2896,7 @@ start:
         params.checkpointed_stream = dom_info->checkpointed_stream;
         params.stream_version =
             (hdr.mandatory_flags & XL_MANDATORY_FLAG_STREAMv2) ? 2 : 1;
+        params.colo_proxy_script = dom_info->colo_proxy_script;
 
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
@@ -4428,7 +4432,8 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
 
 static void migrate_receive(int debug, int daemonize, int monitor,
                             int send_fd, int recv_fd,
-                            libxl_checkpointed_stream checkpointed)
+                            libxl_checkpointed_stream checkpointed,
+                            char *colo_proxy_script)
 {
     uint32_t domid;
     int rc, rc2;
@@ -4457,6 +4462,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = checkpointed;
+    dom_info.colo_proxy_script = colo_proxy_script;
     if (checkpointed == LIBXL_CHECKPOINTED_STREAM_COLO)
         /* COLO uses stdout to send control message to master */
         dom_info.quiet = 1;
@@ -4653,8 +4659,9 @@ int main_migrate_receive(int argc, char **argv)
     int debug = 0, daemonize = 1, monitor = 1;
     libxl_checkpointed_stream checkpointed = LIBXL_CHECKPOINTED_STREAM_NONE;
     int opt;
+    char *script = NULL;
 
-    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
+    SWITCH_FOREACH_OPT(opt, "Fedrcn:", NULL, "migrate-receive", 0) {
     case 'F':
         daemonize = 0;
         break;
@@ -4671,6 +4678,9 @@ int main_migrate_receive(int argc, char **argv)
     case 'c':
         checkpointed = LIBXL_CHECKPOINTED_STREAM_COLO;
         break;
+    case 'n':
+        script = optarg;
+        break;
     }
 
     if (argc-optind != 0) {
@@ -4679,7 +4689,7 @@ int main_migrate_receive(int argc, char **argv)
     }
     migrate_receive(debug, daemonize, monitor,
                     STDOUT_FILENO, STDIN_FILENO,
-                    checkpointed);
+                    checkpointed, script);
 
     return 0;
 }
@@ -8079,8 +8089,10 @@ int main_remus(int argc, char **argv)
         r_info.interval = 200;
 
     if (libxl_defbool_val(r_info.colo)) {
-        if (r_info.interval || libxl_defbool_val(r_info.blackhole)) {
-            perror("Option -c conflicts with -i or -b");
+        if (r_info.interval || libxl_defbool_val(r_info.blackhole) ||
+            !libxl_defbool_is_default(r_info.netbuf) ||
+            !libxl_defbool_is_default(r_info.diskbuf)) {
+            perror("option -c is conflict with -i, -d, -n or -b");
             exit(-1);
         }
 
@@ -8091,8 +8103,12 @@ int main_remus(int argc, char **argv)
         }
     }
 
-    if (!r_info.netbufscript)
-        r_info.netbufscript = default_remus_netbufscript;
+    if (!r_info.netbufscript) {
+        if (libxl_defbool_val(r_info.colo))
+            r_info.netbufscript = default_colo_proxy_script;
+        else
+            r_info.netbufscript = default_remus_netbufscript;
+    }
 
     if (libxl_defbool_val(r_info.blackhole)) {
         send_fd = open("/dev/null", O_RDWR, 0644);
@@ -8105,10 +8121,19 @@ int main_remus(int argc, char **argv)
         if (!ssh_command[0]) {
             rune = host;
         } else {
-            xasprintf(&rune, "exec %s %s xl migrate-receive %s %s",
-                      ssh_command, host,
-                      libxl_defbool_val(r_info.colo) ? "-c" : "-r",
-                      daemonize ? "" : " -e");
+            if (!libxl_defbool_val(r_info.colo)) {
+                xasprintf(&rune, "exec %s %s xl migrate-receive %s %s",
+                          ssh_command, host,
+                          "-r",
+                          daemonize ? "" : " -e");
+            } else {
+                xasprintf(&rune, "exec %s %s xl migrate-receive %s %s %s %s",
+                          ssh_command, host,
+                          "-c",
+                          r_info.netbufscript ? "-n" : "",
+                          r_info.netbufscript ? r_info.netbufscript : "",
+                          daemonize ? "" : " -e");
+            }
         }
 
         save_domain_core_begin(domid, NULL, &config_data, &config_len);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2015-12-30  2:37 ` [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams Wen Congyang
@ 2016-01-26 20:40   ` Konrad Rzeszutek Wilk
  2016-01-27  6:47     ` Wen Congyang
  0 siblings, 1 reply; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 20:40 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
> It is the negotiation record for COLO.
> Primary->Secondary:
> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
> Secondary->Primary:
>                 0x00000001: Secondary VM is suspended
>                 0x00000002: Secondary VM is ready
>                 0x00000003: Secondary VM is resumed
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> ---
>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
>  tools/python/xen/migration/libxl.py      |  9 +++++++++
>  3 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
> index 2c97d86..5166d66 100644
> --- a/docs/specs/libxl-migration-stream.pandoc
> +++ b/docs/specs/libxl-migration-stream.pandoc
> @@ -1,6 +1,6 @@
>  % LibXenLight Domain Image Format
>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
> -% Revision 1
> +% Revision 2
>  
>  Introduction
>  ============
> @@ -119,7 +119,9 @@ type         0x00000000: END
>  
>               0x00000004: CHECKPOINT_END
>  
> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
> +             0x00000005: CHECKPOINT_STATE
> +
> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_

This is in the 'mandatory' records. Should it be part of optional records?

Would this checkpoint state always present on non-COLO guest migration?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2015-12-30  2:37 ` [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records Wen Congyang
@ 2016-01-26 20:44   ` Konrad Rzeszutek Wilk
  2016-01-27  6:47     ` Wen Congyang
  2016-01-27  7:12     ` Wen Congyang
  0 siblings, 2 replies; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 20:44 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

> +             0x0000000F: DIRTY_PFN_LIST
> +

Perhaps make it part of the optional and prefix it with CHECKPOINT?

> +             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
>               records.
>  
>               0x80000000 - 0xFFFFFFFF: Reserved for future _optional_

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 04/25] libxc/migration: export read_record for common use
  2015-12-30  2:37 ` [PATCH v9 04/25] libxc/migration: export read_record for common use Wen Congyang
@ 2016-01-26 20:45   ` Konrad Rzeszutek Wilk
  2016-01-27  0:57     ` Wen Congyang
  0 siblings, 1 reply; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 20:45 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On Wed, Dec 30, 2015 at 10:37:34AM +0800, Wen Congyang wrote:
> read_record() could be used by primary to read dirty bitmap
> record sent by secondary under COLO.
> When used by save side, we need to pass the backchannel fd
> instead of ctx->fd to read_record(), so we added a fd param to
> it.

Could you add:

No functional changes.

> 
> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/libxc/xc_sr_common.c  | 49 +++++++++++++++++++++++++++++++++++
>  tools/libxc/xc_sr_common.h  | 14 ++++++++++
>  tools/libxc/xc_sr_restore.c | 63 +--------------------------------------------
>  3 files changed, 64 insertions(+), 62 deletions(-)
> 
> diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
> index 8150140..42ee074 100644
> --- a/tools/libxc/xc_sr_common.c
> +++ b/tools/libxc/xc_sr_common.c
> @@ -89,6 +89,55 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
>      return -1;
>  }
>  
> +int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct xc_sr_rhdr rhdr;
> +    size_t datasz;
> +
> +    if ( read_exact(fd, &rhdr, sizeof(rhdr)) )
> +    {
> +        PERROR("Failed to read Record Header from stream");
> +        return -1;
> +    }
> +    else if ( rhdr.length > REC_LENGTH_MAX )
> +    {
> +        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
> +              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
> +        return -1;
> +    }
> +
> +    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
> +
> +    if ( datasz )
> +    {
> +        rec->data = malloc(datasz);
> +
> +        if ( !rec->data )
> +        {
> +            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
> +                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
> +            return -1;
> +        }
> +
> +        if ( read_exact(fd, rec->data, datasz) )
> +        {
> +            free(rec->data);
> +            rec->data = NULL;
> +            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
> +                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
> +            return -1;
> +        }
> +    }
> +    else
> +        rec->data = NULL;
> +
> +    rec->type   = rhdr.type;
> +    rec->length = rhdr.length;
> +
> +    return 0;
> +};
> +
>  static void __attribute__((unused)) build_assertions(void)
>  {
>      XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> index bc99e9a..53d6129 100644
> --- a/tools/libxc/xc_sr_common.h
> +++ b/tools/libxc/xc_sr_common.h
> @@ -370,6 +370,20 @@ static inline int write_record(struct xc_sr_context *ctx,
>  }
>  
>  /*
> + * Reads a record from the stream, and fills in the record structure.
> + *
> + * Returns 0 on success and non-0 on failure.
> + *
> + * On success, the records type and size shall be valid.
> + * - If size is 0, data shall be NULL.
> + * - If size is non-0, data shall be a buffer allocated by malloc() which must
> + *   be passed to free() by the caller.
> + *
> + * On failure, the contents of the record structure are undefined.
> + */
> +int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
> +
> +/*
>   * This would ideally be private in restore.c, but is needed by
>   * x86_pv_localise_page() if we receive pagetables frames ahead of the
>   * contents of the frames they point at.
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index d4dc501..e543be3 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -69,67 +69,6 @@ static int read_headers(struct xc_sr_context *ctx)
>  }
>  
>  /*
> - * Reads a record from the stream, and fills in the record structure.
> - *
> - * Returns 0 on success and non-0 on failure.
> - *
> - * On success, the records type and size shall be valid.
> - * - If size is 0, data shall be NULL.
> - * - If size is non-0, data shall be a buffer allocated by malloc() which must
> - *   be passed to free() by the caller.
> - *
> - * On failure, the contents of the record structure are undefined.
> - */
> -static int read_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
> -{
> -    xc_interface *xch = ctx->xch;
> -    struct xc_sr_rhdr rhdr;
> -    size_t datasz;
> -
> -    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
> -    {
> -        PERROR("Failed to read Record Header from stream");
> -        return -1;
> -    }
> -    else if ( rhdr.length > REC_LENGTH_MAX )
> -    {
> -        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
> -              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
> -        return -1;
> -    }
> -
> -    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
> -
> -    if ( datasz )
> -    {
> -        rec->data = malloc(datasz);
> -
> -        if ( !rec->data )
> -        {
> -            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
> -                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
> -            return -1;
> -        }
> -
> -        if ( read_exact(ctx->fd, rec->data, datasz) )
> -        {
> -            free(rec->data);
> -            rec->data = NULL;
> -            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
> -                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
> -            return -1;
> -        }
> -    }
> -    else
> -        rec->data = NULL;
> -
> -    rec->type   = rhdr.type;
> -    rec->length = rhdr.length;
> -
> -    return 0;
> -};
> -
> -/*
>   * Is a pfn populated?
>   */
>  static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
> @@ -646,7 +585,7 @@ static int restore(struct xc_sr_context *ctx)
>  
>      do
>      {
> -        rc = read_record(ctx, &rec);
> +        rc = read_record(ctx, ctx->fd, &rec);
>          if ( rc )
>          {
>              if ( ctx->restore.buffer_all_records )
> -- 
> 2.5.0
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback
  2015-12-30  2:37 ` [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback Wen Congyang
@ 2016-01-26 20:50   ` Konrad Rzeszutek Wilk
  2016-01-26 21:09     ` Konrad Rzeszutek Wilk
  2016-01-27  1:18     ` Wen Congyang
  0 siblings, 2 replies; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 20:50 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On Wed, Dec 30, 2015 at 10:37:39AM +0800, Wen Congyang wrote:
> Under COLO, we are doing checkpoint on demand, if this
> callback returns 1, we will take another checkpoint.

So 1 means OK.

> 0 indicates unexpected error.

Why not return an error?
> 
> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/libxc/include/xenguest.h     | 18 ++++++++++++++++++
>  tools/libxl/libxl_save_msgs_gen.pl |  7 ++++---
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index bd133af..88d6e13 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -62,6 +62,15 @@ struct save_callbacks {
>       * 1: take another checkpoint */
>      int (*checkpoint)(void* data);
>  
> +    /*
> +     * Called after the checkpoint callback.
> +     *
> +     * returns:
> +     * 0: terminate checkpointing gracefully

checkpointing terminated gracefully

Why not return -EXX instead ?

> +     * 1: take another checkpoint

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback
  2016-01-26 20:50   ` Konrad Rzeszutek Wilk
@ 2016-01-26 21:09     ` Konrad Rzeszutek Wilk
  2016-01-27  1:03       ` Wen Congyang
  2016-01-27  1:18     ` Wen Congyang
  1 sibling, 1 reply; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 21:09 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On Tue, Jan 26, 2016 at 03:50:32PM -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 30, 2015 at 10:37:39AM +0800, Wen Congyang wrote:
> > Under COLO, we are doing checkpoint on demand, if this
> > callback returns 1, we will take another checkpoint.
> 
> So 1 means OK.
> 
> > 0 indicates unexpected error.
> 
> Why not return an error?
> > 
> > Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> > ---
> >  tools/libxc/include/xenguest.h     | 18 ++++++++++++++++++
> >  tools/libxl/libxl_save_msgs_gen.pl |  7 ++++---
> >  2 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> > index bd133af..88d6e13 100644
> > --- a/tools/libxc/include/xenguest.h
> > +++ b/tools/libxc/include/xenguest.h
> > @@ -62,6 +62,15 @@ struct save_callbacks {
> >       * 1: take another checkpoint */
> >      int (*checkpoint)(void* data);
> >  
> > +    /*
> > +     * Called after the checkpoint callback.
> > +     *
> > +     * returns:
> > +     * 0: terminate checkpointing gracefully
> 
> checkpointing terminated gracefully
> 
> Why not return -EXX instead ?
> 
> > +     * 1: take another checkpoint

Also perhaps the function instead of 'should_checkpoint' should just be
called 'checkpoint' or 'do_checkpoint' ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 04/25] libxc/migration: export read_record for common use
  2016-01-26 20:45   ` Konrad Rzeszutek Wilk
@ 2016-01-27  0:57     ` Wen Congyang
  0 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  0:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 04:45 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 30, 2015 at 10:37:34AM +0800, Wen Congyang wrote:
>> read_record() could be used by primary to read dirty bitmap
>> record sent by secondary under COLO.
>> When used by save side, we need to pass the backchannel fd
>> instead of ctx->fd to read_record(), so we added a fd param to
>> it.
> 
> Could you add:
> 
> No functional changes.

OK, will fix it in the next version.

Thanks
Wen Congyang

> 
>>
>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>  tools/libxc/xc_sr_common.c  | 49 +++++++++++++++++++++++++++++++++++
>>  tools/libxc/xc_sr_common.h  | 14 ++++++++++
>>  tools/libxc/xc_sr_restore.c | 63 +--------------------------------------------
>>  3 files changed, 64 insertions(+), 62 deletions(-)
>>
>> diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
>> index 8150140..42ee074 100644
>> --- a/tools/libxc/xc_sr_common.c
>> +++ b/tools/libxc/xc_sr_common.c
>> @@ -89,6 +89,55 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
>>      return -1;
>>  }
>>  
>> +int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct xc_sr_rhdr rhdr;
>> +    size_t datasz;
>> +
>> +    if ( read_exact(fd, &rhdr, sizeof(rhdr)) )
>> +    {
>> +        PERROR("Failed to read Record Header from stream");
>> +        return -1;
>> +    }
>> +    else if ( rhdr.length > REC_LENGTH_MAX )
>> +    {
>> +        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
>> +              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
>> +        return -1;
>> +    }
>> +
>> +    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
>> +
>> +    if ( datasz )
>> +    {
>> +        rec->data = malloc(datasz);
>> +
>> +        if ( !rec->data )
>> +        {
>> +            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
>> +                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
>> +            return -1;
>> +        }
>> +
>> +        if ( read_exact(fd, rec->data, datasz) )
>> +        {
>> +            free(rec->data);
>> +            rec->data = NULL;
>> +            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
>> +                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
>> +            return -1;
>> +        }
>> +    }
>> +    else
>> +        rec->data = NULL;
>> +
>> +    rec->type   = rhdr.type;
>> +    rec->length = rhdr.length;
>> +
>> +    return 0;
>> +};
>> +
>>  static void __attribute__((unused)) build_assertions(void)
>>  {
>>      XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
>> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
>> index bc99e9a..53d6129 100644
>> --- a/tools/libxc/xc_sr_common.h
>> +++ b/tools/libxc/xc_sr_common.h
>> @@ -370,6 +370,20 @@ static inline int write_record(struct xc_sr_context *ctx,
>>  }
>>  
>>  /*
>> + * Reads a record from the stream, and fills in the record structure.
>> + *
>> + * Returns 0 on success and non-0 on failure.
>> + *
>> + * On success, the records type and size shall be valid.
>> + * - If size is 0, data shall be NULL.
>> + * - If size is non-0, data shall be a buffer allocated by malloc() which must
>> + *   be passed to free() by the caller.
>> + *
>> + * On failure, the contents of the record structure are undefined.
>> + */
>> +int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
>> +
>> +/*
>>   * This would ideally be private in restore.c, but is needed by
>>   * x86_pv_localise_page() if we receive pagetables frames ahead of the
>>   * contents of the frames they point at.
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index d4dc501..e543be3 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -69,67 +69,6 @@ static int read_headers(struct xc_sr_context *ctx)
>>  }
>>  
>>  /*
>> - * Reads a record from the stream, and fills in the record structure.
>> - *
>> - * Returns 0 on success and non-0 on failure.
>> - *
>> - * On success, the records type and size shall be valid.
>> - * - If size is 0, data shall be NULL.
>> - * - If size is non-0, data shall be a buffer allocated by malloc() which must
>> - *   be passed to free() by the caller.
>> - *
>> - * On failure, the contents of the record structure are undefined.
>> - */
>> -static int read_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
>> -{
>> -    xc_interface *xch = ctx->xch;
>> -    struct xc_sr_rhdr rhdr;
>> -    size_t datasz;
>> -
>> -    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
>> -    {
>> -        PERROR("Failed to read Record Header from stream");
>> -        return -1;
>> -    }
>> -    else if ( rhdr.length > REC_LENGTH_MAX )
>> -    {
>> -        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
>> -              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
>> -        return -1;
>> -    }
>> -
>> -    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
>> -
>> -    if ( datasz )
>> -    {
>> -        rec->data = malloc(datasz);
>> -
>> -        if ( !rec->data )
>> -        {
>> -            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
>> -                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
>> -            return -1;
>> -        }
>> -
>> -        if ( read_exact(ctx->fd, rec->data, datasz) )
>> -        {
>> -            free(rec->data);
>> -            rec->data = NULL;
>> -            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
>> -                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
>> -            return -1;
>> -        }
>> -    }
>> -    else
>> -        rec->data = NULL;
>> -
>> -    rec->type   = rhdr.type;
>> -    rec->length = rhdr.length;
>> -
>> -    return 0;
>> -};
>> -
>> -/*
>>   * Is a pfn populated?
>>   */
>>  static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
>> @@ -646,7 +585,7 @@ static int restore(struct xc_sr_context *ctx)
>>  
>>      do
>>      {
>> -        rc = read_record(ctx, &rec);
>> +        rc = read_record(ctx, ctx->fd, &rec);
>>          if ( rc )
>>          {
>>              if ( ctx->restore.buffer_all_records )
>> -- 
>> 2.5.0
>>
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback
  2016-01-26 21:09     ` Konrad Rzeszutek Wilk
@ 2016-01-27  1:03       ` Wen Congyang
  0 siblings, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  1:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 05:09 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 26, 2016 at 03:50:32PM -0500, Konrad Rzeszutek Wilk wrote:
>> On Wed, Dec 30, 2015 at 10:37:39AM +0800, Wen Congyang wrote:
>>> Under COLO, we are doing checkpoint on demand, if this
>>> callback returns 1, we will take another checkpoint.
>>
>> So 1 means OK.
>>
>>> 0 indicates unexpected error.
>>
>> Why not return an error?
>>>
>>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> ---
>>>  tools/libxc/include/xenguest.h     | 18 ++++++++++++++++++
>>>  tools/libxl/libxl_save_msgs_gen.pl |  7 ++++---
>>>  2 files changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
>>> index bd133af..88d6e13 100644
>>> --- a/tools/libxc/include/xenguest.h
>>> +++ b/tools/libxc/include/xenguest.h
>>> @@ -62,6 +62,15 @@ struct save_callbacks {
>>>       * 1: take another checkpoint */
>>>      int (*checkpoint)(void* data);
>>>  
>>> +    /*
>>> +     * Called after the checkpoint callback.
>>> +     *
>>> +     * returns:
>>> +     * 0: terminate checkpointing gracefully
>>
>> checkpointing terminated gracefully
>>
>> Why not return -EXX instead ?
>>
>>> +     * 1: take another checkpoint
> 
> Also perhaps the function instead of 'should_checkpoint' should just be
> called 'checkpoint' or 'do_checkpoint' ?

I will check it. IIRC, should_checkpoint() only wait for a new checkpoint.
If so, I think we can call it wait_checkpoint().

Thanks
Wen Congyang

> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback
  2016-01-26 20:50   ` Konrad Rzeszutek Wilk
  2016-01-26 21:09     ` Konrad Rzeszutek Wilk
@ 2016-01-27  1:18     ` Wen Congyang
  1 sibling, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  1:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 04:50 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 30, 2015 at 10:37:39AM +0800, Wen Congyang wrote:
>> Under COLO, we are doing checkpoint on demand, if this
>> callback returns 1, we will take another checkpoint.
> 
> So 1 means OK.
> 
>> 0 indicates unexpected error.
> 
> Why not return an error?
>>
>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>  tools/libxc/include/xenguest.h     | 18 ++++++++++++++++++
>>  tools/libxl/libxl_save_msgs_gen.pl |  7 ++++---
>>  2 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
>> index bd133af..88d6e13 100644
>> --- a/tools/libxc/include/xenguest.h
>> +++ b/tools/libxc/include/xenguest.h
>> @@ -62,6 +62,15 @@ struct save_callbacks {
>>       * 1: take another checkpoint */
>>      int (*checkpoint)(void* data);
>>  
>> +    /*
>> +     * Called after the checkpoint callback.
>> +     *
>> +     * returns:
>> +     * 0: terminate checkpointing gracefully
> 
> checkpointing terminated gracefully
> 
> Why not return -EXX instead ?

Other callbacks also use 0 to indicate an error.

Thanks
Wen Congyang

> 
>> +     * 1: take another checkpoint
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-26 20:40   ` Konrad Rzeszutek Wilk
@ 2016-01-27  6:47     ` Wen Congyang
  2016-01-27 11:00       ` Andrew Cooper
  0 siblings, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  6:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
>> It is the negotiation record for COLO.
>> Primary->Secondary:
>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
>> Secondary->Primary:
>>                 0x00000001: Secondary VM is suspended
>>                 0x00000002: Secondary VM is ready
>>                 0x00000003: Secondary VM is resumed
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>> ---
>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
>>  3 files changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
>> index 2c97d86..5166d66 100644
>> --- a/docs/specs/libxl-migration-stream.pandoc
>> +++ b/docs/specs/libxl-migration-stream.pandoc
>> @@ -1,6 +1,6 @@
>>  % LibXenLight Domain Image Format
>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
>> -% Revision 1
>> +% Revision 2
>>  
>>  Introduction
>>  ============
>> @@ -119,7 +119,9 @@ type         0x00000000: END
>>  
>>               0x00000004: CHECKPOINT_END
>>  
>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
>> +             0x00000005: CHECKPOINT_STATE
>> +
>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
> 
> This is in the 'mandatory' records. Should it be part of optional records?
> 
> Would this checkpoint state always present on non-COLO guest migration?

No. Will be fixed in the next version

Thanks
Wen Congyang

> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2016-01-26 20:44   ` Konrad Rzeszutek Wilk
@ 2016-01-27  6:47     ` Wen Congyang
  2016-01-27  7:12     ` Wen Congyang
  1 sibling, 0 replies; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  6:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 04:44 AM, Konrad Rzeszutek Wilk wrote:
>> +             0x0000000F: DIRTY_PFN_LIST
>> +
> 
> Perhaps make it part of the optional and prefix it with CHECKPOINT?

Will be fixed in the next version.

Thanks
Wen Congyang

> 
>> +             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
>>               records.
>>  
>>               0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2016-01-26 20:44   ` Konrad Rzeszutek Wilk
  2016-01-27  6:47     ` Wen Congyang
@ 2016-01-27  7:12     ` Wen Congyang
  2016-01-27 10:00       ` Ian Campbell
  1 sibling, 1 reply; 45+ messages in thread
From: Wen Congyang @ 2016-01-27  7:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On 01/27/2016 04:44 AM, Konrad Rzeszutek Wilk wrote:
>> +             0x0000000F: DIRTY_PFN_LIST
>> +
> 
> Perhaps make it part of the optional and prefix it with CHECKPOINT?

IIUC, optional record can be ignored, but this record cannot be ignored.

To Andrew Cooper:
Should I mark this record as optional record?

Thanks
Wen Congyang

> 
>> +             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
>>               records.
>>  
>>               0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2016-01-27  7:12     ` Wen Congyang
@ 2016-01-27 10:00       ` Ian Campbell
  2016-01-27 11:01         ` Andrew Cooper
  0 siblings, 1 reply; 45+ messages in thread
From: Ian Campbell @ 2016-01-27 10:00 UTC (permalink / raw)
  To: Wen Congyang, Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Dong Eddie, Andrew Cooper,
	Jiang Yunhong, Ian Jackson, xen devel, Gui Jianfeng,
	Shriram Rajagopalan, Yang Hongyang

On Wed, 2016-01-27 at 15:12 +0800, Wen Congyang wrote:
> On 01/27/2016 04:44 AM, Konrad Rzeszutek Wilk wrote:
> > > +             0x0000000F: DIRTY_PFN_LIST
> > > +
> > 
> > Perhaps make it part of the optional and prefix it with CHECKPOINT?
> 
> IIUC, optional record can be ignored, but this record cannot be ignored.
> 
> To Andrew Cooper:
> Should I mark this record as optional record?

My understanding was that this indicated things for which support was
mandatory (whereas unknown optional ones may be ignored), not that they
must be present in every stream.

IOW placing this in the mandatory flags is correct, since the restorer
cannot simply ignore a checkpoint flag.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27  6:47     ` Wen Congyang
@ 2016-01-27 11:00       ` Andrew Cooper
  2016-01-27 15:11         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Cooper @ 2016-01-27 11:00 UTC (permalink / raw)
  To: Wen Congyang, Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Gui Jianfeng,
	Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie,
	Shriram Rajagopalan, Yang Hongyang

On 27/01/16 06:47, Wen Congyang wrote:
> On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
>>> It is the negotiation record for COLO.
>>> Primary->Secondary:
>>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
>>> Secondary->Primary:
>>>                 0x00000001: Secondary VM is suspended
>>>                 0x00000002: Secondary VM is ready
>>>                 0x00000003: Secondary VM is resumed
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>>> ---
>>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
>>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
>>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
>>>  3 files changed, 43 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
>>> index 2c97d86..5166d66 100644
>>> --- a/docs/specs/libxl-migration-stream.pandoc
>>> +++ b/docs/specs/libxl-migration-stream.pandoc
>>> @@ -1,6 +1,6 @@
>>>  % LibXenLight Domain Image Format
>>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
>>> -% Revision 1
>>> +% Revision 2
>>>  
>>>  Introduction
>>>  ============
>>> @@ -119,7 +119,9 @@ type         0x00000000: END
>>>  
>>>               0x00000004: CHECKPOINT_END
>>>  
>>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
>>> +             0x00000005: CHECKPOINT_STATE
>>> +
>>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
>> This is in the 'mandatory' records. Should it be part of optional records?
>>
>> Would this checkpoint state always present on non-COLO guest migration?
> No. Will be fixed in the next version

It is correct that CHECKPOINT_STATE is a mandatory record.

Optional records which are free for the receiving end to ignore if they
are not understood.

~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records
  2016-01-27 10:00       ` Ian Campbell
@ 2016-01-27 11:01         ` Andrew Cooper
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Cooper @ 2016-01-27 11:01 UTC (permalink / raw)
  To: Ian Campbell, Wen Congyang, Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Dong Eddie, Gui Jianfeng,
	Jiang Yunhong, Ian Jackson, xen devel, Shriram Rajagopalan,
	Yang Hongyang

On 27/01/16 10:00, Ian Campbell wrote:
> On Wed, 2016-01-27 at 15:12 +0800, Wen Congyang wrote:
>> On 01/27/2016 04:44 AM, Konrad Rzeszutek Wilk wrote:
>>>> +             0x0000000F: DIRTY_PFN_LIST
>>>> +
>>> Perhaps make it part of the optional and prefix it with CHECKPOINT?
>> IIUC, optional record can be ignored, but this record cannot be ignored.
>>
>> To Andrew Cooper:
>> Should I mark this record as optional record?
> My understanding was that this indicated things for which support was
> mandatory (whereas unknown optional ones may be ignored), not that they
> must be present in every stream.
>
> IOW placing this in the mandatory flags is correct, since the restorer
> cannot simply ignore a checkpoint flag.

Both correct on all points.  This should be a mandatory record.

~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27 11:00       ` Andrew Cooper
@ 2016-01-27 15:11         ` Konrad Rzeszutek Wilk
  2016-01-27 15:15           ` Andrew Cooper
  0 siblings, 1 reply; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 15:11 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang,
	Gui Jianfeng, Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie,
	Shriram Rajagopalan, Yang Hongyang

On Wed, Jan 27, 2016 at 11:00:24AM +0000, Andrew Cooper wrote:
> On 27/01/16 06:47, Wen Congyang wrote:
> > On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
> >> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
> >>> It is the negotiation record for COLO.
> >>> Primary->Secondary:
> >>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
> >>> Secondary->Primary:
> >>>                 0x00000001: Secondary VM is suspended
> >>>                 0x00000002: Secondary VM is ready
> >>>                 0x00000003: Secondary VM is resumed
> >>>
> >>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> >>> ---
> >>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
> >>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
> >>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
> >>>  3 files changed, 43 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
> >>> index 2c97d86..5166d66 100644
> >>> --- a/docs/specs/libxl-migration-stream.pandoc
> >>> +++ b/docs/specs/libxl-migration-stream.pandoc
> >>> @@ -1,6 +1,6 @@
> >>>  % LibXenLight Domain Image Format
> >>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
> >>> -% Revision 1
> >>> +% Revision 2
> >>>  
> >>>  Introduction
> >>>  ============
> >>> @@ -119,7 +119,9 @@ type         0x00000000: END
> >>>  
> >>>               0x00000004: CHECKPOINT_END
> >>>  
> >>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
> >>> +             0x00000005: CHECKPOINT_STATE
> >>> +
> >>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
> >> This is in the 'mandatory' records. Should it be part of optional records?
> >>
> >> Would this checkpoint state always present on non-COLO guest migration?
> > No. Will be fixed in the next version
> 
> It is correct that CHECKPOINT_STATE is a mandatory record.
> 
> Optional records which are free for the receiving end to ignore if they
> are not understood.

What you are saying is that the receving end has to expect this (CHECKPOINT_STATE)
even there is nothing in them - as the size of them is zero (becuase there are
no  dirty PFNs to send).

> 
> ~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27 15:11         ` Konrad Rzeszutek Wilk
@ 2016-01-27 15:15           ` Andrew Cooper
  2016-01-27 15:28             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Cooper @ 2016-01-27 15:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang,
	Gui Jianfeng, Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie,
	Shriram Rajagopalan, Yang Hongyang

On 27/01/16 15:11, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2016 at 11:00:24AM +0000, Andrew Cooper wrote:
>> On 27/01/16 06:47, Wen Congyang wrote:
>>> On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
>>>>> It is the negotiation record for COLO.
>>>>> Primary->Secondary:
>>>>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
>>>>> Secondary->Primary:
>>>>>                 0x00000001: Secondary VM is suspended
>>>>>                 0x00000002: Secondary VM is ready
>>>>>                 0x00000003: Secondary VM is resumed
>>>>>
>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>>>>> ---
>>>>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
>>>>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
>>>>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
>>>>>  3 files changed, 43 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
>>>>> index 2c97d86..5166d66 100644
>>>>> --- a/docs/specs/libxl-migration-stream.pandoc
>>>>> +++ b/docs/specs/libxl-migration-stream.pandoc
>>>>> @@ -1,6 +1,6 @@
>>>>>  % LibXenLight Domain Image Format
>>>>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
>>>>> -% Revision 1
>>>>> +% Revision 2
>>>>>  
>>>>>  Introduction
>>>>>  ============
>>>>> @@ -119,7 +119,9 @@ type         0x00000000: END
>>>>>  
>>>>>               0x00000004: CHECKPOINT_END
>>>>>  
>>>>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
>>>>> +             0x00000005: CHECKPOINT_STATE
>>>>> +
>>>>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
>>>> This is in the 'mandatory' records. Should it be part of optional records?
>>>>
>>>> Would this checkpoint state always present on non-COLO guest migration?
>>> No. Will be fixed in the next version
>> It is correct that CHECKPOINT_STATE is a mandatory record.
>>
>> Optional records which are free for the receiving end to ignore if they
>> are not understood.
> What you are saying is that the receving end has to expect this (CHECKPOINT_STATE)
> even there is nothing in them - as the size of them is zero (becuase there are
> no  dirty PFNs to send).

The sole difference between a mandatory record an an option record is
the receivers behaviour.

Mandatory records may not be ignored, and constitutes a hard error. 
Optional records may be ignored, without error, if they are not understood.

~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27 15:15           ` Andrew Cooper
@ 2016-01-27 15:28             ` Konrad Rzeszutek Wilk
  2016-01-27 15:30               ` Andrew Cooper
  0 siblings, 1 reply; 45+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 15:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang,
	Gui Jianfeng, Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie,
	Shriram Rajagopalan, Yang Hongyang

On Wed, Jan 27, 2016 at 03:15:47PM +0000, Andrew Cooper wrote:
> On 27/01/16 15:11, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 27, 2016 at 11:00:24AM +0000, Andrew Cooper wrote:
> >> On 27/01/16 06:47, Wen Congyang wrote:
> >>> On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
> >>>> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
> >>>>> It is the negotiation record for COLO.
> >>>>> Primary->Secondary:
> >>>>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
> >>>>> Secondary->Primary:
> >>>>>                 0x00000001: Secondary VM is suspended
> >>>>>                 0x00000002: Secondary VM is ready
> >>>>>                 0x00000003: Secondary VM is resumed
> >>>>>
> >>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >>>>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
> >>>>> ---
> >>>>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
> >>>>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
> >>>>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
> >>>>>  3 files changed, 43 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
> >>>>> index 2c97d86..5166d66 100644
> >>>>> --- a/docs/specs/libxl-migration-stream.pandoc
> >>>>> +++ b/docs/specs/libxl-migration-stream.pandoc
> >>>>> @@ -1,6 +1,6 @@
> >>>>>  % LibXenLight Domain Image Format
> >>>>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
> >>>>> -% Revision 1
> >>>>> +% Revision 2
> >>>>>  
> >>>>>  Introduction
> >>>>>  ============
> >>>>> @@ -119,7 +119,9 @@ type         0x00000000: END
> >>>>>  
> >>>>>               0x00000004: CHECKPOINT_END
> >>>>>  
> >>>>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
> >>>>> +             0x00000005: CHECKPOINT_STATE
> >>>>> +
> >>>>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
> >>>> This is in the 'mandatory' records. Should it be part of optional records?
> >>>>
> >>>> Would this checkpoint state always present on non-COLO guest migration?
> >>> No. Will be fixed in the next version
> >> It is correct that CHECKPOINT_STATE is a mandatory record.
> >>
> >> Optional records which are free for the receiving end to ignore if they
> >> are not understood.
> > What you are saying is that the receving end has to expect this (CHECKPOINT_STATE)
> > even there is nothing in them - as the size of them is zero (becuase there are
> > no  dirty PFNs to send).
> 
> The sole difference between a mandatory record an an option record is
> the receivers behaviour.
> 
> Mandatory records may not be ignored, and constitutes a hard error. 
> Optional records may be ignored, without error, if they are not understood.

You are still not answering my question.

Is it a hard error if the mandatory record is zero length?

> 
> ~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27 15:28             ` Konrad Rzeszutek Wilk
@ 2016-01-27 15:30               ` Andrew Cooper
  2016-01-27 16:01                 ` Ian Jackson
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Cooper @ 2016-01-27 15:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang,
	Gui Jianfeng, Jiang Yunhong, Ian Jackson, xen devel, Dong Eddie,
	Shriram Rajagopalan, Yang Hongyang

On 27/01/16 15:28, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2016 at 03:15:47PM +0000, Andrew Cooper wrote:
>> On 27/01/16 15:11, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Jan 27, 2016 at 11:00:24AM +0000, Andrew Cooper wrote:
>>>> On 27/01/16 06:47, Wen Congyang wrote:
>>>>> On 01/27/2016 04:40 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Wed, Dec 30, 2015 at 10:37:32AM +0800, Wen Congyang wrote:
>>>>>>> It is the negotiation record for COLO.
>>>>>>> Primary->Secondary:
>>>>>>> control_id      0x00000000: Secondary VM is out of sync, start a new checkpoint
>>>>>>> Secondary->Primary:
>>>>>>>                 0x00000001: Secondary VM is suspended
>>>>>>>                 0x00000002: Secondary VM is ready
>>>>>>>                 0x00000003: Secondary VM is resumed
>>>>>>>
>>>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
>>>>>>> ---
>>>>>>>  docs/specs/libxl-migration-stream.pandoc | 25 +++++++++++++++++++++++--
>>>>>>>  tools/libxl/libxl_sr_stream_format.h     | 11 +++++++++++
>>>>>>>  tools/python/xen/migration/libxl.py      |  9 +++++++++
>>>>>>>  3 files changed, 43 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
>>>>>>> index 2c97d86..5166d66 100644
>>>>>>> --- a/docs/specs/libxl-migration-stream.pandoc
>>>>>>> +++ b/docs/specs/libxl-migration-stream.pandoc
>>>>>>> @@ -1,6 +1,6 @@
>>>>>>>  % LibXenLight Domain Image Format
>>>>>>>  % Andrew Cooper <<andrew.cooper3@citrix.com>>
>>>>>>> -% Revision 1
>>>>>>> +% Revision 2
>>>>>>>  
>>>>>>>  Introduction
>>>>>>>  ============
>>>>>>> @@ -119,7 +119,9 @@ type         0x00000000: END
>>>>>>>  
>>>>>>>               0x00000004: CHECKPOINT_END
>>>>>>>  
>>>>>>> -             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
>>>>>>> +             0x00000005: CHECKPOINT_STATE
>>>>>>> +
>>>>>>> +             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
>>>>>> This is in the 'mandatory' records. Should it be part of optional records?
>>>>>>
>>>>>> Would this checkpoint state always present on non-COLO guest migration?
>>>>> No. Will be fixed in the next version
>>>> It is correct that CHECKPOINT_STATE is a mandatory record.
>>>>
>>>> Optional records which are free for the receiving end to ignore if they
>>>> are not understood.
>>> What you are saying is that the receving end has to expect this (CHECKPOINT_STATE)
>>> even there is nothing in them - as the size of them is zero (becuase there are
>>> no  dirty PFNs to send).
>> The sole difference between a mandatory record an an option record is
>> the receivers behaviour.
>>
>> Mandatory records may not be ignored, and constitutes a hard error. 
>> Optional records may be ignored, without error, if they are not understood.
> You are still not answering my question.
>
> Is it a hard error if the mandatory record is zero length?

Not if the type specifies that a zero length record is permitted.

~Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams
  2016-01-27 15:30               ` Andrew Cooper
@ 2016-01-27 16:01                 ` Ian Jackson
  0 siblings, 0 replies; 45+ messages in thread
From: Ian Jackson @ 2016-01-27 16:01 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lars Kurth, Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang,
	Gui Jianfeng, Jiang Yunhong, Dong Eddie, xen devel,
	Shriram Rajagopalan, Yang Hongyang

Andrew Cooper writes ("Re: [Xen-devel] [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams"):
> On 27/01/16 15:28, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 27, 2016 at 03:15:47PM +0000, Andrew Cooper wrote:
> >> Mandatory records may not be ignored, and constitutes a hard
> >> error.  Optional records may be ignored, without error, if they
> >> are not understood.

`Mandatory' is a somewhat misleading term.  (In a past life we called
these different kinds of protocol extension `harmful' and `harmless'.
A `harmless' extension is one that is safe to ignore if the receiver
does not know about the extension.  A `harmful' extension will always
cause a naive receiver to get a parse error.)

> > You are still not answering my question.
> >
> > Is it a hard error if the mandatory record is zero length?
>
> Not if the type specifies that a zero length record is permitted.

This is misleading.  Obviously the behaviour of a naive receiver
cannot depend on the specification document for the type, because that
specification document was written after the type was invented.


Let me provide an exhaustive case analysis:

  Optional or   Value sent    Receiver knows   Receiver
   Mandatory                  about the type?   behaviour

  Optional or   No record of    Yes        Depends on type
  Mandatory     this type                 spec and semantics;
                                          reciever may fail
                                          if information is
                                          required.

  Optional or   Empty           Yes        Depends on type
  Mandatory      record(s)                spec and semantics;
                                          empty record might be
                                          invalid or semantically
                                          inappropriate (eg
                                          inconsistent with
                                          other info)

  Optional or   Nonempty        Yes        Depends on type
  Mandatory      record(s)                spec and semantics

  Optional or   No record of    No         Whatever it does
  Mandatory     this type                 normally, obviously



And these are the only cases where `Optional' or `Mandatory' makes a
difference:

  Optional      Empty           No         Record(s) silently
                 record(s)                discarded by receiver

  Optional      Nonempty        No         Record(s) silently
                 record(s)                discarded by receiver

  Mandatory     Empty           No         Receiver ABORTS
                 record(s)                entire operation

  Mandatory     Nonempty        No         Receiver ABORTS
                 record(s)                entire operation



There is nothing special about empty records.  Indeed an empty record
can be used as a flag.

Ian.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-01-27 16:01 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-30  2:37 [PATCH v9 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
2015-12-30  2:37 ` [PATCH v9 01/25] docs: add colo readme Wen Congyang
2015-12-30  2:37 ` [PATCH v9 02/25] docs/libxl: Introduce COLO_CONTEXT to support migration v2 colo streams Wen Congyang
2016-01-26 20:40   ` Konrad Rzeszutek Wilk
2016-01-27  6:47     ` Wen Congyang
2016-01-27 11:00       ` Andrew Cooper
2016-01-27 15:11         ` Konrad Rzeszutek Wilk
2016-01-27 15:15           ` Andrew Cooper
2016-01-27 15:28             ` Konrad Rzeszutek Wilk
2016-01-27 15:30               ` Andrew Cooper
2016-01-27 16:01                 ` Ian Jackson
2015-12-30  2:37 ` [PATCH v9 03/25] libxc/migration: Specification update for DIRTY_PFN_LIST records Wen Congyang
2016-01-26 20:44   ` Konrad Rzeszutek Wilk
2016-01-27  6:47     ` Wen Congyang
2016-01-27  7:12     ` Wen Congyang
2016-01-27 10:00       ` Ian Campbell
2016-01-27 11:01         ` Andrew Cooper
2015-12-30  2:37 ` [PATCH v9 04/25] libxc/migration: export read_record for common use Wen Congyang
2016-01-26 20:45   ` Konrad Rzeszutek Wilk
2016-01-27  0:57     ` Wen Congyang
2015-12-30  2:37 ` [PATCH v9 05/25] tools/libxl: add back channel support to write stream Wen Congyang
2015-12-30  2:37 ` [PATCH v9 06/25] tools/libxl: write checkpoint_state records into the stream Wen Congyang
2015-12-30  2:37 ` [PATCH v9 07/25] tools/libxl: add back channel support to read stream Wen Congyang
2015-12-30  2:37 ` [PATCH v9 08/25] tools/libxl: handle checkpoint_state records in a libxl migration v2 " Wen Congyang
2015-12-30  2:37 ` [PATCH v9 09/25] tools/libx{l, c}: introduce should_checkpoint callback Wen Congyang
2016-01-26 20:50   ` Konrad Rzeszutek Wilk
2016-01-26 21:09     ` Konrad Rzeszutek Wilk
2016-01-27  1:03       ` Wen Congyang
2016-01-27  1:18     ` Wen Congyang
2015-12-30  2:37 ` [PATCH v9 10/25] tools/libx{l, c}: add postcopy/suspend callback to restore side Wen Congyang
2015-12-30  2:37 ` [PATCH v9 11/25] secondary vm suspend/resume/checkpoint code Wen Congyang
2015-12-30  2:37 ` [PATCH v9 12/25] primary " Wen Congyang
2015-12-30  2:37 ` [PATCH v9 13/25] libxc/restore: support COLO restore Wen Congyang
2015-12-30  2:37 ` [PATCH v9 14/25] libxc/restore: send dirty pfn list to primary when checkpoint under colo Wen Congyang
2015-12-30  2:37 ` [PATCH v9 15/25] send store gfn and console gfn to xl before resuming secondary vm Wen Congyang
2015-12-30  2:37 ` [PATCH v9 16/25] libxc/save: support COLO save Wen Congyang
2015-12-30  2:37 ` [PATCH v9 17/25] implement the cmdline for COLO Wen Congyang
2015-12-30  2:37 ` [PATCH v9 18/25] Support colo mode for qemu disk Wen Congyang
2015-12-30  2:37 ` [PATCH v9 19/25] COLO: use qemu block replication Wen Congyang
2015-12-30  2:37 ` [PATCH v9 20/25] COLO proxy: implement setup/teardown of COLO proxy module Wen Congyang
2015-12-30  2:37 ` [PATCH v9 21/25] COLO proxy: preresume, postresume and checkpoint Wen Congyang
2015-12-30  2:37 ` [PATCH v9 22/25] COLO nic: implement COLO nic subkind Wen Congyang
2015-12-30  2:37 ` [PATCH v9 23/25] setup and control colo proxy on primary side Wen Congyang
2015-12-30  2:37 ` [PATCH v9 24/25] setup and control colo proxy on secondary side Wen Congyang
2015-12-30  2:37 ` [PATCH v9 25/25] cmdline switches and config vars to control colo-proxy Wen Congyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.