All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
@ 2016-01-13  9:18 Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file Changlong Xie
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: fnstml-hwcolo

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v14

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v13

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
   structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (8):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  auto complete active commit
  Implement new driver for block replication
  support replication driver in blockdev-add

 block.c                          |  19 ++
 block/Makefile.objs              |   3 +-
 block/backup.c                   |  14 +
 block/mirror.c                   |  13 +-
 block/replication-comm.c         |  66 +++++
 block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
 blockdev.c                       |   2 +-
 blockjob.c                       |  11 +
 docs/block-replication.txt       | 229 +++++++++++++++
 include/block/block_int.h        |   4 +-
 include/block/blockjob.h         |  12 +
 include/block/replication-comm.h |  50 ++++
 qapi/block-core.json             |  33 ++-
 qemu-img.c                       |   2 +-
 14 files changed, 1038 insertions(+), 10 deletions(-)
 create mode 100644 block/replication-comm.c
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt
 create mode 100644 include/block/replication-comm.h

-- 
1.9.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild Changlong Xie
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index e90b6cf..12b5231 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
     /* Otherwise we won't be able to commit due to check in bdrv_commit */
     bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
                     bs->backing_blocker);
+    /*
+     * We do backup in 3 ways:
+     * 1. drive backup
+     *    The target bs is new opened, and the source is top BDS
+     * 2. blockdev backup
+     *    Both the source and the target are top BDSes.
+     * 3. internal backup(used for block replication)
+     *    Both the source and the target are backing file
+     *
+     * In case 1, and 2, the backing file is neither the source nor
+     * the target.
+     * In case 3, we will block the top BDS, so there is only one block
+     * job for the top BDS and its backing chain.
+     */
+    bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                    bs->backing_blocker);
+    bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+                    bs->backing_blocker);
 out:
     bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint Changlong Xie
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block.c                   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 12b5231..33d3427 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
     BdrvChild *child = g_new(BdrvChild, 1);
     *child = (BdrvChild) {
         .bs     = child_bs,
+        .parent = parent_bs,
         .name   = g_strdup(child_name),
         .role   = child_role,
     };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ebe8b1e..19c02b6 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -361,6 +361,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
     BlockDriverState *bs;
+    BlockDriverState *parent;
     char *name;
     const BdrvChildRole *role;
     QLIST_ENTRY(BdrvChild) next;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-27 16:05   ` Stefan Hajnoczi
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS Changlong Xie
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
---
 block/backup.c           | 14 ++++++++++++++
 blockjob.c               | 11 +++++++++++
 include/block/blockjob.h | 12 ++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 705bb77..0a27d01 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
     }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+        error_setg(errp, "The backup job only supports block checkpoint in"
+                   " sync=none mode");
+        return;
+    }
+
+    hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
     .instance_size  = sizeof(BackupBlockJob),
     .job_type       = BLOCK_JOB_TYPE_BACKUP,
     .set_speed      = backup_set_speed,
     .iostatus_reset = backup_iostatus_reset,
+    .do_checkpoint  = backup_do_checkpoint,
     .commit         = backup_commit,
     .abort          = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
     QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
     block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+    if (!job->driver->do_checkpoint) {
+        error_setg(errp, "The job %s doesn't support block checkpoint",
+                   BlockJobType_lookup[job->driver->job_type]);
+        return;
+    }
+
+    job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
      * never both.
      */
     void (*abort)(BlockJob *job);
+
+    /** Optional callback for job types that support checkpoint. */
+    void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (2 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-27 14:04   ` Stefan Hajnoczi
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 5/8] docs: block replication's description Changlong Xie
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
 iscsi.o-libs       := $(LIBISCSI_LIBS)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (3 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-02-02 15:20   ` Eric Blake
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 6/8] auto complete active commit Changlong Xie
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 docs/block-replication.txt | 229 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..d1a231e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,229 @@
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content (it could be from either "Secondary Write Requests" or
+       previous COW of "Primary Write Requests") in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary    2 filter
+     disk         ^                                                             virtio-blk
+                  |                                                                  ^
+                3 NBD  ------->  3 NBD                                               |
+                client    ||     server                                          2 filter
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none 6
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The new block filter (the name is replication) will control the block
+replication.
+
+3) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+4) The disk on the secondary is represented by a custom block device
+(called active-disk). It should start as an empty disk, and the format
+should support bdrv_make_empty() and backing file.
+
+5) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also start as an empty disk,
+and the driver supports bdrv_make_empty() and backing file.
+
+6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer
+any state that would otherwise be lost by the speculative write-through
+of the NBD server into the secondary disk. So before block replication,
+the primary disk and secondary disk should contain the same data.
+
+== Failure Handling ==
+There are 7 internal errors when block replication is running:
+1. I/O error on primary disk
+2. Forwarding primary write requests failed
+3. Backup failed
+4. I/O error on secondary disk
+5. I/O error on active disk
+6. Making active disk or hidden disk empty failed
+7. Doing failover failed
+In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
+4 and 6, we just report block replication's error to FT/HA manager (which
+decides when to do a new checkpoint, when to do failover).
+In case 7, if active commit failed, we use replication failover failed state
+in Secondary's write operation (what decides which target to write).
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. block_replication_start_all()
+   Start block replication, called in migration/checkpoint thread.
+   We must call block_replication_start_all() in secondary QEMU before
+   calling block_replication_start_all() in primary QEMU. The caller
+   must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+b. block_replication_do_checkpoint_all()
+   This interface is called after all VM state is transferred to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. block_replication_stop_all()
+   It is called on failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it if you use this API to shutdown the guest, or other
+   things except failover. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw
+
+  Run qmp command in primary qemu:
+    { 'execute': 'human-monitor-command',
+      'arguments': {
+          'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'
+      }
+    }
+    { 'execute': 'x-blockdev-change',
+      'arguments': {
+          'parent': 'colo1',
+          'node': 'nbd_client1'
+      }
+    }
+  Note:
+  1. There should be only one NBD Client for each primary disk.
+  2. host is the secondary physical machine's hostname or IP
+  3. Each disk must have its own export name.
+  4. It is all a single argument to -drive and you should ignore the
+     leading whitespace.
+  5. The qmp command line must be run after running qmp command line in
+     secondary qemu.
+
+Secondary:
+  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
+  -drive if=xxx,driver=replication,mode=secondary,\
+         file.file.filename=active_disk.qcow2,\
+         file.driver=qcow2,\
+         file.backing.file.filename=hidden_disk.qcow2,\
+         file.backing.driver=qcow2,\
+         file.backing.backing=colo1
+
+  Then run qmp command in secondary qemu:
+    { 'execute': 'nbd-server-start',
+      'arguments': {
+          'addr': {
+              'type': 'inet',
+              'data': {
+                  'host': 'xxx',
+                  'port': 'xxx'
+              }
+          }
+      }
+    }
+    { 'execute': 'nbd-server-add',
+      'arguments': {
+          'device': 'colo1',
+          'writable': true
+      }
+    }
+
+  Note:
+  1. The export name in secondary QEMU command line is the secondary
+     disk's id.
+  2. The export name for the same disk must be the same
+  3. The qmp command nbd-server-start and nbd-server-add must be run
+     before running the qmp command migrate on primary QEMU
+  4. Active disk, hidden disk and nbd target's length should be the
+     same.
+  5. It is better to put active disk and hidden disk in ramdisk.
+  6. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
+
+After Failover:
+Primary:
+  The secondary host is down, so we should run the following qmp command
+  to remove the nbd child from the quorum:
+  { 'execute': 'x-blockdev-change',
+    'arguments': {
+        'parent': 'colo1',
+        'child': 'children.1'
+    }
+  }
+  Note: there is no qmp command to remove the blockdev now
+
+Secondary:
+  The primary host is down, so we should do the following thing:
+  { 'execute': 'nbd-server-stop' }
+
+TODO:
+1. Continuous block replication
+2. Shared disk
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 6/8] auto complete active commit
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (4 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 5/8] docs: block replication's description Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Changlong Xie
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block/mirror.c            | 13 +++++++++----
 blockdev.c                |  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c                |  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index f201f2b..b8e51f8 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -716,7 +716,8 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
                              BlockCompletionFunc *cb,
                              void *opaque, Error **errp,
                              const BlockJobDriver *driver,
-                             bool is_none_mode, BlockDriverState *base)
+                             bool is_none_mode, BlockDriverState *base,
+                             bool auto_complete)
 {
     MirrorBlockJob *s;
     BlockDriverState *replaced_bs;
@@ -772,6 +773,9 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
     s->granularity = granularity;
     s->buf_size = ROUND_UP(buf_size, granularity);
     s->unmap = unmap;
+    if (auto_complete) {
+        s->should_complete = true;
+    }
 
     s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
     if (!s->dirty_bitmap) {
@@ -813,14 +817,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target,
     mirror_start_job(bs, target, replaces,
                      speed, granularity, buf_size,
                      on_source_error, on_target_error, unmap, cb, opaque, errp,
-                     &mirror_job_driver, is_none_mode, base);
+                     &mirror_job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
                          int64_t speed,
                          BlockdevOnError on_error,
                          BlockCompletionFunc *cb,
-                         void *opaque, Error **errp)
+                         void *opaque, Error **errp,
+                         bool auto_complete)
 {
     int64_t length, base_length;
     int orig_base_flags;
@@ -861,7 +866,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
     bdrv_ref(base);
     mirror_start_job(bs, base, NULL, speed, 0, 0,
                      on_error, on_error, false, cb, opaque, &local_err,
-                     &commit_active_job_driver, false, base);
+                     &commit_active_job_driver, false, base, auto_complete);
     if (local_err) {
         error_propagate(errp, local_err);
         goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 574a365..603e858 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3038,7 +3038,7 @@ void qmp_block_commit(const char *device,
             goto out;
         }
         commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-                            bs, &local_err);
+                            bs, &local_err, false);
     } else {
         commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
                      has_backing_file ? backing_file : NULL, &local_err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 19c02b6..1b2d51a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -634,13 +634,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState *base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
                          int64_t speed,
                          BlockdevOnError on_error,
                          BlockCompletionFunc *cb,
-                         void *opaque, Error **errp);
+                         void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index 3d48b4f..4b8b092 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -759,7 +759,7 @@ static int img_commit(int argc, char **argv)
     };
 
     commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-                        common_block_job_cb, &cbi, &local_err);
+                        common_block_job_cb, &cbi, &local_err, false);
     if (local_err) {
         goto done;
     }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (5 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 6/8] auto complete active commit Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-20  0:04   ` Eric Blake
  2016-01-27 14:46   ` Stefan Hajnoczi
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add Changlong Xie
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block/Makefile.objs              |   1 +
 block/replication-comm.c         |  66 +++++
 block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
 include/block/replication-comm.h |  50 ++++
 qapi/block-core.json             |  13 +
 5 files changed, 720 insertions(+)
 create mode 100644 block/replication-comm.c
 create mode 100644 block/replication.c
 create mode 100644 include/block/replication-comm.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..7037662 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication-comm.o replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication-comm.c b/block/replication-comm.c
new file mode 100644
index 0000000..8af748b
--- /dev/null
+++ b/block/replication-comm.c
@@ -0,0 +1,66 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "block/replication-comm.h"
+
+static QLIST_HEAD(, BlockReplicationState) block_replication_states;
+
+BlockReplicationState *block_replication_new(BlockDriverState *bs,
+                                             BlockReplicationOps *ops)
+{
+    BlockReplicationState *brs;
+
+    brs = g_new0(BlockReplicationState, 1);
+    brs->bs = bs;
+    brs->ops = ops;
+    QLIST_INSERT_HEAD(&block_replication_states, brs, node);
+
+    return brs;
+}
+
+void block_replication_remove(BlockReplicationState *brs)
+{
+    QLIST_REMOVE(brs, node);
+    g_free(brs);
+}
+
+void block_replication_start_all(ReplicationMode mode, Error **errp)
+{
+    BlockReplicationState *brs, *next;
+    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+        if (brs->ops && brs->ops->start) {
+            brs->ops->start(brs, mode, errp);
+        }
+    }
+}
+
+void block_replication_do_checkpoint_all(Error **errp)
+{
+    BlockReplicationState *brs, *next;
+    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+        if (brs->ops && brs->ops->checkpoint) {
+            brs->ops->checkpoint(brs, errp);
+        }
+    }
+}
+
+void block_replication_stop_all(bool failover, Error **errp)
+{
+    BlockReplicationState *brs, *next;
+    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
+        if (brs->ops && brs->ops->stop) {
+            brs->ops->stop(brs, failover, errp);
+        }
+    }
+}
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 0000000..29c677a
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,590 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+#include "block/replication-comm.h"
+
+typedef struct BDRVReplicationState {
+    ReplicationMode mode;
+    int replication_state;
+    BlockDriverState *active_disk;
+    BlockDriverState *hidden_disk;
+    BlockDriverState *secondary_disk;
+    BlockDriverState *top_bs;
+    BlockReplicationState *brs;
+    Error *blocker;
+    int orig_hidden_flags;
+    int orig_secondary_flags;
+    int error;
+} BDRVReplicationState;
+
+enum {
+    BLOCK_REPLICATION_NONE,             /* block replication is not started */
+    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
+    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
+    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
+    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
+};
+
+static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
+                              Error **errp);
+static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp);
+static void replication_stop(BlockReplicationState *brs, bool failover,
+                             Error **errp);
+
+#define REPLICATION_MODE        "mode"
+static QemuOptsList replication_runtime_opts = {
+    .name = "replication",
+    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+    .desc = {
+        {
+            .name = REPLICATION_MODE,
+            .type = QEMU_OPT_STRING,
+        },
+        { /* end of list */ }
+    },
+};
+
+static BlockReplicationOps replication_ops = {
+    .start = replication_start,
+    .checkpoint = replication_do_checkpoint,
+    .stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+                            int flags, Error **errp)
+{
+    int ret;
+    BDRVReplicationState *s = bs->opaque;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    const char *mode;
+
+    ret = -EINVAL;
+    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        goto fail;
+    }
+
+    mode = qemu_opt_get(opts, REPLICATION_MODE);
+    if (!mode) {
+        error_setg(&local_err, "Missing the option mode");
+        goto fail;
+    }
+
+    if (!strcmp(mode, "primary")) {
+        s->mode = REPLICATION_MODE_PRIMARY;
+    } else if (!strcmp(mode, "secondary")) {
+        s->mode = REPLICATION_MODE_SECONDARY;
+    } else {
+        error_setg(&local_err,
+                   "The option mode's value should be primary or secondary");
+        goto fail;
+    }
+
+    s->brs = block_replication_new(bs, &replication_ops);
+
+    ret = 0;
+
+fail:
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+        replication_stop(s->brs, false, NULL);
+        block_replication_remove(s->brs);
+    }
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file->bs);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+    switch (s->replication_state) {
+    case BLOCK_REPLICATION_NONE:
+        return -EIO;
+    case BLOCK_REPLICATION_RUNNING:
+        return 0;
+    case BLOCK_REPLICATION_FAILOVER:
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
+    case BLOCK_REPLICATION_FAILOVER_FAILED:
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+    case BLOCK_REPLICATION_DONE:
+        /*
+         * active commit job completes, and active disk and secondary_disk
+         * is swapped, so we can operate bs->file directly
+         */
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
+    default:
+        abort();
+    }
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        return ret;
+    }
+
+    if (ret < 0) {
+        s->error = ret;
+        ret = 0;
+    }
+
+    return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             int remaining_sectors,
+                                             QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    if (s->mode == REPLICATION_MODE_PRIMARY) {
+        /* We only use it to forward primary write requests */
+        return -EIO;
+    }
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
+    return replication_return_value(s, ret);
+}
+
+static coroutine_fn int replication_co_writev(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int remaining_sectors,
+                                              QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    QEMUIOVector hd_qiov;
+    uint64_t bytes_done = 0;
+    BlockDriverState *top = bs->file->bs;
+    BlockDriverState *base = s->secondary_disk;
+    BlockDriverState *target;
+    int ret, n;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 0) {
+        ret = bdrv_co_writev(top, sector_num,
+                             remaining_sectors, qiov);
+        return replication_return_value(s, ret);
+    }
+
+    /*
+     * Failover failed, only write to active disk if the sectors
+     * have already been allocated in active disk/hidden disk.
+     */
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+    while (remaining_sectors > 0) {
+        ret = bdrv_is_allocated_above(top, base, sector_num,
+                                      remaining_sectors, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * BDRV_SECTOR_SIZE);
+
+        target = ret ? top : base;
+        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
+        if (ret < 0) {
+            return ret;
+        }
+
+        remaining_sectors -= n;
+        sector_num += n;
+        bytes_done += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static coroutine_fn int replication_co_discard(BlockDriverState *bs,
+                                               int64_t sector_num,
+                                               int nb_sectors)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 1) {
+        /* It is secondary qemu and failover are running*/
+        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
+    return replication_return_value(s, ret);
+}
+
+static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
+                                                    BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
+}
+
+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    if (!s->secondary_disk->job) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make active disk empty");
+        return;
+    }
+
+    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make hidden disk empty");
+        return;
+    }
+}
+
+static void reopen_backing_file(BDRVReplicationState *s, bool writable,
+                                Error **errp)
+{
+    BlockReopenQueue *reopen_queue = NULL;
+    int orig_hidden_flags, orig_secondary_flags;
+    int new_hidden_flags, new_secondary_flags;
+    Error *local_err = NULL;
+
+    if (writable) {
+        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
+        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
+        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
+        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
+    } else {
+        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
+        new_hidden_flags = s->orig_hidden_flags;
+        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
+        new_secondary_flags = s->orig_secondary_flags;
+    }
+
+    if (orig_hidden_flags != new_hidden_flags) {
+        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
+                                         new_hidden_flags);
+    }
+
+    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
+        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
+                                         new_secondary_flags);
+    }
+
+    if (reopen_queue) {
+        bdrv_reopen_multiple(reopen_queue, &local_err);
+        if (local_err != NULL) {
+            error_propagate(errp, local_err);
+        }
+    }
+}
+
+static void backup_job_cleanup(BDRVReplicationState *s)
+{
+    bdrv_op_unblock_all(s->top_bs, s->blocker);
+    error_free(s->blocker);
+    reopen_backing_file(s, false, NULL);
+}
+
+static void backup_job_completed(void *opaque, int ret)
+{
+    BDRVReplicationState *s = opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
+        /* The backup job is cancelled unexpectedly */
+        s->error = -EIO;
+    }
+
+    backup_job_cleanup(s);
+}
+
+static BlockDriverState *get_top_bs(BlockDriverState *bs)
+{
+    BdrvChild *child;
+
+    while (!bs->blk) {
+        if (QLIST_EMPTY(&bs->parents)) {
+            return NULL;
+        }
+
+        child = QLIST_FIRST(&bs->parents);
+        if (QLIST_NEXT(child, next_parent)) {
+            return NULL;
+        }
+
+        bs = child->parent;
+    }
+
+    return bs;
+}
+
+static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
+                              Error **errp)
+{
+    BlockDriverState *bs = brs->bs;
+    BDRVReplicationState *s = brs->bs->opaque;
+    int64_t active_length, hidden_length, disk_length;
+    AioContext *aio_context;
+    Error *local_err = NULL;
+
+    if (s->replication_state != BLOCK_REPLICATION_NONE) {
+        error_setg(errp, "Block replication is running or done");
+        return;
+    }
+
+    if (s->mode != mode) {
+        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
+                   " but receives %d", s->mode, mode);
+        return;
+    }
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        s->active_disk = bs->file->bs;
+        if (!bs->file->bs->backing) {
+            error_setg(errp, "Active disk doesn't have backing file");
+            return;
+        }
+
+        s->hidden_disk = s->active_disk->backing->bs;
+        if (!s->hidden_disk->backing) {
+            error_setg(errp, "Hidden disk doesn't have backing file");
+            return;
+        }
+
+        s->secondary_disk = s->hidden_disk->backing->bs;
+        if (!s->secondary_disk->blk) {
+            error_setg(errp, "The secondary disk doesn't have block backend");
+            return;
+        }
+
+        s->top_bs = get_top_bs(bs);
+        if (!s->top_bs) {
+            error_setg(errp, "Cannot get the top block driver state to do"
+                       " internal backup");
+            return;
+        }
+
+        /* verify the length */
+        active_length = bdrv_getlength(s->active_disk);
+        hidden_length = bdrv_getlength(s->hidden_disk);
+        disk_length = bdrv_getlength(s->secondary_disk);
+        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
+            active_length != hidden_length || hidden_length != disk_length) {
+            error_setg(errp, "active disk, hidden disk, secondary disk's length"
+                       " are not the same");
+            return;
+        }
+
+        if (!s->active_disk->drv->bdrv_make_empty ||
+            !s->hidden_disk->drv->bdrv_make_empty) {
+            error_setg(errp,
+                       "active disk or hidden disk doesn't support make_empty");
+            return;
+        }
+
+        /* reopen the backing file in r/w mode */
+        reopen_backing_file(s, true, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+
+        /* start backup job now */
+        error_setg(&s->blocker,
+                   "block device is in use by internal backup job");
+        bdrv_op_block_all(s->top_bs, s->blocker);
+        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+        bdrv_ref(s->hidden_disk);
+
+        aio_context = bdrv_get_aio_context(bs);
+        aio_context_acquire(aio_context);
+        backup_start(s->secondary_disk, s->hidden_disk, 0,
+                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
+                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
+                     s, NULL, &local_err);
+        aio_context_release(aio_context);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            backup_job_cleanup(s);
+            bdrv_unref(s->hidden_disk);
+            return;
+        }
+        break;
+    default:
+        abort();
+    }
+
+    s->replication_state = BLOCK_REPLICATION_RUNNING;
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+
+    s->error = 0;
+}
+
+
+static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
+{
+    BDRVReplicationState *s = brs->bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    if (s->error) {
+        error_setg(errp, "I/O error occurs");
+        return;
+    }
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+}
+
+static void replication_done(void *opaque, int ret)
+{
+    BlockDriverState *bs = opaque;
+    BDRVReplicationState *s = bs->opaque;
+
+    if (ret == 0) {
+        s->replication_state = BLOCK_REPLICATION_DONE;
+
+        /* refresh top bs's filename */
+        bdrv_refresh_filename(bs);
+        s->active_disk = NULL;
+        s->secondary_disk = NULL;
+        s->hidden_disk = NULL;
+        s->error = 0;
+    } else {
+        s->replication_state = BLOCK_REPLICATION_FAILOVER_FAILED;
+        s->error = -EIO;
+    }
+}
+
+static void replication_stop(BlockReplicationState *brs, bool failover, Error **errp)
+{
+    BlockDriverState *bs = brs->bs;
+    BDRVReplicationState *s = brs->bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        s->replication_state = BLOCK_REPLICATION_DONE;
+        s->error = 0;
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        if (!failover) {
+            /*
+             * This BDS will be closed, and the job should be completed
+             * before the BDS is closed, because we will access hidden
+             * disk, secondary disk in backup_job_completed().
+             */
+            if (s->secondary_disk->job) {
+                block_job_cancel_sync(s->secondary_disk->job);
+            }
+            secondary_do_checkpoint(s, errp);
+            s->replication_state = BLOCK_REPLICATION_DONE;
+            return;
+        }
+
+        s->replication_state = BLOCK_REPLICATION_FAILOVER;
+        if (s->secondary_disk->job) {
+            block_job_cancel(s->secondary_disk->job);
+        }
+
+        commit_active_start(s->active_disk, s->secondary_disk, 0,
+                            BLOCKDEV_ON_ERROR_REPORT, replication_done,
+                            bs, errp, true);
+        break;
+    default:
+        abort();
+    }
+}
+
+BlockDriver bdrv_replication = {
+    .format_name                = "replication",
+    .protocol_name              = "replication",
+    .instance_size              = sizeof(BDRVReplicationState),
+
+    .bdrv_open                  = replication_open,
+    .bdrv_close                 = replication_close,
+
+    .bdrv_getlength             = replication_getlength,
+    .bdrv_co_readv              = replication_co_readv,
+    .bdrv_co_writev             = replication_co_writev,
+    .bdrv_co_discard            = replication_co_discard,
+
+    .is_filter                  = true,
+    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
+
+    .has_variable_length        = true,
+};
+
+static void bdrv_replication_init(void)
+{
+    bdrv_register(&bdrv_replication);
+}
+
+block_init(bdrv_replication_init);
diff --git a/include/block/replication-comm.h b/include/block/replication-comm.h
new file mode 100644
index 0000000..6483e56
--- /dev/null
+++ b/include/block/replication-comm.h
@@ -0,0 +1,50 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef REPLICATION_H
+#define REPLICATION_H
+
+#include "qemu/queue.h"
+#include "block/block_int.h"
+
+typedef struct BlockReplicationOps BlockReplicationOps;
+typedef struct BlockReplicationState BlockReplicationState;
+typedef void (*Start)(BlockReplicationState *brs, ReplicationMode mode, Error **errp);
+typedef void (*Stop)(BlockReplicationState *brs, bool failover, Error **errp);
+typedef void (*Checkpoint)(BlockReplicationState *brs, Error **errp);
+
+struct BlockReplicationState {
+    BlockDriverState *bs;
+    BlockReplicationOps *ops;
+    QLIST_ENTRY(BlockReplicationState) node;
+};
+
+struct BlockReplicationOps{
+    Start start;
+    Stop stop;
+    Checkpoint checkpoint;
+};
+
+BlockReplicationState *block_replication_new(BlockDriverState *bs,
+                                             BlockReplicationOps *ops);
+
+void block_replication_remove(BlockReplicationState *brs);
+
+void block_replication_start_all(ReplicationMode mode, Error **errp);
+
+void block_replication_do_checkpoint_all(Error **errp);
+
+void block_replication_stop_all(bool failover, Error **errp);
+
+#endif /* REPLICATION_H */
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2e86ede..d73f46b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1975,6 +1975,19 @@
             '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.6
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (6 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Changlong Xie
@ 2016-01-13  9:18 ` Changlong Xie
  2016-01-14  1:12 ` [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
  2016-01-24 13:34 ` Wen Congyang
  9 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-13  9:18 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qapi/block-core.json | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index d73f46b..00350ed 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -220,6 +220,7 @@
 #       2.2: 'archipelago' added, 'cow' dropped
 #       2.3: 'host_floppy' deprecated
 #       2.5: 'host_floppy' dropped
+#       2.6: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1540,6 +1541,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.6
 #
 # Since: 2.0
 ##
@@ -1547,8 +1549,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
             'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
             'http', 'https', 'null-aio', 'null-co', 'parallels',
-            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-            'vmdk', 'vpc', 'vvfat' ] }
+            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+            'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1988,6 +1990,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.6
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -2024,6 +2039,7 @@
       'quorum':     'BlockdevOptionsQuorum',
       'raw':        'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+      'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
       'tftp':       'BlockdevOptionsFile',
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (7 preceding siblings ...)
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add Changlong Xie
@ 2016-01-14  1:12 ` Changlong Xie
  2016-01-19  1:27   ` Hailiang Zhang
  2016-01-24 13:34 ` Wen Congyang
  9 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-01-14  1:12 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

It seems i missed someone in CC list, add them.

Thanks
	-Xie

On 01/13/2016 05:18 PM, Changlong Xie wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
>
> You can get the detailed information about block replication from here:
> http://wiki.qemu.org/Features/BlockReplication
>
> Usage:
> Please refer to docs/block-replication.txt
>
> This patch series is based on the following patch series:
> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html
>
> You can get the patch here:
> https://github.com/Pating/qemu/tree/changlox/block-replication-v14
>
> You can get the patch with framework here:
> https://github.com/Pating/qemu/tree/changlox/colo_framework_v13
>
> TODO:
> 1. Continuous block replication. It will be started after basic functions
>     are accepted.
>
> Changs Log:
> V14:
> 1. Implement auto complete active commit
> 2. Implement active commit block job for replication.c
> 3. Address the comments from Stefan, add replication-specific API and data
>     structure, also remove old block layer APIs
> V13:
> 1. Rebase to the newest codes
> 2. Remove redundant marcos and semicolon in replication.c
> 3. Fix typos in block-replication.txt
> V12:
> 1. Rebase to the newest codes
> 2. Use backing reference to replcace 'allow-write-backing-file'
> V11:
> 1. Reopen the backing file when starting blcok replication if it is not
>     opened in R/W mode
> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>     when opening backing file
> 3. Block the top BDS so there is only one block job for the top BDS and
>     its backing chain.
> V10:
> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>     reference.
> 2. Address the comments from Eric Blake
> V9:
> 1. Update the error messages
> 2. Rebase to the newest qemu
> 3. Split child add/delete support. These patches are sent in another patchset.
> V8:
> 1. Address Alberto Garcia's comments
> V7:
> 1. Implement adding/removing quorum child. Remove the option non-connect.
> 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
> V6:
> 1. Rebase to the newest qemu.
> V5:
> 1. Address the comments from Gong Lei
> 2. Speed the failover up. The secondary vm can take over very quickly even
>     if there are too many I/O requests.
> V4:
> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
> V3:
> 1: use error_setg() instead of error_set()
> 2. Add a new block job API
> 3. Active disk, hidden disk and nbd target uses the same AioContext
> 4. Add a testcase to test new hbitmap API
> V2:
> 1. Redesign the secondary qemu(use image-fleecing)
> 2. Use Error objects to return error message
> 3. Address the comments from Max Reitz and Eric Blake
>
> Wen Congyang (8):
>    unblock backup operations in backing file
>    Store parent BDS in BdrvChild
>    Backup: clear all bitmap when doing block checkpoint
>    Allow creating backup jobs when opening BDS
>    docs: block replication's description
>    auto complete active commit
>    Implement new driver for block replication
>    support replication driver in blockdev-add
>
>   block.c                          |  19 ++
>   block/Makefile.objs              |   3 +-
>   block/backup.c                   |  14 +
>   block/mirror.c                   |  13 +-
>   block/replication-comm.c         |  66 +++++
>   block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>   blockdev.c                       |   2 +-
>   blockjob.c                       |  11 +
>   docs/block-replication.txt       | 229 +++++++++++++++
>   include/block/block_int.h        |   4 +-
>   include/block/blockjob.h         |  12 +
>   include/block/replication-comm.h |  50 ++++
>   qapi/block-core.json             |  33 ++-
>   qemu-img.c                       |   2 +-
>   14 files changed, 1038 insertions(+), 10 deletions(-)
>   create mode 100644 block/replication-comm.c
>   create mode 100644 block/replication.c
>   create mode 100644 docs/block-replication.txt
>   create mode 100644 include/block/replication-comm.h
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
  2016-01-14  1:12 ` [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
@ 2016-01-19  1:27   ` Hailiang Zhang
  0 siblings, 0 replies; 34+ messages in thread
From: Hailiang Zhang @ 2016-01-19  1:27 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, peter.huangpeng,
	Michael R. Hines, Dr. David Alan Gilbert, Gonglei

ping...

It seems that there is no feedback for a long time, we hope COLO prototype could
be merged in QEMU 2.6, it depends on this series, so please help us.

Thanks.
zhanghailiang

On 2016/1/14 9:12, Changlong Xie wrote:
> It seems i missed someone in CC list, add them.
>
> Thanks
>      -Xie
>
> On 01/13/2016 05:18 PM, Changlong Xie wrote:
>> Block replication is a very important feature which is used for
>> continuous checkpoints(for example: COLO).
>>
>> You can get the detailed information about block replication from here:
>> http://wiki.qemu.org/Features/BlockReplication
>>
>> Usage:
>> Please refer to docs/block-replication.txt
>>
>> This patch series is based on the following patch series:
>> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html
>>
>> You can get the patch here:
>> https://github.com/Pating/qemu/tree/changlox/block-replication-v14
>>
>> You can get the patch with framework here:
>> https://github.com/Pating/qemu/tree/changlox/colo_framework_v13
>>
>> TODO:
>> 1. Continuous block replication. It will be started after basic functions
>>     are accepted.
>>
>> Changs Log:
>> V14:
>> 1. Implement auto complete active commit
>> 2. Implement active commit block job for replication.c
>> 3. Address the comments from Stefan, add replication-specific API and data
>>     structure, also remove old block layer APIs
>> V13:
>> 1. Rebase to the newest codes
>> 2. Remove redundant marcos and semicolon in replication.c
>> 3. Fix typos in block-replication.txt
>> V12:
>> 1. Rebase to the newest codes
>> 2. Use backing reference to replcace 'allow-write-backing-file'
>> V11:
>> 1. Reopen the backing file when starting blcok replication if it is not
>>     opened in R/W mode
>> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>>     when opening backing file
>> 3. Block the top BDS so there is only one block job for the top BDS and
>>     its backing chain.
>> V10:
>> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>>     reference.
>> 2. Address the comments from Eric Blake
>> V9:
>> 1. Update the error messages
>> 2. Rebase to the newest qemu
>> 3. Split child add/delete support. These patches are sent in another patchset.
>> V8:
>> 1. Address Alberto Garcia's comments
>> V7:
>> 1. Implement adding/removing quorum child. Remove the option non-connect.
>> 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
>> V6:
>> 1. Rebase to the newest qemu.
>> V5:
>> 1. Address the comments from Gong Lei
>> 2. Speed the failover up. The secondary vm can take over very quickly even
>>     if there are too many I/O requests.
>> V4:
>> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
>> V3:
>> 1: use error_setg() instead of error_set()
>> 2. Add a new block job API
>> 3. Active disk, hidden disk and nbd target uses the same AioContext
>> 4. Add a testcase to test new hbitmap API
>> V2:
>> 1. Redesign the secondary qemu(use image-fleecing)
>> 2. Use Error objects to return error message
>> 3. Address the comments from Max Reitz and Eric Blake
>>
>> Wen Congyang (8):
>>    unblock backup operations in backing file
>>    Store parent BDS in BdrvChild
>>    Backup: clear all bitmap when doing block checkpoint
>>    Allow creating backup jobs when opening BDS
>>    docs: block replication's description
>>    auto complete active commit
>>    Implement new driver for block replication
>>    support replication driver in blockdev-add
>>
>>   block.c                          |  19 ++
>>   block/Makefile.objs              |   3 +-
>>   block/backup.c                   |  14 +
>>   block/mirror.c                   |  13 +-
>>   block/replication-comm.c         |  66 +++++
>>   block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>>   blockdev.c                       |   2 +-
>>   blockjob.c                       |  11 +
>>   docs/block-replication.txt       | 229 +++++++++++++++
>>   include/block/block_int.h        |   4 +-
>>   include/block/blockjob.h         |  12 +
>>   include/block/replication-comm.h |  50 ++++
>>   qapi/block-core.json             |  33 ++-
>>   qemu-img.c                       |   2 +-
>>   14 files changed, 1038 insertions(+), 10 deletions(-)
>>   create mode 100644 block/replication-comm.c
>>   create mode 100644 block/replication.c
>>   create mode 100644 docs/block-replication.txt
>>   create mode 100644 include/block/replication-comm.h
>>
>
>
>
> .
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Changlong Xie
@ 2016-01-20  0:04   ` Eric Blake
  2016-01-20  7:45     ` Changlong Xie
  2016-01-27 14:46   ` Stefan Hajnoczi
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Blake @ 2016-01-20  0:04 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

[-- Attachment #1: Type: text/plain, Size: 4582 bytes --]

On 01/13/2016 02:18 AM, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> ---
>  block/Makefile.objs              |   1 +
>  block/replication-comm.c         |  66 +++++
>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>  include/block/replication-comm.h |  50 ++++
>  qapi/block-core.json             |  13 +
>  5 files changed, 720 insertions(+)
>  create mode 100644 block/replication-comm.c
>  create mode 100644 block/replication.c
>  create mode 100644 include/block/replication-comm.h
> 

Just a high-level overview, mainly on the user-visible interface and
things that caught my eye.

> +++ b/block/replication-comm.c
> @@ -0,0 +1,66 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED

Do you want to claim 2016 in any of this?


> +
> +enum {
> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/

Space before */

> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
> +};

Should this be an enum type in a .json file?


> +
> +static int replication_open(BlockDriverState *bs, QDict *options,
> +                            int flags, Error **errp)
> +{

> +
> +fail:
> +    qemu_opts_del(opts);
> +    /* propagate error */
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +    }

It's safe to call error_propagate() unconditionally (you could drop the
'if (local_err)').


> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> +                                               int64_t sector_num,
> +                                               int nb_sectors)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 1) {
> +        /* It is secondary qemu and failover are running*/

Space before */

> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    if (!s->secondary_disk->job) {
> +        error_setg(errp, "Backup job is cancelled unexpectedly");

Maybe s/is/was/

> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +    int64_t active_length, hidden_length, disk_length;
> +    AioContext *aio_context;
> +    Error *local_err = NULL;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
> +        error_setg(errp, "Block replication is running or done");
> +        return;
> +    }
> +
> +    if (s->mode != mode) {
> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
> +                   " but receives %d", s->mode, mode);

s/receives/got/


> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
> +{
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    if (s->error) {
> +        error_setg(errp, "I/O error occurs");

s/occurs/occurred/

> +++ b/qapi/block-core.json
> @@ -1975,6 +1975,19 @@
>              '*read-pattern': 'QuorumReadPattern' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.6
> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> +

Interface looks okay.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-20  0:04   ` Eric Blake
@ 2016-01-20  7:45     ` Changlong Xie
  0 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-20  7:45 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang

On 01/20/2016 08:04 AM, Eric Blake wrote:
> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> ---
>>   block/Makefile.objs              |   1 +
>>   block/replication-comm.c         |  66 +++++
>>   block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>>   include/block/replication-comm.h |  50 ++++
>>   qapi/block-core.json             |  13 +
>>   5 files changed, 720 insertions(+)
>>   create mode 100644 block/replication-comm.c
>>   create mode 100644 block/replication.c
>>   create mode 100644 include/block/replication-comm.h
>>
>
> Just a high-level overview, mainly on the user-visible interface and
> things that caught my eye.

Hi eric, thanks for your patience.

>
>> +++ b/block/replication-comm.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>
> Do you want to claim 2016 in any of this?
>
>

Will correct all Copyright issues in next version.

>> +
>> +enum {
>> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
>> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
>> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
>> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
>
> Space before */

Will fix this space issue

>
>> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
>> +};
>
> Should this be an enum type in a .json file?
>

Since this is just an internal enum that only used in 
block/replication.c, i think we don't need put it in *.json file.

>
>> +
>> +static int replication_open(BlockDriverState *bs, QDict *options,
>> +                            int flags, Error **errp)
>> +{
>
>> +
>> +fail:
>> +    qemu_opts_del(opts);
>> +    /* propagate error */
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +    }
>
> It's safe to call error_propagate() unconditionally (you could drop the
> 'if (local_err)').

You're right, i'll fix all relevant issue in my patchset.

>
>
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +                                               int64_t sector_num,
>> +                                               int nb_sectors)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 1) {
>> +        /* It is secondary qemu and failover are running*/
>
> Space before */

Will fix

>
>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    if (!s->secondary_disk->job) {
>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>
> Maybe s/is/was/
>

will fix

>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +    int64_t active_length, hidden_length, disk_length;
>> +    AioContext *aio_context;
>> +    Error *local_err = NULL;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
>> +        error_setg(errp, "Block replication is running or done");
>> +        return;
>> +    }
>> +
>> +    if (s->mode != mode) {
>> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
>> +                   " but receives %d", s->mode, mode);
>
> s/receives/got/
>

will fix

>
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
>> +{
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    if (s->error) {
>> +        error_setg(errp, "I/O error occurs");
>
> s/occurs/occurred/
>

Will fix

>> +++ b/qapi/block-core.json
>> @@ -1975,6 +1975,19 @@
>>               '*read-pattern': 'QuorumReadPattern' } }
>>
>>   ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>> +
>
> Interface looks okay.


Thanks
	-Xie

>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
  2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
                   ` (8 preceding siblings ...)
  2016-01-14  1:12 ` [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
@ 2016-01-24 13:34 ` Wen Congyang
  2016-01-26 16:04   ` Stefan Hajnoczi
  9 siblings, 1 reply; 34+ messages in thread
From: Wen Congyang @ 2016-01-24 13:34 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi

Stefan:ping

Do you have time to review this series patchset?

Thanks
Wen Congyang

At 2016/1/13 17:18, Changlong Xie wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
>
> You can get the detailed information about block replication from here:
> http://wiki.qemu.org/Features/BlockReplication
>
> Usage:
> Please refer to docs/block-replication.txt
>
> This patch series is based on the following patch series:
> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html
>
> You can get the patch here:
> https://github.com/Pating/qemu/tree/changlox/block-replication-v14
>
> You can get the patch with framework here:
> https://github.com/Pating/qemu/tree/changlox/colo_framework_v13
>
> TODO:
> 1. Continuous block replication. It will be started after basic functions
>     are accepted.
>
> Changs Log:
> V14:
> 1. Implement auto complete active commit
> 2. Implement active commit block job for replication.c
> 3. Address the comments from Stefan, add replication-specific API and data
>     structure, also remove old block layer APIs
> V13:
> 1. Rebase to the newest codes
> 2. Remove redundant marcos and semicolon in replication.c
> 3. Fix typos in block-replication.txt
> V12:
> 1. Rebase to the newest codes
> 2. Use backing reference to replcace 'allow-write-backing-file'
> V11:
> 1. Reopen the backing file when starting blcok replication if it is not
>     opened in R/W mode
> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>     when opening backing file
> 3. Block the top BDS so there is only one block job for the top BDS and
>     its backing chain.
> V10:
> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>     reference.
> 2. Address the comments from Eric Blake
> V9:
> 1. Update the error messages
> 2. Rebase to the newest qemu
> 3. Split child add/delete support. These patches are sent in another patchset.
> V8:
> 1. Address Alberto Garcia's comments
> V7:
> 1. Implement adding/removing quorum child. Remove the option non-connect.
> 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
> V6:
> 1. Rebase to the newest qemu.
> V5:
> 1. Address the comments from Gong Lei
> 2. Speed the failover up. The secondary vm can take over very quickly even
>     if there are too many I/O requests.
> V4:
> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
> V3:
> 1: use error_setg() instead of error_set()
> 2. Add a new block job API
> 3. Active disk, hidden disk and nbd target uses the same AioContext
> 4. Add a testcase to test new hbitmap API
> V2:
> 1. Redesign the secondary qemu(use image-fleecing)
> 2. Use Error objects to return error message
> 3. Address the comments from Max Reitz and Eric Blake
>
> Wen Congyang (8):
>    unblock backup operations in backing file
>    Store parent BDS in BdrvChild
>    Backup: clear all bitmap when doing block checkpoint
>    Allow creating backup jobs when opening BDS
>    docs: block replication's description
>    auto complete active commit
>    Implement new driver for block replication
>    support replication driver in blockdev-add
>
>   block.c                          |  19 ++
>   block/Makefile.objs              |   3 +-
>   block/backup.c                   |  14 +
>   block/mirror.c                   |  13 +-
>   block/replication-comm.c         |  66 +++++
>   block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>   blockdev.c                       |   2 +-
>   blockjob.c                       |  11 +
>   docs/block-replication.txt       | 229 +++++++++++++++
>   include/block/block_int.h        |   4 +-
>   include/block/blockjob.h         |  12 +
>   include/block/replication-comm.h |  50 ++++
>   qapi/block-core.json             |  33 ++-
>   qemu-img.c                       |   2 +-
>   14 files changed, 1038 insertions(+), 10 deletions(-)
>   create mode 100644 block/replication-comm.c
>   create mode 100644 block/replication.c
>   create mode 100644 docs/block-replication.txt
>   create mode 100644 include/block/replication-comm.h
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints
  2016-01-24 13:34 ` Wen Congyang
@ 2016-01-26 16:04   ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-26 16:04 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, qemu devel, Max Reitz,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 179 bytes --]

On Sun, Jan 24, 2016 at 09:34:49PM +0800, Wen Congyang wrote:
> Stefan:ping
> 
> Do you have time to review this series patchset?

Yes, I will review it tomorrow.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS Changlong Xie
@ 2016-01-27 14:04   ` Stefan Hajnoczi
  2016-01-28  2:22     ` Changlong Xie
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-27 14:04 UTC (permalink / raw)
  To: Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1386 bytes --]

On Wed, Jan 13, 2016 at 05:18:28PM +0800, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When opening BDS, we need to create backup jobs for
> image-fleecing.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Reviewed-by: Jeff Cody <jcody@redhat.com>
> ---
>  block/Makefile.objs | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 58ef2ef..fa05f37 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
>  block-obj-$(CONFIG_LIBSSH2) += ssh.o
>  block-obj-y += accounting.o
>  block-obj-y += write-threshold.o
> +block-obj-y += backup.o
>  
>  common-obj-y += stream.o
>  common-obj-y += commit.o
> -common-obj-y += backup.o
>  
>  iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
>  iscsi.o-libs       := $(LIBISCSI_LIBS)

The commit message and description seem outdated.  I guess the purpose
of this patch is to link the backup block job into all programs that use
the block layer because you want to add a dependency on the it from core
code.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Changlong Xie
  2016-01-20  0:04   ` Eric Blake
@ 2016-01-27 14:46   ` Stefan Hajnoczi
  2016-01-28  1:13     ` Wen Congyang
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-27 14:46 UTC (permalink / raw)
  To: Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 27280 bytes --]

On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> ---
>  block/Makefile.objs              |   1 +
>  block/replication-comm.c         |  66 +++++
>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>  include/block/replication-comm.h |  50 ++++
>  qapi/block-core.json             |  13 +
>  5 files changed, 720 insertions(+)
>  create mode 100644 block/replication-comm.c
>  create mode 100644 block/replication.c
>  create mode 100644 include/block/replication-comm.h
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index fa05f37..7037662 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
>  block-obj-y += accounting.o
>  block-obj-y += write-threshold.o
>  block-obj-y += backup.o
> +block-obj-y += replication-comm.o replication.o
>  
>  common-obj-y += stream.o
>  common-obj-y += commit.o
> diff --git a/block/replication-comm.c b/block/replication-comm.c
> new file mode 100644
> index 0000000..8af748b
> --- /dev/null
> +++ b/block/replication-comm.c
> @@ -0,0 +1,66 @@
> +/*
> + * Replication Block filter

Is the start/stop/checkpoint callback interface only useful for block
replication?

This seems like a generic interface for registering with COLO.  Other
components (networking, etc) might also need start/stop/checkpoint
callbacks.  If that's the case then this code should be outside block/
and the brs->bs field should either be void *opaque or removed (the
caller needs to use container_of()).

> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "block/replication-comm.h"
> +
> +static QLIST_HEAD(, BlockReplicationState) block_replication_states;
> +
> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
> +                                             BlockReplicationOps *ops)
> +{
> +    BlockReplicationState *brs;
> +
> +    brs = g_new0(BlockReplicationState, 1);
> +    brs->bs = bs;
> +    brs->ops = ops;
> +    QLIST_INSERT_HEAD(&block_replication_states, brs, node);
> +
> +    return brs;
> +}
> +
> +void block_replication_remove(BlockReplicationState *brs)
> +{
> +    QLIST_REMOVE(brs, node);
> +    g_free(brs);
> +}
> +
> +void block_replication_start_all(ReplicationMode mode, Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->start) {
> +            brs->ops->start(brs, mode, errp);
> +        }
> +    }
> +}
> +
> +void block_replication_do_checkpoint_all(Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->checkpoint) {
> +            brs->ops->checkpoint(brs, errp);
> +        }
> +    }
> +}
> +
> +void block_replication_stop_all(bool failover, Error **errp)
> +{
> +    BlockReplicationState *brs, *next;
> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
> +        if (brs->ops && brs->ops->stop) {
> +            brs->ops->stop(brs, failover, errp);
> +        }
> +    }
> +}
> diff --git a/block/replication.c b/block/replication.c
> new file mode 100644
> index 0000000..29c677a
> --- /dev/null
> +++ b/block/replication.c
> @@ -0,0 +1,590 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu-common.h"
> +#include "block/blockjob.h"
> +#include "block/nbd.h"
> +#include "block/replication-comm.h"
> +
> +typedef struct BDRVReplicationState {
> +    ReplicationMode mode;
> +    int replication_state;
> +    BlockDriverState *active_disk;
> +    BlockDriverState *hidden_disk;
> +    BlockDriverState *secondary_disk;
> +    BlockDriverState *top_bs;
> +    BlockReplicationState *brs;
> +    Error *blocker;
> +    int orig_hidden_flags;
> +    int orig_secondary_flags;
> +    int error;
> +} BDRVReplicationState;
> +
> +enum {
> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
> +};
> +
> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp);
> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp);
> +static void replication_stop(BlockReplicationState *brs, bool failover,
> +                             Error **errp);
> +
> +#define REPLICATION_MODE        "mode"
> +static QemuOptsList replication_runtime_opts = {
> +    .name = "replication",
> +    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = REPLICATION_MODE,
> +            .type = QEMU_OPT_STRING,
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static BlockReplicationOps replication_ops = {
> +    .start = replication_start,
> +    .checkpoint = replication_do_checkpoint,
> +    .stop = replication_stop,
> +};
> +
> +static int replication_open(BlockDriverState *bs, QDict *options,
> +                            int flags, Error **errp)
> +{
> +    int ret;
> +    BDRVReplicationState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    QemuOpts *opts = NULL;
> +    const char *mode;
> +
> +    ret = -EINVAL;
> +    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &local_err);
> +    if (local_err) {
> +        goto fail;
> +    }
> +
> +    mode = qemu_opt_get(opts, REPLICATION_MODE);
> +    if (!mode) {
> +        error_setg(&local_err, "Missing the option mode");
> +        goto fail;
> +    }
> +
> +    if (!strcmp(mode, "primary")) {
> +        s->mode = REPLICATION_MODE_PRIMARY;
> +    } else if (!strcmp(mode, "secondary")) {
> +        s->mode = REPLICATION_MODE_SECONDARY;
> +    } else {
> +        error_setg(&local_err,
> +                   "The option mode's value should be primary or secondary");
> +        goto fail;
> +    }
> +
> +    s->brs = block_replication_new(bs, &replication_ops);
> +
> +    ret = 0;
> +
> +fail:
> +    qemu_opts_del(opts);
> +    /* propagate error */
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +    }
> +    return ret;
> +}
> +
> +static void replication_close(BlockDriverState *bs)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +
> +    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
> +        replication_stop(s->brs, false, NULL);
> +        block_replication_remove(s->brs);
> +    }
> +}
> +
> +static int64_t replication_getlength(BlockDriverState *bs)
> +{
> +    return bdrv_getlength(bs->file->bs);
> +}
> +
> +static int replication_get_io_status(BDRVReplicationState *s)
> +{
> +    switch (s->replication_state) {
> +    case BLOCK_REPLICATION_NONE:
> +        return -EIO;
> +    case BLOCK_REPLICATION_RUNNING:
> +        return 0;
> +    case BLOCK_REPLICATION_FAILOVER:
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
> +    case BLOCK_REPLICATION_FAILOVER_FAILED:
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
> +    case BLOCK_REPLICATION_DONE:
> +        /*
> +         * active commit job completes, and active disk and secondary_disk
> +         * is swapped, so we can operate bs->file directly
> +         */
> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
> +    default:
> +        abort();
> +    }
> +}
> +
> +static int replication_return_value(BDRVReplicationState *s, int ret)
> +{
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        return ret;
> +    }
> +
> +    if (ret < 0) {
> +        s->error = ret;
> +        ret = 0;
> +    }
> +
> +    return ret;
> +}
> +
> +static coroutine_fn int replication_co_readv(BlockDriverState *bs,
> +                                             int64_t sector_num,
> +                                             int remaining_sectors,
> +                                             QEMUIOVector *qiov)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    if (s->mode == REPLICATION_MODE_PRIMARY) {
> +        /* We only use it to forward primary write requests */
> +        return -EIO;
> +    }
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
> +    return replication_return_value(s, ret);
> +}
> +
> +static coroutine_fn int replication_co_writev(BlockDriverState *bs,
> +                                              int64_t sector_num,
> +                                              int remaining_sectors,
> +                                              QEMUIOVector *qiov)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    QEMUIOVector hd_qiov;
> +    uint64_t bytes_done = 0;
> +    BlockDriverState *top = bs->file->bs;
> +    BlockDriverState *base = s->secondary_disk;
> +    BlockDriverState *target;
> +    int ret, n;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 0) {
> +        ret = bdrv_co_writev(top, sector_num,
> +                             remaining_sectors, qiov);
> +        return replication_return_value(s, ret);
> +    }
> +
> +    /*
> +     * Failover failed, only write to active disk if the sectors
> +     * have already been allocated in active disk/hidden disk.
> +     */
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +    while (remaining_sectors > 0) {
> +        ret = bdrv_is_allocated_above(top, base, sector_num,
> +                                      remaining_sectors, &n);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        qemu_iovec_reset(&hd_qiov);
> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * BDRV_SECTOR_SIZE);
> +
> +        target = ret ? top : base;
> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        remaining_sectors -= n;
> +        sector_num += n;
> +        bytes_done += n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> +                                               int64_t sector_num,
> +                                               int nb_sectors)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 1) {
> +        /* It is secondary qemu and failover are running*/
> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
> +    return replication_return_value(s, ret);
> +}
> +
> +static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
> +                                                    BlockDriverState *candidate)
> +{
> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
> +}
> +
> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    if (!s->secondary_disk->job) {
> +        error_setg(errp, "Backup job is cancelled unexpectedly");
> +        return;
> +    }
> +
> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);

What happens to in-flight requests to the active and hidden disks?

> +    if (ret < 0) {
> +        error_setg(errp, "Cannot make active disk empty");
> +        return;
> +    }
> +
> +    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
> +    if (ret < 0) {
> +        error_setg(errp, "Cannot make hidden disk empty");
> +        return;
> +    }
> +}
> +
> +static void reopen_backing_file(BDRVReplicationState *s, bool writable,
> +                                Error **errp)
> +{
> +    BlockReopenQueue *reopen_queue = NULL;
> +    int orig_hidden_flags, orig_secondary_flags;
> +    int new_hidden_flags, new_secondary_flags;
> +    Error *local_err = NULL;
> +
> +    if (writable) {
> +        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
> +        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
> +        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
> +        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
> +    } else {
> +        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
> +        new_hidden_flags = s->orig_hidden_flags;
> +        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
> +        new_secondary_flags = s->orig_secondary_flags;
> +    }
> +
> +    if (orig_hidden_flags != new_hidden_flags) {
> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
> +                                         new_hidden_flags);
> +    }
> +
> +    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
> +                                         new_secondary_flags);
> +    }
> +
> +    if (reopen_queue) {
> +        bdrv_reopen_multiple(reopen_queue, &local_err);
> +        if (local_err != NULL) {
> +            error_propagate(errp, local_err);
> +        }
> +    }
> +}
> +
> +static void backup_job_cleanup(BDRVReplicationState *s)
> +{
> +    bdrv_op_unblock_all(s->top_bs, s->blocker);
> +    error_free(s->blocker);
> +    reopen_backing_file(s, false, NULL);
> +}
> +
> +static void backup_job_completed(void *opaque, int ret)
> +{
> +    BDRVReplicationState *s = opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
> +        /* The backup job is cancelled unexpectedly */
> +        s->error = -EIO;
> +    }
> +
> +    backup_job_cleanup(s);
> +}
> +
> +static BlockDriverState *get_top_bs(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +
> +    while (!bs->blk) {
> +        if (QLIST_EMPTY(&bs->parents)) {
> +            return NULL;
> +        }
> +
> +        child = QLIST_FIRST(&bs->parents);
> +        if (QLIST_NEXT(child, next_parent)) {
> +            return NULL;
> +        }
> +
> +        bs = child->parent;
> +    }
> +
> +    return bs;
> +}
> +
> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
> +                              Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +    int64_t active_length, hidden_length, disk_length;
> +    AioContext *aio_context;
> +    Error *local_err = NULL;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
> +        error_setg(errp, "Block replication is running or done");
> +        return;
> +    }
> +
> +    if (s->mode != mode) {
> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
> +                   " but receives %d", s->mode, mode);
> +        return;
> +    }
> +
> +    switch (s->mode) {
> +    case REPLICATION_MODE_PRIMARY:
> +        break;
> +    case REPLICATION_MODE_SECONDARY:
> +        s->active_disk = bs->file->bs;
> +        if (!bs->file->bs->backing) {
> +            error_setg(errp, "Active disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->hidden_disk = s->active_disk->backing->bs;
> +        if (!s->hidden_disk->backing) {
> +            error_setg(errp, "Hidden disk doesn't have backing file");
> +            return;
> +        }
> +
> +        s->secondary_disk = s->hidden_disk->backing->bs;
> +        if (!s->secondary_disk->blk) {
> +            error_setg(errp, "The secondary disk doesn't have block backend");
> +            return;
> +        }

Kevin: Is code allowed to stash away BlockDriverState pointers for
convenience or should it keep the BdrvChild pointers instead?  In order
for replication to work as expected, the graph shouldn't change but for
consistency maybe BdrvChild is best.

> +
> +        s->top_bs = get_top_bs(bs);
> +        if (!s->top_bs) {
> +            error_setg(errp, "Cannot get the top block driver state to do"
> +                       " internal backup");
> +            return;
> +        }
> +
> +        /* verify the length */
> +        active_length = bdrv_getlength(s->active_disk);
> +        hidden_length = bdrv_getlength(s->hidden_disk);
> +        disk_length = bdrv_getlength(s->secondary_disk);
> +        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
> +            active_length != hidden_length || hidden_length != disk_length) {
> +            error_setg(errp, "active disk, hidden disk, secondary disk's length"
> +                       " are not the same");
> +            return;
> +        }
> +
> +        if (!s->active_disk->drv->bdrv_make_empty ||
> +            !s->hidden_disk->drv->bdrv_make_empty) {
> +            error_setg(errp,
> +                       "active disk or hidden disk doesn't support make_empty");
> +            return;
> +        }
> +
> +        /* reopen the backing file in r/w mode */
> +        reopen_backing_file(s, true, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +
> +        /* start backup job now */
> +        error_setg(&s->blocker,
> +                   "block device is in use by internal backup job");
> +        bdrv_op_block_all(s->top_bs, s->blocker);
> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
> +        bdrv_ref(s->hidden_disk);

Why is the explicit reference to hidden_disk (but not secondary_disk or
active_disk) is necessary?

> +
> +        aio_context = bdrv_get_aio_context(bs);
> +        aio_context_acquire(aio_context);
> +        backup_start(s->secondary_disk, s->hidden_disk, 0,
> +                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
> +                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
> +                     s, NULL, &local_err);
> +        aio_context_release(aio_context);

This is the only place that uses AioContext to protect against other
threads accessing BlockDriverState at the same time.  Any operation like
replication_start/stop/checkpoint where you are not sure if the calling
thread is the same thread that holds the AioContext must use
aio_context_acquire/release() to protect against race conditions.

For example, replication_do_checkpoint() might be called from a thread
that doesn't hold AioContext so it must acquire brs->bs AioContext
before accessing BDRVReplicationState.

> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            backup_job_cleanup(s);
> +            bdrv_unref(s->hidden_disk);
> +            return;
> +        }
> +        break;
> +    default:
> +        abort();
> +    }
> +
> +    s->replication_state = BLOCK_REPLICATION_RUNNING;
> +
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        secondary_do_checkpoint(s, errp);
> +    }
> +
> +    s->error = 0;
> +}
> +
> +
> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
> +{
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    if (s->error) {
> +        error_setg(errp, "I/O error occurs");
> +        return;
> +    }
> +
> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
> +        secondary_do_checkpoint(s, errp);
> +    }
> +}
> +
> +static void replication_done(void *opaque, int ret)
> +{
> +    BlockDriverState *bs = opaque;
> +    BDRVReplicationState *s = bs->opaque;
> +
> +    if (ret == 0) {
> +        s->replication_state = BLOCK_REPLICATION_DONE;
> +
> +        /* refresh top bs's filename */
> +        bdrv_refresh_filename(bs);
> +        s->active_disk = NULL;
> +        s->secondary_disk = NULL;
> +        s->hidden_disk = NULL;
> +        s->error = 0;
> +    } else {
> +        s->replication_state = BLOCK_REPLICATION_FAILOVER_FAILED;
> +        s->error = -EIO;
> +    }
> +}
> +
> +static void replication_stop(BlockReplicationState *brs, bool failover, Error **errp)
> +{
> +    BlockDriverState *bs = brs->bs;
> +    BDRVReplicationState *s = brs->bs->opaque;
> +
> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
> +        error_setg(errp, "Block replication is not running");
> +        return;
> +    }
> +
> +    switch (s->mode) {
> +    case REPLICATION_MODE_PRIMARY:
> +        s->replication_state = BLOCK_REPLICATION_DONE;
> +        s->error = 0;
> +        break;
> +    case REPLICATION_MODE_SECONDARY:
> +        if (!failover) {
> +            /*
> +             * This BDS will be closed, and the job should be completed
> +             * before the BDS is closed, because we will access hidden
> +             * disk, secondary disk in backup_job_completed().
> +             */
> +            if (s->secondary_disk->job) {
> +                block_job_cancel_sync(s->secondary_disk->job);
> +            }
> +            secondary_do_checkpoint(s, errp);
> +            s->replication_state = BLOCK_REPLICATION_DONE;
> +            return;
> +        }
> +
> +        s->replication_state = BLOCK_REPLICATION_FAILOVER;
> +        if (s->secondary_disk->job) {
> +            block_job_cancel(s->secondary_disk->job);
> +        }
> +
> +        commit_active_start(s->active_disk, s->secondary_disk, 0,
> +                            BLOCKDEV_ON_ERROR_REPORT, replication_done,
> +                            bs, errp, true);
> +        break;
> +    default:
> +        abort();
> +    }
> +}
> +
> +BlockDriver bdrv_replication = {
> +    .format_name                = "replication",
> +    .protocol_name              = "replication",
> +    .instance_size              = sizeof(BDRVReplicationState),
> +
> +    .bdrv_open                  = replication_open,
> +    .bdrv_close                 = replication_close,
> +
> +    .bdrv_getlength             = replication_getlength,
> +    .bdrv_co_readv              = replication_co_readv,
> +    .bdrv_co_writev             = replication_co_writev,
> +    .bdrv_co_discard            = replication_co_discard,
> +
> +    .is_filter                  = true,
> +    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
> +
> +    .has_variable_length        = true,
> +};
> +
> +static void bdrv_replication_init(void)
> +{
> +    bdrv_register(&bdrv_replication);
> +}
> +
> +block_init(bdrv_replication_init);
> diff --git a/include/block/replication-comm.h b/include/block/replication-comm.h
> new file mode 100644
> index 0000000..6483e56
> --- /dev/null
> +++ b/include/block/replication-comm.h
> @@ -0,0 +1,50 @@
> +/*
> + * Replication Block filter
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 Intel Corporation
> + * Copyright (c) 2015 FUJITSU LIMITED
> + *
> + * Author:
> + *   Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef REPLICATION_H
> +#define REPLICATION_H
> +
> +#include "qemu/queue.h"
> +#include "block/block_int.h"
> +
> +typedef struct BlockReplicationOps BlockReplicationOps;
> +typedef struct BlockReplicationState BlockReplicationState;
> +typedef void (*Start)(BlockReplicationState *brs, ReplicationMode mode, Error **errp);
> +typedef void (*Stop)(BlockReplicationState *brs, bool failover, Error **errp);
> +typedef void (*Checkpoint)(BlockReplicationState *brs, Error **errp);
> +
> +struct BlockReplicationState {
> +    BlockDriverState *bs;
> +    BlockReplicationOps *ops;
> +    QLIST_ENTRY(BlockReplicationState) node;
> +};
> +
> +struct BlockReplicationOps{
> +    Start start;
> +    Stop stop;
> +    Checkpoint checkpoint;
> +};
> +
> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
> +                                             BlockReplicationOps *ops);
> +
> +void block_replication_remove(BlockReplicationState *brs);
> +
> +void block_replication_start_all(ReplicationMode mode, Error **errp);
> +
> +void block_replication_do_checkpoint_all(Error **errp);
> +
> +void block_replication_stop_all(bool failover, Error **errp);
> +
> +#endif /* REPLICATION_H */
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 2e86ede..d73f46b 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1975,6 +1975,19 @@
>              '*read-pattern': 'QuorumReadPattern' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.6
> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> +
> +##
>  # @BlockdevOptions
>  #
>  # Options for creating a block device.
> -- 
> 1.9.3
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint Changlong Xie
@ 2016-01-27 16:05   ` Stefan Hajnoczi
  2016-01-28  2:22     ` Changlong Xie
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-27 16:05 UTC (permalink / raw)
  To: Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

On Wed, Jan 13, 2016 at 05:18:27PM +0800, Changlong Xie wrote:
> diff --git a/blockjob.c b/blockjob.c
> index 80adb9d..0c8edfe 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
>      QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
>      block_job_txn_ref(txn);
>  }
> +
> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +    if (!job->driver->do_checkpoint) {
> +        error_setg(errp, "The job %s doesn't support block checkpoint",
> +                   BlockJobType_lookup[job->driver->job_type]);
> +        return;
> +    }
> +
> +    job->driver->do_checkpoint(job, errp);
> +}
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index d84ccd8..abdba7c 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
>       * never both.
>       */
>      void (*abort)(BlockJob *job);
> +
> +    /** Optional callback for job types that support checkpoint. */
> +    void (*do_checkpoint)(BlockJob *job, Error **errp);

The COLO/replication-specific callbacks have been moved out of
BlockDriver into their own replication struct.  Similar reasoning
applies to BlockJobDriver:

The do_checkpoint() callback is only implemented by one type of job and
its purpose is related to COLO rather than jobs.  This is a strong
indication that this shouldn't be part of the generic BlockJobDriver
struct.

Please drop changes to the generic blockjob interface.  Instead, make
backup_do_checkpoint() public and add assert(job->driver->type ==
BLOCK_JOB_TYPE_BACKUP) into the function.

Then the replication filter can call backup_do_checkpoint() directly.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-27 14:46   ` Stefan Hajnoczi
@ 2016-01-28  1:13     ` Wen Congyang
  2016-01-28 15:15       ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Wen Congyang @ 2016-01-28  1:13 UTC (permalink / raw)
  To: Stefan Hajnoczi, Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> ---
>>  block/Makefile.objs              |   1 +
>>  block/replication-comm.c         |  66 +++++
>>  block/replication.c              | 590 +++++++++++++++++++++++++++++++++++++++
>>  include/block/replication-comm.h |  50 ++++
>>  qapi/block-core.json             |  13 +
>>  5 files changed, 720 insertions(+)
>>  create mode 100644 block/replication-comm.c
>>  create mode 100644 block/replication.c
>>  create mode 100644 include/block/replication-comm.h
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index fa05f37..7037662 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
>>  block-obj-y += accounting.o
>>  block-obj-y += write-threshold.o
>>  block-obj-y += backup.o
>> +block-obj-y += replication-comm.o replication.o
>>  
>>  common-obj-y += stream.o
>>  common-obj-y += commit.o
>> diff --git a/block/replication-comm.c b/block/replication-comm.c
>> new file mode 100644
>> index 0000000..8af748b
>> --- /dev/null
>> +++ b/block/replication-comm.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Replication Block filter
> 
> Is the start/stop/checkpoint callback interface only useful for block
> replication?
> 
> This seems like a generic interface for registering with COLO.  Other
> components (networking, etc) might also need start/stop/checkpoint
> callbacks.  If that's the case then this code should be outside block/
> and the brs->bs field should either be void *opaque or removed (the
> caller needs to use container_of()).

Yes, we will do it in the next version.

> 
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "block/replication-comm.h"
>> +
>> +static QLIST_HEAD(, BlockReplicationState) block_replication_states;
>> +
>> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
>> +                                             BlockReplicationOps *ops)
>> +{
>> +    BlockReplicationState *brs;
>> +
>> +    brs = g_new0(BlockReplicationState, 1);
>> +    brs->bs = bs;
>> +    brs->ops = ops;
>> +    QLIST_INSERT_HEAD(&block_replication_states, brs, node);
>> +
>> +    return brs;
>> +}
>> +
>> +void block_replication_remove(BlockReplicationState *brs)
>> +{
>> +    QLIST_REMOVE(brs, node);
>> +    g_free(brs);
>> +}
>> +
>> +void block_replication_start_all(ReplicationMode mode, Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->start) {
>> +            brs->ops->start(brs, mode, errp);
>> +        }
>> +    }
>> +}
>> +
>> +void block_replication_do_checkpoint_all(Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->checkpoint) {
>> +            brs->ops->checkpoint(brs, errp);
>> +        }
>> +    }
>> +}
>> +
>> +void block_replication_stop_all(bool failover, Error **errp)
>> +{
>> +    BlockReplicationState *brs, *next;
>> +    QLIST_FOREACH_SAFE(brs, &block_replication_states, node, next) {
>> +        if (brs->ops && brs->ops->stop) {
>> +            brs->ops->stop(brs, failover, errp);
>> +        }
>> +    }
>> +}
>> diff --git a/block/replication.c b/block/replication.c
>> new file mode 100644
>> index 0000000..29c677a
>> --- /dev/null
>> +++ b/block/replication.c
>> @@ -0,0 +1,590 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu-common.h"
>> +#include "block/blockjob.h"
>> +#include "block/nbd.h"
>> +#include "block/replication-comm.h"
>> +
>> +typedef struct BDRVReplicationState {
>> +    ReplicationMode mode;
>> +    int replication_state;
>> +    BlockDriverState *active_disk;
>> +    BlockDriverState *hidden_disk;
>> +    BlockDriverState *secondary_disk;
>> +    BlockDriverState *top_bs;
>> +    BlockReplicationState *brs;
>> +    Error *blocker;
>> +    int orig_hidden_flags;
>> +    int orig_secondary_flags;
>> +    int error;
>> +} BDRVReplicationState;
>> +
>> +enum {
>> +    BLOCK_REPLICATION_NONE,             /* block replication is not started */
>> +    BLOCK_REPLICATION_RUNNING,          /* block replication is running */
>> +    BLOCK_REPLICATION_FAILOVER,         /* failover is running in background */
>> +    BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed*/
>> +    BLOCK_REPLICATION_DONE,             /* block replication is done(after failover) */
>> +};
>> +
>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp);
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp);
>> +static void replication_stop(BlockReplicationState *brs, bool failover,
>> +                             Error **errp);
>> +
>> +#define REPLICATION_MODE        "mode"
>> +static QemuOptsList replication_runtime_opts = {
>> +    .name = "replication",
>> +    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = REPLICATION_MODE,
>> +            .type = QEMU_OPT_STRING,
>> +        },
>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +static BlockReplicationOps replication_ops = {
>> +    .start = replication_start,
>> +    .checkpoint = replication_do_checkpoint,
>> +    .stop = replication_stop,
>> +};
>> +
>> +static int replication_open(BlockDriverState *bs, QDict *options,
>> +                            int flags, Error **errp)
>> +{
>> +    int ret;
>> +    BDRVReplicationState *s = bs->opaque;
>> +    Error *local_err = NULL;
>> +    QemuOpts *opts = NULL;
>> +    const char *mode;
>> +
>> +    ret = -EINVAL;
>> +    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (local_err) {
>> +        goto fail;
>> +    }
>> +
>> +    mode = qemu_opt_get(opts, REPLICATION_MODE);
>> +    if (!mode) {
>> +        error_setg(&local_err, "Missing the option mode");
>> +        goto fail;
>> +    }
>> +
>> +    if (!strcmp(mode, "primary")) {
>> +        s->mode = REPLICATION_MODE_PRIMARY;
>> +    } else if (!strcmp(mode, "secondary")) {
>> +        s->mode = REPLICATION_MODE_SECONDARY;
>> +    } else {
>> +        error_setg(&local_err,
>> +                   "The option mode's value should be primary or secondary");
>> +        goto fail;
>> +    }
>> +
>> +    s->brs = block_replication_new(bs, &replication_ops);
>> +
>> +    ret = 0;
>> +
>> +fail:
>> +    qemu_opts_del(opts);
>> +    /* propagate error */
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static void replication_close(BlockDriverState *bs)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +
>> +    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
>> +        replication_stop(s->brs, false, NULL);
>> +        block_replication_remove(s->brs);
>> +    }
>> +}
>> +
>> +static int64_t replication_getlength(BlockDriverState *bs)
>> +{
>> +    return bdrv_getlength(bs->file->bs);
>> +}
>> +
>> +static int replication_get_io_status(BDRVReplicationState *s)
>> +{
>> +    switch (s->replication_state) {
>> +    case BLOCK_REPLICATION_NONE:
>> +        return -EIO;
>> +    case BLOCK_REPLICATION_RUNNING:
>> +        return 0;
>> +    case BLOCK_REPLICATION_FAILOVER:
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
>> +    case BLOCK_REPLICATION_FAILOVER_FAILED:
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
>> +    case BLOCK_REPLICATION_DONE:
>> +        /*
>> +         * active commit job completes, and active disk and secondary_disk
>> +         * is swapped, so we can operate bs->file directly
>> +         */
>> +        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 0;
>> +    default:
>> +        abort();
>> +    }
>> +}
>> +
>> +static int replication_return_value(BDRVReplicationState *s, int ret)
>> +{
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret < 0) {
>> +        s->error = ret;
>> +        ret = 0;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static coroutine_fn int replication_co_readv(BlockDriverState *bs,
>> +                                             int64_t sector_num,
>> +                                             int remaining_sectors,
>> +                                             QEMUIOVector *qiov)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    if (s->mode == REPLICATION_MODE_PRIMARY) {
>> +        /* We only use it to forward primary write requests */
>> +        return -EIO;
>> +    }
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
>> +    return replication_return_value(s, ret);
>> +}
>> +
>> +static coroutine_fn int replication_co_writev(BlockDriverState *bs,
>> +                                              int64_t sector_num,
>> +                                              int remaining_sectors,
>> +                                              QEMUIOVector *qiov)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    QEMUIOVector hd_qiov;
>> +    uint64_t bytes_done = 0;
>> +    BlockDriverState *top = bs->file->bs;
>> +    BlockDriverState *base = s->secondary_disk;
>> +    BlockDriverState *target;
>> +    int ret, n;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 0) {
>> +        ret = bdrv_co_writev(top, sector_num,
>> +                             remaining_sectors, qiov);
>> +        return replication_return_value(s, ret);
>> +    }
>> +
>> +    /*
>> +     * Failover failed, only write to active disk if the sectors
>> +     * have already been allocated in active disk/hidden disk.
>> +     */
>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> +    while (remaining_sectors > 0) {
>> +        ret = bdrv_is_allocated_above(top, base, sector_num,
>> +                                      remaining_sectors, &n);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        qemu_iovec_reset(&hd_qiov);
>> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * BDRV_SECTOR_SIZE);
>> +
>> +        target = ret ? top : base;
>> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        remaining_sectors -= n;
>> +        sector_num += n;
>> +        bytes_done += n * BDRV_SECTOR_SIZE;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +                                               int64_t sector_num,
>> +                                               int nb_sectors)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 1) {
>> +        /* It is secondary qemu and failover are running*/
>> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
>> +    return replication_return_value(s, ret);
>> +}
>> +
>> +static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
>> +                                                    BlockDriverState *candidate)
>> +{
>> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
>> +}
>> +
>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    if (!s->secondary_disk->job) {
>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>> +        return;
>> +    }
>> +
>> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
> 
> What happens to in-flight requests to the active and hidden disks?

we MUST call do_checkpoint() when the vm is stopped.

> 
>> +    if (ret < 0) {
>> +        error_setg(errp, "Cannot make active disk empty");
>> +        return;
>> +    }
>> +
>> +    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
>> +    if (ret < 0) {
>> +        error_setg(errp, "Cannot make hidden disk empty");
>> +        return;
>> +    }
>> +}
>> +
>> +static void reopen_backing_file(BDRVReplicationState *s, bool writable,
>> +                                Error **errp)
>> +{
>> +    BlockReopenQueue *reopen_queue = NULL;
>> +    int orig_hidden_flags, orig_secondary_flags;
>> +    int new_hidden_flags, new_secondary_flags;
>> +    Error *local_err = NULL;
>> +
>> +    if (writable) {
>> +        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
>> +        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
>> +        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
>> +        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
>> +    } else {
>> +        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
>> +        new_hidden_flags = s->orig_hidden_flags;
>> +        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
>> +        new_secondary_flags = s->orig_secondary_flags;
>> +    }
>> +
>> +    if (orig_hidden_flags != new_hidden_flags) {
>> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
>> +                                         new_hidden_flags);
>> +    }
>> +
>> +    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
>> +        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
>> +                                         new_secondary_flags);
>> +    }
>> +
>> +    if (reopen_queue) {
>> +        bdrv_reopen_multiple(reopen_queue, &local_err);
>> +        if (local_err != NULL) {
>> +            error_propagate(errp, local_err);
>> +        }
>> +    }
>> +}
>> +
>> +static void backup_job_cleanup(BDRVReplicationState *s)
>> +{
>> +    bdrv_op_unblock_all(s->top_bs, s->blocker);
>> +    error_free(s->blocker);
>> +    reopen_backing_file(s, false, NULL);
>> +}
>> +
>> +static void backup_job_completed(void *opaque, int ret)
>> +{
>> +    BDRVReplicationState *s = opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
>> +        /* The backup job is cancelled unexpectedly */
>> +        s->error = -EIO;
>> +    }
>> +
>> +    backup_job_cleanup(s);
>> +}
>> +
>> +static BlockDriverState *get_top_bs(BlockDriverState *bs)
>> +{
>> +    BdrvChild *child;
>> +
>> +    while (!bs->blk) {
>> +        if (QLIST_EMPTY(&bs->parents)) {
>> +            return NULL;
>> +        }
>> +
>> +        child = QLIST_FIRST(&bs->parents);
>> +        if (QLIST_NEXT(child, next_parent)) {
>> +            return NULL;
>> +        }
>> +
>> +        bs = child->parent;
>> +    }
>> +
>> +    return bs;
>> +}
>> +
>> +static void replication_start(BlockReplicationState *brs, ReplicationMode mode,
>> +                              Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +    int64_t active_length, hidden_length, disk_length;
>> +    AioContext *aio_context;
>> +    Error *local_err = NULL;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_NONE) {
>> +        error_setg(errp, "Block replication is running or done");
>> +        return;
>> +    }
>> +
>> +    if (s->mode != mode) {
>> +        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
>> +                   " but receives %d", s->mode, mode);
>> +        return;
>> +    }
>> +
>> +    switch (s->mode) {
>> +    case REPLICATION_MODE_PRIMARY:
>> +        break;
>> +    case REPLICATION_MODE_SECONDARY:
>> +        s->active_disk = bs->file->bs;
>> +        if (!bs->file->bs->backing) {
>> +            error_setg(errp, "Active disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->hidden_disk = s->active_disk->backing->bs;
>> +        if (!s->hidden_disk->backing) {
>> +            error_setg(errp, "Hidden disk doesn't have backing file");
>> +            return;
>> +        }
>> +
>> +        s->secondary_disk = s->hidden_disk->backing->bs;
>> +        if (!s->secondary_disk->blk) {
>> +            error_setg(errp, "The secondary disk doesn't have block backend");
>> +            return;
>> +        }
> 
> Kevin: Is code allowed to stash away BlockDriverState pointers for
> convenience or should it keep the BdrvChild pointers instead?  In order
> for replication to work as expected, the graph shouldn't change but for
> consistency maybe BdrvChild is best.
> 
>> +
>> +        s->top_bs = get_top_bs(bs);
>> +        if (!s->top_bs) {
>> +            error_setg(errp, "Cannot get the top block driver state to do"
>> +                       " internal backup");
>> +            return;
>> +        }
>> +
>> +        /* verify the length */
>> +        active_length = bdrv_getlength(s->active_disk);
>> +        hidden_length = bdrv_getlength(s->hidden_disk);
>> +        disk_length = bdrv_getlength(s->secondary_disk);
>> +        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
>> +            active_length != hidden_length || hidden_length != disk_length) {
>> +            error_setg(errp, "active disk, hidden disk, secondary disk's length"
>> +                       " are not the same");
>> +            return;
>> +        }
>> +
>> +        if (!s->active_disk->drv->bdrv_make_empty ||
>> +            !s->hidden_disk->drv->bdrv_make_empty) {
>> +            error_setg(errp,
>> +                       "active disk or hidden disk doesn't support make_empty");
>> +            return;
>> +        }
>> +
>> +        /* reopen the backing file in r/w mode */
>> +        reopen_backing_file(s, true, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +
>> +        /* start backup job now */
>> +        error_setg(&s->blocker,
>> +                   "block device is in use by internal backup job");
>> +        bdrv_op_block_all(s->top_bs, s->blocker);
>> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
>> +        bdrv_ref(s->hidden_disk);
> 
> Why is the explicit reference to hidden_disk (but not secondary_disk or
> active_disk) is necessary?

IIRC, we should reference the backup target before calling backup_start(),
and we will reference the backup source in backup_start().

> 
>> +
>> +        aio_context = bdrv_get_aio_context(bs);
>> +        aio_context_acquire(aio_context);
>> +        backup_start(s->secondary_disk, s->hidden_disk, 0,
>> +                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
>> +                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
>> +                     s, NULL, &local_err);
>> +        aio_context_release(aio_context);
> 
> This is the only place that uses AioContext to protect against other
> threads accessing BlockDriverState at the same time.  Any operation like
> replication_start/stop/checkpoint where you are not sure if the calling
> thread is the same thread that holds the AioContext must use
> aio_context_acquire/release() to protect against race conditions.
> 
> For example, replication_do_checkpoint() might be called from a thread
> that doesn't hold AioContext so it must acquire brs->bs AioContext
> before accessing BDRVReplicationState.

OK, I will check it and fix it in the next version.

Thanks
Wen Congyang

> 
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            backup_job_cleanup(s);
>> +            bdrv_unref(s->hidden_disk);
>> +            return;
>> +        }
>> +        break;
>> +    default:
>> +        abort();
>> +    }
>> +
>> +    s->replication_state = BLOCK_REPLICATION_RUNNING;
>> +
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        secondary_do_checkpoint(s, errp);
>> +    }
>> +
>> +    s->error = 0;
>> +}
>> +
>> +
>> +static void replication_do_checkpoint(BlockReplicationState *brs, Error **errp)
>> +{
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    if (s->error) {
>> +        error_setg(errp, "I/O error occurs");
>> +        return;
>> +    }
>> +
>> +    if (s->mode == REPLICATION_MODE_SECONDARY) {
>> +        secondary_do_checkpoint(s, errp);
>> +    }
>> +}
>> +
>> +static void replication_done(void *opaque, int ret)
>> +{
>> +    BlockDriverState *bs = opaque;
>> +    BDRVReplicationState *s = bs->opaque;
>> +
>> +    if (ret == 0) {
>> +        s->replication_state = BLOCK_REPLICATION_DONE;
>> +
>> +        /* refresh top bs's filename */
>> +        bdrv_refresh_filename(bs);
>> +        s->active_disk = NULL;
>> +        s->secondary_disk = NULL;
>> +        s->hidden_disk = NULL;
>> +        s->error = 0;
>> +    } else {
>> +        s->replication_state = BLOCK_REPLICATION_FAILOVER_FAILED;
>> +        s->error = -EIO;
>> +    }
>> +}
>> +
>> +static void replication_stop(BlockReplicationState *brs, bool failover, Error **errp)
>> +{
>> +    BlockDriverState *bs = brs->bs;
>> +    BDRVReplicationState *s = brs->bs->opaque;
>> +
>> +    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
>> +        error_setg(errp, "Block replication is not running");
>> +        return;
>> +    }
>> +
>> +    switch (s->mode) {
>> +    case REPLICATION_MODE_PRIMARY:
>> +        s->replication_state = BLOCK_REPLICATION_DONE;
>> +        s->error = 0;
>> +        break;
>> +    case REPLICATION_MODE_SECONDARY:
>> +        if (!failover) {
>> +            /*
>> +             * This BDS will be closed, and the job should be completed
>> +             * before the BDS is closed, because we will access hidden
>> +             * disk, secondary disk in backup_job_completed().
>> +             */
>> +            if (s->secondary_disk->job) {
>> +                block_job_cancel_sync(s->secondary_disk->job);
>> +            }
>> +            secondary_do_checkpoint(s, errp);
>> +            s->replication_state = BLOCK_REPLICATION_DONE;
>> +            return;
>> +        }
>> +
>> +        s->replication_state = BLOCK_REPLICATION_FAILOVER;
>> +        if (s->secondary_disk->job) {
>> +            block_job_cancel(s->secondary_disk->job);
>> +        }
>> +
>> +        commit_active_start(s->active_disk, s->secondary_disk, 0,
>> +                            BLOCKDEV_ON_ERROR_REPORT, replication_done,
>> +                            bs, errp, true);
>> +        break;
>> +    default:
>> +        abort();
>> +    }
>> +}
>> +
>> +BlockDriver bdrv_replication = {
>> +    .format_name                = "replication",
>> +    .protocol_name              = "replication",
>> +    .instance_size              = sizeof(BDRVReplicationState),
>> +
>> +    .bdrv_open                  = replication_open,
>> +    .bdrv_close                 = replication_close,
>> +
>> +    .bdrv_getlength             = replication_getlength,
>> +    .bdrv_co_readv              = replication_co_readv,
>> +    .bdrv_co_writev             = replication_co_writev,
>> +    .bdrv_co_discard            = replication_co_discard,
>> +
>> +    .is_filter                  = true,
>> +    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
>> +
>> +    .has_variable_length        = true,
>> +};
>> +
>> +static void bdrv_replication_init(void)
>> +{
>> +    bdrv_register(&bdrv_replication);
>> +}
>> +
>> +block_init(bdrv_replication_init);
>> diff --git a/include/block/replication-comm.h b/include/block/replication-comm.h
>> new file mode 100644
>> index 0000000..6483e56
>> --- /dev/null
>> +++ b/include/block/replication-comm.h
>> @@ -0,0 +1,50 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef REPLICATION_H
>> +#define REPLICATION_H
>> +
>> +#include "qemu/queue.h"
>> +#include "block/block_int.h"
>> +
>> +typedef struct BlockReplicationOps BlockReplicationOps;
>> +typedef struct BlockReplicationState BlockReplicationState;
>> +typedef void (*Start)(BlockReplicationState *brs, ReplicationMode mode, Error **errp);
>> +typedef void (*Stop)(BlockReplicationState *brs, bool failover, Error **errp);
>> +typedef void (*Checkpoint)(BlockReplicationState *brs, Error **errp);
>> +
>> +struct BlockReplicationState {
>> +    BlockDriverState *bs;
>> +    BlockReplicationOps *ops;
>> +    QLIST_ENTRY(BlockReplicationState) node;
>> +};
>> +
>> +struct BlockReplicationOps{
>> +    Start start;
>> +    Stop stop;
>> +    Checkpoint checkpoint;
>> +};
>> +
>> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
>> +                                             BlockReplicationOps *ops);
>> +
>> +void block_replication_remove(BlockReplicationState *brs);
>> +
>> +void block_replication_start_all(ReplicationMode mode, Error **errp);
>> +
>> +void block_replication_do_checkpoint_all(Error **errp);
>> +
>> +void block_replication_stop_all(bool failover, Error **errp);
>> +
>> +#endif /* REPLICATION_H */
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 2e86ede..d73f46b 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1975,6 +1975,19 @@
>>              '*read-pattern': 'QuorumReadPattern' } }
>>  
>>  ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>> +
>> +##
>>  # @BlockdevOptions
>>  #
>>  # Options for creating a block device.
>> -- 
>> 1.9.3
>>
>>
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint
  2016-01-27 16:05   ` Stefan Hajnoczi
@ 2016-01-28  2:22     ` Changlong Xie
  0 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-28  2:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini

On 01/28/2016 12:05 AM, Stefan Hajnoczi wrote:
> On Wed, Jan 13, 2016 at 05:18:27PM +0800, Changlong Xie wrote:
>> diff --git a/blockjob.c b/blockjob.c
>> index 80adb9d..0c8edfe 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
>>       QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
>>       block_job_txn_ref(txn);
>>   }
>> +
>> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
>> +{
>> +    if (!job->driver->do_checkpoint) {
>> +        error_setg(errp, "The job %s doesn't support block checkpoint",
>> +                   BlockJobType_lookup[job->driver->job_type]);
>> +        return;
>> +    }
>> +
>> +    job->driver->do_checkpoint(job, errp);
>> +}
>> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
>> index d84ccd8..abdba7c 100644
>> --- a/include/block/blockjob.h
>> +++ b/include/block/blockjob.h
>> @@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
>>        * never both.
>>        */
>>       void (*abort)(BlockJob *job);
>> +
>> +    /** Optional callback for job types that support checkpoint. */
>> +    void (*do_checkpoint)(BlockJob *job, Error **errp);
>
> The COLO/replication-specific callbacks have been moved out of
> BlockDriver into their own replication struct.  Similar reasoning
> applies to BlockJobDriver:
>
> The do_checkpoint() callback is only implemented by one type of job and
> its purpose is related to COLO rather than jobs.  This is a strong
> indication that this shouldn't be part of the generic BlockJobDriver
> struct.
>
> Please drop changes to the generic blockjob interface.  Instead, make
> backup_do_checkpoint() public and add assert(job->driver->type ==
> BLOCK_JOB_TYPE_BACKUP) into the function.
>
> Then the replication filter can call backup_do_checkpoint() directly.
>

Will fix it in next version.

Thanks
	-Xie

> Stefan
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS
  2016-01-27 14:04   ` Stefan Hajnoczi
@ 2016-01-28  2:22     ` Changlong Xie
  0 siblings, 0 replies; 34+ messages in thread
From: Changlong Xie @ 2016-01-28  2:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

On 01/27/2016 10:04 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 13, 2016 at 05:18:28PM +0800, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> When opening BDS, we need to create backup jobs for
>> image-fleecing.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Reviewed-by: Jeff Cody <jcody@redhat.com>
>> ---
>>   block/Makefile.objs | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index 58ef2ef..fa05f37 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
>>   block-obj-$(CONFIG_LIBSSH2) += ssh.o
>>   block-obj-y += accounting.o
>>   block-obj-y += write-threshold.o
>> +block-obj-y += backup.o
>>
>>   common-obj-y += stream.o
>>   common-obj-y += commit.o
>> -common-obj-y += backup.o
>>
>>   iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
>>   iscsi.o-libs       := $(LIBISCSI_LIBS)
>
> The commit message and description seem outdated.  I guess the purpose
> of this patch is to link the backup block job into all programs that use
> the block layer because you want to add a dependency on the it from core
> code.
>

Will update it in next version.

Thanks
	-Xie

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-28  1:13     ` Wen Congyang
@ 2016-01-28 15:15       ` Stefan Hajnoczi
  2016-01-29  3:13         ` Changlong Xie
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-28 15:15 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, zhanghailiang,
	fnstml-hwcolo, qemu devel, Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> > On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> >> +{
> >> +    Error *local_err = NULL;
> >> +    int ret;
> >> +
> >> +    if (!s->secondary_disk->job) {
> >> +        error_setg(errp, "Backup job is cancelled unexpectedly");
> >> +        return;
> >> +    }
> >> +
> >> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +
> >> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
> > 
> > What happens to in-flight requests to the active and hidden disks?
> 
> we MUST call do_checkpoint() when the vm is stopped.

Please document the environment under which the block replication
callback functions run.

I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
time if the disk is slow/failing.  bdrv_drain_all() blocks until all
in-flight I/O requests have completed.  What does the Primary do if the
Secondary becomes unresponsive?

> >> +    switch (s->mode) {
> >> +    case REPLICATION_MODE_PRIMARY:
> >> +        break;
> >> +    case REPLICATION_MODE_SECONDARY:
> >> +        s->active_disk = bs->file->bs;
> >> +        if (!bs->file->bs->backing) {
> >> +            error_setg(errp, "Active disk doesn't have backing file");
> >> +            return;
> >> +        }
> >> +
> >> +        s->hidden_disk = s->active_disk->backing->bs;
> >> +        if (!s->hidden_disk->backing) {
> >> +            error_setg(errp, "Hidden disk doesn't have backing file");
> >> +            return;
> >> +        }
> >> +
> >> +        s->secondary_disk = s->hidden_disk->backing->bs;
> >> +        if (!s->secondary_disk->blk) {
> >> +            error_setg(errp, "The secondary disk doesn't have block backend");
> >> +            return;
> >> +        }
> > 
> > Kevin: Is code allowed to stash away BlockDriverState pointers for
> > convenience or should it keep the BdrvChild pointers instead?  In order
> > for replication to work as expected, the graph shouldn't change but for
> > consistency maybe BdrvChild is best.

I asked Kevin about this on IRC and he agreed that BdrvChild should be
used instead of holding on to BlockDriverState * pointers.  Although
these pointers will not change during replication (if the op blockers
are set up correctly), it's more consistent and certainly safer to go
through BdrvChild.

> >> +        /* start backup job now */
> >> +        error_setg(&s->blocker,
> >> +                   "block device is in use by internal backup job");
> >> +        bdrv_op_block_all(s->top_bs, s->blocker);
> >> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
> >> +        bdrv_ref(s->hidden_disk);
> > 
> > Why is the explicit reference to hidden_disk (but not secondary_disk or
> > active_disk) is necessary?
> 
> IIRC, we should reference the backup target before calling backup_start(),
> and we will reference the backup source in backup_start().

I'm not sure why this is necessary since they are part of the backing
chain.

If it is necessary, please add a comment so it's clear why the reference
is being taken.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-28 15:15       ` Stefan Hajnoczi
@ 2016-01-29  3:13         ` Changlong Xie
  2016-01-29 15:46           ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-01-29  3:13 UTC (permalink / raw)
  To: Stefan Hajnoczi, Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>> +static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
>>>> +{
>>>> +    Error *local_err = NULL;
>>>> +    int ret;
>>>> +
>>>> +    if (!s->secondary_disk->job) {
>>>> +        error_setg(errp, "Backup job is cancelled unexpectedly");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
>>>
>>> What happens to in-flight requests to the active and hidden disks?
>>
>> we MUST call do_checkpoint() when the vm is stopped.
>
> Please document the environment under which the block replication
> callback functions run.

OK

>
> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> in-flight I/O requests have completed.  What does the Primary do if the
> Secondary becomes unresponsive?

Actually, we knew this problem. But currently, there seems no better way 
to resolve it. If you have any ideas?

>
>>>> +    switch (s->mode) {
>>>> +    case REPLICATION_MODE_PRIMARY:
>>>> +        break;
>>>> +    case REPLICATION_MODE_SECONDARY:
>>>> +        s->active_disk = bs->file->bs;
>>>> +        if (!bs->file->bs->backing) {
>>>> +            error_setg(errp, "Active disk doesn't have backing file");
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        s->hidden_disk = s->active_disk->backing->bs;
>>>> +        if (!s->hidden_disk->backing) {
>>>> +            error_setg(errp, "Hidden disk doesn't have backing file");
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        s->secondary_disk = s->hidden_disk->backing->bs;
>>>> +        if (!s->secondary_disk->blk) {
>>>> +            error_setg(errp, "The secondary disk doesn't have block backend");
>>>> +            return;
>>>> +        }
>>>
>>> Kevin: Is code allowed to stash away BlockDriverState pointers for
>>> convenience or should it keep the BdrvChild pointers instead?  In order
>>> for replication to work as expected, the graph shouldn't change but for
>>> consistency maybe BdrvChild is best.
>
> I asked Kevin about this on IRC and he agreed that BdrvChild should be
> used instead of holding on to BlockDriverState * pointers.  Although
> these pointers will not change during replication (if the op blockers
> are set up correctly), it's more consistent and certainly safer to go
> through BdrvChild.

Ok

>
>>>> +        /* start backup job now */
>>>> +        error_setg(&s->blocker,
>>>> +                   "block device is in use by internal backup job");
>>>> +        bdrv_op_block_all(s->top_bs, s->blocker);
>>>> +        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
>>>> +        bdrv_ref(s->hidden_disk);
>>>
>>> Why is the explicit reference to hidden_disk (but not secondary_disk or
>>> active_disk) is necessary?
>>
>> IIRC, we should reference the backup target before calling backup_start(),
>> and we will reference the backup source in backup_start().
>
> I'm not sure why this is necessary since they are part of the backing
> chain.
>

Just as Wen said, we should reference the backup target before calling 
backup_start() to protect it from destroying, if backup job is stopped 
unexpectedly.

> If it is necessary, please add a comment so it's clear why the reference
> is being taken.
>

Ok

> Stefan
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-29  3:13         ` Changlong Xie
@ 2016-01-29 15:46           ` Stefan Hajnoczi
  2016-02-01  1:13             ` Wen Congyang
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-01-29 15:46 UTC (permalink / raw)
  To: Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >in-flight I/O requests have completed.  What does the Primary do if the
> >Secondary becomes unresponsive?
> 
> Actually, we knew this problem. But currently, there seems no better way to
> resolve it. If you have any ideas?

Is it possible to hold the checkpoint information and acknowledge the
checkpoint right away, without waiting for bdrv_drain_all() or any
Secondory guest activity to complete?

I think this really means falling back to microcheckpointing until the
Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
would prevent vcpus from running and when the last pending I/O finishes
the Secondary could apply the last checkpoint.  This approach does not
block QEMU (the monitor, etc).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-01-29 15:46           ` Stefan Hajnoczi
@ 2016-02-01  1:13             ` Wen Congyang
  2016-02-02 14:34               ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Wen Congyang @ 2016-02-01  1:13 UTC (permalink / raw)
  To: Stefan Hajnoczi, Changlong Xie
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>> in-flight I/O requests have completed.  What does the Primary do if the
>>> Secondary becomes unresponsive?
>>
>> Actually, we knew this problem. But currently, there seems no better way to
>> resolve it. If you have any ideas?
> 
> Is it possible to hold the checkpoint information and acknowledge the
> checkpoint right away, without waiting for bdrv_drain_all() or any
> Secondory guest activity to complete?

There is no way to know that secondary becomes unreponsive.

> 
> I think this really means falling back to microcheckpointing until the
> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> would prevent vcpus from running and when the last pending I/O finishes
> the Secondary could apply the last checkpoint.  This approach does not
> block QEMU (the monitor, etc).
> 

If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
We should do failover in this case.

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-02-01  1:13             ` Wen Congyang
@ 2016-02-02 14:34               ` Stefan Hajnoczi
  2016-02-03  1:29                 ` Wen Congyang
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-02-02 14:34 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, zhanghailiang,
	fnstml-hwcolo, qemu devel, Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> > On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> >> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >>> in-flight I/O requests have completed.  What does the Primary do if the
> >>> Secondary becomes unresponsive?
> >>
> >> Actually, we knew this problem. But currently, there seems no better way to
> >> resolve it. If you have any ideas?
> > 
> > Is it possible to hold the checkpoint information and acknowledge the
> > checkpoint right away, without waiting for bdrv_drain_all() or any
> > Secondory guest activity to complete?
> 
> There is no way to know that secondary becomes unreponsive.

I meant whether it is necessary for the Secondary to vm_stop() and apply
the checkpoint before acknowledging the checkpoint to the Primary?

> > I think this really means falling back to microcheckpointing until the
> > Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> > would prevent vcpus from running and when the last pending I/O finishes
> > the Secondary could apply the last checkpoint.  This approach does not
> > block QEMU (the monitor, etc).
> > 
> 
> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
> We should do failover in this case.

This is dangerous because it means that a delay/failure in the Secondary
would cause the Primary to fail over to the broken Secondary.  All the
more reason not to perform blocking operations on the Secondary in the
checkpoint code path.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
  2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 5/8] docs: block replication's description Changlong Xie
@ 2016-02-02 15:20   ` Eric Blake
  2016-02-03  3:18     ` Changlong Xie
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Blake @ 2016-02-02 15:20 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

[-- Attachment #1: Type: text/plain, Size: 6175 bytes --]

On 01/13/2016 02:18 AM, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> ---
>  docs/block-replication.txt | 229 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 229 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..d1a231e
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,229 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.

Do you want to claim 2016 for any of this?

> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is

s/of Primary/of the Primary/

> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of

s/at the time of/during a vmstate/
s/operations of/operations of the/

> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==

> +== Architecture ==

> +
> +6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer

Space before ( in English description.

> +any state that would otherwise be lost by the speculative write-through
> +of the NBD server into the secondary disk. So before block replication,
> +the primary disk and secondary disk should contain the same data.
> +
> +== Failure Handling ==

> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw
> +
> +  Run qmp command in primary qemu:
> +    { 'execute': 'human-monitor-command',
> +      'arguments': {
> +          'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'

Eww. We shouldn't ever have to pack a command line as a single QMP
string that needs reparsing.  Instead, you should pass the information
as a nested QMP dictionary, something like:

'arguments': {
  'remote-command': { 'command': 'drive_add', 'name': 'buddy',
                      'driver': 'replication', 'mode': 'primary',
                      'file': { 'driver': 'nbd', 'host': 'xxxx',
  ... } } }

> +      }
> +    }
> +    { 'execute': 'x-blockdev-change',
> +      'arguments': {
> +          'parent': 'colo1',
> +          'node': 'nbd_client1'
> +      }
> +    }
> +  Note:
> +  1. There should be only one NBD Client for each primary disk.
> +  2. host is the secondary physical machine's hostname or IP
> +  3. Each disk must have its own export name.
> +  4. It is all a single argument to -drive and you should ignore the
> +     leading whitespace.
> +  5. The qmp command line must be run after running qmp command line in
> +     secondary qemu.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
> +  -drive if=xxx,driver=replication,mode=secondary,\
> +         file.file.filename=active_disk.qcow2,\
> +         file.driver=qcow2,\
> +         file.backing.file.filename=hidden_disk.qcow2,\
> +         file.backing.driver=qcow2,\
> +         file.backing.backing=colo1
> +
> +  Then run qmp command in secondary qemu:
> +    { 'execute': 'nbd-server-start',
> +      'arguments': {
> +          'addr': {
> +              'type': 'inet',
> +              'data': {
> +                  'host': 'xxx',
> +                  'port': 'xxx'
> +              }
> +          }
> +      }
> +    }
> +    { 'execute': 'nbd-server-add',
> +      'arguments': {
> +          'device': 'colo1',
> +          'writable': true
> +      }
> +    }
> +
> +  Note:
> +  1. The export name in secondary QEMU command line is the secondary
> +     disk's id.
> +  2. The export name for the same disk must be the same
> +  3. The qmp command nbd-server-start and nbd-server-add must be run
> +     before running the qmp command migrate on primary QEMU
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
> +  6. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> +
> +After Failover:
> +Primary:
> +  The secondary host is down, so we should run the following qmp command
> +  to remove the nbd child from the quorum:
> +  { 'execute': 'x-blockdev-change',
> +    'arguments': {
> +        'parent': 'colo1',
> +        'child': 'children.1'
> +    }
> +  }
> +  Note: there is no qmp command to remove the blockdev now
> +
> +Secondary:
> +  The primary host is down, so we should do the following thing:
> +  { 'execute': 'nbd-server-stop' }
> +
> +TODO:
> +1. Continuous block replication
> +2. Shared disk
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-02-02 14:34               ` Stefan Hajnoczi
@ 2016-02-03  1:29                 ` Wen Congyang
  2016-02-03  9:32                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Wen Congyang @ 2016-02-03  1:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, zhanghailiang,
	fnstml-hwcolo, qemu devel, Max Reitz, Gonglei, Paolo Bonzini

On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>> Secondary becomes unresponsive?
>>>>
>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>> resolve it. If you have any ideas?
>>>
>>> Is it possible to hold the checkpoint information and acknowledge the
>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>> Secondory guest activity to complete?
>>
>> There is no way to know that secondary becomes unreponsive.
> 
> I meant whether it is necessary for the Secondary to vm_stop() and apply
> the checkpoint before acknowledging the checkpoint to the Primary?

I don't understand this.
Here is the COLO checkpoint flow:

    Primary                                                Secondary
    new checkpoint notice                 --->
    vm_stop()                                              vm_stop()
    vm state(device state, memory, cpu)   --->
                                                           load state
                                          <---             done
    vm_start()                                             vm_start()
> 
>>> I think this really means falling back to microcheckpointing until the
>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>> would prevent vcpus from running and when the last pending I/O finishes
>>> the Secondary could apply the last checkpoint.  This approach does not
>>> block QEMU (the monitor, etc).
>>>
>>
>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>> We should do failover in this case.
> 
> This is dangerous because it means that a delay/failure in the Secondary
> would cause the Primary to fail over to the broken Secondary.  All the
> more reason not to perform blocking operations on the Secondary in the
> checkpoint code path.

If the secondary is broken, primary qemu will take over.

Thanks
Wen Congyang

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
  2016-02-02 15:20   ` Eric Blake
@ 2016-02-03  3:18     ` Changlong Xie
  2016-02-03  3:35       ` Eric Blake
  0 siblings, 1 reply; 34+ messages in thread
From: Changlong Xie @ 2016-02-03  3:18 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

On 02/02/2016 11:20 PM, Eric Blake wrote:
> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> ---
>>   docs/block-replication.txt | 229 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 229 insertions(+)
>>   create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..d1a231e
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,229 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>
> Do you want to claim 2016 for any of this?

Will update in next version.

>
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +Block replication is used for continuous checkpoints. It is designed
>> +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
>> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
>> +where the Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
>
> s/of Primary/of the Primary/

OK

>
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>
> s/at the time of/during a vmstate/
> s/operations of/operations of the/

OK

>
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>
>> +== Architecture ==
>
>> +
>> +6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer
>
> Space before ( in English description.

OK

>
>> +any state that would otherwise be lost by the speculative write-through
>> +of the NBD server into the secondary disk. So before block replication,
>> +the primary disk and secondary disk should contain the same data.
>> +
>> +== Failure Handling ==
>
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw
>> +
>> +  Run qmp command in primary qemu:
>> +    { 'execute': 'human-monitor-command',
>> +      'arguments': {
>> +          'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'
>
> Eww. We shouldn't ever have to pack a command line as . single QMP
> string that needs reparsing.  Instead, you should pass the information
> as a nested QMP dictionary, something like:
>
> 'arguments': {
>    'remote-command': { 'command': 'drive_add', 'name': 'buddy',
>                        'driver': 'replication', 'mode': 'primary',
>                        'file': { 'driver': 'nbd', 'host': 'xxxx',

Hi Eric

What is 'remote-command' here? I just tried below commands, but got some 
errors.

{'execute': 'human-monitor-command',
     'arguments': {
        'command-line': {
             'command': 'drive_add',
             'name': 'buddy',
             'driver': 'replication',
             'mode': 'primary',
             'if': 'none',
             'node-name': 'primary_nbd_node',
             'file': {
                 'driver': 'nbd',
                 'host': '192.168.3.2',
                 'port': '8889',
                 'export': 'colo-disk',
             }

         }
      }
}
{"error": {"class": "GenericError", "desc": "Invalid JSON syntax"}}		


'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and 
it's a hmp command. If i'm right, there seems just one way to execute 
hmp commands in QMP:

================================================================
EQMP

     {
         .name       = "human-monitor-command",
         .args_type  = "command-line:s,cpu-index:i?",
         .mhandler.cmd_new = qmp_marshal_human_monitor_command,
     },

SQMP
human-monitor-command
---------------------

Execute a Human Monitor command.

Arguments:

- command-line: the command name and its arguments, just like the
                 Human Monitor's shell (json-string)
- cpu-index: select the CPU number to be used by commands which access CPU
              data, like 'info registers'. The Monitor selects CPU 0 if this
              argument is not provided (json-int, optional)

Example:

-> { "execute": "human-monitor-command", "arguments": { "command-line": 
"info kvm" } }
<- { "return": "kvm support: enabled\r\n" }
==================================================================

	   	
Thanks
	-Xie


>    ... } } }
>
>> +      }
>> +    }
>> +    { 'execute': 'x-blockdev-change',
>> +      'arguments': {
>> +          'parent': 'colo1',
>> +          'node': 'nbd_client1'
>> +      }
>> +    }
>> +  Note:
>> +  1. There should be only one NBD Client for each primary disk.
>> +  2. host is the secondary physical machine's hostname or IP
>> +  3. Each disk must have its own export name.
>> +  4. It is all a single argument to -drive and you should ignore the
>> +     leading whitespace.
>> +  5. The qmp command line must be run after running qmp command line in
>> +     secondary qemu.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
>> +  -drive if=xxx,driver=replication,mode=secondary,\
>> +         file.file.filename=active_disk.qcow2,\
>> +         file.driver=qcow2,\
>> +         file.backing.file.filename=hidden_disk.qcow2,\
>> +         file.backing.driver=qcow2,\
>> +         file.backing.backing=colo1
>> +
>> +  Then run qmp command in secondary qemu:
>> +    { 'execute': 'nbd-server-start',
>> +      'arguments': {
>> +          'addr': {
>> +              'type': 'inet',
>> +              'data': {
>> +                  'host': 'xxx',
>> +                  'port': 'xxx'
>> +              }
>> +          }
>> +      }
>> +    }
>> +    { 'execute': 'nbd-server-add',
>> +      'arguments': {
>> +          'device': 'colo1',
>> +          'writable': true
>> +      }
>> +    }
>> +
>> +  Note:
>> +  1. The export name in secondary QEMU command line is the secondary
>> +     disk's id.
>> +  2. The export name for the same disk must be the same
>> +  3. The qmp command nbd-server-start and nbd-server-add must be run
>> +     before running the qmp command migrate on primary QEMU
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
>> +  6. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
>> +
>> +After Failover:
>> +Primary:
>> +  The secondary host is down, so we should run the following qmp command
>> +  to remove the nbd child from the quorum:
>> +  { 'execute': 'x-blockdev-change',
>> +    'arguments': {
>> +        'parent': 'colo1',
>> +        'child': 'children.1'
>> +    }
>> +  }
>> +  Note: there is no qmp command to remove the blockdev now
>> +
>> +Secondary:
>> +  The primary host is down, so we should do the following thing:
>> +  { 'execute': 'nbd-server-stop' }
>> +
>> +TODO:
>> +1. Continuous block replication
>> +2. Shared disk
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
  2016-02-03  3:18     ` Changlong Xie
@ 2016-02-03  3:35       ` Eric Blake
  2016-02-03  3:40         ` Wen Congyang
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Blake @ 2016-02-03  3:35 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

On 02/02/2016 08:18 PM, Changlong Xie wrote:
> On 02/02/2016 11:20 PM, Eric Blake wrote:
>> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>>> ---
>>>   docs/block-replication.txt | 229

>>> +== Usage ==
>>> +Primary:
>>> +  -drive
>>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
>>> +         children.0.file.filename=1.raw,\
>>> +         children.0.driver=raw
>>> +
>>> +  Run qmp command in primary qemu:
>>> +    { 'execute': 'human-monitor-command',
>>> +      'arguments': {
>>> +          'command-line': 'drive_add buddy
>>> driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'
>>>
>>
>> Eww. We shouldn't ever have to pack a command line as . single QMP
>> string that needs reparsing.  Instead, you should pass the information
>> as a nested QMP dictionary, something like:
>>
>> 'arguments': {
>>    'remote-command': { 'command': 'drive_add', 'name': 'buddy',
>>                        'driver': 'replication', 'mode': 'primary',
>>                        'file': { 'driver': 'nbd', 'host': 'xxxx',
> 
> Hi Eric
> 
> What is 'remote-command' here? I just tried below commands, but got some
> errors.

Oh, I totally missed that this was using the existing
'human-monitor-command' to send an HMP command, instead of trying to
send a formal QMP command.  I thought you were documenting a new QMP
command.

Still, it would be nice to do this command using strict QMP (that would
be via 'blockdev-add') rather than via HMP (an all-in-one text command
crammed into the single 'command-line' argument).

> 
> 'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and
> it's a hmp command. If i'm right, there seems just one way to execute
> hmp commands in QMP:

For 2.6, we _really_ ought to get blockdev-add working for everything.
We're running short on time, though :(

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description
  2016-02-03  3:35       ` Eric Blake
@ 2016-02-03  3:40         ` Wen Congyang
  0 siblings, 0 replies; 34+ messages in thread
From: Wen Congyang @ 2016-02-03  3:40 UTC (permalink / raw)
  To: Eric Blake, Changlong Xie, qemu devel, Fam Zheng, Max Reitz,
	Paolo Bonzini, Kevin Wolf, Stefan Hajnoczi
  Cc: Gonglei, zhanghailiang, fnstml-hwcolo

On 02/03/2016 11:35 AM, Eric Blake wrote:
> On 02/02/2016 08:18 PM, Changlong Xie wrote:
>> On 02/02/2016 11:20 PM, Eric Blake wrote:
>>> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>>
>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>>>> ---
>>>>   docs/block-replication.txt | 229
> 
>>>> +== Usage ==
>>>> +Primary:
>>>> +  -drive
>>>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
>>>> +         children.0.file.filename=1.raw,\
>>>> +         children.0.driver=raw
>>>> +
>>>> +  Run qmp command in primary qemu:
>>>> +    { 'execute': 'human-monitor-command',
>>>> +      'arguments': {
>>>> +          'command-line': 'drive_add buddy
>>>> driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'
>>>>
>>>
>>> Eww. We shouldn't ever have to pack a command line as . single QMP
>>> string that needs reparsing.  Instead, you should pass the information
>>> as a nested QMP dictionary, something like:
>>>
>>> 'arguments': {
>>>    'remote-command': { 'command': 'drive_add', 'name': 'buddy',
>>>                        'driver': 'replication', 'mode': 'primary',
>>>                        'file': { 'driver': 'nbd', 'host': 'xxxx',
>>
>> Hi Eric
>>
>> What is 'remote-command' here? I just tried below commands, but got some
>> errors.
> 
> Oh, I totally missed that this was using the existing
> 'human-monitor-command' to send an HMP command, instead of trying to
> send a formal QMP command.  I thought you were documenting a new QMP
> command.
> 
> Still, it would be nice to do this command using strict QMP (that would
> be via 'blockdev-add') rather than via HMP (an all-in-one text command
> crammed into the single 'command-line' argument).
> 
>>
>> 'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and
>> it's a hmp command. If i'm right, there seems just one way to execute
>> hmp commands in QMP:
> 
> For 2.6, we _really_ ought to get blockdev-add working for everything.
> We're running short on time, though :(
> 

If the qmp command 'blockdev-add' supports nbd in qemu-2.6, we will update
this document when it is suppoted.

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-02-03  1:29                 ` Wen Congyang
@ 2016-02-03  9:32                   ` Stefan Hajnoczi
  2016-02-03  9:55                     ` Wen Congyang
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2016-02-03  9:32 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, zhanghailiang,
	fnstml-hwcolo, qemu devel, Max Reitz, Gonglei, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3535 bytes --]

On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> > On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
> >> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> >>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> >>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >>>>> in-flight I/O requests have completed.  What does the Primary do if the
> >>>>> Secondary becomes unresponsive?
> >>>>
> >>>> Actually, we knew this problem. But currently, there seems no better way to
> >>>> resolve it. If you have any ideas?
> >>>
> >>> Is it possible to hold the checkpoint information and acknowledge the
> >>> checkpoint right away, without waiting for bdrv_drain_all() or any
> >>> Secondory guest activity to complete?
> >>
> >> There is no way to know that secondary becomes unreponsive.
> > 
> > I meant whether it is necessary for the Secondary to vm_stop() and apply
> > the checkpoint before acknowledging the checkpoint to the Primary?
> 
> I don't understand this.
> Here is the COLO checkpoint flow:
> 
>     Primary                                                Secondary
>     new checkpoint notice                 --->
>     vm_stop()                                              vm_stop()
>     vm state(device state, memory, cpu)   --->
>                                                            load state
>                                           <---             done
>     vm_start()                                             vm_start()

If the Secondary's vm_stop() call blocks then the Primary is stuck too.

I was wondering whether the Secondary can do:

<---  done
      vm_stop()
      load state

It simply receives the checkpoint data into a buffer and immediately
replies with "done".  vm_stop() and load state is only performed after
sending "done".

The advantage is that the Primary will not be delayed by the Secondary.
It's an approach that doesn't block.

But perhaps it's a problem if the Secondary is slower than the Primary
since the Secondary still needs to complete vm_stop() and load state
before it can resume execution?

> >>> I think this really means falling back to microcheckpointing until the
> >>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> >>> would prevent vcpus from running and when the last pending I/O finishes
> >>> the Secondary could apply the last checkpoint.  This approach does not
> >>> block QEMU (the monitor, etc).
> >>>
> >>
> >> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
> >> We should do failover in this case.
> > 
> > This is dangerous because it means that a delay/failure in the Secondary
> > would cause the Primary to fail over to the broken Secondary.  All the
> > more reason not to perform blocking operations on the Secondary in the
> > checkpoint code path.
> 
> If the secondary is broken, primary qemu will take over.

Does the Primary use a timeout between "new checkpoint notice" and
Secondary's "done" so it can move on if the Secondary is unresponsive?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-02-03  9:32                   ` Stefan Hajnoczi
@ 2016-02-03  9:55                     ` Wen Congyang
  2016-02-03 11:25                       ` Hailiang Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Wen Congyang @ 2016-02-03  9:55 UTC (permalink / raw)
  To: Stefan Hajnoczi, zhanghailiang
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, fnstml-hwcolo, qemu devel,
	Max Reitz, Gonglei, Paolo Bonzini

On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>> Secondary becomes unresponsive?
>>>>>>
>>>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>>>> resolve it. If you have any ideas?
>>>>>
>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>> Secondory guest activity to complete?
>>>>
>>>> There is no way to know that secondary becomes unreponsive.
>>>
>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>
>> I don't understand this.
>> Here is the COLO checkpoint flow:
>>
>>     Primary                                                Secondary
>>     new checkpoint notice                 --->
>>     vm_stop()                                              vm_stop()
>>     vm state(device state, memory, cpu)   --->
>>                                                            load state
>>                                           <---             done
>>     vm_start()                                             vm_start()
> 
> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
> 
> I was wondering whether the Secondary can do:
> 
> <---  done
>       vm_stop()
>       load state
> 
> It simply receives the checkpoint data into a buffer and immediately
> replies with "done".  vm_stop() and load state is only performed after
> sending "done".

Secondary vm is running, so we should also get the pages that are dirtied
by secondary vm, but not dirtied by primary vm.
We have two ways to do it:
1. Cache all original memory in the secondary qemu
2. Send the dirty pfn list to primary qemu, and get it.

If we ack the checkpoint and the call vm_stop(), we only can select 1. It
means that secondary qemu costs more memory.
In COLO mode, we will compare the output socket, and will do checkpoint if
the application level data is different. If we ack the checkpoint and the
call vm_stop(), the client can not get any more data until secondary vm
is running again. So we still 'wait' the secondary vm.


> 
> The advantage is that the Primary will not be delayed by the Secondary.
> It's an approach that doesn't block.
> 
> But perhaps it's a problem if the Secondary is slower than the Primary
> since the Secondary still needs to complete vm_stop() and load state
> before it can resume execution?
> 
>>>>> I think this really means falling back to microcheckpointing until the
>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>> block QEMU (the monitor, etc).
>>>>>
>>>>
>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>>>> We should do failover in this case.
>>>
>>> This is dangerous because it means that a delay/failure in the Secondary
>>> would cause the Primary to fail over to the broken Secondary.  All the
>>> more reason not to perform blocking operations on the Secondary in the
>>> checkpoint code path.
>>
>> If the secondary is broken, primary qemu will take over.
> 
> Does the Primary use a timeout between "new checkpoint notice" and
> Secondary's "done" so it can move on if the Secondary is unresponsive?

To hailiang:
IIRC, we don't use a timeout but I think we can do it. In our design, there is
an exteranl heartbeat to check primary and secondary status, and decide when
to do checkpoint.

Thanks
Wen Congyang


> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication
  2016-02-03  9:55                     ` Wen Congyang
@ 2016-02-03 11:25                       ` Hailiang Zhang
  0 siblings, 0 replies; 34+ messages in thread
From: Hailiang Zhang @ 2016-02-03 11:25 UTC (permalink / raw)
  To: Wen Congyang, Stefan Hajnoczi
  Cc: Kevin Wolf, Changlong Xie, Fam Zheng, fnstml-hwcolo, qemu devel,
	peter.huangpeng, Max Reitz, Gonglei, Paolo Bonzini

On 2016/2/3 17:55, Wen Congyang wrote:
> On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
>> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>>> Secondary becomes unresponsive?
>>>>>>>
>>>>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>>>>> resolve it. If you have any ideas?
>>>>>>
>>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>>> Secondory guest activity to complete?
>>>>>
>>>>> There is no way to know that secondary becomes unreponsive.
>>>>
>>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>>
>>> I don't understand this.
>>> Here is the COLO checkpoint flow:
>>>
>>>      Primary                                                Secondary
>>>      new checkpoint notice                 --->
>>>      vm_stop()                                              vm_stop()
>>>      vm state(device state, memory, cpu)   --->
>>>                                                             load state
>>>                                            <---             done
>>>      vm_start()                                             vm_start()
>>
>> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
>>
>> I was wondering whether the Secondary can do:
>>
>> <---  done
>>        vm_stop()
>>        load state
>>
>> It simply receives the checkpoint data into a buffer and immediately
>> replies with "done".  vm_stop() and load state is only performed after
>> sending "done".
>
> Secondary vm is running, so we should also get the pages that are dirtied
> by secondary vm, but not dirtied by primary vm.
> We have two ways to do it:
> 1. Cache all original memory in the secondary qemu
> 2. Send the dirty pfn list to primary qemu, and get it.
>
> If we ack the checkpoint and the call vm_stop(), we only can select 1. It
> means that secondary qemu costs more memory.
> In COLO mode, we will compare the output socket, and will do checkpoint if
> the application level data is different. If we ack the checkpoint and the
> call vm_stop(), the client can not get any more data until secondary vm
> is running again. So we still 'wait' the secondary vm.
>
>
>>
>> The advantage is that the Primary will not be delayed by the Secondary.
>> It's an approach that doesn't block.
>>
>> But perhaps it's a problem if the Secondary is slower than the Primary
>> since the Secondary still needs to complete vm_stop() and load state
>> before it can resume execution?
>>
>>>>>> I think this really means falling back to microcheckpointing until the
>>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>>> block QEMU (the monitor, etc).
>>>>>>
>>>>>
>>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>>>>> We should do failover in this case.
>>>>
>>>> This is dangerous because it means that a delay/failure in the Secondary
>>>> would cause the Primary to fail over to the broken Secondary.  All the
>>>> more reason not to perform blocking operations on the Secondary in the
>>>> checkpoint code path.
>>>
>>> If the secondary is broken, primary qemu will take over.
>>
>> Does the Primary use a timeout between "new checkpoint notice" and
>> Secondary's "done" so it can move on if the Secondary is unresponsive?
>
> To hailiang:
> IIRC, we don't use a timeout but I think we can do it. In our design, there is

Yes, we may need a timeout to help detecting the unresponsive case
which can not be caught by the external heartbeat module.
I will investigate it.

Thanks,
Hailiang

> an exteranl heartbeat to check primary and secondary status, and decide when
> to do checkpoint.
>
> Thanks
> Wen Congyang
>
>
>>
>> Stefan
>>
>
>
>
>
> .
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-02-03 11:30 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13  9:18 [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 1/8] unblock backup operations in backing file Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 2/8] Store parent BDS in BdrvChild Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 3/8] Backup: clear all bitmap when doing block checkpoint Changlong Xie
2016-01-27 16:05   ` Stefan Hajnoczi
2016-01-28  2:22     ` Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 4/8] Allow creating backup jobs when opening BDS Changlong Xie
2016-01-27 14:04   ` Stefan Hajnoczi
2016-01-28  2:22     ` Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 5/8] docs: block replication's description Changlong Xie
2016-02-02 15:20   ` Eric Blake
2016-02-03  3:18     ` Changlong Xie
2016-02-03  3:35       ` Eric Blake
2016-02-03  3:40         ` Wen Congyang
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 6/8] auto complete active commit Changlong Xie
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication Changlong Xie
2016-01-20  0:04   ` Eric Blake
2016-01-20  7:45     ` Changlong Xie
2016-01-27 14:46   ` Stefan Hajnoczi
2016-01-28  1:13     ` Wen Congyang
2016-01-28 15:15       ` Stefan Hajnoczi
2016-01-29  3:13         ` Changlong Xie
2016-01-29 15:46           ` Stefan Hajnoczi
2016-02-01  1:13             ` Wen Congyang
2016-02-02 14:34               ` Stefan Hajnoczi
2016-02-03  1:29                 ` Wen Congyang
2016-02-03  9:32                   ` Stefan Hajnoczi
2016-02-03  9:55                     ` Wen Congyang
2016-02-03 11:25                       ` Hailiang Zhang
2016-01-13  9:18 ` [Qemu-devel] [PATCH v14 8/8] support replication driver in blockdev-add Changlong Xie
2016-01-14  1:12 ` [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints Changlong Xie
2016-01-19  1:27   ` Hailiang Zhang
2016-01-24 13:34 ` Wen Congyang
2016-01-26 16:04   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.