All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints
@ 2015-12-02  5:31 Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file Wen Congyang
                   ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html

You can get the patch here:
https://github.com/coloft/qemu/tree/wency/block-replication-v12

You can get the patch with framework here:
https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (10):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  Add new block driver interfaces to control block replication
  quorum: implement block driver interfaces for block replication
  Implement new driver for block replication
  support replication driver in blockdev-add
  Add a new API to start/stop replication, do checkpoint to all BDSes

 block.c                    | 145 ++++++++++++
 block/Makefile.objs        |   3 +-
 block/backup.c             |  14 ++
 block/quorum.c             |  78 +++++++
 block/replication.c        | 549 +++++++++++++++++++++++++++++++++++++++++++++
 blockjob.c                 |  11 +
 docs/block-replication.txt | 227 +++++++++++++++++++
 include/block/block.h      |   9 +
 include/block/block_int.h  |  15 ++
 include/block/blockjob.h   |  12 +
 qapi/block-core.json       |  34 ++-
 11 files changed, 1093 insertions(+), 4 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt

-- 
2.5.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 02/10] Store parent BDS in BdrvChild Wen Congyang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index bfc2be8..eaf479a 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
     /* Otherwise we won't be able to commit due to check in bdrv_commit */
     bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
                     bs->backing_blocker);
+    /*
+     * We do backup in 3 ways:
+     * 1. drive backup
+     *    The target bs is new opened, and the source is top BDS
+     * 2. blockdev backup
+     *    Both the source and the target are top BDSes.
+     * 3. internal backup(used for block replication)
+     *    Both the source and the target are backing file
+     *
+     * In case 1, and 2, the backing file is neither the source nor
+     * the target.
+     * In case 3, we will block the top BDS, so there is only one block
+     * job for the top BDS and its backing chain.
+     */
+    bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                    bs->backing_blocker);
+    bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+                    bs->backing_blocker);
 out:
     bdrv_refresh_limits(bs, NULL);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 02/10] Store parent BDS in BdrvChild
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 03/10] Backup: clear all bitmap when doing block checkpoint Wen Congyang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c                   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index eaf479a..0a0468f 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
     BdrvChild *child = g_new(BdrvChild, 1);
     *child = (BdrvChild) {
         .bs     = child_bs,
+        .parent = parent_bs,
         .name   = g_strdup(child_name),
         .role   = child_role,
     };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ea20d12..1f56046 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -357,6 +357,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
     BlockDriverState *bs;
+    BlockDriverState *parent;
     char *name;
     const BdrvChildRole *role;
     QLIST_ENTRY(BdrvChild) next;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 03/10] Backup: clear all bitmap when doing block checkpoint
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 02/10] Store parent BDS in BdrvChild Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 04/10] Allow creating backup jobs when opening BDS Wen Congyang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
---
 block/backup.c           | 14 ++++++++++++++
 blockjob.c               | 11 +++++++++++
 include/block/blockjob.h | 12 ++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 3b39119..1ca102d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
     }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+        error_setg(errp, "The backup job only supports block checkpoint in"
+                   " sync=none mode");
+        return;
+    }
+
+    hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
     .instance_size  = sizeof(BackupBlockJob),
     .job_type       = BLOCK_JOB_TYPE_BACKUP,
     .set_speed      = backup_set_speed,
     .iostatus_reset = backup_iostatus_reset,
+    .do_checkpoint  = backup_do_checkpoint,
     .commit         = backup_commit,
     .abort          = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
     QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
     block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+    if (!job->driver->do_checkpoint) {
+        error_setg(errp, "The job %s doesn't support block checkpoint",
+                   BlockJobType_lookup[job->driver->job_type]);
+        return;
+    }
+
+    job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
      * never both.
      */
     void (*abort)(BlockJob *job);
+
+    /** Optional callback for job types that support checkpoint. */
+    void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 04/10] Allow creating backup jobs when opening BDS
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (2 preceding siblings ...)
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 03/10] Backup: clear all bitmap when doing block checkpoint Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description Wen Congyang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
 iscsi.o-libs       := $(LIBISCSI_LIBS)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (3 preceding siblings ...)
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 04/10] Allow creating backup jobs when opening BDS Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-23  9:26   ` Stefan Hajnoczi
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 06/10] Add new block driver interfaces to control block replication Wen Congyang
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 227 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..c7bad0e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,227 @@
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content (it could be from either "Secondary Write Requests" or
+       previous COW of "Primary Write Requests") in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary    2 filter
+     disk         ^                                                             virtio-blk
+                  |                                                                  ^
+                3 NBD  ------->  3 NBD                                               |
+                client    ||     server                                          2 filter
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none 6
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The new block filter (the name is replication) will control the block
+replication.
+
+3) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+4) The disk on the secondary is represented by a custom block device
+(called active-disk). It should start as an empty disk, and the format
+should support bdrv_make_empty() and backing file.
+
+5) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also start as an empty disk,
+and the driver supports bdrv_make_empty() and backing file.
+
+6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer
+any state that would otherwise be lost by the speculative write-through
+of the NBD server into the secondary disk. So before block replication,
+the primary disk and secondary disk should contain the same data.
+
+== Failure Handling ==
+There are 6 internal errors when block replication is running:
+1. I/O error on primary disk
+2. Forwarding primary write requests failed
+3. Backup failed
+4. I/O error on secondary disk
+5. I/O error on active disk
+6. Making active disk or hidden disk empty failed
+In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
+4 and 6, we just report block replication's error to FT/HA manager (which
+decides when to do a new checkpoint, when to do failover).
+There is no internal error when doing failover.
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU. The caller
+   must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transferred to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. bdrv_stop_replication()
+   It is called on failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it if you use this API to shutdown the guest, or other
+   things except failover. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+
+  Run qmp command in primary qemu:
+    { 'execute': 'human-monitor-command',
+      'arguments': {
+          'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none'
+      }
+    }
+    { 'execute': 'x-blockdev-change',
+      'arguments': {
+          'parent': 'colo1',
+          'node': 'nbd_client1'
+      }
+    }
+  Note:
+  1. There should be only one NBD Client for each primary disk.
+  2. host is the secondary physical machine's hostname or IP
+  3. Each disk must have its own export name.
+  4. It is all a single argument to -drive and you should ignore the
+     leading whitespace.
+  5. The qmp command line must be run after running qmp command line in
+     secondary qemu.
+
+Secondary:
+  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
+  -drive if=xxx,driver=replication,mode=secondary,\
+         file.file.filename=active_disk.qcow2,\
+         file.driver=qcow2,\
+         file.backing.file.filename=hidden_disk.qcow2,\
+         file.backing.driver=qcow2,\
+         file.backing.backing=colo1
+
+  Then run qmp command in secondary qemu:
+    { 'execute': 'nbd-server-start',
+      'arguments': {
+          'addr': {
+              'type': inet',
+              'data': {
+                  'host': 'xxx',
+                  'port': 'xxx'
+              }
+          }
+      }
+    }
+    { 'execute': 'nbd-server-add',
+      'arguments': {
+          'device': 'colo1',
+          'writable': true
+      }
+    }
+
+  Note:
+  1. The export name in secondary QEMU command line is the secondary
+     disk's id.
+  2. The export name for the same disk must be the same
+  3. The qmp command nbd-server-start and nbd-server-add must be run
+     before running the qmp command migrate on primary QEMU
+  4. Active disk, hidden disk and nbd target's length should be the
+     same.
+  5. It is better to put active disk and hidden disk in ramdisk.
+  6. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
+
+After Failover:
+Primary:
+  The secondary host is down, so we should run the following qmp command
+  to remove the nbd child from the quorum:
+  { 'execute': 'x-blockdev-change',
+    'arguments': {
+        'parent': 'colo1',
+        'child': 'children.1'
+    }
+  }
+  Note: there is no qmp command to remove the blockdev now
+
+Secondary:
+  The primary host is down, so we should do the following thing:
+  { 'execute': 'nbd-server-stop' }
+
+TODO:
+1. Continuous block replication
+2. Shared disk
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 06/10] Add new block driver interfaces to control block replication
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (4 preceding siblings ...)
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 07/10] quorum: implement block driver interfaces for " Wen Congyang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Luiz Capitulino, Gonglei, Michael Roth,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   | 43 +++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |  5 +++++
 include/block/block_int.h | 14 ++++++++++++++
 qapi/block-core.json      | 13 +++++++++++++
 4 files changed, 75 insertions(+)

diff --git a/block.c b/block.c
index 0a0468f..213bee8 100644
--- a/block.c
+++ b/block.c
@@ -4390,3 +4390,46 @@ void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
 
     parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_start_replication) {
+        drv->bdrv_start_replication(bs, mode, errp);
+    } else if (bs->file) {
+        bdrv_start_replication(bs->file->bs, mode, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support starting block"
+                   " replication", bs->filename);
+    }
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_do_checkpoint) {
+        drv->bdrv_do_checkpoint(bs, errp);
+    } else if (bs->file) {
+        bdrv_do_checkpoint(bs->file->bs, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support block checkpoint",
+                   bs->filename);
+    }
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_stop_replication) {
+        drv->bdrv_stop_replication(bs, failover, errp);
+    } else if (bs->file) {
+        bdrv_stop_replication(bs->file->bs, failover, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support stopping block"
+                   " replication", bs->filename);
+    }
+}
diff --git a/include/block/block.h b/include/block/block.h
index 1d3b9c6..cd39d50 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -648,4 +648,9 @@ void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
 void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
                     Error **errp);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1f56046..a6aba8b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -307,6 +307,20 @@ struct BlockDriver {
     void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
                            Error **errp);
 
+    void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+                                   Error **errp);
+    /* Drop Disk buffer when doing checkpoint. */
+    void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+    /*
+     * After failover, we should flush Disk buffer into secondary disk
+     * and stop block replication.
+     *
+     * If the guest is shutdown, we should drop Disk buffer and stop
+     * block representation.
+     */
+    void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover,
+                                  Error **errp);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index feb8da2..2c6bd3f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1925,6 +1925,19 @@
             '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.5
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 07/10] quorum: implement block driver interfaces for block replication
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (5 preceding siblings ...)
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 06/10] Add new block driver interfaces to control block replication Wen Congyang
@ 2015-12-02  5:31 ` Wen Congyang
  2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 08/10] Implement new driver " Wen Congyang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:31 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
---
 block/quorum.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index b7df14b..6fa54f3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -85,6 +85,8 @@ typedef struct BDRVQuorumState {
     int bsize;
 
     QuorumReadPattern read_pattern;
+
+    int replication_index; /* store which child supports block replication */
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -949,6 +951,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     s->bsize = s->num_children;
 
     g_free(opened);
+    s->replication_index = -1;
     goto exit;
 
 close_exit:
@@ -1148,6 +1151,77 @@ static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
     bs->full_open_options = opts;
 }
 
+static void quorum_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                                     Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int count = 0, i, index;
+    Error *local_err = NULL;
+
+    /*
+     * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != REPLICATION_MODE_PRIMARY) {
+        error_setg(errp, "The replication mode for quorum should be 'primary'");
+        return;
+    }
+
+    if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) {
+        error_setg(errp, "Block replication needs read pattern 'fifo'");
+        return;
+    }
+
+    for (i = 0; i < s->num_children; i++) {
+        bdrv_start_replication(s->children[i]->bs, mode, &local_err);
+        if (local_err) {
+            error_free(local_err);
+            local_err = NULL;
+        } else {
+            count++;
+            index = i;
+        }
+    }
+
+    if (count == 0) {
+        error_setg(errp, "No child supports block replication");
+    } else if (count > 1) {
+        for (i = 0; i < s->num_children; i++) {
+            bdrv_stop_replication(s->children[i]->bs, false, NULL);
+        }
+        error_setg(errp, "Too many children support block replication");
+    } else {
+        s->replication_index = index;
+    }
+}
+
+static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->replication_index < 0) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    bdrv_do_checkpoint(s->children[s->replication_index]->bs, errp);
+}
+
+static void quorum_stop_replication(BlockDriverState *bs, bool failover,
+                                    Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->replication_index < 0) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    bdrv_stop_replication(s->children[s->replication_index]->bs, failover,
+                          errp);
+    s->replication_index = -1;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name                        = "quorum",
     .protocol_name                      = "quorum",
@@ -1174,6 +1248,10 @@ static BlockDriver bdrv_quorum = {
 
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+    .bdrv_start_replication             = quorum_start_replication,
+    .bdrv_do_checkpoint                 = quorum_do_checkpoint,
+    .bdrv_stop_replication              = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (6 preceding siblings ...)
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 07/10] quorum: implement block driver interfaces for " Wen Congyang
@ 2015-12-02  5:37 ` Wen Congyang
  2015-12-23  9:47   ` Stefan Hajnoczi
  2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 09/10] support replication driver in blockdev-add Wen Congyang
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:37 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 549 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 550 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 0000000..c46c916
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,549 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+
+typedef struct BDRVReplicationState {
+    ReplicationMode mode;
+    int replication_state;
+    BlockDriverState *active_disk;
+    BlockDriverState *hidden_disk;
+    BlockDriverState *secondary_disk;
+    BlockDriverState *top_bs;
+    Error *blocker;
+    int orig_hidden_flags;
+    int orig_secondary_flags;
+    int error;
+} BDRVReplicationState;
+
+enum {
+    BLOCK_REPLICATION_NONE,     /* block replication is not started */
+    BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+    BLOCK_REPLICATION_DONE,     /* block replication is done(failover) */
+};
+
+#define COMMIT_CLUSTER_BITS 16
+#define COMMIT_CLUSTER_SIZE (1 << COMMIT_CLUSTER_BITS)
+#define COMMIT_SECTORS_PER_CLUSTER (COMMIT_CLUSTER_SIZE / BDRV_SECTOR_SIZE)
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp);
+
+#define REPLICATION_MODE        "mode"
+static QemuOptsList replication_runtime_opts = {
+    .name = "replication",
+    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+    .desc = {
+        {
+            .name = REPLICATION_MODE,
+            .type = QEMU_OPT_STRING,
+        },
+        { /* end of list */ }
+    },
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+                            int flags, Error **errp)
+{
+    int ret;
+    BDRVReplicationState *s = bs->opaque;;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    const char *mode;
+
+    ret = -EINVAL;
+    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        goto fail;
+    }
+
+    mode = qemu_opt_get(opts, REPLICATION_MODE);
+    if (!mode) {
+        error_setg(&local_err, "Missing the option mode");
+        goto fail;
+    }
+
+    if (!strcmp(mode, "primary")) {
+        s->mode = REPLICATION_MODE_PRIMARY;
+    } else if (!strcmp(mode, "secondary")) {
+        s->mode = REPLICATION_MODE_SECONDARY;
+    } else {
+        error_setg(&local_err,
+                   "The option mode's value should be primary or secondary");
+        goto fail;
+    }
+
+    ret = 0;
+
+fail:
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+        replication_stop(bs, false, NULL);
+    }
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file->bs);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+    switch (s->replication_state) {
+    case BLOCK_REPLICATION_NONE:
+        return -EIO;
+    case BLOCK_REPLICATION_RUNNING:
+        return 0;
+    case BLOCK_REPLICATION_DONE:
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+    default:
+        abort();
+    }
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        return ret;
+    }
+
+    if (ret < 0) {
+        s->error = ret;
+        ret = 0;
+    }
+
+    return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             int remaining_sectors,
+                                             QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    if (s->mode == REPLICATION_MODE_PRIMARY) {
+        /* We only use it to forward primary write requests */
+        return -EIO;
+    }
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /*
+     * After failover, because we don't commit active disk/hidden disk
+     * to secondary disk, so we should read from active disk directly.
+     */
+    ret = bdrv_co_readv(bs->file->bs, sector_num, remaining_sectors, qiov);
+    return replication_return_value(s, ret);
+}
+
+static coroutine_fn int replication_co_writev(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int remaining_sectors,
+                                              QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    QEMUIOVector hd_qiov;
+    uint64_t bytes_done = 0;
+    BlockDriverState *top = bs->file->bs;
+    BlockDriverState *base = s->secondary_disk;
+    BlockDriverState *target;
+    int ret, n;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 0) {
+        ret = bdrv_co_writev(bs->file->bs, sector_num,
+                             remaining_sectors, qiov);
+        return replication_return_value(s, ret);
+    }
+
+    /*
+     * Only write to active disk if the sectors have
+     * already been allocated in active disk/hidden disk.
+     */
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+    while (remaining_sectors > 0) {
+        ret = bdrv_is_allocated_above(top, base, sector_num,
+                                      remaining_sectors, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
+
+        target = ret ? top : base;
+        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
+        if (ret < 0) {
+            return ret;
+        }
+
+        remaining_sectors -= n;
+        sector_num += n;
+        bytes_done += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static coroutine_fn int replication_co_discard(BlockDriverState *bs,
+                                               int64_t sector_num,
+                                               int nb_sectors)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 1) {
+        /* It is secondary qemu and we are after failover */
+        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    ret = bdrv_co_discard(bs->file->bs, sector_num, nb_sectors);
+    return replication_return_value(s, ret);
+}
+
+static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
+                                                    BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
+}
+
+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    if (!s->secondary_disk->job) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make active disk empty");
+        return;
+    }
+
+    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make hidden disk empty");
+        return;
+    }
+}
+
+static void reopen_backing_file(BDRVReplicationState *s, bool writable,
+                                Error **errp)
+{
+    BlockReopenQueue *reopen_queue = NULL;
+    int orig_hidden_flags, orig_secondary_flags;
+    int new_hidden_flags, new_secondary_flags;
+    Error *local_err = NULL;
+
+    if (writable) {
+        orig_hidden_flags = bdrv_get_flags(s->hidden_disk);
+        new_hidden_flags = orig_hidden_flags | BDRV_O_RDWR;
+        orig_secondary_flags = bdrv_get_flags(s->secondary_disk);
+        new_secondary_flags = orig_secondary_flags | BDRV_O_RDWR;
+    } else {
+        orig_hidden_flags = s->orig_hidden_flags | BDRV_O_RDWR;
+        new_hidden_flags = s->orig_hidden_flags;
+        orig_secondary_flags = s->orig_secondary_flags | BDRV_O_RDWR;
+        new_secondary_flags = s->orig_secondary_flags;
+    }
+
+    if (orig_hidden_flags != new_hidden_flags) {
+        reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk, NULL,
+                                         new_hidden_flags);
+    }
+
+    if (!(orig_secondary_flags & BDRV_O_RDWR)) {
+        reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk, NULL,
+                                         new_secondary_flags);
+    }
+
+    if (reopen_queue) {
+        bdrv_reopen_multiple(reopen_queue, &local_err);
+        if (local_err != NULL) {
+            error_propagate(errp, local_err);
+        }
+    }
+}
+
+static void backup_job_cleanup(BDRVReplicationState *s)
+{
+    bdrv_op_unblock_all(s->top_bs, s->blocker);
+    error_free(s->blocker);
+    reopen_backing_file(s, false, NULL);
+}
+
+static void backup_job_completed(void *opaque, int ret)
+{
+    BDRVReplicationState *s = opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_DONE) {
+        /* The backup job is cancelled unexpectedly */
+        s->error = -EIO;
+    }
+
+    backup_job_cleanup(s);
+}
+
+static BlockDriverState *get_top_bs(BlockDriverState *bs)
+{
+    BdrvChild *child;
+
+    while (!bs->blk) {
+        if (QLIST_EMPTY(&bs->parents)) {
+            return NULL;
+        }
+
+        child = QLIST_FIRST(&bs->parents);
+        if (QLIST_NEXT(child, next_parent)) {
+            return NULL;
+        }
+
+        bs = child->parent;
+    }
+
+    return bs;
+}
+
+static void replication_start(BlockDriverState *bs, ReplicationMode mode,
+                              Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int64_t active_length, hidden_length, disk_length;
+    AioContext *aio_context;
+    Error *local_err = NULL;
+
+    if (s->replication_state != BLOCK_REPLICATION_NONE) {
+        error_setg(errp, "Block replication is running or done");
+        return;
+    }
+
+    if (s->mode != mode) {
+        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
+                   " but receives %d", s->mode, mode);
+        return;
+    }
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        s->active_disk = bs->file->bs;
+        if (!bs->file->bs->backing) {
+            error_setg(errp, "Active disk doesn't have backing file");
+            return;
+        }
+
+        s->hidden_disk = s->active_disk->backing->bs;
+        if (!s->hidden_disk->backing) {
+            error_setg(errp, "Hidden disk doesn't have backing file");
+            return;
+        }
+
+        s->secondary_disk = s->hidden_disk->backing->bs;
+        if (!s->secondary_disk->blk) {
+            error_setg(errp, "The secondary disk doesn't have block backend");
+            return;
+        }
+
+        s->top_bs = get_top_bs(bs);
+        if (!s->top_bs) {
+            error_setg(errp, "Cannot get the top block driver state to do"
+                       " internal backup");
+            return;
+        }
+
+        /* verify the length */
+        active_length = bdrv_getlength(s->active_disk);
+        hidden_length = bdrv_getlength(s->hidden_disk);
+        disk_length = bdrv_getlength(s->secondary_disk);
+        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
+            active_length != hidden_length || hidden_length != disk_length) {
+            error_setg(errp, "active disk, hidden disk, secondary disk's length"
+                       " are not the same");
+            return;
+        }
+
+        if (!s->active_disk->drv->bdrv_make_empty ||
+            !s->hidden_disk->drv->bdrv_make_empty) {
+            error_setg(errp,
+                       "active disk or hidden disk doesn't support make_empty");
+            return;
+        }
+
+        /* reopen the backing file in r/w mode */
+        reopen_backing_file(s, true, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+
+        /* start backup job now */
+        error_setg(&s->blocker,
+                   "block device is in use by internal backup job");
+        bdrv_op_block_all(s->top_bs, s->blocker);
+        bdrv_op_unblock(s->top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+        bdrv_ref(s->hidden_disk);
+
+        aio_context = bdrv_get_aio_context(bs);
+        aio_context_acquire(aio_context);
+        backup_start(s->secondary_disk, s->hidden_disk, 0,
+                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
+                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
+                     s, NULL, &local_err);
+        aio_context_release(aio_context);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            backup_job_cleanup(s);
+            bdrv_unref(s->hidden_disk);
+            return;
+        }
+        break;
+    default:
+        abort();
+    }
+
+    s->replication_state = BLOCK_REPLICATION_RUNNING;
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+
+    s->error = 0;
+}
+
+static void replication_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    if (s->error) {
+        error_setg(errp, "I/O error occurs");
+        return;
+    }
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+}
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    s->replication_state = BLOCK_REPLICATION_DONE;
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        if (!failover) {
+            /*
+             * This BDS will be closed, and the job should be completed
+             * before the BDS is closed, because we will access hidden
+             * disk, secondary disk in backup_job_completed().
+             */
+            if (s->secondary_disk->job) {
+                block_job_cancel_sync(s->secondary_disk->job);
+            }
+            secondary_do_checkpoint(s, errp);
+            return;
+        }
+
+        if (s->secondary_disk->job) {
+            block_job_cancel(s->secondary_disk->job);
+        }
+        break;
+    default:
+        abort();
+    }
+}
+
+BlockDriver bdrv_replication = {
+    .format_name                = "replication",
+    .protocol_name              = "replication",
+    .instance_size              = sizeof(BDRVReplicationState),
+
+    .bdrv_open                  = replication_open,
+    .bdrv_close                 = replication_close,
+
+    .bdrv_getlength             = replication_getlength,
+    .bdrv_co_readv              = replication_co_readv,
+    .bdrv_co_writev             = replication_co_writev,
+    .bdrv_co_discard            = replication_co_discard,
+
+    .is_filter                  = true,
+    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
+
+    .bdrv_start_replication     = replication_start,
+    .bdrv_do_checkpoint         = replication_do_checkpoint,
+    .bdrv_stop_replication      = replication_stop,
+
+    .has_variable_length        = true,
+};
+
+static void bdrv_replication_init(void)
+{
+    bdrv_register(&bdrv_replication);
+}
+
+block_init(bdrv_replication_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 09/10] support replication driver in blockdev-add
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (7 preceding siblings ...)
  2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 08/10] Implement new driver " Wen Congyang
@ 2015-12-02  5:37 ` Wen Congyang
  2015-12-02  5:38 ` [Qemu-devel] [Patch v12 resend 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang
  2015-12-17  6:22 ` [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:37 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qapi/block-core.json | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2c6bd3f..acc9f8d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -219,7 +219,7 @@
 #       'qcow2', 'raw', 'tftp', 'vdi', 'vmdk', 'vpc', 'vvfat'
 #       2.2: 'archipelago' added, 'cow' dropped
 #       2.3: 'host_floppy' deprecated
-#       2.5: 'host_floppy' dropped
+#       2.5: 'host_floppy' dropped, 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1492,6 +1492,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.5
 #
 # Since: 2.0
 ##
@@ -1499,8 +1500,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
             'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
             'http', 'https', 'null-aio', 'null-co', 'parallels',
-            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-            'vmdk', 'vpc', 'vvfat' ] }
+            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+            'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1938,6 +1939,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.5
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -1974,6 +1988,7 @@
       'quorum':     'BlockdevOptionsQuorum',
       'raw':        'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+      'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
       'tftp':       'BlockdevOptionsFile',
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [Patch v12 resend 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (8 preceding siblings ...)
  2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 09/10] support replication driver in blockdev-add Wen Congyang
@ 2015-12-02  5:38 ` Wen Congyang
  2015-12-17  6:22 ` [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
  10 siblings, 0 replies; 22+ messages in thread
From: Wen Congyang @ 2015-12-02  5:38 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c               | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h |  4 +++
 2 files changed, 87 insertions(+)

diff --git a/block.c b/block.c
index 213bee8..09ee7f1 100644
--- a/block.c
+++ b/block.c
@@ -4433,3 +4433,86 @@ void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
                    " replication", bs->filename);
     }
 }
+
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp)
+{
+    BlockDriverState *bs = NULL, *temp = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (!QLIST_EMPTY(&bs->parents)) {
+            /* It is not top BDS */
+            continue;
+        }
+
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_start_replication(bs, mode, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            goto fail;
+        }
+    }
+
+    return;
+
+fail:
+    while ((temp = bdrv_next(temp)) && bs != temp) {
+        bdrv_stop_replication(temp, false, NULL);
+    }
+}
+
+void bdrv_do_checkpoint_all(Error **errp)
+{
+    BlockDriverState *bs = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (!QLIST_EMPTY(&bs->parents)) {
+            /* It is not top BDS */
+            continue;
+        }
+
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_do_checkpoint(bs, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
+
+void bdrv_stop_replication_all(bool failover, Error **errp)
+{
+    BlockDriverState *bs = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (!QLIST_EMPTY(&bs->parents)) {
+            /* It is not top BDS */
+            continue;
+        }
+
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_stop_replication(bs, failover, &local_err);
+        if (!errp) {
+            /*
+             * The caller doesn't care the result, they just
+             * want to stop all block's replication.
+             */
+            continue;
+        }
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
diff --git a/include/block/block.h b/include/block/block.h
index cd39d50..39d246c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -653,4 +653,8 @@ void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
 void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
 void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
 
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp);
+void bdrv_do_checkpoint_all(Error **errp);
+void bdrv_stop_replication_all(bool failover, Error **errp);
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints
  2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
                   ` (9 preceding siblings ...)
  2015-12-02  5:38 ` [Qemu-devel] [Patch v12 resend 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang
@ 2015-12-17  6:22 ` Wen Congyang
  2015-12-23 10:04   ` Stefan Hajnoczi
  10 siblings, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2015-12-17  6:22 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Kevin Wolf,
	Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei

Stefan:Ping...

What about this feature? I have worked for it about 1 year, but it is still in the
way...

On 12/02/2015 01:31 PM, Wen Congyang wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
> 
> You can get the detailed information about block replication from here:
> http://wiki.qemu.org/Features/BlockReplication
> 
> Usage:
> Please refer to docs/block-replication.txt
> 
> This patch series is based on the following patch series:
> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
> 2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html
> 
> You can get the patch here:
> https://github.com/coloft/qemu/tree/wency/block-replication-v12
> 
> You can get the patch with framework here:
> https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2
> 
> TODO:
> 1. Continuous block replication. It will be started after basic functions
>    are accepted.
> 
> Changs Log:
> V12:
> 1. Rebase to the newest codes
> 2. Use backing reference to replcace 'allow-write-backing-file'
> V11:
> 1. Reopen the backing file when starting blcok replication if it is not
>    opened in R/W mode
> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>    when opening backing file
> 3. Block the top BDS so there is only one block job for the top BDS and
>    its backing chain.
> V10:
> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>    reference.
> 2. Address the comments from Eric Blake
> V9:
> 1. Update the error messages
> 2. Rebase to the newest qemu
> 3. Split child add/delete support. These patches are sent in another patchset.
> V8:
> 1. Address Alberto Garcia's comments
> V7:
> 1. Implement adding/removing quorum child. Remove the option non-connect.
> 2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
> V6:
> 1. Rebase to the newest qemu.
> V5:
> 1. Address the comments from Gong Lei
> 2. Speed the failover up. The secondary vm can take over very quickly even
>    if there are too many I/O requests.
> V4:
> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
> V3:
> 1: use error_setg() instead of error_set()
> 2. Add a new block job API
> 3. Active disk, hidden disk and nbd target uses the same AioContext
> 4. Add a testcase to test new hbitmap API
> V2:
> 1. Redesign the secondary qemu(use image-fleecing)
> 2. Use Error objects to return error message
> 3. Address the comments from Max Reitz and Eric Blake
> 
> Wen Congyang (10):
>   unblock backup operations in backing file
>   Store parent BDS in BdrvChild
>   Backup: clear all bitmap when doing block checkpoint
>   Allow creating backup jobs when opening BDS
>   docs: block replication's description
>   Add new block driver interfaces to control block replication
>   quorum: implement block driver interfaces for block replication
>   Implement new driver for block replication
>   support replication driver in blockdev-add
>   Add a new API to start/stop replication, do checkpoint to all BDSes
> 
>  block.c                    | 145 ++++++++++++
>  block/Makefile.objs        |   3 +-
>  block/backup.c             |  14 ++
>  block/quorum.c             |  78 +++++++
>  block/replication.c        | 549 +++++++++++++++++++++++++++++++++++++++++++++
>  blockjob.c                 |  11 +
>  docs/block-replication.txt | 227 +++++++++++++++++++
>  include/block/block.h      |   9 +
>  include/block/block_int.h  |  15 ++
>  include/block/blockjob.h   |  12 +
>  qapi/block-core.json       |  34 ++-
>  11 files changed, 1093 insertions(+), 4 deletions(-)
>  create mode 100644 block/replication.c
>  create mode 100644 docs/block-replication.txt
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description
  2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description Wen Congyang
@ 2015-12-23  9:26   ` Stefan Hajnoczi
  2016-01-04  6:03     ` Wen Congyang
  2016-01-04 15:51     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2015-12-23  9:26 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]

On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
> +== Failure Handling ==
> +There are 6 internal errors when block replication is running:
> +1. I/O error on primary disk
> +2. Forwarding primary write requests failed
> +3. Backup failed
> +4. I/O error on secondary disk
> +5. I/O error on active disk
> +6. Making active disk or hidden disk empty failed
> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
> +4 and 6, we just report block replication's error to FT/HA manager (which
> +decides when to do a new checkpoint, when to do failover).
> +There is no internal error when doing failover.

Not sure this is true.

Below it says the following for failover: "We will flush the Disk buffer
into Secondary Disk and stop block replication".  Flushing the disk
buffer can result in I/O errors.  This means that failover operations
are not guaranteed to succeed.

In practice I think this is similar to a successful failover followed by
immediately getting I/O errors on the new Primary Disk.  It means that
right after failover there is another failure and the system may not be
able to continue.

So this really only matters in the case where there is a new Secondary
ready after failover.  In that case the user might expect failover to
continue to the new Secondary (Host 3):

   [X]        [X]
  Host 1 <-> Host 2 <-> Host 3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication
  2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 08/10] Implement new driver " Wen Congyang
@ 2015-12-23  9:47   ` Stefan Hajnoczi
  2016-01-04  5:50     ` Wen Congyang
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2015-12-23  9:47 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

On Wed, Dec 02, 2015 at 01:37:25PM +0800, Wen Congyang wrote:
> +    /*
> +     * Only write to active disk if the sectors have
> +     * already been allocated in active disk/hidden disk.
> +     */
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +    while (remaining_sectors > 0) {
> +        ret = bdrv_is_allocated_above(top, base, sector_num,
> +                                      remaining_sectors, &n);

There is a race condition here since multiple I/O requests can be in
flight at the same time.   If two requests touch the same cluster
between top->base then the result of these checks could be unreliable.

The simple but slow solution is to use a CoMutex to serialize requests.

> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        qemu_iovec_reset(&hd_qiov);
> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
> +
> +        target = ret ? top : base;
> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        remaining_sectors -= n;
> +        sector_num += n;
> +        bytes_done += n * BDRV_SECTOR_SIZE;
> +    }

I think this can be replaced with an active commit block job that copies
data down from the hidden/active disk to the secondary disk.  It is okay
to keep writing to the secondary disk while the block job is running and
then switch over to the secondary disk once it completes.

> +
> +    return 0;
> +}
> +
> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> +                                               int64_t sector_num,
> +                                               int nb_sectors)
> +{
> +    BDRVReplicationState *s = bs->opaque;
> +    int ret;
> +
> +    ret = replication_get_io_status(s);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (ret == 1) {
> +        /* It is secondary qemu and we are after failover */
> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);

What if the clusters are still allocated in the hidden/active disk?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints
  2015-12-17  6:22 ` [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
@ 2015-12-23 10:04   ` Stefan Hajnoczi
  2016-01-04  5:27     ` Wen Congyang
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2015-12-23 10:04 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Thu, Dec 17, 2015 at 02:22:14PM +0800, Wen Congyang wrote:
> Stefan:Ping...
> 
> What about this feature? I have worked for it about 1 year, but it is still in the
> way...

The code still has TODOs.  What is the plan for supporting replication
after failover?  This feature seems critical because anyone who wants FT
won't be able to use this code unless it supports FT after the first
failure.

---

Adding new block layer APIs that are replication-specific is not clean.
Only the replication block driver cares about the start/stop/checkpoint
interface.

It is cleaner to have a separate API and data structure for block
replication.

The replication code should define its own BlockReplicationOps struct
and allow objects to register themselves.  Then it's no longer necessary
to modify the core block layer to forward start/stop/checkpoint calls.

Something like:

typedef struct BlockReplicationOps BlockReplicationOps;
typedef struct BlockReplicationState {
    const BlockReplicationOps *ops;
    QLIST_ENTRY(BlockReplicationState) list;
} BlockReplicationState;

typedef struct {
    void start(BlockReplicationState *brs, Error **errp);
    void stop(BlockReplicationState *brs, Error **errp);
    void checkpoint(BlockReplicationState *brs, Error **errp);
} BlockReplicationOps;

static QLIST_HEAD(BlockReplicationState) block_replication_states;

void block_replication_add(BlockReplicationState *brs);
void block_replication_remove(BlockReplicationState *brs);

The replication block driver would add/remove itself.  The quorum block
driver probably doesn't need to be modified (I think in your current
patches you modify it just to forward the start/stop/checkpoint calls to
a particular quorum child).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints
  2015-12-23 10:04   ` Stefan Hajnoczi
@ 2016-01-04  5:27     ` Wen Congyang
  2016-01-04 16:03       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2016-01-04  5:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

On 12/23/2015 06:04 PM, Stefan Hajnoczi wrote:
> On Thu, Dec 17, 2015 at 02:22:14PM +0800, Wen Congyang wrote:
>> Stefan:Ping...
>>
>> What about this feature? I have worked for it about 1 year, but it is still in the
>> way...
> 
> The code still has TODOs.  What is the plan for supporting replication
> after failover?  This feature seems critical because anyone who wants FT
> won't be able to use this code unless it supports FT after the first
> failure.

We have implemented it based on an old version qemu. To keep the logical
simple, we don't post them now. We will post them after this feature is merged
into qemu.

> 
> ---
> 
> Adding new block layer APIs that are replication-specific is not clean.
> Only the replication block driver cares about the start/stop/checkpoint
> interface.
> 
> It is cleaner to have a separate API and data structure for block
> replication.
> 
> The replication code should define its own BlockReplicationOps struct
> and allow objects to register themselves.  Then it's no longer necessary
> to modify the core block layer to forward start/stop/checkpoint calls.
> 
> Something like:
> 
> typedef struct BlockReplicationOps BlockReplicationOps;
> typedef struct BlockReplicationState {
>     const BlockReplicationOps *ops;
>     QLIST_ENTRY(BlockReplicationState) list;
> } BlockReplicationState;
> 
> typedef struct {
>     void start(BlockReplicationState *brs, Error **errp);
>     void stop(BlockReplicationState *brs, Error **errp);
>     void checkpoint(BlockReplicationState *brs, Error **errp);
> } BlockReplicationOps;
> 
> static QLIST_HEAD(BlockReplicationState) block_replication_states;
> 
> void block_replication_add(BlockReplicationState *brs);
> void block_replication_remove(BlockReplicationState *brs);
> 
> The replication block driver would add/remove itself.  The quorum block
> driver probably doesn't need to be modified (I think in your current
> patches you modify it just to forward the start/stop/checkpoint calls to
> a particular quorum child).

Yes, it is the major purpose. We also do some check in the quorum driver: 
we don't allow more than one child support block replication.

Thanks
Wen Congyang

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication
  2015-12-23  9:47   ` Stefan Hajnoczi
@ 2016-01-04  5:50     ` Wen Congyang
  2016-01-26 14:27       ` Stefan Hajnoczi
  0 siblings, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2016-01-04  5:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

On 12/23/2015 05:47 PM, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2015 at 01:37:25PM +0800, Wen Congyang wrote:
>> +    /*
>> +     * Only write to active disk if the sectors have
>> +     * already been allocated in active disk/hidden disk.
>> +     */
>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> +    while (remaining_sectors > 0) {
>> +        ret = bdrv_is_allocated_above(top, base, sector_num,
>> +                                      remaining_sectors, &n);
> 
> There is a race condition here since multiple I/O requests can be in
> flight at the same time.   If two requests touch the same cluster
> between top->base then the result of these checks could be unreliable.

I don't think so. When we come here, primary qemu is gone, and failover is
done. We only write to active disk if the sectors have already been allocated
in active disk/hidden disk before failover. So it two requests touch the same
cluster, it is OK, because the function bdrv_is_allocated_above()'s return
value is not changed.

> 
> The simple but slow solution is to use a CoMutex to serialize requests.
> 
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        qemu_iovec_reset(&hd_qiov);
>> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
>> +
>> +        target = ret ? top : base;
>> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        remaining_sectors -= n;
>> +        sector_num += n;
>> +        bytes_done += n * BDRV_SECTOR_SIZE;
>> +    }
> 
> I think this can be replaced with an active commit block job that copies
> data down from the hidden/active disk to the secondary disk.  It is okay
> to keep writing to the secondary disk while the block job is running and
> then switch over to the secondary disk once it completes.

Yes, active commit is another choice. IIRC, I don't use it because mirror job
has some problem. It is fixed now(see bdrv_drained_begin()/bdrv_drained_end()
in the mirror job).
We will use mirror job in the next version.

> 
>> +
>> +    return 0;
>> +}
>> +
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +                                               int64_t sector_num,
>> +                                               int nb_sectors)
>> +{
>> +    BDRVReplicationState *s = bs->opaque;
>> +    int ret;
>> +
>> +    ret = replication_get_io_status(s);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    if (ret == 1) {
>> +        /* It is secondary qemu and we are after failover */
>> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
> 
> What if the clusters are still allocated in the hidden/active disk?
> 

What does discard do? Drop the data that allocated in the disk?
If so, I think I make a misunderstand. I will fix it in the next version.

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description
  2015-12-23  9:26   ` Stefan Hajnoczi
@ 2016-01-04  6:03     ` Wen Congyang
  2016-01-26 13:57       ` Stefan Hajnoczi
  2016-01-04 15:51     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 22+ messages in thread
From: Wen Congyang @ 2016-01-04  6:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

On 12/23/2015 05:26 PM, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
>> +== Failure Handling ==
>> +There are 6 internal errors when block replication is running:
>> +1. I/O error on primary disk
>> +2. Forwarding primary write requests failed
>> +3. Backup failed
>> +4. I/O error on secondary disk
>> +5. I/O error on active disk
>> +6. Making active disk or hidden disk empty failed
>> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
>> +4 and 6, we just report block replication's error to FT/HA manager (which
>> +decides when to do a new checkpoint, when to do failover).
>> +There is no internal error when doing failover.
> 
> Not sure this is true.
> 
> Below it says the following for failover: "We will flush the Disk buffer
> into Secondary Disk and stop block replication".  Flushing the disk
> buffer can result in I/O errors.  This means that failover operations
> are not guaranteed to succeed.

We don't use mirror job now. We may use it in the next version.
Is there any way to know the I/O error when the mirror job is running?
Get the job's status?

> 
> In practice I think this is similar to a successful failover followed by
> immediately getting I/O errors on the new Primary Disk.  It means that
> right after failover there is another failure and the system may not be
> able to continue.

Block replication is not designed for such case. For example, we don't do
failover on primary disk's failure. In such case, we just report the error
to the disk layer(It is the case 1 in the above Failure Handling).

Sorry for the late reply. Your mail is sent at 2015-12-23, but I receive
it at 2016-01-04....

> 
> So this really only matters in the case where there is a new Secondary
> ready after failover.  In that case the user might expect failover to
> continue to the new Secondary (Host 3):
> 
>    [X]        [X]
>   Host 1 <-> Host 2 <-> Host 3
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description
  2015-12-23  9:26   ` Stefan Hajnoczi
  2016-01-04  6:03     ` Wen Congyang
@ 2016-01-04 15:51     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-04 15:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Michael R. Hines, Max Reitz, Gonglei, Paolo Bonzini,
	zhanghailiang

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
> > +== Failure Handling ==
> > +There are 6 internal errors when block replication is running:
> > +1. I/O error on primary disk
> > +2. Forwarding primary write requests failed
> > +3. Backup failed
> > +4. I/O error on secondary disk
> > +5. I/O error on active disk
> > +6. Making active disk or hidden disk empty failed
> > +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
> > +4 and 6, we just report block replication's error to FT/HA manager (which
> > +decides when to do a new checkpoint, when to do failover).
> > +There is no internal error when doing failover.
> 
> Not sure this is true.
> 
> Below it says the following for failover: "We will flush the Disk buffer
> into Secondary Disk and stop block replication".  Flushing the disk
> buffer can result in I/O errors.  This means that failover operations
> are not guaranteed to succeed.
> 
> In practice I think this is similar to a successful failover followed by
> immediately getting I/O errors on the new Primary Disk.  It means that
> right after failover there is another failure and the system may not be
> able to continue.

Yes, I think that's true.

> So this really only matters in the case where there is a new Secondary
> ready after failover.  In that case the user might expect failover to
> continue to the new Secondary (Host 3):
> 
>    [X]        [X]
>   Host 1 <-> Host 2 <-> Host 3

Since COLO is just doing a 1+1 redundency, I think it's not expecting to
cope with a double host failure; it's going to take some time (seconds?) to
sync Host 3 back in when you add it after a failover and the aim would
be not to have distrubed the application for that long, so it should
already be running on Host 2 during that resync.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints
  2016-01-04  5:27     ` Wen Congyang
@ 2016-01-04 16:03       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-04 16:03 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Max Reitz, Gonglei,
	Stefan Hajnoczi, Paolo Bonzini

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 12/23/2015 06:04 PM, Stefan Hajnoczi wrote:
> > On Thu, Dec 17, 2015 at 02:22:14PM +0800, Wen Congyang wrote:
> >> Stefan:Ping...
> >>
> >> What about this feature? I have worked for it about 1 year, but it is still in the
> >> way...
> > 
> > The code still has TODOs.  What is the plan for supporting replication
> > after failover?  This feature seems critical because anyone who wants FT
> > won't be able to use this code unless it supports FT after the first
> > failure.
> 
> We have implemented it based on an old version qemu. To keep the logical
> simple, we don't post them now. We will post them after this feature is merged
> into qemu.

It's probably best to post them, or at least say how you intend to do it,
so people can get an understanding of which way you're going.

However, why is anything new needed to continue replication after failover?
Shouldn't it be possible to build the secondary's disk structure in a way
that it can (by device/disk add/remove) into a structure that looks the same
as a primary, and then we just start a new migration to the new secondary?

(There's a separate problem of getting that to work with the rest of COLO as
well)

Dave

> 
> > 
> > ---
> > 
> > Adding new block layer APIs that are replication-specific is not clean.
> > Only the replication block driver cares about the start/stop/checkpoint
> > interface.
> > 
> > It is cleaner to have a separate API and data structure for block
> > replication.
> > 
> > The replication code should define its own BlockReplicationOps struct
> > and allow objects to register themselves.  Then it's no longer necessary
> > to modify the core block layer to forward start/stop/checkpoint calls.
> > 
> > Something like:
> > 
> > typedef struct BlockReplicationOps BlockReplicationOps;
> > typedef struct BlockReplicationState {
> >     const BlockReplicationOps *ops;
> >     QLIST_ENTRY(BlockReplicationState) list;
> > } BlockReplicationState;
> > 
> > typedef struct {
> >     void start(BlockReplicationState *brs, Error **errp);
> >     void stop(BlockReplicationState *brs, Error **errp);
> >     void checkpoint(BlockReplicationState *brs, Error **errp);
> > } BlockReplicationOps;
> > 
> > static QLIST_HEAD(BlockReplicationState) block_replication_states;
> > 
> > void block_replication_add(BlockReplicationState *brs);
> > void block_replication_remove(BlockReplicationState *brs);
> > 
> > The replication block driver would add/remove itself.  The quorum block
> > driver probably doesn't need to be modified (I think in your current
> > patches you modify it just to forward the start/stop/checkpoint calls to
> > a particular quorum child).
> 
> Yes, it is the major purpose. We also do some check in the quorum driver: 
> we don't allow more than one child support block replication.
> 
> Thanks
> Wen Congyang
> 
> > 
> > Stefan
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description
  2016-01-04  6:03     ` Wen Congyang
@ 2016-01-26 13:57       ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2016-01-26 13:57 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]

On Mon, Jan 04, 2016 at 02:03:16PM +0800, Wen Congyang wrote:
> On 12/23/2015 05:26 PM, Stefan Hajnoczi wrote:
> > On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
> >> +== Failure Handling ==
> >> +There are 6 internal errors when block replication is running:
> >> +1. I/O error on primary disk
> >> +2. Forwarding primary write requests failed
> >> +3. Backup failed
> >> +4. I/O error on secondary disk
> >> +5. I/O error on active disk
> >> +6. Making active disk or hidden disk empty failed
> >> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
> >> +4 and 6, we just report block replication's error to FT/HA manager (which
> >> +decides when to do a new checkpoint, when to do failover).
> >> +There is no internal error when doing failover.
> > 
> > Not sure this is true.
> > 
> > Below it says the following for failover: "We will flush the Disk buffer
> > into Secondary Disk and stop block replication".  Flushing the disk
> > buffer can result in I/O errors.  This means that failover operations
> > are not guaranteed to succeed.
> 
> We don't use mirror job now. We may use it in the next version.
> Is there any way to know the I/O error when the mirror job is running?
> Get the job's status?

Block jobs have an error status which is exposed via QMP.  The block job
emits a QMP event notifying the client.  If the client issues
query-block-jobs it will also see the iostatus field.

I'm not aware of an internal API to monitor QMP events.  It would be
possible to add it but first I wonder why you want to use mirror?

> > In practice I think this is similar to a successful failover followed by
> > immediately getting I/O errors on the new Primary Disk.  It means that
> > right after failover there is another failure and the system may not be
> > able to continue.
> 
> Block replication is not designed for such case. For example, we don't do
> failover on primary disk's failure. In such case, we just report the error
> to the disk layer(It is the case 1 in the above Failure Handling).
> 
> Sorry for the late reply. Your mail is sent at 2015-12-23, but I receive
> it at 2016-01-04....

What is supposed to happen when flushing the Disk Buffer into the
Secondary Disk fails?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication
  2016-01-04  5:50     ` Wen Congyang
@ 2016-01-26 14:27       ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2016-01-26 14:27 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Dr. David Alan Gilbert,
	Gonglei, Paolo Bonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 3619 bytes --]

On Mon, Jan 04, 2016 at 01:50:40PM +0800, Wen Congyang wrote:
> On 12/23/2015 05:47 PM, Stefan Hajnoczi wrote:
> > On Wed, Dec 02, 2015 at 01:37:25PM +0800, Wen Congyang wrote:
> >> +    /*
> >> +     * Only write to active disk if the sectors have
> >> +     * already been allocated in active disk/hidden disk.
> >> +     */
> >> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> >> +    while (remaining_sectors > 0) {
> >> +        ret = bdrv_is_allocated_above(top, base, sector_num,
> >> +                                      remaining_sectors, &n);
> > 
> > There is a race condition here since multiple I/O requests can be in
> > flight at the same time.   If two requests touch the same cluster
> > between top->base then the result of these checks could be unreliable.
> 
> I don't think so. When we come here, primary qemu is gone, and failover is
> done. We only write to active disk if the sectors have already been allocated
> in active disk/hidden disk before failover. So it two requests touch the same
> cluster, it is OK, because the function bdrv_is_allocated_above()'s return
> value is not changed.

You are right.  I didn't realize that there will be no allocating writes
to the active disk.  There is no race condition.

> >> +        if (ret < 0) {
> >> +            return ret;
> >> +        }
> >> +
> >> +        qemu_iovec_reset(&hd_qiov);
> >> +        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
> >> +
> >> +        target = ret ? top : base;
> >> +        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
> >> +        if (ret < 0) {
> >> +            return ret;
> >> +        }
> >> +
> >> +        remaining_sectors -= n;
> >> +        sector_num += n;
> >> +        bytes_done += n * BDRV_SECTOR_SIZE;
> >> +    }
> > 
> > I think this can be replaced with an active commit block job that copies
> > data down from the hidden/active disk to the secondary disk.  It is okay
> > to keep writing to the secondary disk while the block job is running and
> > then switch over to the secondary disk once it completes.
> 
> Yes, active commit is another choice. IIRC, I don't use it because mirror job
> has some problem. It is fixed now(see bdrv_drained_begin()/bdrv_drained_end()
> in the mirror job).
> We will use mirror job in the next version.

I see, thanks.

> > 
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
> >> +                                               int64_t sector_num,
> >> +                                               int nb_sectors)
> >> +{
> >> +    BDRVReplicationState *s = bs->opaque;
> >> +    int ret;
> >> +
> >> +    ret = replication_get_io_status(s);
> >> +    if (ret < 0) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    if (ret == 1) {
> >> +        /* It is secondary qemu and we are after failover */
> >> +        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
> > 
> > What if the clusters are still allocated in the hidden/active disk?
> > 
> 
> What does discard do? Drop the data that allocated in the disk?
> If so, I think I make a misunderstand. I will fix it in the next version.

If I understand correctly the chain is:

 (base) secondary_disk <- hidden disk <- active disk (top)

Clusters in hidden disk or active disk will remain if you discard in
secondary_disk.

I think you need to consider how discard works with a backing chain (I
have forgotten the details but you can check block.c and block/qcow2.c
to find out).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-01-26 14:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02  5:31 [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 02/10] Store parent BDS in BdrvChild Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 03/10] Backup: clear all bitmap when doing block checkpoint Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 04/10] Allow creating backup jobs when opening BDS Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description Wen Congyang
2015-12-23  9:26   ` Stefan Hajnoczi
2016-01-04  6:03     ` Wen Congyang
2016-01-26 13:57       ` Stefan Hajnoczi
2016-01-04 15:51     ` Dr. David Alan Gilbert
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 06/10] Add new block driver interfaces to control block replication Wen Congyang
2015-12-02  5:31 ` [Qemu-devel] [Patch v12 resend 07/10] quorum: implement block driver interfaces for " Wen Congyang
2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 08/10] Implement new driver " Wen Congyang
2015-12-23  9:47   ` Stefan Hajnoczi
2016-01-04  5:50     ` Wen Congyang
2016-01-26 14:27       ` Stefan Hajnoczi
2015-12-02  5:37 ` [Qemu-devel] [Patch v12 resend 09/10] support replication driver in blockdev-add Wen Congyang
2015-12-02  5:38 ` [Qemu-devel] [Patch v12 resend 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang
2015-12-17  6:22 ` [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints Wen Congyang
2015-12-23 10:04   ` Stefan Hajnoczi
2016-01-04  5:27     ` Wen Congyang
2016-01-04 16:03       ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.