[Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints
@ 2015-02-12  3:07 Wen Congyang
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
                   ` (14 more replies)
  0 siblings, 15 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Yang Hongyang

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

Usage:
Primary:
  -drive if=xxx,driver=quorum,read-pattern=first,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw,\
         children.1.file.driver=nbd+colo,\
         children.1.file.host=xxx,\
         children.1.file.port=xxx,\
         children.1.file.export=xxx,\
         children.1.driver=raw
  Note:
  1. NBD Client should not be the first child of quorum.
  2. There should be only one NBD Client.
  3. host is the secondary physical machine's hostname or IP
  4. Each disk must have its own export name.

Secondary:
  -drive if=xxx,driver=blkcolo,export=xxx,\
         backing.file.filename=1.raw,\
         backing.driver=raw
  Then run qmp command:
    nbd_server_start host:port
  Note:
  1. The export name for the same disk must be the same in primary
     and secondary QEMU command line
  2. The qmp command nbd_server_start must be run before running the
     qmp command migrate on primary QEMU
  3. Don't use nbd_server_start's other options

You can get the patch here:
https://github.com/wencongyang/qemu-colo/commits/block-replication-v1

Wen Congyang (14):
  docs: block replication's description
  quorom: add a new read pattern
  quorum: ignore 0-length child
  Add new block driver interfaces to control disk replication
  quorom: implement block driver interfaces for block replication
  NBD client: connect to nbd server later
  NBD client: implement block driver interfaces for block replication
  block: add a new API to create a hidden BlockBackend
  block: give backing image its own BlockBackend
  allow the backing image access the origin BlockDriverState
  allow writing to the backing file
  Add disk buffer for block replication
  COW: move cow interfaces to a seperate file
  COLO: implement a new block driver

 Makefile.objs                  |   2 +-
 block.c                        |  53 +++++-
 block/Makefile.objs            |   1 +
 block/backup.c                 |  52 +-----
 block/blkcolo-buffer.c         | 324 ++++++++++++++++++++++++++++++++
 block/blkcolo.c                | 409 +++++++++++++++++++++++++++++++++++++++++
 block/blkcolo.h                |  35 ++++
 block/block-backend.c          |  29 ++-
 block/nbd.c                    | 155 ++++++++++++++--
 block/quorum.c                 |  79 +++++++-
 blockcow.c                     |  52 ++++++
 docs/block-replication.txt     | 129 +++++++++++++
 include/block/block.h          |  27 +++
 include/block/block_int.h      |  14 ++
 include/sysemu/block-backend.h |   2 +
 qapi/block-core.json           |   4 +-
 16 files changed, 1295 insertions(+), 72 deletions(-)
 create mode 100644 block/blkcolo-buffer.c
 create mode 100644 block/blkcolo.c
 create mode 100644 block/blkcolo.h
 create mode 100644 blockcow.c
 create mode 100644 docs/block-replication.txt

-- 
2.1.0

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-12  7:21   ` Fam Zheng
  2015-03-04 16:35   ` Dr. David Alan Gilbert
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..59150b8
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,129 @@
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The block replication is used for continuous checkpoints. It is designed
+for COLO that Secondary VM is running. It can also be applied for FT/HA
+scene that Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                  virtio-blk
+                          ||        ^                         ^
+--------.                 ||        |                         |
+Primary |                 ||  Secondary disk <--------- COLO buffer 3
+--------'                 ||                   backing
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+("COLO buffer"). The disk buffer's backing image is the secondary disk,
+and the disk buffer uses bdrv_add_before_write_notifier to implement
+copy-on-write, similar to block/backup.c.
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+c. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=first,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+         children.1.file.driver=nbd+colo,\
+         children.1.file.host=xxx,\
+         children.1.file.port=xxx,\
+         children.1.file.export=xxx,\
+         children.1.driver=raw
+  Note:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+  3. host is the secondary physical machine's hostname or IP
+  4. Each disk must have its own export name.
+
+Secondary:
+  -drive if=xxx,driver=blkcolo,export=xxx,\
+         backing.file.filename=1.raw,\
+         backing.driver=raw
+  Then run qmp command:
+    nbd_server_start host:port
+  Note:
+  1. The export name for the same disk must be the same in primary
+     and secondary QEMU command line
+  2. The qmp command nbd_server_start must be run before running the
+     qmp command migrate on primary QEMU
+  3. Don't use nbd_server_start's other options
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-12  6:42   ` Gonglei
                     ` (2 more replies)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child Wen Congyang
                   ` (12 subsequent siblings)
  14 siblings, 3 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Luiz Capitulino, Gonglei, Yang Hongyang, Michael Roth,
	zhanghailiang

To block replication, we only need to read from the first child.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 block/quorum.c       | 5 +++--
 qapi/block-core.json | 4 +++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 437b122..5ed1ff8 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -286,9 +286,10 @@ static void quorum_aio_cb(void *opaque, int ret)
     BDRVQuorumState *s = acb->common.bs->opaque;
     bool rewrite = false;
 
-    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
+    if (acb->is_read && s->read_pattern != QUORUM_READ_PATTERN_QUORUM) {
         /* We try to read next child in FIFO order if we fail to read */
-        if (ret < 0 && ++acb->child_iter < s->num_children) {
+        if (s->read_pattern == QUORUM_READ_PATTERN_FIFO &&
+            ret < 0 && ++acb->child_iter < s->num_children) {
             read_fifo_child(acb);
             return;
         }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index a3fdaf0..d6382e9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1618,9 +1618,11 @@
 #
 # @fifo: read only from the first child that has not failed
 #
+# @first: read only from the first child
+#
 # Since: 2.2
 ##
-{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo', 'first' ] }
 
 ##
 # @BlockdevOptionsQuorum
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 20:43   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication Wen Congyang
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

We connect to NBD server when starting block replication, so
the length is 0 before starting block replication.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/quorum.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 5ed1ff8..e6aff5f 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -734,6 +734,11 @@ static int64_t quorum_getlength(BlockDriverState *bs)
         if (value < 0) {
             return value;
         }
+
+        if (!value) {
+            continue;
+        }
+
         if (value != result) {
             return -EIO;
         }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (2 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 20:57   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication Wen Congyang
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c                   | 36 ++++++++++++++++++++++++++++++++++++
 include/block/block.h     | 10 ++++++++++
 include/block/block_int.h | 12 ++++++++++++
 3 files changed, 58 insertions(+)

diff --git a/block.c b/block.c
index 210fd5f..2335af1 100644
--- a/block.c
+++ b/block.c
@@ -6156,3 +6156,39 @@ BlockAcctStats *bdrv_get_stats(BlockDriverState *bs)
 {
     return &bs->stats;
 }
+
+int bdrv_start_replication(BlockDriverState *bs, int mode)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_start_replication) {
+        return drv->bdrv_start_replication(bs, mode);
+    } else if (bs->file) {
+        return bdrv_start_replication(bs->file, mode);
+    }
+
+    return -1;
+}
+
+int bdrv_do_checkpoint(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_do_checkpoint) {
+        return drv->bdrv_do_checkpoint(bs);
+    } else if (bs->file) {
+        return bdrv_do_checkpoint(bs->file);
+    }
+
+    return -1;
+}
+
+int bdrv_stop_replication(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_stop_replication) {
+        return drv->bdrv_stop_replication(bs);
+    } else if (bs->file) {
+        return bdrv_stop_replication(bs->file);
+    }
+
+    return -1;
+}
diff --git a/include/block/block.h b/include/block/block.h
index 321295e..632b9fc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -557,4 +557,14 @@ void bdrv_flush_io_queue(BlockDriverState *bs);
 
 BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
 
+/* Checkpoint control, called in migration/checkpoint thread */
+enum {
+    COLO_UNPROTECTED_MODE = 0,
+    COLO_PRIMARY_MODE,
+    COLO_SECONDARY_MODE,
+};
+int bdrv_start_replication(BlockDriverState *bs, int mode);
+int bdrv_do_checkpoint(BlockDriverState *bs);
+int bdrv_stop_replication(BlockDriverState *bs);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 7ad1950..603f704 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -273,6 +273,18 @@ struct BlockDriver {
     void (*bdrv_io_unplug)(BlockDriverState *bs);
     void (*bdrv_flush_io_queue)(BlockDriverState *bs);
 
+
+    /* Checkpoint control, called in migration/checkpoint thread */
+    int (*bdrv_start_replication)(BlockDriverState *bs, int mode);
+    /*
+     * Drop Disk buffer when doing checkpoint.
+     */
+    int (*bdrv_do_checkpoint)(BlockDriverState *bs);
+    /* After failover, we should flush Disk buffer into secondary disk
+     * and stop block replication.
+     */
+    int (*bdrv_stop_replication)(BlockDriverState *bs);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (3 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 21:22   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later Wen Congyang
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/quorum.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index e6aff5f..c8479b4 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1070,6 +1070,71 @@ static void quorum_refresh_filename(BlockDriverState *bs)
     bs->full_open_options = opts;
 }
 
+static int quorum_stop_replication(BlockDriverState *bs);
+static int quorum_start_replication(BlockDriverState *bs, int mode)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int ret = -1, i;
+
+    /*
+     * TODO: support COLO_SECONDARY_MODE if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != COLO_PRIMARY_MODE) {
+        return -1;
+    }
+
+    if (s->read_pattern != QUORUM_READ_PATTERN_FIRST) {
+        return -1;
+    }
+
+    /* NBD client should not be the first child */
+    if (bdrv_start_replication(s->bs[0], mode) == 0) {
+        bdrv_stop_replication(s->bs[0]);
+        return -1;
+    }
+
+    for (i = 1; i < s->num_children; i++) {
+        if (bdrv_start_replication(s->bs[i], mode) == 0) {
+            ret++;
+        }
+    }
+
+    if (ret > 0) {
+        quorum_stop_replication(bs);
+    }
+
+    return ret ? -1 : 0;
+}
+
+static int quorum_do_checkpoint(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i;
+
+    for (i = 1; i < s->num_children; i++) {
+        if (bdrv_do_checkpoint(s->bs[i]) == 0) {
+            return 0;
+        }
+    }
+
+    return -1;
+}
+
+static int quorum_stop_replication(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int ret = -1, i;
+
+    for (i = 0; i < s->num_children; i++) {
+        if (bdrv_stop_replication(s->bs[i]) == 0) {
+            ret++;
+        }
+    }
+
+    return ret ? -1 : 0;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name                        = "quorum",
     .protocol_name                      = "quorum",
@@ -1093,6 +1158,10 @@ static BlockDriver bdrv_quorum = {
 
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+    .bdrv_start_replication             = quorum_start_replication,
+    .bdrv_do_checkpoint                 = quorum_do_checkpoint,
+    .bdrv_stop_replication              = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (4 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 21:31   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication Wen Congyang
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

The secondary qemu starts later than the primary qemu, so we
cannot connect to nbd server in bdrv_open().

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/nbd.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 87 insertions(+), 13 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index b05d1d0..19b9200 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -44,6 +44,8 @@
 typedef struct BDRVNBDState {
     NbdClientSession client;
     QemuOpts *socket_opts;
+    char *export;
+    bool connected;
 } BDRVNBDState;
 
 static int nbd_parse_uri(const char *filename, QDict *options)
@@ -247,20 +249,10 @@ static int nbd_establish_connection(BlockDriverState *bs, Error **errp)
     return sock;
 }
 
-static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
-                    Error **errp)
+static int nbd_connect_server(BlockDriverState *bs, Error **errp)
 {
     BDRVNBDState *s = bs->opaque;
-    char *export = NULL;
     int result, sock;
-    Error *local_err = NULL;
-
-    /* Pop the config into our state object. Exit if invalid. */
-    nbd_config(s, options, &export, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return -EINVAL;
-    }
 
     /* establish TCP connection, return error if it fails
      * TODO: Configurable retry-until-timeout behaviour.
@@ -271,16 +263,57 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     /* NBD handshake */
-    result = nbd_client_session_init(&s->client, bs, sock, export, errp);
-    g_free(export);
+    result = nbd_client_session_init(&s->client, bs, sock, s->export, errp);
+    g_free(s->export);
+    s->export = NULL;
+    if (!result) {
+        s->connected = true;
+    }
+
     return result;
 }
 
+static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
+                    Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+    Error *local_err = NULL;
+
+    /* Pop the config into our state object. Exit if invalid. */
+    nbd_config(s, options, &s->export, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+
+    return nbd_connect_server(bs, errp);
+}
+
+static int nbd_open_colo(BlockDriverState *bs, QDict *options, int flags,
+                         Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+    Error *local_err = NULL;
+
+    /* Pop the config into our state object. Exit if invalid. */
+    nbd_config(s, options, &s->export, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
 static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
                         int nb_sectors, QEMUIOVector *qiov)
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return -EIO;
+    }
+
     return nbd_client_session_co_readv(&s->client, sector_num,
                                        nb_sectors, qiov);
 }
@@ -290,6 +323,10 @@ static int nbd_co_writev(BlockDriverState *bs, int64_t sector_num,
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return 0;
+    }
+
     return nbd_client_session_co_writev(&s->client, sector_num,
                                         nb_sectors, qiov);
 }
@@ -298,6 +335,10 @@ static int nbd_co_flush(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return 0;
+    }
+
     return nbd_client_session_co_flush(&s->client);
 }
 
@@ -312,6 +353,10 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return 0;
+    }
+
     return nbd_client_session_co_discard(&s->client, sector_num,
                                          nb_sectors);
 }
@@ -322,6 +367,7 @@ static void nbd_close(BlockDriverState *bs)
 
     qemu_opts_del(s->socket_opts);
     nbd_client_session_close(&s->client);
+    s->connected = false;
 }
 
 static int64_t nbd_getlength(BlockDriverState *bs)
@@ -335,6 +381,10 @@ static void nbd_detach_aio_context(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return;
+    }
+
     nbd_client_session_detach_aio_context(&s->client);
 }
 
@@ -343,6 +393,10 @@ static void nbd_attach_aio_context(BlockDriverState *bs,
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        return;
+    }
+
     nbd_client_session_attach_aio_context(&s->client, new_context);
 }
 
@@ -445,11 +499,31 @@ static BlockDriver bdrv_nbd_unix = {
     .bdrv_refresh_filename      = nbd_refresh_filename,
 };
 
+static BlockDriver bdrv_nbd_colo = {
+    .format_name                = "nbd+colo",
+    .protocol_name              = "nbd+colo",
+    .instance_size              = sizeof(BDRVNBDState),
+    .bdrv_parse_filename        = nbd_parse_filename,
+    .bdrv_file_open             = nbd_open_colo,
+    .bdrv_co_readv              = nbd_co_readv,
+    .bdrv_co_writev             = nbd_co_writev,
+    .bdrv_close                 = nbd_close,
+    .bdrv_co_flush_to_os        = nbd_co_flush,
+    .bdrv_co_discard            = nbd_co_discard,
+    .bdrv_getlength             = nbd_getlength,
+    .bdrv_detach_aio_context    = nbd_detach_aio_context,
+    .bdrv_attach_aio_context    = nbd_attach_aio_context,
+    .bdrv_refresh_filename      = nbd_refresh_filename,
+
+    .has_variable_length        = true,
+};
+
 static void bdrv_nbd_init(void)
 {
     bdrv_register(&bdrv_nbd);
     bdrv_register(&bdrv_nbd_tcp);
     bdrv_register(&bdrv_nbd_unix);
+    bdrv_register(&bdrv_nbd_colo);
 }
 
 block_init(bdrv_nbd_init);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (5 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 21:41   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend Wen Congyang
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/nbd.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 19b9200..1ff6ecf 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -445,6 +445,58 @@ static void nbd_refresh_filename(BlockDriverState *bs)
     bs->full_open_options = opts;
 }
 
+static int nbd_start_replication(BlockDriverState *bs, int mode)
+{
+    BDRVNBDState *s = bs->opaque;
+    Error *local_err = NULL;
+    int ret;
+
+    /*
+     * TODO: support COLO_SECONDARY_MODE if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != COLO_PRIMARY_MODE) {
+        return -1;
+    }
+
+    if (s->connected) {
+        return -1;
+    }
+
+    /* TODO: NBD client should be one child of quorum, how to verify it? */
+    ret = nbd_connect_server(bs, &local_err);
+    if (local_err) {
+        error_free(local_err);
+    }
+
+    return ret;
+}
+
+static int nbd_do_checkpoint(BlockDriverState *bs)
+{
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static int nbd_stop_replication(BlockDriverState *bs)
+{
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -1;
+    }
+
+    nbd_client_session_close(&s->client);
+    s->connected = false;
+
+    return 0;
+}
+
 static BlockDriver bdrv_nbd = {
     .format_name                = "nbd",
     .protocol_name              = "nbd",
@@ -514,6 +566,9 @@ static BlockDriver bdrv_nbd_colo = {
     .bdrv_detach_aio_context    = nbd_detach_aio_context,
     .bdrv_attach_aio_context    = nbd_attach_aio_context,
     .bdrv_refresh_filename      = nbd_refresh_filename,
+    .bdrv_start_replication     = nbd_start_replication,
+    .bdrv_do_checkpoint         = nbd_do_checkpoint,
+    .bdrv_stop_replication      = nbd_stop_replication,
 
     .has_variable_length        = true,
 };
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (6 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 21:48   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend Wen Congyang
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/block-backend.c          | 29 ++++++++++++++++++++++++++++-
 include/sysemu/block-backend.h |  2 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index c28e240..1bbc078 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -72,6 +72,19 @@ BlockBackend *blk_new(const char *name, Error **errp)
 }
 
 /*
+ * Create a new hidden BlockBackend, with a reference count of one.
+ * Return the new BlockBackend on success, null on failure.
+ */
+BlockBackend *blk_hide_new(void)
+{
+    BlockBackend *blk;
+
+    blk = g_new0(BlockBackend, 1);
+    blk->refcnt = 1;
+    return blk;
+}
+
+/*
  * Create a new BlockBackend with a new BlockDriverState attached.
  * Otherwise just like blk_new(), which see.
  */
@@ -91,6 +104,20 @@ BlockBackend *blk_new_with_bs(const char *name, Error **errp)
     return blk;
 }
 
+/*
+ * Create a new hidden BlockBackend with a new BlockDriverState attached.
+ * Otherwise just like blk_hide_new(), which see.
+ */
+BlockBackend *blk_hide_new_with_bs(void)
+{
+    BlockBackend *blk = blk_hide_new();
+    BlockDriverState *bs = bdrv_new();
+
+    blk->bs = bs;
+    bs->blk = blk;
+    return blk;
+}
+
 static void blk_delete(BlockBackend *blk)
 {
     assert(!blk->refcnt);
@@ -102,7 +129,7 @@ static void blk_delete(BlockBackend *blk)
         blk->bs = NULL;
     }
     /* Avoid double-remove after blk_hide_on_behalf_of_do_drive_del() */
-    if (blk->name[0]) {
+    if (blk->name && blk->name[0]) {
         QTAILQ_REMOVE(&blk_backends, blk, link);
     }
     g_free(blk->name);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index aab12b9..acc50f5 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -61,7 +61,9 @@ typedef struct BlockDevOps {
 } BlockDevOps;
 
 BlockBackend *blk_new(const char *name, Error **errp);
+BlockBackend *blk_hide_new(void);
 BlockBackend *blk_new_with_bs(const char *name, Error **errp);
+BlockBackend *blk_hide_new_with_bs(void);
 void blk_ref(BlockBackend *blk);
 void blk_unref(BlockBackend *blk);
 const char *blk_name(BlockBackend *blk);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (7 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 21:53   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState Wen Congyang
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 2335af1..a7a8932 100644
--- a/block.c
+++ b/block.c
@@ -1218,6 +1218,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
 {
     char *backing_filename = g_malloc0(PATH_MAX);
     int ret = 0;
+    BlockBackend *backing_blk;
     BlockDriverState *backing_hd;
     Error *local_err = NULL;
 
@@ -1255,7 +1256,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
         goto free_exit;
     }
 
-    backing_hd = bdrv_new();
+    backing_blk = blk_hide_new_with_bs();
+    backing_hd = blk_bs(backing_blk);
 
     if (bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
         qdict_put(options, "driver", qstring_from_str(bs->backing_format));
@@ -1266,7 +1268,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
                     *backing_filename ? backing_filename : NULL, NULL, options,
                     bdrv_backing_flags(bs->open_flags), NULL, &local_err);
     if (ret < 0) {
-        bdrv_unref(backing_hd);
+        blk_unref(backing_blk);
+        backing_blk = NULL;
         backing_hd = NULL;
         bs->open_flags |= BDRV_O_NO_BACKING;
         error_setg(errp, "Could not open backing file: %s",
@@ -1870,9 +1873,9 @@ void bdrv_close(BlockDriverState *bs)
 
     if (bs->drv) {
         if (bs->backing_hd) {
-            BlockDriverState *backing_hd = bs->backing_hd;
+            BlockBackend *backing_blk = bs->backing_hd->blk;
             bdrv_set_backing_hd(bs, NULL);
-            bdrv_unref(backing_hd);
+            blk_unref(backing_blk);
         }
         bs->drv->bdrv_close(bs);
         g_free(bs->opaque);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (8 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 22:01   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file Wen Congyang
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Block replication needs this feature.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c                   | 2 ++
 include/block/block_int.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block.c b/block.c
index a7a8932..067c44b 100644
--- a/block.c
+++ b/block.c
@@ -1181,6 +1181,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
     if (bs->backing_hd) {
         assert(bs->backing_blocker);
         bdrv_op_unblock_all(bs->backing_hd, bs->backing_blocker);
+        bs->backing_hd->origin_file = NULL;
     } else if (backing_hd) {
         error_setg(&bs->backing_blocker,
                    "device is used as backing hd of '%s'",
@@ -1193,6 +1194,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
         bs->backing_blocker = NULL;
         goto out;
     }
+    backing_hd->origin_file = bs;
     bs->open_flags &= ~BDRV_O_NO_BACKING;
     pstrcpy(bs->backing_file, sizeof(bs->backing_file), backing_hd->filename);
     pstrcpy(bs->backing_format, sizeof(bs->backing_format),
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 603f704..9be13a8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -360,6 +360,8 @@ struct BlockDriverState {
     char exact_filename[PATH_MAX];
 
     BlockDriverState *backing_hd;
+    /* used by backing image */
+    BlockDriverState *origin_file;
     BlockDriverState *file;
 
     NotifierList close_notifiers;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (9 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 22:03   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication Wen Congyang
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 067c44b..96cf973 100644
--- a/block.c
+++ b/block.c
@@ -856,8 +856,8 @@ static int bdrv_inherited_flags(int flags)
  */
 static int bdrv_backing_flags(int flags)
 {
-    /* backing files always opened read-only */
-    flags &= ~(BDRV_O_RDWR | BDRV_O_COPY_ON_READ);
+    /* backing files are opened read-write for block replication */
+    flags &= ~BDRV_O_COPY_ON_READ;
 
     /* snapshot=on is handled on the top layer */
     flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_TEMPORARY);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (10 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 22:27   ` Max Reitz
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 13/14] COW: move cow interfaces to a seperate file Wen Congyang
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/Makefile.objs    |   1 +
 block/blkcolo-buffer.c | 324 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/blkcolo.h        |  35 ++++++
 3 files changed, 360 insertions(+)
 create mode 100644 block/blkcolo-buffer.c
 create mode 100644 block/blkcolo.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index db2933e..1b7b458 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -21,6 +21,7 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += blkcolo-buffer.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/blkcolo-buffer.c b/block/blkcolo-buffer.c
new file mode 100644
index 0000000..1f64542
--- /dev/null
+++ b/block/blkcolo-buffer.c
@@ -0,0 +1,324 @@
+/*
+ * Block driver for COLO
+ *
+ * Copyright Fujitsu, Corp. 2015
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+ *
+ * Authors:
+ *     Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "qemu/queue.h"
+#include "block/block.h"
+#include "block/blkcolo.h"
+
+typedef struct buffered_request_state {
+    uint64_t start_sector;
+    int nb_sectors;
+    void *data;
+    QSIMPLEQ_ENTRY(buffered_request_state) entry;
+} buffered_request_state;
+
+/* common functions */
+/*
+ * The buffered data may eat too much memory, and glibc cannot work
+ * very well in such case.
+ */
+static void *alloc_buffered_data(int nb_sectors)
+{
+    return g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
+}
+
+static void free_buffered_data(void *data)
+{
+    g_free(data);
+}
+
+typedef struct search_brs_state {
+    uint64_t sector;
+    buffered_request_state *prev;
+} search_brs_state;
+
+static buffered_request_state *search_brs(disk_buffer *disk_buffer,
+                                           search_brs_state *sbs)
+{
+    buffered_request_state *brs;
+
+    QSIMPLEQ_FOREACH(brs, &disk_buffer->head, entry) {
+        if (sbs->sector < brs->start_sector) {
+            return NULL;
+        }
+
+        if (sbs->sector < brs->start_sector + brs->nb_sectors) {
+            return brs;
+        }
+
+        sbs->prev = brs;
+    }
+
+    return NULL;
+}
+
+static buffered_request_state *get_next_brs(buffered_request_state *brs)
+{
+    return QSIMPLEQ_NEXT(brs, entry);
+}
+
+static void add_brs_after(disk_buffer *disk_buffer,
+                          buffered_request_state *new_brs,
+                          buffered_request_state *prev)
+{
+    if (!prev) {
+        QSIMPLEQ_INSERT_HEAD(&disk_buffer->head, new_brs, entry);
+    } else {
+        QSIMPLEQ_INSERT_AFTER(&disk_buffer->head, prev, new_brs, entry);
+    }
+}
+
+static bool disk_buffer_empty(disk_buffer *disk_buffer)
+{
+    return QSIMPLEQ_EMPTY(&disk_buffer->head);
+}
+
+/* Disk buffer */
+static buffered_request_state *create_new_brs(QEMUIOVector *qiov,
+                                              uint64_t iov_sector,
+                                              uint64_t sector, int nb_sectors)
+{
+    buffered_request_state *brs;
+
+    brs = g_slice_new(buffered_request_state);
+    brs->start_sector = sector;
+    brs->nb_sectors = nb_sectors;
+    brs->data = alloc_buffered_data(nb_sectors);
+    qemu_iovec_to_buf(qiov, (sector - iov_sector) * BDRV_SECTOR_SIZE,
+                      brs->data, nb_sectors * BDRV_SECTOR_SIZE);
+
+    return brs;
+}
+
+static void free_brs(buffered_request_state *brs)
+{
+    free_buffered_data(brs->data);
+    g_slice_free(buffered_request_state, brs);
+}
+
+bool buffer_has_empty_range(disk_buffer *disk_buffer,
+                            uint64_t sector, int nb_sectors)
+{
+    buffered_request_state *brs;
+    search_brs_state sbs;
+    uint64_t cur_sector = sector;
+
+    if (nb_sectors <= 0) {
+        return false;
+    }
+
+    sbs.sector = sector;
+    sbs.prev = NULL;
+    brs = search_brs(disk_buffer, &sbs);
+    if (!brs) {
+        return true;
+    }
+
+    while (brs && cur_sector < sector + nb_sectors) {
+        if (cur_sector < brs->start_sector) {
+            return true;
+        }
+
+        if (brs->start_sector + brs->nb_sectors >= sector + nb_sectors) {
+            return false;
+        }
+
+        cur_sector = brs->start_sector + brs->nb_sectors;
+        brs = get_next_brs(brs);
+    }
+
+    if (cur_sector < sector + nb_sectors) {
+        return true;
+    } else {
+        return false;
+    }
+}
+
+/* Note: only the sector that exists in the buffer will be overwriten */
+void qiov_read_from_buffer(disk_buffer *disk_buffer, QEMUIOVector *qiov,
+                           uint64_t sector, int nb_sectors)
+{
+    search_brs_state sbs;
+    buffered_request_state *brs;
+    size_t offset, cur_nb_sectors;
+    uint64_t cur_sector = sector;
+    void *buf;
+
+    if (disk_buffer_empty(disk_buffer)) {
+        /* The disk buffer is empty */
+        return;
+    }
+
+    sbs.sector = sector;
+    sbs.prev = NULL;
+    brs = search_brs(disk_buffer, &sbs);
+    if (!brs) {
+        if (!sbs.prev) {
+            brs = QSIMPLEQ_FIRST(&disk_buffer->head);
+        } else {
+            brs = get_next_brs(sbs.prev);
+        }
+    }
+
+    while (brs && cur_sector < sector + nb_sectors) {
+        if (brs->start_sector >= sector + nb_sectors) {
+            break;
+        }
+
+        /* In the first loop, brs->start_sector can be less than sector */
+        if (brs->start_sector < cur_sector) {
+            offset = cur_sector - brs->start_sector;
+            buf = brs->data + offset * BDRV_SECTOR_SIZE;
+        } else {
+            cur_sector = brs->start_sector;
+            offset = 0;
+            buf = brs->data;
+        }
+        if (brs->start_sector + brs->nb_sectors >= sector + nb_sectors) {
+            cur_nb_sectors = sector + nb_sectors - cur_sector;
+        } else {
+            cur_nb_sectors = brs->nb_sectors - offset;
+        }
+        qemu_iovec_from_buf(qiov, (cur_sector - sector) * BDRV_SECTOR_SIZE,
+                            buf, cur_nb_sectors * BDRV_SECTOR_SIZE);
+
+        cur_sector = brs->start_sector + brs->nb_sectors;
+        brs = get_next_brs(brs);
+    }
+}
+
+void qiov_write_to_buffer(disk_buffer *disk_buffer, QEMUIOVector *qiov,
+                          uint64_t sector, int nb_sectors, bool overwrite)
+{
+    search_brs_state sbs;
+    buffered_request_state *brs, *new_brs, *prev;
+    uint64_t cur_sector = sector;
+    int cur_nb_sectors, offset;
+
+    if (disk_buffer_empty(disk_buffer)) {
+        /* The disk buffer is empty */
+        new_brs = create_new_brs(qiov, sector, cur_sector, nb_sectors);
+        add_brs_after(disk_buffer, new_brs, NULL);
+        return;
+    }
+
+    sbs.sector = sector;
+    sbs.prev = NULL;
+    brs = search_brs(disk_buffer, &sbs);
+    if (!sbs.prev) {
+        prev = NULL;
+        brs = QSIMPLEQ_FIRST(&disk_buffer->head);
+    } else {
+        prev = sbs.prev;
+        brs = get_next_brs(sbs.prev);
+    }
+
+    while (brs && cur_sector < sector + nb_sectors) {
+        if (cur_sector < brs->start_sector) {
+            if (sector + nb_sectors <= brs->start_sector) {
+                cur_nb_sectors = sector + nb_sectors - cur_sector;
+            } else {
+                cur_nb_sectors = brs->start_sector - cur_sector;
+            }
+            new_brs = create_new_brs(qiov, sector, cur_sector, cur_nb_sectors);
+            add_brs_after(disk_buffer, new_brs, prev);
+            cur_sector = brs->start_sector;
+        }
+
+        if (cur_sector >= sector + nb_sectors) {
+            break;
+        }
+
+        if (overwrite) {
+            offset = cur_sector - brs->start_sector;
+            if (sector + nb_sectors <= brs->start_sector + brs->nb_sectors) {
+                cur_nb_sectors = sector + nb_sectors - cur_sector;
+            } else {
+                cur_nb_sectors = brs->nb_sectors - offset;
+            }
+            qemu_iovec_to_buf(qiov, (cur_sector - sector) * BDRV_SECTOR_SIZE,
+                              brs->data + offset * BDRV_SECTOR_SIZE,
+                              cur_nb_sectors * BDRV_SECTOR_SIZE);
+        }
+
+        cur_sector = brs->start_sector + brs->nb_sectors;
+
+        prev = brs;
+        brs = get_next_brs(brs);
+    }
+
+    if (cur_sector < sector + nb_sectors) {
+        new_brs = create_new_brs(qiov, sector, cur_sector,
+                                 sector + nb_sectors - cur_sector);
+        add_brs_after(disk_buffer, new_brs, prev);
+    }
+}
+
+struct flushed_data {
+    QEMUIOVector qiov;
+    buffered_request_state *brs;
+};
+
+static void flush_buffered_data_complete(void *opaque, int ret)
+{
+    struct flushed_data *flushed_data = opaque;
+
+    /* We have reported the guest that this write ops successed */
+    assert(ret == 0);
+
+    qemu_iovec_destroy(&flushed_data->qiov);
+    free_brs(flushed_data->brs);
+    g_free(flushed_data);
+}
+
+void flush_buffered_data_to_disk(disk_buffer *disk_buffer,
+                                 BlockDriverState *bs)
+{
+    buffered_request_state *brs, *tmp;
+    struct flushed_data *flushed_data = NULL;
+
+    QSIMPLEQ_FOREACH_SAFE(brs, &disk_buffer->head, entry, tmp) {
+        /* brs is always the head */
+        QSIMPLEQ_REMOVE_HEAD(&disk_buffer->head, entry);
+
+        flushed_data = g_malloc(sizeof(struct flushed_data));
+        qemu_iovec_init(&flushed_data->qiov, 1);
+        qemu_iovec_add(&flushed_data->qiov, brs->data,
+                       brs->nb_sectors * BDRV_SECTOR_SIZE);
+        flushed_data->brs = brs;
+        bdrv_aio_writev(bs, brs->start_sector, &flushed_data->qiov,
+                        brs->nb_sectors, flush_buffered_data_complete,
+                        flushed_data);
+    }
+
+    bdrv_drain_all();
+}
+
+void init_disk_buffer(disk_buffer *disk_buffer)
+{
+    QSIMPLEQ_INIT(&disk_buffer->head);
+}
+
+void clear_all_buffered_data(disk_buffer *disk_buffer)
+{
+    buffered_request_state *brs, *tmp;
+
+    QSIMPLEQ_FOREACH_SAFE(brs, &disk_buffer->head, entry, tmp) {
+        /* brs is always the head */
+        QSIMPLEQ_REMOVE_HEAD(&disk_buffer->head, entry);
+        free_brs(brs);
+    }
+}
diff --git a/block/blkcolo.h b/block/blkcolo.h
new file mode 100644
index 0000000..d8e0d9a
--- /dev/null
+++ b/block/blkcolo.h
@@ -0,0 +1,35 @@
+/*
+ * Block driver for COLO
+ *
+ * Copyright Fujitsu, Corp. 2015
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+ *
+ * Authors:
+ *     Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef BLOCK_BLKCOLO_H
+#define BLOCK_BLKCOLO_H
+
+typedef struct disk_buffer {
+    QSIMPLEQ_HEAD(, buffered_request_state) head;
+} disk_buffer;
+
+bool buffer_has_empty_range(disk_buffer *disk_buffer,
+                            uint64_t sector, int nb_sectors);
+void qiov_read_from_buffer(disk_buffer *disk_buffer, QEMUIOVector *qiov,
+                           uint64_t sector, int nb_sectors);
+void qiov_write_to_buffer(disk_buffer *disk_buffer, QEMUIOVector *qiov,
+                          uint64_t sector, int nb_sectors, bool overwrite);
+void flush_buffered_data_to_disk(disk_buffer *disk_buffer,
+                                 BlockDriverState *bs);
+
+void init_disk_buffer(disk_buffer *disk_buffer);
+void clear_all_buffered_data(disk_buffer *disk_buffer);
+
+#endif
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 13/14] COW: move cow interfaces to a seperate file
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (11 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver Wen Congyang
  2015-02-18 16:26 ` [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Paolo Bonzini
  14 siblings, 0 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 Makefile.objs         |  2 +-
 block/backup.c        | 52 ++++-----------------------------------------------
 blockcow.c            | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h | 17 +++++++++++++++++
 4 files changed, 74 insertions(+), 49 deletions(-)
 create mode 100644 blockcow.c

diff --git a/Makefile.objs b/Makefile.objs
index 28999d3..91bba07 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -7,7 +7,7 @@ util-obj-y = util/ qobject/ qapi/ qapi-types.o qapi-visit.o qapi-event.o
 # block-obj-y is code used by both qemu system emulation and qemu-img
 
 block-obj-y = async.o thread-pool.o
-block-obj-y += nbd.o block.o blockjob.o
+block-obj-y += nbd.o block.o blockjob.o blockcow.o
 block-obj-y += main-loop.o iohandler.o qemu-timer.o
 block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
diff --git a/block/backup.c b/block/backup.c
index 1c535b1..2816b9a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -27,13 +27,6 @@
 
 #define SLICE_TIME 100000000ULL /* ns */
 
-typedef struct CowRequest {
-    int64_t start;
-    int64_t end;
-    QLIST_ENTRY(CowRequest) list;
-    CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
 typedef struct BackupBlockJob {
     BlockJob common;
     BlockDriverState *target;
@@ -44,46 +37,9 @@ typedef struct BackupBlockJob {
     CoRwlock flush_rwlock;
     uint64_t sectors_read;
     HBitmap *bitmap;
-    QLIST_HEAD(, CowRequest) inflight_reqs;
+    CowJob cow_job;
 } BackupBlockJob;
 
-/* See if in-flight requests overlap and wait for them to complete */
-static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
-                                                       int64_t start,
-                                                       int64_t end)
-{
-    CowRequest *req;
-    bool retry;
-
-    do {
-        retry = false;
-        QLIST_FOREACH(req, &job->inflight_reqs, list) {
-            if (end > req->start && start < req->end) {
-                qemu_co_queue_wait(&req->wait_queue);
-                retry = true;
-                break;
-            }
-        }
-    } while (retry);
-}
-
-/* Keep track of an in-flight request */
-static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
-                                     int64_t start, int64_t end)
-{
-    req->start = start;
-    req->end = end;
-    qemu_co_queue_init(&req->wait_queue);
-    QLIST_INSERT_HEAD(&job->inflight_reqs, req, list);
-}
-
-/* Forget about a completed request */
-static void cow_request_end(CowRequest *req)
-{
-    QLIST_REMOVE(req, list);
-    qemu_co_queue_restart_all(&req->wait_queue);
-}
-
 static int coroutine_fn backup_do_cow(BlockDriverState *bs,
                                       int64_t sector_num, int nb_sectors,
                                       bool *error_is_read)
@@ -104,8 +60,8 @@ static int coroutine_fn backup_do_cow(BlockDriverState *bs,
 
     trace_backup_do_cow_enter(job, start, sector_num, nb_sectors);
 
-    wait_for_overlapping_requests(job, start, end);
-    cow_request_begin(&cow_request, job, start, end);
+    wait_for_overlapping_requests(&job->cow_job, start, end);
+    cow_request_begin(&cow_request, &job->cow_job, start, end);
 
     for (; start < end; start++) {
         if (hbitmap_get(job->bitmap, start)) {
@@ -255,7 +211,7 @@ static void coroutine_fn backup_run(void *opaque)
     int64_t start, end;
     int ret = 0;
 
-    QLIST_INIT(&job->inflight_reqs);
+    QLIST_INIT(&job->cow_job.inflight_reqs);
     qemu_co_rwlock_init(&job->flush_rwlock);
 
     start = 0;
diff --git a/blockcow.c b/blockcow.c
new file mode 100644
index 0000000..c070a62
--- /dev/null
+++ b/blockcow.c
@@ -0,0 +1,52 @@
+/*
+ * QEMU block WAW
+ *
+ * Copyright Fujitsu, Corp. 2015
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+ *
+ * Authors:
+ *  Wen Congyang (wency@cn.fujitsu.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "block/block.h"
+
+/* See if in-flight requests overlap and wait for them to complete */
+void coroutine_fn
+wait_for_overlapping_requests(CowJob *job, int64_t start, int64_t end)
+{
+    CowRequest *req;
+    bool retry;
+
+    do {
+        retry = false;
+        QLIST_FOREACH(req, &job->inflight_reqs, list) {
+            if (end > req->start && start < req->end) {
+                qemu_co_queue_wait(&req->wait_queue);
+                retry = true;
+                break;
+            }
+        }
+    } while (retry);
+}
+
+/* Keep track of an in-flight request */
+void cow_request_begin(CowRequest *req, CowJob *job,
+                       int64_t start, int64_t end)
+{
+    req->start = start;
+    req->end = end;
+    qemu_co_queue_init(&req->wait_queue);
+    QLIST_INSERT_HEAD(&job->inflight_reqs, req, list);
+}
+
+/* Forget about a completed request */
+void cow_request_end(CowRequest *req)
+{
+    QLIST_REMOVE(req, list);
+    qemu_co_queue_restart_all(&req->wait_queue);
+}
diff --git a/include/block/block.h b/include/block/block.h
index 632b9fc..0a55373 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -567,4 +567,21 @@ int bdrv_start_replication(BlockDriverState *bs, int mode);
 int bdrv_do_checkpoint(BlockDriverState *bs);
 int bdrv_stop_replication(BlockDriverState *bs);
 
+typedef struct CowRequest {
+    int64_t start;
+    int64_t end;
+    QLIST_ENTRY(CowRequest) list;
+    CoQueue wait_queue; /* coroutines blocked on this request */
+} CowRequest;
+
+typedef struct CowJob {
+    QLIST_HEAD(, CowRequest) inflight_reqs;
+} CowJob;
+
+void coroutine_fn
+wait_for_overlapping_requests(CowJob *job, int64_t start, int64_t end);
+void cow_request_begin(CowRequest *req, CowJob *job,
+                       int64_t start, int64_t end);
+void cow_request_end(CowRequest *req);
+
 #endif
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (12 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 13/14] COW: move cow interfaces to a seperate file Wen Congyang
@ 2015-02-12  3:07 ` Wen Congyang
  2015-02-23 22:35   ` Max Reitz
  2015-02-18 16:26 ` [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Paolo Bonzini
  14 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  3:07 UTC (permalink / raw)
  To: qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/Makefile.objs |   2 +-
 block/blkcolo.c     | 409 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 410 insertions(+), 1 deletion(-)
 create mode 100644 block/blkcolo.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 1b7b458..021e891 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -21,7 +21,7 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
-block-obj-y += blkcolo-buffer.o
+block-obj-y += blkcolo-buffer.o blkcolo.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/blkcolo.c b/block/blkcolo.c
new file mode 100644
index 0000000..2f73486
--- /dev/null
+++ b/block/blkcolo.c
@@ -0,0 +1,409 @@
+/*
+ * Block driver for block replication
+ *
+ * Copyright Fujitsu, Corp. 2015
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+ *
+ * Authors:
+ *     Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+#include "block/blkcolo.h"
+#include "block/nbd.h"
+
+#define COLO_OPT_EXPORT         "export"
+
+#define COLO_CLUSTER_BITS 16
+#define COLO_CLUSTER_SIZE (1 << COLO_CLUSTER_BITS)
+#define COLO_SECTORS_PER_CLUSTER (COLO_CLUSTER_SIZE / BDRV_SECTOR_SIZE)
+
+typedef struct BDRVBlkcoloState BDRVBlkcoloState;
+
+struct BDRVBlkcoloState {
+    BlockDriverState *bs;
+    char *export_name;
+    int mode;
+    disk_buffer disk_buffer;
+    NotifierWithReturn before_write;
+    NBDExport *exp;
+    CowJob cow_job;
+    bool error;
+};
+
+static void colo_svm_init(BDRVBlkcoloState *s);
+static void colo_svm_fini(BDRVBlkcoloState *s);
+
+static int switch_mode(BDRVBlkcoloState *s, int new_mode)
+{
+    if (s->mode == new_mode) {
+        return 0;
+    }
+
+    if (s->mode == COLO_SECONDARY_MODE) {
+        colo_svm_fini(s);
+    }
+
+    s->mode = new_mode;
+    if (s->mode == COLO_SECONDARY_MODE) {
+        colo_svm_init(s);
+    }
+
+    return 0;
+}
+
+/*
+ * Secondary mode functions
+ *
+ * All write requests are forwarded to secondary QEMU from primary QEMU.
+ * The secondary QEMU should do the following things:
+ * 1. Use NBD server to receive and handle the forwarded write requests
+ * 2. Buffer the secondary write requests
+ */
+
+static int coroutine_fn
+colo_svm_co_writev(BlockDriverState *bs, int64_t sector_num,
+                   int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    /*
+     * Write the request to the disk buffer. How to limit the
+     * write speed?
+     */
+    qiov_write_to_buffer(&s->disk_buffer, qiov, sector_num, nb_sectors, true);
+
+    return 0;
+}
+
+static int coroutine_fn
+colo_svm_co_readv(BlockDriverState *bs, int64_t sector_num,
+                  int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+    int ret;
+
+    /*
+     * Read the sector content from secondary disk first. If the sector
+     * content is buffered, use the buffered content.
+     */
+    ret = bdrv_co_readv(bs->backing_hd, sector_num, nb_sectors, qiov);
+    if (ret) {
+        return ret;
+    }
+
+    /* Read from the buffer */
+    qiov_read_from_buffer(&s->disk_buffer, qiov, sector_num, nb_sectors);
+    return 0;
+}
+
+static int coroutine_fn
+colo_do_cow(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
+{
+    BDRVBlkcoloState *s = bs->origin_file->opaque;
+    CowRequest cow_request;
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    void *bounce_buffer = NULL;
+    int ret = 0;
+    int64_t start, end;
+    int n;
+
+    start = sector_num / COLO_SECTORS_PER_CLUSTER;
+    end = DIV_ROUND_UP(sector_num + nb_sectors, COLO_SECTORS_PER_CLUSTER);
+
+    wait_for_overlapping_requests(&s->cow_job, start, end);
+    cow_request_begin(&cow_request, &s->cow_job, start, end);
+
+    nb_sectors = COLO_SECTORS_PER_CLUSTER;
+    for (; start < end; start++) {
+        sector_num = start * COLO_SECTORS_PER_CLUSTER;
+        if (!buffer_has_empty_range(&s->disk_buffer, sector_num, nb_sectors)) {
+            continue;
+        }
+
+        /* TODO */
+        n = COLO_SECTORS_PER_CLUSTER;
+
+        if (!bounce_buffer) {
+            bounce_buffer = qemu_blockalign(bs, COLO_CLUSTER_SIZE);
+        }
+        iov.iov_base = bounce_buffer;
+        iov.iov_len = n * BDRV_SECTOR_SIZE;
+        qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+
+        ret = bdrv_co_readv(bs, sector_num, n, &bounce_qiov);
+        if (ret < 0) {
+            goto out;
+        }
+
+        qiov_write_to_buffer(&s->disk_buffer, &bounce_qiov,
+                             sector_num, n, false);
+    }
+
+out:
+    cow_request_end(&cow_request);
+    return ret;
+}
+
+static int coroutine_fn
+colo_before_write_notify(NotifierWithReturn *notifier, void *opaque)
+{
+    BdrvTrackedRequest *req = opaque;
+    BlockDriverState *bs = req->bs;
+    BDRVBlkcoloState *s = bs->origin_file->opaque;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
+    int ret;
+
+    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    ret = colo_do_cow(bs, sector_num, nb_sectors);
+    if (ret) {
+        s->error = true;
+    }
+
+    return ret;
+}
+
+/*
+ * It should be called in the migration/checkpoint thread, and the caller
+ * should be hold io thread lock
+ */
+static int svm_do_checkpoint(BDRVBlkcoloState *s)
+{
+    if (s->error) {
+        /* TODO: we should report the error more earlier */
+        return -1;
+    }
+
+    /* clear disk buffer */
+    clear_all_buffered_data(&s->disk_buffer);
+    return 0;
+}
+
+/* It should be called in the migration/checkpoint thread */
+static void svm_stop_replication(BDRVBlkcoloState *s)
+{
+    /* switch to unprotected mode */
+    switch_mode(s, COLO_UNPROTECTED_MODE);
+}
+
+static void colo_svm_init(BDRVBlkcoloState *s)
+{
+    BlockBackend *blk = s->bs->backing_hd->blk;
+
+    /* Init Disk Buffer */
+    init_disk_buffer(&s->disk_buffer);
+
+    s->before_write.notify = colo_before_write_notify;
+    bdrv_add_before_write_notifier(s->bs->backing_hd, &s->before_write);
+
+    /* start NBD server */
+    s->exp = nbd_export_new(blk, 0, -1, 0, NULL);
+    nbd_export_set_name(s->exp, s->export_name);
+
+    s->error = false;
+    QLIST_INIT(&s->cow_job.inflight_reqs);
+}
+
+static void colo_svm_fini(BDRVBlkcoloState *s)
+{
+    /* stop NBD server */
+    nbd_export_close(s->exp);
+    nbd_export_put(s->exp);
+
+    /* notifier_with_return_remove */
+    notifier_with_return_remove(&s->before_write);
+
+    /* TODO: All pvm write requests have been done? */
+
+    /* flush all buffered data to secondary disk */
+    flush_buffered_data_to_disk(&s->disk_buffer, s->bs->backing_hd);
+}
+
+/* block driver interfaces */
+static QemuOptsList colo_runtime_opts = {
+    .name = "colo",
+    .head = QTAILQ_HEAD_INITIALIZER(colo_runtime_opts.head),
+    .desc = {
+        {
+            .name = COLO_OPT_EXPORT,
+            .type = QEMU_OPT_STRING,
+            .help = "The NBD server name",
+        },
+        { /* end of list */ }
+    },
+};
+
+/*
+ * usage: -drive if=xxx,driver=colo,export=xxx,\
+ *        backing.file.filename=1.raw,\
+ *        backing.driver=raw
+ */
+static int blkcolo_open(BlockDriverState *bs, QDict *options, int flags,
+                        Error **errp)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    int ret = 0;
+
+    s->bs = bs;
+
+    opts = qemu_opts_create(&colo_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        goto exit;
+    }
+
+    s->export_name = g_strdup(qemu_opt_get(opts, COLO_OPT_EXPORT));
+    if (!s->export_name) {
+        error_setg(&local_err, "Missing the option export");
+        ret = -EINVAL;
+        goto exit;
+    }
+
+exit:
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void blkcolo_close(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (s->mode == COLO_SECONDARY_MODE) {
+        switch_mode(s, COLO_UNPROTECTED_MODE);
+    }
+
+    g_free(s->export_name);
+}
+
+static int64_t blkcolo_getlength(BlockDriverState *bs)
+{
+    if (!bs->backing_hd) {
+        return 0;
+    } else {
+        return bdrv_getlength(bs->backing_hd);
+    }
+}
+
+static int blkcolo_co_readv(BlockDriverState *bs, int64_t sector_num,
+                            int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (s->mode == COLO_SECONDARY_MODE) {
+        return colo_svm_co_readv(bs, sector_num, nb_sectors, qiov);
+    }
+
+    assert(s->mode == COLO_UNPROTECTED_MODE);
+
+    if (!bs->backing_hd) {
+        return -EIO;
+    } else {
+        return bdrv_co_readv(bs->backing_hd, sector_num, nb_sectors, qiov);
+    }
+}
+
+static int blkcolo_co_writev(BlockDriverState *bs, int64_t sector_num,
+                             int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (s->mode == COLO_SECONDARY_MODE) {
+        return colo_svm_co_writev(bs, sector_num, nb_sectors, qiov);
+    }
+
+    assert(s->mode == COLO_UNPROTECTED_MODE);
+
+    if (!bs->backing_hd) {
+        return -EIO;
+    } else {
+        return bdrv_co_writev(bs->backing_hd, sector_num, nb_sectors, qiov);
+    }
+}
+
+static int blkcolo_start_replication(BlockDriverState *bs, int mode)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (mode != COLO_SECONDARY_MODE ||
+        s->mode != COLO_UNPROTECTED_MODE ||
+        !bs->backing_hd) {
+        return -1;
+    }
+
+    if (!blk_is_inserted(bs->backing_hd->blk)) {
+        return -1;
+    }
+
+    if (blk_is_read_only(bs->backing_hd->blk)) {
+        return -1;
+    }
+
+    return switch_mode(s, mode);
+}
+
+static int blkcolo_do_checkpoint(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (s->mode != COLO_SECONDARY_MODE) {
+        return -1;
+    }
+
+    return svm_do_checkpoint(s);
+}
+
+static int blkcolo_stop_replication(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    if (s->mode != COLO_SECONDARY_MODE) {
+        return -1;
+    }
+
+    svm_stop_replication(s);
+    return 0;
+}
+
+static BlockDriver bdrv_blkcolo = {
+    .format_name                = "blkcolo",
+    .protocol_name              = "blkcolo",
+    .instance_size              = sizeof(BDRVBlkcoloState),
+
+    .bdrv_file_open             = blkcolo_open,
+    .bdrv_close                 = blkcolo_close,
+    .bdrv_getlength             = blkcolo_getlength,
+
+    .bdrv_co_readv              = blkcolo_co_readv,
+    .bdrv_co_writev             = blkcolo_co_writev,
+
+    .bdrv_start_replication     = blkcolo_start_replication,
+    .bdrv_do_checkpoint         = blkcolo_do_checkpoint,
+    .bdrv_stop_replication      = blkcolo_stop_replication,
+
+    .supports_backing           = true,
+    .has_variable_length        = true,
+};
+
+static void bdrv_blkcolo_init(void)
+{
+    bdrv_register(&bdrv_blkcolo);
+};
+
+block_init(bdrv_blkcolo_init);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
@ 2015-02-12  6:42   ` Gonglei
  2015-02-23 20:36   ` Max Reitz
  2015-02-23 21:56   ` Eric Blake
  2 siblings, 0 replies; 81+ messages in thread
From: Gonglei @ 2015-02-12  6:42 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: zhanghailiang, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Luiz Capitulino, Yang Hongyang, Michael Roth, Lai Jiangshan

On 2015/2/12 11:07, Wen Congyang wrote:
> To block replication, we only need to read from the first child.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>

A little typo in title of patch 2 and patch 5:
s/quorom/quorum/

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
@ 2015-02-12  7:21   ` Fam Zheng
  2015-02-12  7:40     ` Wen Congyang
  2015-03-04 16:35   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-12  7:21 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, zhanghailiang

Hi Congyang,

On Thu, 02/12 11:07, Wen Congyang wrote:
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.

I'm a little confused by the tenses ("will be" versus "are") and terms. I am
reading them as "s/will be/are/g"

Why do you need this buffer?

If both primary and secondary write to the same sector, what is saved in the
buffer?

Fam

> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  7:21   ` Fam Zheng
@ 2015-02-12  7:40     ` Wen Congyang
  2015-02-12  8:44       ` Fam Zheng
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  7:40 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, zhanghailiang

On 02/12/2015 03:21 PM, Fam Zheng wrote:
> Hi Congyang,
> 
> On Thu, 02/12 11:07, Wen Congyang wrote:
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
> 
> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> reading them as "s/will be/are/g"
> 
> Why do you need this buffer?

We only sync the disk till next checkpoint. Before next checkpoint, secondary
vm write to the buffer.

> 
> If both primary and secondary write to the same sector, what is saved in the
> buffer?

The primary content will be written to the secondary disk, and the secondary content
is saved in the buffer.

Thanks
Wen Congyang

> 
> Fam
> 
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  7:40     ` Wen Congyang
@ 2015-02-12  8:44       ` Fam Zheng
  2015-02-12  9:33         ` Wen Congyang
                           ` (4 more replies)
  0 siblings, 5 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-12  8:44 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On Thu, 02/12 15:40, Wen Congyang wrote:
> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> > Hi Congyang,
> > 
> > On Thu, 02/12 11:07, Wen Congyang wrote:
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> > 
> > I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> > reading them as "s/will be/are/g"
> > 
> > Why do you need this buffer?
> 
> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> vm write to the buffer.
> 
> > 
> > If both primary and secondary write to the same sector, what is saved in the
> > buffer?
> 
> The primary content will be written to the secondary disk, and the secondary content
> is saved in the buffer.

I wonder if alternatively this is possible with an imaginary "writable backing
image" feature, as described below.

When we have a normal backing chain,

               {virtio-blk dev 'foo'}
                         |
                         |
                         |
    [base] <- [mid] <- (foo)

Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
to an existing image on top,

               {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
                         |                              |
                         |                              |
                         |                              |
    [base] <- [mid] <- (foo)  <---------------------- (bar)

It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
We can utilize an automatic hidden drive-backup target:

               {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
                         |                                                          |
                         |                                                          |
                         v                                                          v

    [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)

                         v                              ^
                         v                              ^
                         v                              ^
                         v                              ^
                         >>>> drive-backup sync=none >>>>

So when guest writes to 'foo', the old data is moved to (hidden target), which
remains unchanged from (bar)'s PoV.

The drive in the middle is called hidden because QEMU creates it automatically,
the naming is arbitrary.

It is interesting because it is a more generalized case of image fleecing,
where the (hidden target) is exposed via NBD server for data scanning (read
only) purpose.

More interestingly, with above facility, it is also possible to create a guest
visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
cheaply. Or call it shadow copy if you will.

Back to the COLO case, the configuration will be very similar:


                      {primary wr}                                                {secondary vm}
                            |                                                           |
                            |                                                           |
                            |                                                           |
                            v                                                           v

   [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)

                            v                              ^
                            v                              ^
                            v                              ^
                            v                              ^
                            >>>> drive-backup sync=none >>>>

The workflow analogue is:

> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.

Primary write requests are forwarded to secondary QEMU as well.

> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.

Before Primary write requests are written to (nbd target), aka the Secondary
disk, the orignal sector content is read from it and copied to (hidden buf
disk) by drive-backup. It obviously will not overwrite the data in (active
disk).

> >> +    3) Primary write requests will be written to Secondary disk.

Primary write requests are written to (nbd target).

> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.

Secondary write request will be written in (active disk) as usual.

Finally, when checkpoint arrives, if you want to sync with primary, just drop
data in (hidden buf disk) and (active disk); when failover happends, if you
want to promote secondary vm, you can commit (active disk) to (nbd target), and
drop data in (hidden buf disk).

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  8:44       ` Fam Zheng
@ 2015-02-12  9:33         ` Wen Congyang
  2015-02-12  9:44           ` Fam Zheng
  2015-02-12  9:36         ` Hongyang Yang
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12  9:33 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>

What is active disk? There are two disk images?

Thanks
Wen Congyang

> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  8:44       ` Fam Zheng
  2015-02-12  9:33         ` Wen Congyang
@ 2015-02-12  9:36         ` Hongyang Yang
  2015-02-12  9:46           ` Fam Zheng
  2015-02-24  7:50         ` Wen Congyang
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Hongyang Yang @ 2015-02-12  9:36 UTC (permalink / raw)
  To: Fam Zheng, Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	jsnow, zhanghailiang

Hi Fam,

在 02/12/2015 04:44 PM, Fam Zheng 写道:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
>
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
>
> When we have a normal backing chain,
>
>                 {virtio-blk dev 'foo'}
>                           |
>                           |
>                           |
>      [base] <- [mid] <- (foo)
>
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
>
>                 {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                           |                              |
>                           |                              |
>                           |                              |
>      [base] <- [mid] <- (foo)  <---------------------- (bar)
>
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
>
>                 {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                           |                                                          |
>                           |                                                          |
>                           v                                                          v
>
>      [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>
>                           v                              ^
>                           v                              ^
>                           v                              ^
>                           v                              ^
>                           >>>> drive-backup sync=none >>>>
>
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
>
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
>
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
>
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
>
> Back to the COLO case, the configuration will be very similar:
>
>
>                        {primary wr}                                                {secondary vm}
>                              |                                                           |
>                              |                                                           |
>                              |                                                           |
>                              v                                                           v
>
>     [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>
>                              v                              ^
>                              v                              ^
>                              v                              ^
>                              v                              ^
>                              >>>> drive-backup sync=none >>>>
>
> The workflow analogue is:
>
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>
> Primary write requests are forwarded to secondary QEMU as well.
>
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
>
>>>> +    3) Primary write requests will be written to Secondary disk.
>
> Primary write requests are written to (nbd target).
>
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
>
> Secondary write request will be written in (active disk) as usual.
>
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).

If I understand correctly, you split the Disk Buffer into a hidden buf disk +
an active disk. What we need to do is only to implement a buf disk(will be
used as hidden buf disk and active disk as mentioned), apart from this, we can
use the existing mechinism like backing-file/drive-backup?

>
> Fam
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  9:33         ` Wen Congyang
@ 2015-02-12  9:44           ` Fam Zheng
  2015-02-12 10:11             ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-12  9:44 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Thu, 02/12 17:33, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> 
> What is active disk? There are two disk images?

It starts as an empty image with (hidden buf disk) as backing file, which in
turn has (nbd target) as backing file.

Fam

> 
> Thanks
> Wen Congyang
> 
> > 
> > The workflow analogue is:
> > 
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> > 
> > Primary write requests are forwarded to secondary QEMU as well.
> > 
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> > 
> > Before Primary write requests are written to (nbd target), aka the Secondary
> > disk, the orignal sector content is read from it and copied to (hidden buf
> > disk) by drive-backup. It obviously will not overwrite the data in (active
> > disk).
> > 
> >>>> +    3) Primary write requests will be written to Secondary disk.
> > 
> > Primary write requests are written to (nbd target).
> > 
> >>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>> +       will overwrite the existing sector content in the buffer.
> > 
> > Secondary write request will be written in (active disk) as usual.
> > 
> > Finally, when checkpoint arrives, if you want to sync with primary, just drop
> > data in (hidden buf disk) and (active disk); when failover happends, if you
> > want to promote secondary vm, you can commit (active disk) to (nbd target), and
> > drop data in (hidden buf disk).
> > 
> > Fam
> > .
> > 
> 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  9:36         ` Hongyang Yang
@ 2015-02-12  9:46           ` Fam Zheng
  0 siblings, 0 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-12  9:46 UTC (permalink / raw)
  To: Hongyang Yang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	jsnow, zhanghailiang

On Thu, 02/12 17:36, Hongyang Yang wrote:
> Hi Fam,
> 
> 在 02/12/2015 04:44 PM, Fam Zheng 写道:
> >On Thu, 02/12 15:40, Wen Congyang wrote:
> >>On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>Hi Congyang,
> >>>
> >>>On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>+== Workflow ==
> >>>>+The following is the image of block replication workflow:
> >>>>+
> >>>>+        +----------------------+            +------------------------+
> >>>>+        |Primary Write Requests|            |Secondary Write Requests|
> >>>>+        +----------------------+            +------------------------+
> >>>>+                  |                                       |
> >>>>+                  |                                      (4)
> >>>>+                  |                                       V
> >>>>+                  |                              /-------------\
> >>>>+                  |      Copy and Forward        |             |
> >>>>+                  |---------(1)----------+       | Disk Buffer |
> >>>>+                  |                      |       |             |
> >>>>+                  |                     (3)      \-------------/
> >>>>+                  |                 speculative      ^
> >>>>+                  |                write through    (2)
> >>>>+                  |                      |           |
> >>>>+                  V                      V           |
> >>>>+           +--------------+           +----------------+
> >>>>+           | Primary Disk |           | Secondary Disk |
> >>>>+           +--------------+           +----------------+
> >>>>+
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>+       sector content in the Disk buffer.
> >>>
> >>>I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>reading them as "s/will be/are/g"
> >>>
> >>>Why do you need this buffer?
> >>
> >>We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>vm write to the buffer.
> >>
> >>>
> >>>If both primary and secondary write to the same sector, what is saved in the
> >>>buffer?
> >>
> >>The primary content will be written to the secondary disk, and the secondary content
> >>is saved in the buffer.
> >
> >I wonder if alternatively this is possible with an imaginary "writable backing
> >image" feature, as described below.
> >
> >When we have a normal backing chain,
> >
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> >
> >Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >to an existing image on top,
> >
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >
> >It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >We can utilize an automatic hidden drive-backup target:
> >
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> >
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> >
> >So when guest writes to 'foo', the old data is moved to (hidden target), which
> >remains unchanged from (bar)'s PoV.
> >
> >The drive in the middle is called hidden because QEMU creates it automatically,
> >the naming is arbitrary.
> >
> >It is interesting because it is a more generalized case of image fleecing,
> >where the (hidden target) is exposed via NBD server for data scanning (read
> >only) purpose.
> >
> >More interestingly, with above facility, it is also possible to create a guest
> >visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >cheaply. Or call it shadow copy if you will.
> >
> >Back to the COLO case, the configuration will be very similar:
> >
> >
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> >
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> >
> >The workflow analogue is:
> >
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >
> >Primary write requests are forwarded to secondary QEMU as well.
> >
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>+       sector content in the Disk buffer.
> >
> >Before Primary write requests are written to (nbd target), aka the Secondary
> >disk, the orignal sector content is read from it and copied to (hidden buf
> >disk) by drive-backup. It obviously will not overwrite the data in (active
> >disk).
> >
> >>>>+    3) Primary write requests will be written to Secondary disk.
> >
> >Primary write requests are written to (nbd target).
> >
> >>>>+    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>+       will overwrite the existing sector content in the buffer.
> >
> >Secondary write request will be written in (active disk) as usual.
> >
> >Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >data in (hidden buf disk) and (active disk); when failover happends, if you
> >want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >drop data in (hidden buf disk).
> 
> If I understand correctly, you split the Disk Buffer into a hidden buf disk +
> an active disk. What we need to do is only to implement a buf disk(will be
> used as hidden buf disk and active disk as mentioned), apart from this, we can
> use the existing mechinism like backing-file/drive-backup?
> 

Yes, but you need a separate driver to take care of the buffer logic as
introduced in this series, which is less generic, but does the same thing we
will need in the image fleecing use case.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  9:44           ` Fam Zheng
@ 2015-02-12 10:11             ` Wen Congyang
  2015-02-12 10:26               ` famz
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-12 10:11 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 05:44 PM, Fam Zheng wrote:
> On Thu, 02/12 17:33, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>>
>>> It is interesting because it is a more generalized case of image fleecing,
>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>> only) purpose.
>>>
>>> More interestingly, with above facility, it is also possible to create a guest
>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>> cheaply. Or call it shadow copy if you will.
>>>
>>> Back to the COLO case, the configuration will be very similar:
>>>
>>>
>>>                       {primary wr}                                                {secondary vm}
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             v                                                           v
>>>
>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             >>>> drive-backup sync=none >>>>
>>
>> What is active disk? There are two disk images?
> 
> It starts as an empty image with (hidden buf disk) as backing file, which in
> turn has (nbd target) as backing file.

It's too complicated..., and I don't understand it.
1. What is active disk? Use raw or a new block driver?
2. Hidden buf disk use new block driver?
3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
   export a nbd with read-only BlockDriverState, but nbd server needs to write it.

Thanks
Wen Congyang

> 
> Fam
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> The workflow analogue is:
>>>
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>
>>> Primary write requests are forwarded to secondary QEMU as well.
>>>
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>
>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>> disk).
>>>
>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>
>>> Primary write requests are written to (nbd target).
>>>
>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>> +       will overwrite the existing sector content in the buffer.
>>>
>>> Secondary write request will be written in (active disk) as usual.
>>>
>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>> drop data in (hidden buf disk).
>>>
>>> Fam
>>> .
>>>
>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12 10:11             ` Wen Congyang
@ 2015-02-12 10:26               ` famz
  2015-02-13  5:09                 ` Wen Congyang
  2015-03-03  7:53                 ` Wen Congyang
  0 siblings, 2 replies; 81+ messages in thread
From: famz @ 2015-02-12 10:26 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Thu, 02/12 18:11, Wen Congyang wrote:
> On 02/12/2015 05:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 17:33, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>>
> >>> It is interesting because it is a more generalized case of image fleecing,
> >>> where the (hidden target) is exposed via NBD server for data scanning (read
> >>> only) purpose.
> >>>
> >>> More interestingly, with above facility, it is also possible to create a guest
> >>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >>> cheaply. Or call it shadow copy if you will.
> >>>
> >>> Back to the COLO case, the configuration will be very similar:
> >>>
> >>>
> >>>                       {primary wr}                                                {secondary vm}
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             v                                                           v
> >>>
> >>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >>>
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             >>>> drive-backup sync=none >>>>
> >>
> >> What is active disk? There are two disk images?
> > 
> > It starts as an empty image with (hidden buf disk) as backing file, which in
> > turn has (nbd target) as backing file.
> 
> It's too complicated..., and I don't understand it.
> 1. What is active disk? Use raw or a new block driver?

It is an empty qcow2 image with the same lenght as your Secondary Disk.

> 2. Hidden buf disk use new block driver?

It is an empty qcow2 image with the same lenght as your Secondary Disk, too.

> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.

NBD target is your Secondary Disk. It is opened read-write.

The patches to enable opening it as read-write, and starting drive-backup
between it and hidden buf disk, are all work in progress (the core concept) of
image fleecing.

Fam

> >>>
> >>> The workflow analogue is:
> >>>
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>
> >>> Primary write requests are forwarded to secondary QEMU as well.
> >>>
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>
> >>> Before Primary write requests are written to (nbd target), aka the Secondary
> >>> disk, the orignal sector content is read from it and copied to (hidden buf
> >>> disk) by drive-backup. It obviously will not overwrite the data in (active
> >>> disk).
> >>>
> >>>>>> +    3) Primary write requests will be written to Secondary disk.
> >>>
> >>> Primary write requests are written to (nbd target).
> >>>
> >>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>>> +       will overwrite the existing sector content in the buffer.
> >>>
> >>> Secondary write request will be written in (active disk) as usual.
> >>>
> >>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >>> data in (hidden buf disk) and (active disk); when failover happends, if you
> >>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >>> drop data in (hidden buf disk).
> >>>
> >>> Fam
> >>> .
> >>>
> >>
> >>
> > .
> > 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12 10:26               ` famz
@ 2015-02-13  5:09                 ` Wen Congyang
  2015-02-13  7:01                   ` Fam Zheng
  2015-03-03  7:53                 ` Wen Congyang
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-13  5:09 UTC (permalink / raw)
  To: famz
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 06:26 PM, famz@redhat.com wrote:
> On Thu, 02/12 18:11, Wen Congyang wrote:
>> On 02/12/2015 05:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 17:33, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>
>>>> What is active disk? There are two disk images?
>>>
>>> It starts as an empty image with (hidden buf disk) as backing file, which in
>>> turn has (nbd target) as backing file.
>>
>> It's too complicated..., and I don't understand it.
>> 1. What is active disk? Use raw or a new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk.
> 
>> 2. Hidden buf disk use new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk, too.
> 
>> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.
> 
> NBD target is your Secondary Disk. It is opened read-write.
> 
> The patches to enable opening it as read-write, and starting drive-backup
> between it and hidden buf disk, are all work in progress (the core concept) of
> image fleecing.

What is image fleecing? Are you implementing it now?

Thanks
Wen Congyang

> 
> Fam
> 
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>>>
>>> .
>>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-13  5:09                 ` Wen Congyang
@ 2015-02-13  7:01                   ` Fam Zheng
  2015-02-13 20:29                     ` John Snow
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-13  7:01 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Fri, 02/13 13:09, Wen Congyang wrote:
> What is image fleecing?
> 

It's the name of the feature which enables the built-in NBD server to exporting
a thin point-in-time snapshot created via drive-backup sync=none.

It's for host side data scanning tool to access a disk snapshot of running VM.
The workflow in theory is:

1. guest uses "disk0" as its virtio-blk device.

2. in qmp, use blockdev-add (drive-backup) to add an empty "target0" qcow2
image, that uses "disk0" as its backing file, and use nbd-server-add to export
this empty image with NBD. This way, all reads coming from NBD client will
produce data of "disk0".

3. in qmp, start blockdev-backup from "disk0" to "target0" with "sync=none".
After this point, all guest data written to "disk0" will COW the original data
to "target0", in other words, reading "target0" will effectively produce a
point-in-time snapshot of the time when blockdev-backup started.

4. after step 3, the disk data seen by the NBD client is the stable snapshot.
Because of the COW mechanism in blockdev-backup, "target0" is thin, and can be
dropped once the inspection process is done.

>  Are you implementing it now?

I worked on it. Most parts of the series is merged, the remaining part is
relatively small, namely to

1) enable adding "target0" in step 2 (currently in blockdev-add it's not
possible to reference an existing drive as backing file);

2) enable "blockdev-backup" from "disk0" to "target0", which is obviously not
possible because 1) is not done.

I do have the patches at my tree, just that they need to be refreshed. :)

https://github.com/famz/qemu/tree/image-fleecing

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-13  7:01                   ` Fam Zheng
@ 2015-02-13 20:29                     ` John Snow
  0 siblings, 0 replies; 81+ messages in thread
From: John Snow @ 2015-02-13 20:29 UTC (permalink / raw)
  To: Fam Zheng, Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, zhanghailiang



On 02/13/2015 02:01 AM, Fam Zheng wrote:
> On Fri, 02/13 13:09, Wen Congyang wrote:
>> What is image fleecing?
>>
>
> It's the name of the feature which enables the built-in NBD server to exporting
> a thin point-in-time snapshot created via drive-backup sync=none.
>
> It's for host side data scanning tool to access a disk snapshot of running VM.
> The workflow in theory is:
>
> 1. guest uses "disk0" as its virtio-blk device.
>
> 2. in qmp, use blockdev-add (drive-backup) to add an empty "target0" qcow2
> image, that uses "disk0" as its backing file, and use nbd-server-add to export
> this empty image with NBD. This way, all reads coming from NBD client will
> produce data of "disk0".
>
> 3. in qmp, start blockdev-backup from "disk0" to "target0" with "sync=none".
> After this point, all guest data written to "disk0" will COW the original data
> to "target0", in other words, reading "target0" will effectively produce a
> point-in-time snapshot of the time when blockdev-backup started.
>
> 4. after step 3, the disk data seen by the NBD client is the stable snapshot.
> Because of the COW mechanism in blockdev-backup, "target0" is thin, and can be
> dropped once the inspection process is done.
>
>>   Are you implementing it now?
>
> I worked on it. Most parts of the series is merged, the remaining part is
> relatively small, namely to
>
> 1) enable adding "target0" in step 2 (currently in blockdev-add it's not
> possible to reference an existing drive as backing file);
>
> 2) enable "blockdev-backup" from "disk0" to "target0", which is obviously not
> possible because 1) is not done.
>
> I do have the patches at my tree, just that they need to be refreshed. :)
>
> https://github.com/famz/qemu/tree/image-fleecing
>
> Fam
>

I had intended to pick up these patches after I got incremental backup 
working, as Fam had started both of these projects and I inherited them 
-- though I hadn't begun work in earnest on refining and testing this 
particular feature yet.

--js

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints
  2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (13 preceding siblings ...)
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver Wen Congyang
@ 2015-02-18 16:26 ` Paolo Bonzini
  14 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-02-18 16:26 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Max Reitz, Yang Hongyang



On 12/02/2015 04:07, Wen Congyang wrote:
> Wen Congyang (14):
>   docs: block replication's description
>   quorom: add a new read pattern
>   quorum: ignore 0-length child
>   Add new block driver interfaces to control disk replication
>   quorom: implement block driver interfaces for block replication
>   NBD client: connect to nbd server later
>   NBD client: implement block driver interfaces for block replication
>   block: add a new API to create a hidden BlockBackend
>   block: give backing image its own BlockBackend
>   allow the backing image access the origin BlockDriverState
>   allow writing to the backing file
>   Add disk buffer for block replication
>   COW: move cow interfaces to a seperate file
>   COLO: implement a new block driver

Hi Wen, sorry for the delay.

Kevin and Max need to review this series, as they are the most
comfortable with BlockDriverState vs. BlockBackend.

I suspect you cannot unconditionally give a separate BlockBackend to
each backing image, but they can hopefully suggest how to proceed.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
  2015-02-12  6:42   ` Gonglei
@ 2015-02-23 20:36   ` Max Reitz
  2015-02-23 21:56   ` Eric Blake
  2 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 20:36 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Luiz Capitulino, Gonglei, Yang Hongyang, Michael Roth,
	zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> To block replication, we only need to read from the first child.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>   block/quorum.c       | 5 +++--
>   qapi/block-core.json | 4 +++-
>   2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 437b122..5ed1ff8 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -286,9 +286,10 @@ static void quorum_aio_cb(void *opaque, int ret)
>       BDRVQuorumState *s = acb->common.bs->opaque;
>       bool rewrite = false;
>   
> -    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> +    if (acb->is_read && s->read_pattern != QUORUM_READ_PATTERN_QUORUM) {

Maybe I'd prefer "&& (s->read_pattern == QUORUM_READ_PATTERN_FIFO || 
s->read_pattern == QUORUM_READ_PATTERN_FIRST)"; but it does fit with 
what quorum_aio_readv() does, so I'm fine with it.

>           /* We try to read next child in FIFO order if we fail to read */

However, I think this comment should be modified, because in fact we do 
not try to read the next child if s->read_pattern == 
QUORUM_READ_PATTERN_FIRST.

> -        if (ret < 0 && ++acb->child_iter < s->num_children) {
> +        if (s->read_pattern == QUORUM_READ_PATTERN_FIFO &&
> +            ret < 0 && ++acb->child_iter < s->num_children) {
>               read_fifo_child(acb);
>               return;
>           }
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index a3fdaf0..d6382e9 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1618,9 +1618,11 @@
>   #
>   # @fifo: read only from the first child that has not failed
>   #
> +# @first: read only from the first child

There should be version info here (like "(Since 2.3)").

Max

> +#
>   # Since: 2.2
>   ##
> -{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
> +{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo', 'first' ] }
>   
>   ##
>   # @BlockdevOptionsQuorum

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child Wen Congyang
@ 2015-02-23 20:43   ` Max Reitz
  2015-02-24  2:33     ` Wen Congyang
  2015-03-18  5:29     ` Wen Congyang
  0 siblings, 2 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 20:43 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> We connect to NBD server when starting block replication, so
> the length is 0 before starting block replication.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/quorum.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 5ed1ff8..e6aff5f 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -734,6 +734,11 @@ static int64_t quorum_getlength(BlockDriverState *bs)
>           if (value < 0) {
>               return value;
>           }
> +
> +        if (!value) {
> +            continue;
> +        }
> +
>           if (value != result) {
>               return -EIO;
>           }

Hm, what do you think about some specific error value returned by your 
delayed NBD implementation? Like -ENOTCONN or something like that? Then 
we'd be able to discern a real 0-length block device from a 
not-yet-connected NBD server.

Also, while you did write that one shouldn't be using the NBD client as 
the first quorum child, I think we should try to support that case 
anyway. For this patch, that means accepting that 
bdrv_getlength(s->bs[0]) may be off.

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication Wen Congyang
@ 2015-02-23 20:57   ` Max Reitz
  2015-02-23 21:58     ` Eric Blake
  0 siblings, 1 reply; 81+ messages in thread
From: Max Reitz @ 2015-02-23 20:57 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block.c                   | 36 ++++++++++++++++++++++++++++++++++++
>   include/block/block.h     | 10 ++++++++++
>   include/block/block_int.h | 12 ++++++++++++
>   3 files changed, 58 insertions(+)
>
> diff --git a/block.c b/block.c
> index 210fd5f..2335af1 100644
> --- a/block.c
> +++ b/block.c
> @@ -6156,3 +6156,39 @@ BlockAcctStats *bdrv_get_stats(BlockDriverState *bs)
>   {
>       return &bs->stats;
>   }
> +
> +int bdrv_start_replication(BlockDriverState *bs, int mode)
> +{
> +    BlockDriver *drv = bs->drv;
> +    if (drv && drv->bdrv_start_replication) {
> +        return drv->bdrv_start_replication(bs, mode);
> +    } else if (bs->file) {
> +        return bdrv_start_replication(bs->file, mode);
> +    }
> +
> +    return -1;

I'd prefer returning -errno here (like -ENOTSUP). Alternatively, you may 
want to use Error objects (which would probably actually be the 
preferable way).

> +}
> +
> +int bdrv_do_checkpoint(BlockDriverState *bs)
> +{
> +    BlockDriver *drv = bs->drv;
> +    if (drv && drv->bdrv_do_checkpoint) {
> +        return drv->bdrv_do_checkpoint(bs);
> +    } else if (bs->file) {
> +        return bdrv_do_checkpoint(bs->file);
> +    }
> +
> +    return -1;

Same here.

> +}
> +
> +int bdrv_stop_replication(BlockDriverState *bs)
> +{
> +    BlockDriver *drv = bs->drv;
> +    if (drv && drv->bdrv_stop_replication) {
> +        return drv->bdrv_stop_replication(bs);
> +    } else if (bs->file) {
> +        return bdrv_stop_replication(bs->file);
> +    }
> +
> +    return -1;

And here.

> +}
> diff --git a/include/block/block.h b/include/block/block.h
> index 321295e..632b9fc 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -557,4 +557,14 @@ void bdrv_flush_io_queue(BlockDriverState *bs);
>   
>   BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
>   
> +/* Checkpoint control, called in migration/checkpoint thread */
> +enum {
> +    COLO_UNPROTECTED_MODE = 0,
> +    COLO_PRIMARY_MODE,
> +    COLO_SECONDARY_MODE,
> +};

I have a feeling that you may want to define these values through QAPI...

There's nothing wrong with this patch, but I don't yet really know what 
you want to do with these functions (the doc didn't really help me with 
them), so I'll have to look into the rest of the series before I can 
really say something useful about it.

Max

> +int bdrv_start_replication(BlockDriverState *bs, int mode);
> +int bdrv_do_checkpoint(BlockDriverState *bs);
> +int bdrv_stop_replication(BlockDriverState *bs);
> +
>   #endif
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 7ad1950..603f704 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -273,6 +273,18 @@ struct BlockDriver {
>       void (*bdrv_io_unplug)(BlockDriverState *bs);
>       void (*bdrv_flush_io_queue)(BlockDriverState *bs);
>   
> +
> +    /* Checkpoint control, called in migration/checkpoint thread */
> +    int (*bdrv_start_replication)(BlockDriverState *bs, int mode);
> +    /*
> +     * Drop Disk buffer when doing checkpoint.
> +     */
> +    int (*bdrv_do_checkpoint)(BlockDriverState *bs);
> +    /* After failover, we should flush Disk buffer into secondary disk
> +     * and stop block replication.
> +     */
> +    int (*bdrv_stop_replication)(BlockDriverState *bs);
> +
>       QLIST_ENTRY(BlockDriver) list;
>   };
>   

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication Wen Congyang
@ 2015-02-23 21:22   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 21:22 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/quorum.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 69 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index e6aff5f..c8479b4 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -1070,6 +1070,71 @@ static void quorum_refresh_filename(BlockDriverState *bs)
>       bs->full_open_options = opts;
>   }
>   
> +static int quorum_stop_replication(BlockDriverState *bs);
> +static int quorum_start_replication(BlockDriverState *bs, int mode)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int ret = -1, i;

Again, I'd prefer it if you used -errno (or the Error API).

> +
> +    /*
> +     * TODO: support COLO_SECONDARY_MODE if we allow secondary
> +     * QEMU becoming primary QEMU.
> +     */
> +    if (mode != COLO_PRIMARY_MODE) {
> +        return -1;
> +    }
> +
> +    if (s->read_pattern != QUORUM_READ_PATTERN_FIRST) {
> +        return -1;
> +    }
> +
> +    /* NBD client should not be the first child */
> +    if (bdrv_start_replication(s->bs[0], mode) == 0) {

If you allow the NBD client to be the first child you can probably drop 
this block (and start from "i = 0" in the for loop).

> +        bdrv_stop_replication(s->bs[0]);
> +        return -1;
> +    }
> +
> +    for (i = 1; i < s->num_children; i++) {
> +        if (bdrv_start_replication(s->bs[i], mode) == 0) {
> +            ret++;
> +        }
> +    }
> +
> +    if (ret > 0) {
> +        quorum_stop_replication(bs);
> +    }

I think it would be easier to read if you had an additional "count" 
variable which is set to 0 before the for loop and then incremented 
(instead of ret). This would then be "if (count > 1)".

> +
> +    return ret ? -1 : 0;

And this would be "return count == 0 ? 0 : -ENOTSUP" or something.

But apart from that, what's so bad about having multiple children which 
support bdrv_start_replication()? I mean, other than "It's not what we 
intended".

Max

> +}
> +
> +static int quorum_do_checkpoint(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int i;
> +
> +    for (i = 1; i < s->num_children; i++) {
> +        if (bdrv_do_checkpoint(s->bs[i]) == 0) {
> +            return 0;
> +        }
> +    }
> +
> +    return -1;
> +}
> +
> +static int quorum_stop_replication(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int ret = -1, i;
> +
> +    for (i = 0; i < s->num_children; i++) {
> +        if (bdrv_stop_replication(s->bs[i]) == 0) {
> +            ret++;
> +        }
> +    }
> +
> +    return ret ? -1 : 0;
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name                        = "quorum",
>       .protocol_name                      = "quorum",
> @@ -1093,6 +1158,10 @@ static BlockDriver bdrv_quorum = {
>   
>       .is_filter                          = true,
>       .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
> +
> +    .bdrv_start_replication             = quorum_start_replication,
> +    .bdrv_do_checkpoint                 = quorum_do_checkpoint,
> +    .bdrv_stop_replication              = quorum_stop_replication,
>   };
>   
>   static void bdrv_quorum_init(void)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later Wen Congyang
@ 2015-02-23 21:31   ` Max Reitz
  2015-02-25  2:23     ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Max Reitz @ 2015-02-23 21:31 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> The secondary qemu starts later than the primary qemu, so we
> cannot connect to nbd server in bdrv_open().
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/nbd.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 87 insertions(+), 13 deletions(-)
>
> diff --git a/block/nbd.c b/block/nbd.c
> index b05d1d0..19b9200 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -44,6 +44,8 @@
>   typedef struct BDRVNBDState {
>       NbdClientSession client;
>       QemuOpts *socket_opts;
> +    char *export;
> +    bool connected;
>   } BDRVNBDState;
>   
>   static int nbd_parse_uri(const char *filename, QDict *options)
> @@ -247,20 +249,10 @@ static int nbd_establish_connection(BlockDriverState *bs, Error **errp)
>       return sock;
>   }
>   
> -static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
> -                    Error **errp)
> +static int nbd_connect_server(BlockDriverState *bs, Error **errp)
>   {
>       BDRVNBDState *s = bs->opaque;
> -    char *export = NULL;
>       int result, sock;
> -    Error *local_err = NULL;
> -
> -    /* Pop the config into our state object. Exit if invalid. */
> -    nbd_config(s, options, &export, &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return -EINVAL;
> -    }
>   
>       /* establish TCP connection, return error if it fails
>        * TODO: Configurable retry-until-timeout behaviour.
> @@ -271,16 +263,57 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>       }
>   
>       /* NBD handshake */
> -    result = nbd_client_session_init(&s->client, bs, sock, export, errp);
> -    g_free(export);
> +    result = nbd_client_session_init(&s->client, bs, sock, s->export, errp);
> +    g_free(s->export);
> +    s->export = NULL;
> +    if (!result) {
> +        s->connected = true;
> +    }
> +
>       return result;
>   }
>   
> +static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
> +                    Error **errp)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    Error *local_err = NULL;
> +
> +    /* Pop the config into our state object. Exit if invalid. */
> +    nbd_config(s, options, &s->export, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return -EINVAL;
> +    }
> +
> +    return nbd_connect_server(bs, errp);
> +}
> +
> +static int nbd_open_colo(BlockDriverState *bs, QDict *options, int flags,
> +                         Error **errp)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    Error *local_err = NULL;
> +
> +    /* Pop the config into our state object. Exit if invalid. */
> +    nbd_config(s, options, &s->export, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
>   static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
>                           int nb_sectors, QEMUIOVector *qiov)
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return -EIO;
> +    }
> +
>       return nbd_client_session_co_readv(&s->client, sector_num,
>                                          nb_sectors, qiov);
>   }
> @@ -290,6 +323,10 @@ static int nbd_co_writev(BlockDriverState *bs, int64_t sector_num,
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return 0;
> +    }

Would it break anything to return -EIO here as well? (And in all the 
following functions)

> +
>       return nbd_client_session_co_writev(&s->client, sector_num,
>                                           nb_sectors, qiov);
>   }
> @@ -298,6 +335,10 @@ static int nbd_co_flush(BlockDriverState *bs)
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return 0;
> +    }
> +
>       return nbd_client_session_co_flush(&s->client);
>   }
>   
> @@ -312,6 +353,10 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return 0;
> +    }
> +
>       return nbd_client_session_co_discard(&s->client, sector_num,
>                                            nb_sectors);
>   }
> @@ -322,6 +367,7 @@ static void nbd_close(BlockDriverState *bs)
>   
>       qemu_opts_del(s->socket_opts);
>       nbd_client_session_close(&s->client);
> +    s->connected = false;
>   }
>   

As I proposed before, can you make nbd_getlength() return -ENOTCONN or 
something unique in case s->connected is false? I think that'd be better 
than returning 0 (which is a valid value).

Max

>   static int64_t nbd_getlength(BlockDriverState *bs)
> @@ -335,6 +381,10 @@ static void nbd_detach_aio_context(BlockDriverState *bs)
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return;
> +    }
> +
>       nbd_client_session_detach_aio_context(&s->client);
>   }
>   
> @@ -343,6 +393,10 @@ static void nbd_attach_aio_context(BlockDriverState *bs,
>   {
>       BDRVNBDState *s = bs->opaque;
>   
> +    if (!s->connected) {
> +        return;
> +    }
> +
>       nbd_client_session_attach_aio_context(&s->client, new_context);
>   }
>   
> @@ -445,11 +499,31 @@ static BlockDriver bdrv_nbd_unix = {
>       .bdrv_refresh_filename      = nbd_refresh_filename,
>   };
>   
> +static BlockDriver bdrv_nbd_colo = {
> +    .format_name                = "nbd+colo",
> +    .protocol_name              = "nbd+colo",
> +    .instance_size              = sizeof(BDRVNBDState),
> +    .bdrv_parse_filename        = nbd_parse_filename,
> +    .bdrv_file_open             = nbd_open_colo,
> +    .bdrv_co_readv              = nbd_co_readv,
> +    .bdrv_co_writev             = nbd_co_writev,
> +    .bdrv_close                 = nbd_close,
> +    .bdrv_co_flush_to_os        = nbd_co_flush,
> +    .bdrv_co_discard            = nbd_co_discard,
> +    .bdrv_getlength             = nbd_getlength,
> +    .bdrv_detach_aio_context    = nbd_detach_aio_context,
> +    .bdrv_attach_aio_context    = nbd_attach_aio_context,
> +    .bdrv_refresh_filename      = nbd_refresh_filename,
> +
> +    .has_variable_length        = true,
> +};
> +
>   static void bdrv_nbd_init(void)
>   {
>       bdrv_register(&bdrv_nbd);
>       bdrv_register(&bdrv_nbd_tcp);
>       bdrv_register(&bdrv_nbd_unix);
> +    bdrv_register(&bdrv_nbd_colo);
>   }
>   
>   block_init(bdrv_nbd_init);

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication Wen Congyang
@ 2015-02-23 21:41   ` Max Reitz
  2015-02-26 14:08     ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Max Reitz @ 2015-02-23 21:41 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/nbd.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 55 insertions(+)

So by now to me it looks like you're using bdrv_start_replication() on 
the primary VM to start transferring data through the NBD connection to 
the secondary VM.

I guess this is what you were discussing with Fam and John, whether 
there'd be a better way to do it by using functionality that is already 
(or is about to become) part of qemu, right?

> diff --git a/block/nbd.c b/block/nbd.c
> index 19b9200..1ff6ecf 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -445,6 +445,58 @@ static void nbd_refresh_filename(BlockDriverState *bs)
>       bs->full_open_options = opts;
>   }
>   
> +static int nbd_start_replication(BlockDriverState *bs, int mode)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    /*
> +     * TODO: support COLO_SECONDARY_MODE if we allow secondary
> +     * QEMU becoming primary QEMU.
> +     */
> +    if (mode != COLO_PRIMARY_MODE) {
> +        return -1;

Once again, I'd like -ENOTSUP more (or -EINVAL or whatever you prefer).

> +    }
> +
> +    if (s->connected) {
> +        return -1;
> +    }
> +
> +    /* TODO: NBD client should be one child of quorum, how to verify it? */

Again, why would you care about that? Other than "It's how it's intended 
to be used".

> +    ret = nbd_connect_server(bs, &local_err);
> +    if (local_err) {
> +        error_free(local_err);
> +    }

If you'd use the Error API you'd be able to propagate the error.

Max

> +
> +    return ret;
> +}
> +
> +static int nbd_do_checkpoint(BlockDriverState *bs)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +
> +    if (!s->connected) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int nbd_stop_replication(BlockDriverState *bs)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +
> +    if (!s->connected) {
> +        return -1;
> +    }
> +
> +    nbd_client_session_close(&s->client);
> +    s->connected = false;
> +
> +    return 0;
> +}
> +
>   static BlockDriver bdrv_nbd = {
>       .format_name                = "nbd",
>       .protocol_name              = "nbd",
> @@ -514,6 +566,9 @@ static BlockDriver bdrv_nbd_colo = {
>       .bdrv_detach_aio_context    = nbd_detach_aio_context,
>       .bdrv_attach_aio_context    = nbd_attach_aio_context,
>       .bdrv_refresh_filename      = nbd_refresh_filename,
> +    .bdrv_start_replication     = nbd_start_replication,
> +    .bdrv_do_checkpoint         = nbd_do_checkpoint,
> +    .bdrv_stop_replication      = nbd_stop_replication,
>   
>       .has_variable_length        = true,
>   };

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend Wen Congyang
@ 2015-02-23 21:48   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 21:48 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/block-backend.c          | 29 ++++++++++++++++++++++++++++-
>   include/sysemu/block-backend.h |  2 ++
>   2 files changed, 30 insertions(+), 1 deletion(-)

Hm, I'm currently working on a series that (among other things) separate 
the list of monitor-owned BlockBackends and the list of all BBs. That 
should be helpful here; but Paolo said it might not be necessary to 
create a BB at all, so... Maybe he's right. I'll have to look into the 
remaining patches first.

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend Wen Congyang
@ 2015-02-23 21:53   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 21:53 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)

Our current stance on BlockBackends is (as far as I know, anyway) that a 
BlockBackend always comes with a user. In case you're creating a 
BlockBackend through -drive or blockdev-add, the user is the monitor 
(which can use it to attach it to a device, for instance). In this case, 
there is no user and nobody holds a reference to the BB (other than the 
BDS, but that doesn't count).

Therefore, this patch doesn't look quite right.

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
  2015-02-12  6:42   ` Gonglei
  2015-02-23 20:36   ` Max Reitz
@ 2015-02-23 21:56   ` Eric Blake
  2 siblings, 0 replies; 81+ messages in thread
From: Eric Blake @ 2015-02-23 21:56 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Luiz Capitulino, Gonglei, Yang Hongyang, Michael Roth,
	zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

On 02/11/2015 08:07 PM, Wen Congyang wrote:
> To block replication, we only need to read from the first child.

s/quorom/quorum/ in the subject line

s/To block/For block/

> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>  block/quorum.c       | 5 +++--
>  qapi/block-core.json | 4 +++-
>  2 files changed, 6 insertions(+), 3 deletions(-)

> +++ b/qapi/block-core.json
> @@ -1618,9 +1618,11 @@
>  #
>  # @fifo: read only from the first child that has not failed
>  #
> +# @first: read only from the first child
> +

Missing a 'since 2.3' designation.  Otherwise looks okay.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication
  2015-02-23 20:57   ` Max Reitz
@ 2015-02-23 21:58     ` Eric Blake
  0 siblings, 0 replies; 81+ messages in thread
From: Eric Blake @ 2015-02-23 21:58 UTC (permalink / raw)
  To: Max Reitz, Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1674 bytes --]

On 02/23/2015 01:57 PM, Max Reitz wrote:
> On 2015-02-11 at 22:07, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   block.c                   | 36 ++++++++++++++++++++++++++++++++++++
>>   include/block/block.h     | 10 ++++++++++
>>   include/block/block_int.h | 12 ++++++++++++
>>   3 files changed, 58 insertions(+)
>>

>> +++ b/include/block/block.h
>> @@ -557,4 +557,14 @@ void bdrv_flush_io_queue(BlockDriverState *bs);
>>     BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
>>   +/* Checkpoint control, called in migration/checkpoint thread */
>> +enum {
>> +    COLO_UNPROTECTED_MODE = 0,
>> +    COLO_PRIMARY_MODE,
>> +    COLO_SECONDARY_MODE,
>> +};
> 
> I have a feeling that you may want to define these values through QAPI...

especially if you intend for a QMP command to output which mode a colo
disk is in.


>> +++ b/include/block/block_int.h
>> @@ -273,6 +273,18 @@ struct BlockDriver {
>>       void (*bdrv_io_unplug)(BlockDriverState *bs);
>>       void (*bdrv_flush_io_queue)(BlockDriverState *bs);
>>   +
>> +    /* Checkpoint control, called in migration/checkpoint thread */
>> +    int (*bdrv_start_replication)(BlockDriverState *bs, int mode);
>> +    /*
>> +     * Drop Disk buffer when doing checkpoint.
>> +     */
>> +    int (*bdrv_do_checkpoint)(BlockDriverState *bs);

Inconsistent comment style between one-line and multi-line.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState Wen Congyang
@ 2015-02-23 22:01   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 22:01 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Block replication needs this feature.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block.c                   | 2 ++
>   include/block/block_int.h | 2 ++
>   2 files changed, 4 insertions(+)
>
> diff --git a/block.c b/block.c
> index a7a8932..067c44b 100644
> --- a/block.c
> +++ b/block.c
> @@ -1181,6 +1181,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
>       if (bs->backing_hd) {
>           assert(bs->backing_blocker);
>           bdrv_op_unblock_all(bs->backing_hd, bs->backing_blocker);
> +        bs->backing_hd->origin_file = NULL;

Seems more like "backed_file" to me. Can you explain to me where "origin 
file" comes from?

Since apparently one BDS can be used as a backing file by only at most 
one other BDS, the patch seems fine to me (other than the naming issue).

Max

>       } else if (backing_hd) {
>           error_setg(&bs->backing_blocker,
>                      "device is used as backing hd of '%s'",
> @@ -1193,6 +1194,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
>           bs->backing_blocker = NULL;
>           goto out;
>       }
> +    backing_hd->origin_file = bs;
>       bs->open_flags &= ~BDRV_O_NO_BACKING;
>       pstrcpy(bs->backing_file, sizeof(bs->backing_file), backing_hd->filename);
>       pstrcpy(bs->backing_format, sizeof(bs->backing_format),
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 603f704..9be13a8 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -360,6 +360,8 @@ struct BlockDriverState {
>       char exact_filename[PATH_MAX];
>   
>       BlockDriverState *backing_hd;
> +    /* used by backing image */
> +    BlockDriverState *origin_file;
>       BlockDriverState *file;
>   
>       NotifierList close_notifiers;

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file Wen Congyang
@ 2015-02-23 22:03   ` Max Reitz
  2015-02-26 14:15     ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Max Reitz @ 2015-02-23 22:03 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

I don't think this is a good idea. With this patch, every time you open 
a COW file (with a backing file) R/W, the backing file will be writable. 
I'd rather like a way to explicitly overwrite the R/W mode of the 
backing file; but by default, in my opinion, it should stay read-only.

Max

> diff --git a/block.c b/block.c
> index 067c44b..96cf973 100644
> --- a/block.c
> +++ b/block.c
> @@ -856,8 +856,8 @@ static int bdrv_inherited_flags(int flags)
>    */
>   static int bdrv_backing_flags(int flags)
>   {
> -    /* backing files always opened read-only */
> -    flags &= ~(BDRV_O_RDWR | BDRV_O_COPY_ON_READ);
> +    /* backing files are opened read-write for block replication */
> +    flags &= ~BDRV_O_COPY_ON_READ;
>   
>       /* snapshot=on is handled on the top layer */
>       flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_TEMPORARY);

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication Wen Congyang
@ 2015-02-23 22:27   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 22:27 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/Makefile.objs    |   1 +
>   block/blkcolo-buffer.c | 324 +++++++++++++++++++++++++++++++++++++++++++++++++
>   block/blkcolo.h        |  35 ++++++
>   3 files changed, 360 insertions(+)
>   create mode 100644 block/blkcolo-buffer.c
>   create mode 100644 block/blkcolo.h

In general: Can you please add some prefix to the non-static functions, 
like colo_*?

As for the design questions regarding this block driver, I should 
probably leave that to Fam and John (because they seemed to have an idea 
how to approach the issue at hand using a different implementation, 
based on functionality that's already (or close to become) part of qemu).

Therefore, I'm hesitating to review this patch (and the following ones) 
until you've reached a conclusion on how to proceed.

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver Wen Congyang
@ 2015-02-23 22:35   ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-02-23 22:35 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-11 at 22:07, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>   block/Makefile.objs |   2 +-
>   block/blkcolo.c     | 409 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 410 insertions(+), 1 deletion(-)
>   create mode 100644 block/blkcolo.c

Seeing what you want to use the BlockBackend for (which is, set up an 
NBD server): I think it's best to create the BlockBackend the moment the 
NBD server is created, and destroy it the moment the NBD server is 
stopped (that is, create it in colo_svm_init() and destroy it in 
colo_svm_fini()).

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child
  2015-02-23 20:43   ` Max Reitz
@ 2015-02-24  2:33     ` Wen Congyang
  2015-03-18  5:29     ` Wen Congyang
  1 sibling, 0 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-24  2:33 UTC (permalink / raw)
  To: Max Reitz, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 02/24/2015 04:43 AM, Max Reitz wrote:
> On 2015-02-11 at 22:07, Wen Congyang wrote:
>> We connect to NBD server when starting block replication, so
>> the length is 0 before starting block replication.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   block/quorum.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/block/quorum.c b/block/quorum.c
>> index 5ed1ff8..e6aff5f 100644
>> --- a/block/quorum.c
>> +++ b/block/quorum.c
>> @@ -734,6 +734,11 @@ static int64_t quorum_getlength(BlockDriverState *bs)
>>           if (value < 0) {
>>               return value;
>>           }
>> +
>> +        if (!value) {
>> +            continue;
>> +        }
>> +
>>           if (value != result) {
>>               return -EIO;
>>           }
> 
> Hm, what do you think about some specific error value returned by your delayed NBD implementation? Like -ENOTCONN or something like that? Then we'd be able to discern a real 0-length block device from a not-yet-connected NBD server.
> 
> Also, while you did write that one shouldn't be using the NBD client as the first quorum child, I think we should try to support that case anyway. For this patch, that means accepting that bdrv_getlength(s->bs[0]) may be off.

Good idea. I will try it.

Thanks
Wen Congyang

> 
> Max
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  8:44       ` Fam Zheng
  2015-02-12  9:33         ` Wen Congyang
  2015-02-12  9:36         ` Hongyang Yang
@ 2015-02-24  7:50         ` Wen Congyang
  2015-02-25  2:46           ` Fam Zheng
  2015-02-25  8:11         ` Wen Congyang
  2015-02-25  9:10         ` Wen Congyang
  4 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-24  7:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.

I don't understand this. In which function, the hidden target is created automatically?

Thanks
Wen Congyang

> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>
> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later
  2015-02-23 21:31   ` Max Reitz
@ 2015-02-25  2:23     ` Wen Congyang
  2015-02-25 14:22       ` Max Reitz
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-25  2:23 UTC (permalink / raw)
  To: Max Reitz, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 02/24/2015 05:31 AM, Max Reitz wrote:
> On 2015-02-11 at 22:07, Wen Congyang wrote:
>> The secondary qemu starts later than the primary qemu, so we
>> cannot connect to nbd server in bdrv_open().
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   block/nbd.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 87 insertions(+), 13 deletions(-)
>>
>> diff --git a/block/nbd.c b/block/nbd.c
>> index b05d1d0..19b9200 100644
>> --- a/block/nbd.c
>> +++ b/block/nbd.c
>> @@ -44,6 +44,8 @@
>>   typedef struct BDRVNBDState {
>>       NbdClientSession client;
>>       QemuOpts *socket_opts;
>> +    char *export;
>> +    bool connected;
>>   } BDRVNBDState;
>>     static int nbd_parse_uri(const char *filename, QDict *options)
>> @@ -247,20 +249,10 @@ static int nbd_establish_connection(BlockDriverState *bs, Error **errp)
>>       return sock;
>>   }
>>   -static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>> -                    Error **errp)
>> +static int nbd_connect_server(BlockDriverState *bs, Error **errp)
>>   {
>>       BDRVNBDState *s = bs->opaque;
>> -    char *export = NULL;
>>       int result, sock;
>> -    Error *local_err = NULL;
>> -
>> -    /* Pop the config into our state object. Exit if invalid. */
>> -    nbd_config(s, options, &export, &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return -EINVAL;
>> -    }
>>         /* establish TCP connection, return error if it fails
>>        * TODO: Configurable retry-until-timeout behaviour.
>> @@ -271,16 +263,57 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>>       }
>>         /* NBD handshake */
>> -    result = nbd_client_session_init(&s->client, bs, sock, export, errp);
>> -    g_free(export);
>> +    result = nbd_client_session_init(&s->client, bs, sock, s->export, errp);
>> +    g_free(s->export);
>> +    s->export = NULL;
>> +    if (!result) {
>> +        s->connected = true;
>> +    }
>> +
>>       return result;
>>   }
>>   +static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>> +                    Error **errp)
>> +{
>> +    BDRVNBDState *s = bs->opaque;
>> +    Error *local_err = NULL;
>> +
>> +    /* Pop the config into our state object. Exit if invalid. */
>> +    nbd_config(s, options, &s->export, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return -EINVAL;
>> +    }
>> +
>> +    return nbd_connect_server(bs, errp);
>> +}
>> +
>> +static int nbd_open_colo(BlockDriverState *bs, QDict *options, int flags,
>> +                         Error **errp)
>> +{
>> +    BDRVNBDState *s = bs->opaque;
>> +    Error *local_err = NULL;
>> +
>> +    /* Pop the config into our state object. Exit if invalid. */
>> +    nbd_config(s, options, &s->export, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
>>                           int nb_sectors, QEMUIOVector *qiov)
>>   {
>>       BDRVNBDState *s = bs->opaque;
>>   +    if (!s->connected) {
>> +        return -EIO;
>> +    }
>> +
>>       return nbd_client_session_co_readv(&s->client, sector_num,
>>                                          nb_sectors, qiov);
>>   }
>> @@ -290,6 +323,10 @@ static int nbd_co_writev(BlockDriverState *bs, int64_t sector_num,
>>   {
>>       BDRVNBDState *s = bs->opaque;
>>   +    if (!s->connected) {
>> +        return 0;
>> +    }
> 
> Would it break anything to return -EIO here as well? (And in all the following functions)

1. nbd_co_writev()
   If one child returns error, quorum will report it. There may be many write requests before
   we connect to nbd server, so there are too many qapi events...
2. nbd_co_flush()
   If quorum only have two children, and nbd client is the last one, quorum_co_flush()
   will return -EIO.
3. nbd_co_discard()
   quorum doens't call bdrv_co_discard(), so it is OK to return -EIO here.

So only nbd_co_discard() can return -EIO.

Thanks
Wen Congyang

> 
>> +
>>       return nbd_client_session_co_writev(&s->client, sector_num,
>>                                           nb_sectors, qiov);
>>   }
>> @@ -298,6 +335,10 @@ static int nbd_co_flush(BlockDriverState *bs)
>>   {
>>       BDRVNBDState *s = bs->opaque;
>>   +    if (!s->connected) {
>> +        return 0;
>> +    }
>> +
>>       return nbd_client_session_co_flush(&s->client);
>>   }
>>   @@ -312,6 +353,10 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
>>   {
>>       BDRVNBDState *s = bs->opaque;
>>   +    if (!s->connected) {
>> +        return 0;
>> +    }
>> +
>>       return nbd_client_session_co_discard(&s->client, sector_num,
>>                                            nb_sectors);
>>   }
>> @@ -322,6 +367,7 @@ static void nbd_close(BlockDriverState *bs)
>>         qemu_opts_del(s->socket_opts);
>>       nbd_client_session_close(&s->client);
>> +    s->connected = false;
>>   }
>>   
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-24  7:50         ` Wen Congyang
@ 2015-02-25  2:46           ` Fam Zheng
  2015-02-25  8:36             ` Wen Congyang
  2015-02-26  6:38             ` Wen Congyang
  0 siblings, 2 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-25  2:46 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On Tue, 02/24 15:50, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> 
> I don't understand this. In which function, the hidden target is created automatically?
> 

It's to be determined. This part is only in my mind :)

Fam

> 
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> > 
> > The workflow analogue is:
> > 
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> > 
> > Primary write requests are forwarded to secondary QEMU as well.
> > 
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> > 
> > Before Primary write requests are written to (nbd target), aka the Secondary
> > disk, the orignal sector content is read from it and copied to (hidden buf
> > disk) by drive-backup. It obviously will not overwrite the data in (active
> > disk).
> > 
> >>>> +    3) Primary write requests will be written to Secondary disk.
> > 
> > Primary write requests are written to (nbd target).
> > 
> >>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>> +       will overwrite the existing sector content in the buffer.
> > 
> > Secondary write request will be written in (active disk) as usual.
> > 
> > Finally, when checkpoint arrives, if you want to sync with primary, just drop
> > data in (hidden buf disk) and (active disk); when failover happends, if you
> > want to promote secondary vm, you can commit (active disk) to (nbd target), and
> > drop data in (hidden buf disk).
> > 
> > Fam
> > .
> > 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  8:44       ` Fam Zheng
                           ` (2 preceding siblings ...)
  2015-02-24  7:50         ` Wen Congyang
@ 2015-02-25  8:11         ` Wen Congyang
  2015-02-25  8:18           ` Fam Zheng
  2015-02-25  9:10         ` Wen Congyang
  4 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-25  8:11 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)

foo's backing is mid, and mid's backing is base?

The foo is a base's snapshot?

Thanks
Wen Congyang

> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>
> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).
> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  8:11         ` Wen Congyang
@ 2015-02-25  8:18           ` Fam Zheng
  0 siblings, 0 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-25  8:18 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Wed, 02/25 16:11, Wen Congyang wrote:
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> 
> foo's backing is mid, and mid's backing is base?

Yes.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  2:46           ` Fam Zheng
@ 2015-02-25  8:36             ` Wen Congyang
  2015-02-25  8:58               ` Fam Zheng
  2015-02-26  6:38             ` Wen Congyang
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-25  8:36 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/25/2015 10:46 AM, Fam Zheng wrote:
> On Tue, 02/24 15:50, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>
>> I don't understand this. In which function, the hidden target is created automatically?
>>
> 
> It's to be determined. This part is only in my mind :)

Does hidden target is only used for COLO?

Thanks
Wen Congyang

> 
> Fam
> 
>>
>>>
>>> It is interesting because it is a more generalized case of image fleecing,
>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>> only) purpose.
>>>
>>> More interestingly, with above facility, it is also possible to create a guest
>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>> cheaply. Or call it shadow copy if you will.
>>>
>>> Back to the COLO case, the configuration will be very similar:
>>>
>>>
>>>                       {primary wr}                                                {secondary vm}
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             |                                                           |
>>>                             v                                                           v
>>>
>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             v                              ^
>>>                             >>>> drive-backup sync=none >>>>
>>>
>>> The workflow analogue is:
>>>
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>
>>> Primary write requests are forwarded to secondary QEMU as well.
>>>
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>
>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>> disk).
>>>
>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>
>>> Primary write requests are written to (nbd target).
>>>
>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>> +       will overwrite the existing sector content in the buffer.
>>>
>>> Secondary write request will be written in (active disk) as usual.
>>>
>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>> drop data in (hidden buf disk).
>>>
>>> Fam
>>> .
>>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  8:36             ` Wen Congyang
@ 2015-02-25  8:58               ` Fam Zheng
  2015-02-25  9:58                 ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-25  8:58 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Wed, 02/25 16:36, Wen Congyang wrote:
> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> > On Tue, 02/24 15:50, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>
> >> I don't understand this. In which function, the hidden target is created automatically?
> >>
> > 
> > It's to be determined. This part is only in my mind :)
> 
> Does hidden target is only used for COLO?
> 

I'm not sure I get your question.

In this case yes, this is a dedicate target that's only written to by COLO's
secondary VM.

In other general cases, this infrastructure could also be used for backup or
image fleecing.

Fam

> 
> > 
> > Fam
> > 
> >>
> >>>
> >>> It is interesting because it is a more generalized case of image fleecing,
> >>> where the (hidden target) is exposed via NBD server for data scanning (read
> >>> only) purpose.
> >>>
> >>> More interestingly, with above facility, it is also possible to create a guest
> >>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >>> cheaply. Or call it shadow copy if you will.
> >>>
> >>> Back to the COLO case, the configuration will be very similar:
> >>>
> >>>
> >>>                       {primary wr}                                                {secondary vm}
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             |                                                           |
> >>>                             v                                                           v
> >>>
> >>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> >>>
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             v                              ^
> >>>                             >>>> drive-backup sync=none >>>>
> >>>
> >>> The workflow analogue is:
> >>>
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>
> >>> Primary write requests are forwarded to secondary QEMU as well.
> >>>
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>
> >>> Before Primary write requests are written to (nbd target), aka the Secondary
> >>> disk, the orignal sector content is read from it and copied to (hidden buf
> >>> disk) by drive-backup. It obviously will not overwrite the data in (active
> >>> disk).
> >>>
> >>>>>> +    3) Primary write requests will be written to Secondary disk.
> >>>
> >>> Primary write requests are written to (nbd target).
> >>>
> >>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >>>>>> +       will overwrite the existing sector content in the buffer.
> >>>
> >>> Secondary write request will be written in (active disk) as usual.
> >>>
> >>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >>> data in (hidden buf disk) and (active disk); when failover happends, if you
> >>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> >>> drop data in (hidden buf disk).
> >>>
> >>> Fam
> >>> .
> >>>
> >>
> > .
> > 
> 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  8:44       ` Fam Zheng
                           ` (3 preceding siblings ...)
  2015-02-25  8:11         ` Wen Congyang
@ 2015-02-25  9:10         ` Wen Congyang
  2015-02-25  9:45           ` Fam Zheng
  4 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-25  9:10 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/12 15:40, Wen Congyang wrote:
>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>> Hi Congyang,
>>>
>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>> +== Workflow ==
>>>> +The following is the image of block replication workflow:
>>>> +
>>>> +        +----------------------+            +------------------------+
>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>> +        +----------------------+            +------------------------+
>>>> +                  |                                       |
>>>> +                  |                                      (4)
>>>> +                  |                                       V
>>>> +                  |                              /-------------\
>>>> +                  |      Copy and Forward        |             |
>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>> +                  |                      |       |             |
>>>> +                  |                     (3)      \-------------/
>>>> +                  |                 speculative      ^
>>>> +                  |                write through    (2)
>>>> +                  |                      |           |
>>>> +                  V                      V           |
>>>> +           +--------------+           +----------------+
>>>> +           | Primary Disk |           | Secondary Disk |
>>>> +           +--------------+           +----------------+
>>>> +
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
>>>
>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>> reading them as "s/will be/are/g"
>>>
>>> Why do you need this buffer?
>>
>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>> vm write to the buffer.
>>
>>>
>>> If both primary and secondary write to the same sector, what is saved in the
>>> buffer?
>>
>> The primary content will be written to the secondary disk, and the secondary content
>> is saved in the buffer.
> 
> I wonder if alternatively this is possible with an imaginary "writable backing
> image" feature, as described below.
> 
> When we have a normal backing chain,
> 
>                {virtio-blk dev 'foo'}
>                          |
>                          |
>                          |
>     [base] <- [mid] <- (foo)
> 
> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> to an existing image on top,
> 
>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>                          |                              |
>                          |                              |
>                          |                              |
>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> 
> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> We can utilize an automatic hidden drive-backup target:
> 
>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>                          |                                                          |
>                          |                                                          |
>                          v                                                          v
> 
>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> 
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          v                              ^
>                          >>>> drive-backup sync=none >>>>
> 
> So when guest writes to 'foo', the old data is moved to (hidden target), which
> remains unchanged from (bar)'s PoV.
> 
> The drive in the middle is called hidden because QEMU creates it automatically,
> the naming is arbitrary.
> 
> It is interesting because it is a more generalized case of image fleecing,
> where the (hidden target) is exposed via NBD server for data scanning (read
> only) purpose.
> 
> More interestingly, with above facility, it is also possible to create a guest
> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> cheaply. Or call it shadow copy if you will.
> 
> Back to the COLO case, the configuration will be very similar:
> 
> 
>                       {primary wr}                                                {secondary vm}
>                             |                                                           |
>                             |                                                           |
>                             |                                                           |
>                             v                                                           v
> 
>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> 
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             v                              ^
>                             >>>> drive-backup sync=none >>>>

Why nbd target has backing image ever?

> 
> The workflow analogue is:
> 
>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>> +       QEMU.
> 
> Primary write requests are forwarded to secondary QEMU as well.
> 
>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>> +       original sector content will be read from Secondary disk and
>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>> +       sector content in the Disk buffer.
> 
> Before Primary write requests are written to (nbd target), aka the Secondary
> disk, the orignal sector content is read from it and copied to (hidden buf
> disk) by drive-backup. It obviously will not overwrite the data in (active
> disk).
> 
>>>> +    3) Primary write requests will be written to Secondary disk.
> 
> Primary write requests are written to (nbd target).
> 
>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>> +       will overwrite the existing sector content in the buffer.
> 
> Secondary write request will be written in (active disk) as usual.
> 
> Finally, when checkpoint arrives, if you want to sync with primary, just drop
> data in (hidden buf disk) and (active disk); when failover happends, if you
> want to promote secondary vm, you can commit (active disk) to (nbd target), and
> drop data in (hidden buf disk).

We cannot drop data in (hidden buf disk). Commit (hidden buf disk) to (nbd target)
first, and then commit (active disk) to (nbd target).

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  9:10         ` Wen Congyang
@ 2015-02-25  9:45           ` Fam Zheng
  0 siblings, 0 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-25  9:45 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Wed, 02/25 17:10, Wen Congyang wrote:
> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/12 15:40, Wen Congyang wrote:
> >> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>> Hi Congyang,
> >>>
> >>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>> +== Workflow ==
> >>>> +The following is the image of block replication workflow:
> >>>> +
> >>>> +        +----------------------+            +------------------------+
> >>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>> +        +----------------------+            +------------------------+
> >>>> +                  |                                       |
> >>>> +                  |                                      (4)
> >>>> +                  |                                       V
> >>>> +                  |                              /-------------\
> >>>> +                  |      Copy and Forward        |             |
> >>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>> +                  |                      |       |             |
> >>>> +                  |                     (3)      \-------------/
> >>>> +                  |                 speculative      ^
> >>>> +                  |                write through    (2)
> >>>> +                  |                      |           |
> >>>> +                  V                      V           |
> >>>> +           +--------------+           +----------------+
> >>>> +           | Primary Disk |           | Secondary Disk |
> >>>> +           +--------------+           +----------------+
> >>>> +
> >>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>> +       QEMU.
> >>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>> +       original sector content will be read from Secondary disk and
> >>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>> +       sector content in the Disk buffer.
> >>>
> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>> reading them as "s/will be/are/g"
> >>>
> >>> Why do you need this buffer?
> >>
> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >> vm write to the buffer.
> >>
> >>>
> >>> If both primary and secondary write to the same sector, what is saved in the
> >>> buffer?
> >>
> >> The primary content will be written to the secondary disk, and the secondary content
> >> is saved in the buffer.
> > 
> > I wonder if alternatively this is possible with an imaginary "writable backing
> > image" feature, as described below.
> > 
> > When we have a normal backing chain,
> > 
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> > 
> > Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> > to an existing image on top,
> > 
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> > 
> > It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> > We can utilize an automatic hidden drive-backup target:
> > 
> >                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >                          |                                                          |
> >                          |                                                          |
> >                          v                                                          v
> > 
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> > 
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> > 
> > So when guest writes to 'foo', the old data is moved to (hidden target), which
> > remains unchanged from (bar)'s PoV.
> > 
> > The drive in the middle is called hidden because QEMU creates it automatically,
> > the naming is arbitrary.
> > 
> > It is interesting because it is a more generalized case of image fleecing,
> > where the (hidden target) is exposed via NBD server for data scanning (read
> > only) purpose.
> > 
> > More interestingly, with above facility, it is also possible to create a guest
> > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> > cheaply. Or call it shadow copy if you will.
> > 
> > Back to the COLO case, the configuration will be very similar:
> > 
> > 
> >                       {primary wr}                                                {secondary vm}
> >                             |                                                           |
> >                             |                                                           |
> >                             |                                                           |
> >                             v                                                           v
> > 
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
> > 
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> 
> Why nbd target has backing image ever?

It's not strictly necessary, depending on your VM disk configuration. (for
example at the time of vm booting, your image already points to a backing file,
etc.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  8:58               ` Fam Zheng
@ 2015-02-25  9:58                 ` Wen Congyang
  0 siblings, 0 replies; 81+ messages in thread
From: Wen Congyang @ 2015-02-25  9:58 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 02/25/2015 04:58 PM, Fam Zheng wrote:
> On Wed, 02/25 16:36, Wen Congyang wrote:
>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>
>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>
>>>
>>> It's to be determined. This part is only in my mind :)
>>
>> Does hidden target is only used for COLO?
>>
> 
> I'm not sure I get your question.
> 
> In this case yes, this is a dedicate target that's only written to by COLO's
> secondary VM.
> 
> In other general cases, this infrastructure could also be used for backup or
> image fleecing.

In COLO case, we can create (hidden buf disk) when starting block replication.
In other general cases, I don't know when to create (hidden buf disk)?

Thanks
Wen Congyang

> 
> Fam
> 
>>
>>>
>>> Fam
>>>
>>>>
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later
  2015-02-25  2:23     ` Wen Congyang
@ 2015-02-25 14:22       ` Max Reitz
  2015-02-26 14:07         ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Max Reitz @ 2015-02-25 14:22 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-02-24 at 21:23, Wen Congyang wrote:
> On 02/24/2015 05:31 AM, Max Reitz wrote:
>> On 2015-02-11 at 22:07, Wen Congyang wrote:
>>> The secondary qemu starts later than the primary qemu, so we
>>> cannot connect to nbd server in bdrv_open().
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> ---
>>>    block/nbd.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
>>>    1 file changed, 87 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/block/nbd.c b/block/nbd.c
>>> index b05d1d0..19b9200 100644
>>> --- a/block/nbd.c
>>> +++ b/block/nbd.c
>>> @@ -44,6 +44,8 @@
>>>    typedef struct BDRVNBDState {
>>>        NbdClientSession client;
>>>        QemuOpts *socket_opts;
>>> +    char *export;
>>> +    bool connected;
>>>    } BDRVNBDState;
>>>      static int nbd_parse_uri(const char *filename, QDict *options)
>>> @@ -247,20 +249,10 @@ static int nbd_establish_connection(BlockDriverState *bs, Error **errp)
>>>        return sock;
>>>    }
>>>    -static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>>> -                    Error **errp)
>>> +static int nbd_connect_server(BlockDriverState *bs, Error **errp)
>>>    {
>>>        BDRVNBDState *s = bs->opaque;
>>> -    char *export = NULL;
>>>        int result, sock;
>>> -    Error *local_err = NULL;
>>> -
>>> -    /* Pop the config into our state object. Exit if invalid. */
>>> -    nbd_config(s, options, &export, &local_err);
>>> -    if (local_err) {
>>> -        error_propagate(errp, local_err);
>>> -        return -EINVAL;
>>> -    }
>>>          /* establish TCP connection, return error if it fails
>>>         * TODO: Configurable retry-until-timeout behaviour.
>>> @@ -271,16 +263,57 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>>>        }
>>>          /* NBD handshake */
>>> -    result = nbd_client_session_init(&s->client, bs, sock, export, errp);
>>> -    g_free(export);
>>> +    result = nbd_client_session_init(&s->client, bs, sock, s->export, errp);
>>> +    g_free(s->export);
>>> +    s->export = NULL;
>>> +    if (!result) {
>>> +        s->connected = true;
>>> +    }
>>> +
>>>        return result;
>>>    }
>>>    +static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>>> +                    Error **errp)
>>> +{
>>> +    BDRVNBDState *s = bs->opaque;
>>> +    Error *local_err = NULL;
>>> +
>>> +    /* Pop the config into our state object. Exit if invalid. */
>>> +    nbd_config(s, options, &s->export, &local_err);
>>> +    if (local_err) {
>>> +        error_propagate(errp, local_err);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return nbd_connect_server(bs, errp);
>>> +}
>>> +
>>> +static int nbd_open_colo(BlockDriverState *bs, QDict *options, int flags,
>>> +                         Error **errp)
>>> +{
>>> +    BDRVNBDState *s = bs->opaque;
>>> +    Error *local_err = NULL;
>>> +
>>> +    /* Pop the config into our state object. Exit if invalid. */
>>> +    nbd_config(s, options, &s->export, &local_err);
>>> +    if (local_err) {
>>> +        error_propagate(errp, local_err);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
>>>                            int nb_sectors, QEMUIOVector *qiov)
>>>    {
>>>        BDRVNBDState *s = bs->opaque;
>>>    +    if (!s->connected) {
>>> +        return -EIO;
>>> +    }
>>> +
>>>        return nbd_client_session_co_readv(&s->client, sector_num,
>>>                                           nb_sectors, qiov);
>>>    }
>>> @@ -290,6 +323,10 @@ static int nbd_co_writev(BlockDriverState *bs, int64_t sector_num,
>>>    {
>>>        BDRVNBDState *s = bs->opaque;
>>>    +    if (!s->connected) {
>>> +        return 0;
>>> +    }
>> Would it break anything to return -EIO here as well? (And in all the following functions)
> 1. nbd_co_writev()
>     If one child returns error, quorum will report it. There may be many write requests before
>     we connect to nbd server, so there are too many qapi events...
> 2. nbd_co_flush()
>     If quorum only have two children, and nbd client is the last one, quorum_co_flush()
>     will return -EIO.
> 3. nbd_co_discard()
>     quorum doens't call bdrv_co_discard(), so it is OK to return -EIO here.
>
> So only nbd_co_discard() can return -EIO.

Hm, okay. How about adding an option to quorum for ignoring errors from 
a specific child? It's probably not possible to do something like 
"children.1.ignore-errors=true", but maybe you can just ignore errors in 
quorum from any but the first child if the read pattern is set to 
"first", that would make sense to me.

But if you don't want to do that, I guess just making NBD some kind of 
/dev/null before it's connected should be fine.

Max

>>> +
>>>        return nbd_client_session_co_writev(&s->client, sector_num,
>>>                                            nb_sectors, qiov);
>>>    }
>>> @@ -298,6 +335,10 @@ static int nbd_co_flush(BlockDriverState *bs)
>>>    {
>>>        BDRVNBDState *s = bs->opaque;
>>>    +    if (!s->connected) {
>>> +        return 0;
>>> +    }
>>> +
>>>        return nbd_client_session_co_flush(&s->client);
>>>    }
>>>    @@ -312,6 +353,10 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
>>>    {
>>>        BDRVNBDState *s = bs->opaque;
>>>    +    if (!s->connected) {
>>> +        return 0;
>>> +    }
>>> +
>>>        return nbd_client_session_co_discard(&s->client, sector_num,
>>>                                             nb_sectors);
>>>    }
>>> @@ -322,6 +367,7 @@ static void nbd_close(BlockDriverState *bs)
>>>          qemu_opts_del(s->socket_opts);
>>>        nbd_client_session_close(&s->client);
>>> +    s->connected = false;
>>>    }
>>>    

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-25  2:46           ` Fam Zheng
  2015-02-25  8:36             ` Wen Congyang
@ 2015-02-26  6:38             ` Wen Congyang
  2015-02-26  8:44               ` Fam Zheng
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-26  6:38 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/25/2015 10:46 AM, Fam Zheng wrote:
> On Tue, 02/24 15:50, Wen Congyang wrote:
>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>> Hi Congyang,
>>>>>
>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>> +== Workflow ==
>>>>>> +The following is the image of block replication workflow:
>>>>>> +
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>> +        +----------------------+            +------------------------+
>>>>>> +                  |                                       |
>>>>>> +                  |                                      (4)
>>>>>> +                  |                                       V
>>>>>> +                  |                              /-------------\
>>>>>> +                  |      Copy and Forward        |             |
>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>> +                  |                      |       |             |
>>>>>> +                  |                     (3)      \-------------/
>>>>>> +                  |                 speculative      ^
>>>>>> +                  |                write through    (2)
>>>>>> +                  |                      |           |
>>>>>> +                  V                      V           |
>>>>>> +           +--------------+           +----------------+
>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>> +           +--------------+           +----------------+
>>>>>> +
>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>> +       QEMU.
>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>> +       original sector content will be read from Secondary disk and
>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>> reading them as "s/will be/are/g"
>>>>>
>>>>> Why do you need this buffer?
>>>>
>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>> vm write to the buffer.
>>>>
>>>>>
>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>> buffer?
>>>>
>>>> The primary content will be written to the secondary disk, and the secondary content
>>>> is saved in the buffer.
>>>
>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>> image" feature, as described below.
>>>
>>> When we have a normal backing chain,
>>>
>>>                {virtio-blk dev 'foo'}
>>>                          |
>>>                          |
>>>                          |
>>>     [base] <- [mid] <- (foo)
>>>
>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>> to an existing image on top,
>>>
>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>                          |                              |
>>>                          |                              |
>>>                          |                              |
>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>
>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>> We can utilize an automatic hidden drive-backup target:
>>>
>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>                          |                                                          |
>>>                          |                                                          |
>>>                          v                                                          v
>>>
>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          v                              ^
>>>                          >>>> drive-backup sync=none >>>>
>>>
>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>> remains unchanged from (bar)'s PoV.
>>>
>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>> the naming is arbitrary.
>>
>> I don't understand this. In which function, the hidden target is created automatically?
>>
> 
> It's to be determined. This part is only in my mind :)

What about this:
-drive file=nbd-target,if=none,id=nbd-target0 \
-drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0

Thanks
Wen Congyang

> 
> Fam
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-26  6:38             ` Wen Congyang
@ 2015-02-26  8:44               ` Fam Zheng
  2015-02-26  9:07                 ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-26  8:44 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Thu, 02/26 14:38, Wen Congyang wrote:
> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> > On Tue, 02/24 15:50, Wen Congyang wrote:
> >> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>> Hi Congyang,
> >>>>>
> >>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>> +== Workflow ==
> >>>>>> +The following is the image of block replication workflow:
> >>>>>> +
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>> +        +----------------------+            +------------------------+
> >>>>>> +                  |                                       |
> >>>>>> +                  |                                      (4)
> >>>>>> +                  |                                       V
> >>>>>> +                  |                              /-------------\
> >>>>>> +                  |      Copy and Forward        |             |
> >>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>> +                  |                      |       |             |
> >>>>>> +                  |                     (3)      \-------------/
> >>>>>> +                  |                 speculative      ^
> >>>>>> +                  |                write through    (2)
> >>>>>> +                  |                      |           |
> >>>>>> +                  V                      V           |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>> +           +--------------+           +----------------+
> >>>>>> +
> >>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>> +       QEMU.
> >>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>> +       original sector content will be read from Secondary disk and
> >>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>> +       sector content in the Disk buffer.
> >>>>>
> >>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>> reading them as "s/will be/are/g"
> >>>>>
> >>>>> Why do you need this buffer?
> >>>>
> >>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>> vm write to the buffer.
> >>>>
> >>>>>
> >>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>> buffer?
> >>>>
> >>>> The primary content will be written to the secondary disk, and the secondary content
> >>>> is saved in the buffer.
> >>>
> >>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>> image" feature, as described below.
> >>>
> >>> When we have a normal backing chain,
> >>>
> >>>                {virtio-blk dev 'foo'}
> >>>                          |
> >>>                          |
> >>>                          |
> >>>     [base] <- [mid] <- (foo)
> >>>
> >>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>> to an existing image on top,
> >>>
> >>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>                          |                              |
> >>>                          |                              |
> >>>                          |                              |
> >>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>
> >>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>> We can utilize an automatic hidden drive-backup target:
> >>>
> >>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>                          |                                                          |
> >>>                          |                                                          |
> >>>                          v                                                          v
> >>>
> >>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          v                              ^
> >>>                          >>>> drive-backup sync=none >>>>
> >>>
> >>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>> remains unchanged from (bar)'s PoV.
> >>>
> >>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>> the naming is arbitrary.
> >>
> >> I don't understand this. In which function, the hidden target is created automatically?
> >>
> > 
> > It's to be determined. This part is only in my mind :)
> 
> What about this:
> -drive file=nbd-target,if=none,id=nbd-target0 \
> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
> 

It's close. I suppose backing.backing is referencing another drive as its
backing_hd, then you cannot have the other backing.file.* option - they
conflict. It would be something along:

-drive file=nbd-target,if=none,id=nbd-target0 \
-drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
-drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0

Or for simplicity, s/backing.backing=/backing=/g

Yes, adding these "backing=$drive_id" option is also exactly what we expect
in order to support image-fleecing, but we haven't figured how to allow that
without breaking other qmp operations like block jobs, etc.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-26  8:44               ` Fam Zheng
@ 2015-02-26  9:07                 ` Wen Congyang
  2015-02-26 10:02                   ` Fam Zheng
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-26  9:07 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 02/26/2015 04:44 PM, Fam Zheng wrote:
> On Thu, 02/26 14:38, Wen Congyang wrote:
>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>
>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>
>>>
>>> It's to be determined. This part is only in my mind :)
>>
>> What about this:
>> -drive file=nbd-target,if=none,id=nbd-target0 \
>> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
>>
> 
> It's close. I suppose backing.backing is referencing another drive as its
> backing_hd, then you cannot have the other backing.file.* option - they
> conflict. It would be something along:
> 
> -drive file=nbd-target,if=none,id=nbd-target0 \
> -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
> -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
> 
> Or for simplicity, s/backing.backing=/backing=/g

If using backing=drive_id, backing.backing and backing.file.* are not conflict.
backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.

> 
> Yes, adding these "backing=$drive_id" option is also exactly what we expect
> in order to support image-fleecing, but we haven't figured how to allow that
> without breaking other qmp operations like block jobs, etc.

I don't understand this. In which case, qmp operations will be broken? Can you give
me some examples?

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-26  9:07                 ` Wen Congyang
@ 2015-02-26 10:02                   ` Fam Zheng
  2015-02-27  2:27                     ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-02-26 10:02 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On Thu, 02/26 17:07, Wen Congyang wrote:
> On 02/26/2015 04:44 PM, Fam Zheng wrote:
> > On Thu, 02/26 14:38, Wen Congyang wrote:
> >> On 02/25/2015 10:46 AM, Fam Zheng wrote:
> >>> On Tue, 02/24 15:50, Wen Congyang wrote:
> >>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
> >>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
> >>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>>>>> Hi Congyang,
> >>>>>>>
> >>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>>>>> +== Workflow ==
> >>>>>>>> +The following is the image of block replication workflow:
> >>>>>>>> +
> >>>>>>>> +        +----------------------+            +------------------------+
> >>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
> >>>>>>>> +        +----------------------+            +------------------------+
> >>>>>>>> +                  |                                       |
> >>>>>>>> +                  |                                      (4)
> >>>>>>>> +                  |                                       V
> >>>>>>>> +                  |                              /-------------\
> >>>>>>>> +                  |      Copy and Forward        |             |
> >>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
> >>>>>>>> +                  |                      |       |             |
> >>>>>>>> +                  |                     (3)      \-------------/
> >>>>>>>> +                  |                 speculative      ^
> >>>>>>>> +                  |                write through    (2)
> >>>>>>>> +                  |                      |           |
> >>>>>>>> +                  V                      V           |
> >>>>>>>> +           +--------------+           +----------------+
> >>>>>>>> +           | Primary Disk |           | Secondary Disk |
> >>>>>>>> +           +--------------+           +----------------+
> >>>>>>>> +
> >>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
> >>>>>>>> +       QEMU.
> >>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
> >>>>>>>> +       original sector content will be read from Secondary disk and
> >>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
> >>>>>>>> +       sector content in the Disk buffer.
> >>>>>>>
> >>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
> >>>>>>> reading them as "s/will be/are/g"
> >>>>>>>
> >>>>>>> Why do you need this buffer?
> >>>>>>
> >>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
> >>>>>> vm write to the buffer.
> >>>>>>
> >>>>>>>
> >>>>>>> If both primary and secondary write to the same sector, what is saved in the
> >>>>>>> buffer?
> >>>>>>
> >>>>>> The primary content will be written to the secondary disk, and the secondary content
> >>>>>> is saved in the buffer.
> >>>>>
> >>>>> I wonder if alternatively this is possible with an imaginary "writable backing
> >>>>> image" feature, as described below.
> >>>>>
> >>>>> When we have a normal backing chain,
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}
> >>>>>                          |
> >>>>>                          |
> >>>>>                          |
> >>>>>     [base] <- [mid] <- (foo)
> >>>>>
> >>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
> >>>>> to an existing image on top,
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >>>>>                          |                              |
> >>>>>                          |                              |
> >>>>>                          |                              |
> >>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >>>>>
> >>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
> >>>>> We can utilize an automatic hidden drive-backup target:
> >>>>>
> >>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
> >>>>>                          |                                                          |
> >>>>>                          |                                                          |
> >>>>>                          v                                                          v
> >>>>>
> >>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
> >>>>>
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          v                              ^
> >>>>>                          >>>> drive-backup sync=none >>>>
> >>>>>
> >>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
> >>>>> remains unchanged from (bar)'s PoV.
> >>>>>
> >>>>> The drive in the middle is called hidden because QEMU creates it automatically,
> >>>>> the naming is arbitrary.
> >>>>
> >>>> I don't understand this. In which function, the hidden target is created automatically?
> >>>>
> >>>
> >>> It's to be determined. This part is only in my mind :)
> >>
> >> What about this:
> >> -drive file=nbd-target,if=none,id=nbd-target0 \
> >> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
> >>
> > 
> > It's close. I suppose backing.backing is referencing another drive as its
> > backing_hd, then you cannot have the other backing.file.* option - they
> > conflict. It would be something along:
> > 
> > -drive file=nbd-target,if=none,id=nbd-target0 \
> > -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
> > -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
> > 
> > Or for simplicity, s/backing.backing=/backing=/g
> 
> If using backing=drive_id, backing.backing and backing.file.* are not conflict.
> backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.

I see.

> 
> > 
> > Yes, adding these "backing=$drive_id" option is also exactly what we expect
> > in order to support image-fleecing, but we haven't figured how to allow that
> > without breaking other qmp operations like block jobs, etc.
> 
> I don't understand this. In which case, qmp operations will be broken? Can you give
> me some examples?
> 

I don't mean there is a fundamental stopper for this, but in order to relax the
assumption that "only top BDS can have a BlockBackend", we need to think
through the whole block layer, and add new finer checks/restrictions where it's
necessary, otherwise it will be a mess to allow arbitrary backing reference.

Some random questions I'm now aware of:

1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
corrupted by writings to it. So there need to be a new convention and
invariance to follow.

2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
corrupt data (from nbd-target0's perspective).

3. unclear implications of "change" and "eject" when there is backing
reference.

4. can a drive be backing referenced by more than one other drives?

Just two cents, and I still need to think about it systematically.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later
  2015-02-25 14:22       ` Max Reitz
@ 2015-02-26 14:07         ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-02-26 14:07 UTC (permalink / raw)
  To: Max Reitz, Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang



On 25/02/2015 15:22, Max Reitz wrote:
> 3. nbd_co_discard()
>     quorum doens't call bdrv_co_discard(), so it is OK to return -EIO here. 

That can change, so I think you should return -EIO either everywhere or
nowhere.  Which probably means nowhere.

> Hm, okay. How about adding an option to quorum for ignoring errors from
> a specific child? It's probably not possible to do something like
> "children.1.ignore-errors=true", but maybe you can just ignore errors in
> quorum from any but the first child if the read pattern is set to
> "first", that would make sense to me.
> 
> But if you don't want to do that, I guess just making NBD some kind of
> /dev/null before it's connected should be fine.

I think what Wen is doing is okay, especially since it's only the
special nbd+colo:// URIs that are acting as /dev/null.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication
  2015-02-23 21:41   ` Max Reitz
@ 2015-02-26 14:08     ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-02-26 14:08 UTC (permalink / raw)
  To: Max Reitz, Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang



On 23/02/2015 22:41, Max Reitz wrote:
>>
>>   }
>>   +static int nbd_start_replication(BlockDriverState *bs, int mode)
>> +{
>> +    BDRVNBDState *s = bs->opaque;
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    /*
>> +     * TODO: support COLO_SECONDARY_MODE if we allow secondary
>> +     * QEMU becoming primary QEMU.
>> +     */
>> +    if (mode != COLO_PRIMARY_MODE) {
>> +        return -1;
> 
> Once again, I'd like -ENOTSUP more (or -EINVAL or whatever you prefer).

Using the Error API is the right thing to do here.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file
  2015-02-23 22:03   ` Max Reitz
@ 2015-02-26 14:15     ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-02-26 14:15 UTC (permalink / raw)
  To: Max Reitz, Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang



On 23/02/2015 23:03, Max Reitz wrote:
> On 2015-02-11 at 22:07, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   block.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> I don't think this is a good idea. With this patch, every time you open
> a COW file (with a backing file) R/W, the backing file will be writable.
> I'd rather like a way to explicitly overwrite the R/W mode of the
> backing file; but by default, in my opinion, it should stay read-only.

I agree.

Perhaps blkcolo_open or colo_svm_init can take care of setting
BDRV_O_RDWR on the backing file?  They could also use bdrv_reopen.

Paolo

> Max
> 
>> diff --git a/block.c b/block.c
>> index 067c44b..96cf973 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -856,8 +856,8 @@ static int bdrv_inherited_flags(int flags)
>>    */
>>   static int bdrv_backing_flags(int flags)
>>   {
>> -    /* backing files always opened read-only */
>> -    flags &= ~(BDRV_O_RDWR | BDRV_O_COPY_ON_READ);
>> +    /* backing files are opened read-write for block replication */
>> +    flags &= ~BDRV_O_COPY_ON_READ;
>>         /* snapshot=on is handled on the top layer */
>>       flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_TEMPORARY);
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-26 10:02                   ` Fam Zheng
@ 2015-02-27  2:27                     ` Wen Congyang
  2015-02-27  2:32                       ` Fam Zheng
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-02-27  2:27 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 02/26/2015 06:02 PM, Fam Zheng wrote:
> On Thu, 02/26 17:07, Wen Congyang wrote:
>> On 02/26/2015 04:44 PM, Fam Zheng wrote:
>>> On Thu, 02/26 14:38, Wen Congyang wrote:
>>>> On 02/25/2015 10:46 AM, Fam Zheng wrote:
>>>>> On Tue, 02/24 15:50, Wen Congyang wrote:
>>>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>>>> Hi Congyang,
>>>>>>>>>
>>>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>>>> +== Workflow ==
>>>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>>>> +
>>>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>>>> +                  |                                       |
>>>>>>>>>> +                  |                                      (4)
>>>>>>>>>> +                  |                                       V
>>>>>>>>>> +                  |                              /-------------\
>>>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>>>> +                  |                      |       |             |
>>>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>>>> +                  |                 speculative      ^
>>>>>>>>>> +                  |                write through    (2)
>>>>>>>>>> +                  |                      |           |
>>>>>>>>>> +                  V                      V           |
>>>>>>>>>> +           +--------------+           +----------------+
>>>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>>>> +           +--------------+           +----------------+
>>>>>>>>>> +
>>>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>>>> +       QEMU.
>>>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>>>
>>>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>>>> reading them as "s/will be/are/g"
>>>>>>>>>
>>>>>>>>> Why do you need this buffer?
>>>>>>>>
>>>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>>>> vm write to the buffer.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>>>> buffer?
>>>>>>>>
>>>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>>>> is saved in the buffer.
>>>>>>>
>>>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>>>> image" feature, as described below.
>>>>>>>
>>>>>>> When we have a normal backing chain,
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}
>>>>>>>                          |
>>>>>>>                          |
>>>>>>>                          |
>>>>>>>     [base] <- [mid] <- (foo)
>>>>>>>
>>>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>>>> to an existing image on top,
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>>>                          |                              |
>>>>>>>                          |                              |
>>>>>>>                          |                              |
>>>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>>>
>>>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>>>
>>>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>>>                          |                                                          |
>>>>>>>                          |                                                          |
>>>>>>>                          v                                                          v
>>>>>>>
>>>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>>>
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          v                              ^
>>>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>>>
>>>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>>>> remains unchanged from (bar)'s PoV.
>>>>>>>
>>>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>>>> the naming is arbitrary.
>>>>>>
>>>>>> I don't understand this. In which function, the hidden target is created automatically?
>>>>>>
>>>>>
>>>>> It's to be determined. This part is only in my mind :)
>>>>
>>>> What about this:
>>>> -drive file=nbd-target,if=none,id=nbd-target0 \
>>>> -drive file=active-disk,if=virtio,driver=qcow2,backing.file.filename=hidden-disk,backing.driver=qcow2,backing.backing=nbd-target0
>>>>
>>>
>>> It's close. I suppose backing.backing is referencing another drive as its
>>> backing_hd, then you cannot have the other backing.file.* option - they
>>> conflict. It would be something along:
>>>
>>> -drive file=nbd-target,if=none,id=nbd-target0 \
>>> -drive file=hidden-disk,if=none,id=hidden0,backing.backing=nbd-target0 \
>>> -drive file=active-disk,if=virtio,driver=qcow2,backing.backing=hidden0
>>>
>>> Or for simplicity, s/backing.backing=/backing=/g
>>
>> If using backing=drive_id, backing.backing and backing.file.* are not conflict.
>> backing.backing=$drive_id means that: backing file's backing file's id is $drive_id.
> 
> I see.
> 
>>
>>>
>>> Yes, adding these "backing=$drive_id" option is also exactly what we expect
>>> in order to support image-fleecing, but we haven't figured how to allow that
>>> without breaking other qmp operations like block jobs, etc.
>>
>> I don't understand this. In which case, qmp operations will be broken? Can you give
>> me some examples?
>>
> 
> I don't mean there is a fundamental stopper for this, but in order to relax the
> assumption that "only top BDS can have a BlockBackend", we need to think
> through the whole block layer, and add new finer checks/restrictions where it's
> necessary, otherwise it will be a mess to allow arbitrary backing reference.
> 
> Some random questions I'm now aware of:
> 
> 1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
> corrupted by writings to it. So there need to be a new convention and
> invariance to follow.

Hmm, I understand while the hidden-disk should be opened automatically now.
If we use backing reference, I think we should open a hindden-disk, and set
drive backup automatically. Block any conflict operations(commit, change, eject?)

> 
> 2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
> corrupt data (from nbd-target0's perspective).
> 
> 3. unclear implications of "change" and "eject" when there is backing
> reference.
> 
> 4. can a drive be backing referenced by more than one other drives?

We can forbid it first.

Thanks
Wen Congyang

> 
> Just two cents, and I still need to think about it systematically.
> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-27  2:27                     ` Wen Congyang
@ 2015-02-27  2:32                       ` Fam Zheng
  0 siblings, 0 replies; 81+ messages in thread
From: Fam Zheng @ 2015-02-27  2:32 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Fri, 02/27 10:27, Wen Congyang wrote:
> > 1. nbd-target0 is writable here, without the drive-backup, hidden0 could be
> > corrupted by writings to it. So there need to be a new convention and
> > invariance to follow.
> 
> Hmm, I understand while the hidden-disk should be opened automatically now.
> If we use backing reference, I think we should open a hindden-disk, and set
> drive backup automatically. Block any conflict operations(commit, change, eject?)

This might be a good idea.

> 
> > 
> > 2. in qmp, block-commit hidden0 to nbd-target0 or it's backing file, will
> > corrupt data (from nbd-target0's perspective).
> > 
> > 3. unclear implications of "change" and "eject" when there is backing
> > reference.
> > 
> > 4. can a drive be backing referenced by more than one other drives?
> 
> We can forbid it first.
> 

Yes, probably with a new op blocker type.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12 10:26               ` famz
  2015-02-13  5:09                 ` Wen Congyang
@ 2015-03-03  7:53                 ` Wen Congyang
  2015-03-03  7:59                   ` Fam Zheng
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-03  7:53 UTC (permalink / raw)
  To: famz
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 02/12/2015 06:26 PM, famz@redhat.com wrote:
> On Thu, 02/12 18:11, Wen Congyang wrote:
>> On 02/12/2015 05:44 PM, Fam Zheng wrote:
>>> On Thu, 02/12 17:33, Wen Congyang wrote:
>>>> On 02/12/2015 04:44 PM, Fam Zheng wrote:
>>>>> On Thu, 02/12 15:40, Wen Congyang wrote:
>>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote:
>>>>>>> Hi Congyang,
>>>>>>>
>>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote:
>>>>>>>> +== Workflow ==
>>>>>>>> +The following is the image of block replication workflow:
>>>>>>>> +
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +        |Primary Write Requests|            |Secondary Write Requests|
>>>>>>>> +        +----------------------+            +------------------------+
>>>>>>>> +                  |                                       |
>>>>>>>> +                  |                                      (4)
>>>>>>>> +                  |                                       V
>>>>>>>> +                  |                              /-------------\
>>>>>>>> +                  |      Copy and Forward        |             |
>>>>>>>> +                  |---------(1)----------+       | Disk Buffer |
>>>>>>>> +                  |                      |       |             |
>>>>>>>> +                  |                     (3)      \-------------/
>>>>>>>> +                  |                 speculative      ^
>>>>>>>> +                  |                write through    (2)
>>>>>>>> +                  |                      |           |
>>>>>>>> +                  V                      V           |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +           | Primary Disk |           | Secondary Disk |
>>>>>>>> +           +--------------+           +----------------+
>>>>>>>> +
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>>>
>>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am
>>>>>>> reading them as "s/will be/are/g"
>>>>>>>
>>>>>>> Why do you need this buffer?
>>>>>>
>>>>>> We only sync the disk till next checkpoint. Before next checkpoint, secondary
>>>>>> vm write to the buffer.
>>>>>>
>>>>>>>
>>>>>>> If both primary and secondary write to the same sector, what is saved in the
>>>>>>> buffer?
>>>>>>
>>>>>> The primary content will be written to the secondary disk, and the secondary content
>>>>>> is saved in the buffer.
>>>>>
>>>>> I wonder if alternatively this is possible with an imaginary "writable backing
>>>>> image" feature, as described below.
>>>>>
>>>>> When we have a normal backing chain,
>>>>>
>>>>>                {virtio-blk dev 'foo'}
>>>>>                          |
>>>>>                          |
>>>>>                          |
>>>>>     [base] <- [mid] <- (foo)
>>>>>
>>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
>>>>> to an existing image on top,
>>>>>
>>>>>                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>                          |                              |
>>>>>     [base] <- [mid] <- (foo)  <---------------------- (bar)
>>>>>
>>>>> It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
>>>>> We can utilize an automatic hidden drive-backup target:
>>>>>
>>>>>                {virtio-blk dev 'foo'}                                    {virtio-blk dev 'bar'}
>>>>>                          |                                                          |
>>>>>                          |                                                          |
>>>>>                          v                                                          v
>>>>>
>>>>>     [base] <- [mid] <- (foo)  <----------------- (hidden target) <--------------- (bar)
>>>>>
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          v                              ^
>>>>>                          >>>> drive-backup sync=none >>>>
>>>>>
>>>>> So when guest writes to 'foo', the old data is moved to (hidden target), which
>>>>> remains unchanged from (bar)'s PoV.
>>>>>
>>>>> The drive in the middle is called hidden because QEMU creates it automatically,
>>>>> the naming is arbitrary.
>>>>>
>>>>> It is interesting because it is a more generalized case of image fleecing,
>>>>> where the (hidden target) is exposed via NBD server for data scanning (read
>>>>> only) purpose.
>>>>>
>>>>> More interestingly, with above facility, it is also possible to create a guest
>>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
>>>>> cheaply. Or call it shadow copy if you will.
>>>>>
>>>>> Back to the COLO case, the configuration will be very similar:
>>>>>
>>>>>
>>>>>                       {primary wr}                                                {secondary vm}
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             |                                                           |
>>>>>                             v                                                           v
>>>>>
>>>>>    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <------------- (active disk)
>>>>>
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             v                              ^
>>>>>                             >>>> drive-backup sync=none >>>>
>>>>
>>>> What is active disk? There are two disk images?
>>>
>>> It starts as an empty image with (hidden buf disk) as backing file, which in
>>> turn has (nbd target) as backing file.
>>
>> It's too complicated..., and I don't understand it.
>> 1. What is active disk? Use raw or a new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk.

I test qcow2_make_empty()'s performance. The result shows that it may
take about 100ms(normal sata disk). It is not acceptable for COLO. So
I think disk buff is necessary(just use it to replace qcow2).

Thanks
Wen Congyang

> 
>> 2. Hidden buf disk use new block driver?
> 
> It is an empty qcow2 image with the same lenght as your Secondary Disk, too.
> 
>> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, we will
>>    export a nbd with read-only BlockDriverState, but nbd server needs to write it.
> 
> NBD target is your Secondary Disk. It is opened read-write.
> 
> The patches to enable opening it as read-write, and starting drive-backup
> between it and hidden buf disk, are all work in progress (the core concept) of
> image fleecing.
> 
> Fam
> 
>>>>>
>>>>> The workflow analogue is:
>>>>>
>>>>>>>> +    1) Primary write requests will be copied and forwarded to Secondary
>>>>>>>> +       QEMU.
>>>>>
>>>>> Primary write requests are forwarded to secondary QEMU as well.
>>>>>
>>>>>>>> +    2) Before Primary write requests are written to Secondary disk, the
>>>>>>>> +       original sector content will be read from Secondary disk and
>>>>>>>> +       buffered in the Disk buffer, but it will not overwrite the existing
>>>>>>>> +       sector content in the Disk buffer.
>>>>>
>>>>> Before Primary write requests are written to (nbd target), aka the Secondary
>>>>> disk, the orignal sector content is read from it and copied to (hidden buf
>>>>> disk) by drive-backup. It obviously will not overwrite the data in (active
>>>>> disk).
>>>>>
>>>>>>>> +    3) Primary write requests will be written to Secondary disk.
>>>>>
>>>>> Primary write requests are written to (nbd target).
>>>>>
>>>>>>>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>>>>>>>> +       will overwrite the existing sector content in the buffer.
>>>>>
>>>>> Secondary write request will be written in (active disk) as usual.
>>>>>
>>>>> Finally, when checkpoint arrives, if you want to sync with primary, just drop
>>>>> data in (hidden buf disk) and (active disk); when failover happends, if you
>>>>> want to promote secondary vm, you can commit (active disk) to (nbd target), and
>>>>> drop data in (hidden buf disk).
>>>>>
>>>>> Fam
>>>>> .
>>>>>
>>>>
>>>>
>>> .
>>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-03  7:53                 ` Wen Congyang
@ 2015-03-03  7:59                   ` Fam Zheng
  2015-03-03 12:12                     ` Wen Congyang
  2015-03-11  6:44                     ` Wen Congyang
  0 siblings, 2 replies; 81+ messages in thread
From: Fam Zheng @ 2015-03-03  7:59 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Tue, 03/03 15:53, Wen Congyang wrote:
> I test qcow2_make_empty()'s performance. The result shows that it may
> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> I think disk buff is necessary(just use it to replace qcow2).

Why not tmpfs or ramdisk?

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-03  7:59                   ` Fam Zheng
@ 2015-03-03 12:12                     ` Wen Congyang
  2015-03-11  6:44                     ` Wen Congyang
  1 sibling, 0 replies; 81+ messages in thread
From: Wen Congyang @ 2015-03-03 12:12 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 03/03/2015 03:59 PM, Fam Zheng wrote:
> On Tue, 03/03 15:53, Wen Congyang wrote:
>> I test qcow2_make_empty()'s performance. The result shows that it may
>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>> I think disk buff is necessary(just use it to replace qcow2).
> 
> Why not tmpfs or ramdisk?

I test it, and it only takes 2-3ms.

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
  2015-02-12  7:21   ` Fam Zheng
@ 2015-03-04 16:35   ` Dr. David Alan Gilbert
  2015-03-05  1:03     ` Wen Congyang
  1 sibling, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-04 16:35 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Gonglei, Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang,
	zhanghailiang

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

Hi,

> ---
>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..59150b8
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,129 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +The block replication is used for continuous checkpoints. It is designed
> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> +scene that Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.

Can you explain how the block data is synchronised with the main checkpoint
stream?  i.e. when the secondary receives a new checkpoint how does it know
it's received all of the block writes from the primary associated with that
checkpoint and that all the following writes that it receives are for the
next checkpoint period?

Dave

> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                  virtio-blk
> +                          ||        ^                         ^
> +--------.                 ||        |                         |
> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
> +--------'                 ||                   backing
> +
> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
> +and the disk buffer uses bdrv_add_before_write_notifier to implement
> +copy-on-write, similar to block/backup.c.
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transfered to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +c. bdrv_stop_replication()
> +   It is called when failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=first,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +
> +Secondary:
> +  -drive if=xxx,driver=blkcolo,export=xxx,\
> +         backing.file.filename=1.raw,\
> +         backing.driver=raw
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd_server_start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd_server_start's other options
> -- 
> 2.1.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-04 16:35   ` Dr. David Alan Gilbert
@ 2015-03-05  1:03     ` Wen Congyang
  2015-03-05 19:04       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-05  1:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Gonglei, Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang,
	zhanghailiang

On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> 
> Hi,
> 
>> ---
>>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 129 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..59150b8
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,129 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +The block replication is used for continuous checkpoints. It is designed
>> +for COLO that Secondary VM is running. It can also be applied for FT/HA
>> +scene that Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
> 
> Can you explain how the block data is synchronised with the main checkpoint
> stream?  i.e. when the secondary receives a new checkpoint how does it know
> it's received all of the block writes from the primary associated with that
> checkpoint and that all the following writes that it receives are for the
> next checkpoint period?

NBD server will do it. Writing to NBD client will return after NBD server replies
the result(ACK or error).

Thanks
Wen Congyang

> 
> Dave
> 
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                  virtio-blk
>> +                          ||        ^                         ^
>> +--------.                 ||        |                         |
>> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
>> +--------'                 ||                   backing
>> +
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
>> +and the disk buffer uses bdrv_add_before_write_notifier to implement
>> +copy-on-write, similar to block/backup.c.
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transfered to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +c. bdrv_stop_replication()
>> +   It is called when failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication.
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=first,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +
>> +Secondary:
>> +  -drive if=xxx,driver=blkcolo,export=xxx,\
>> +         backing.file.filename=1.raw,\
>> +         backing.driver=raw
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd_server_start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd_server_start's other options
>> -- 
>> 2.1.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-05  1:03     ` Wen Congyang
@ 2015-03-05 19:04       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-05 19:04 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Gonglei, Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang,
	zhanghailiang

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > 
> > Hi,
> > 
> >> ---
> >>  docs/block-replication.txt | 129 +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 129 insertions(+)
> >>  create mode 100644 docs/block-replication.txt
> >>
> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> >> new file mode 100644
> >> index 0000000..59150b8
> >> --- /dev/null
> >> +++ b/docs/block-replication.txt
> >> @@ -0,0 +1,129 @@
> >> +Block replication
> >> +----------------------------------------
> >> +Copyright Fujitsu, Corp. 2015
> >> +Copyright (c) 2015 Intel Corporation
> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> >> +
> >> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> +See the COPYING file in the top-level directory.
> >> +
> >> +The block replication is used for continuous checkpoints. It is designed
> >> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> >> +scene that Secondary VM is not running.
> >> +
> >> +This document gives an overview of block replication's design.
> >> +
> >> +== Background ==
> >> +High availability solutions such as micro checkpoint and COLO will do
> >> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> >> +identical right after a VM checkpoint, but becomes different as the VM
> >> +executes till the next checkpoint. To support disk contents checkpoint,
> >> +the modified disk contents in the Secondary VM must be buffered, and are
> >> +only dropped at next checkpoint time. To reduce the network transportation
> >> +effort at the time of checkpoint, the disk modification operations of
> >> +Primary disk are asynchronously forwarded to the Secondary node.
> > 
> > Can you explain how the block data is synchronised with the main checkpoint
> > stream?  i.e. when the secondary receives a new checkpoint how does it know
> > it's received all of the block writes from the primary associated with that
> > checkpoint and that all the following writes that it receives are for the
> > next checkpoint period?
> 
> NBD server will do it. Writing to NBD client will return after NBD server replies
> the result(ACK or error).

Ah OK, so if the NBD client is synchronous then yes I can see that;
(I was confused by the word 'asynchronously' in your description above
but I guess that means asynchronous to the checkpoint stream).
I see that 'do_colo_transaction' keeps the primary stopped until after
the secondary does blk_do_checkpoint and then sends 'LOADED'.

I think yes that should work; although potentially you could make it faster;
since the primary doesn't need to know that it's write has been commited
until the next checkpoint, and if you could mark the separation in the two
checkpoints, then you could start the primary running again earlier.  But that's
all more complicated; this should work OK.

Thanks for the explanation,

Dave

> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >> +
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> >> +    3) Primary write requests will be written to Secondary disk.
> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.
> >> +
> >> +== Architecture ==
> >> +We are going to implement COLO block replication from many basic
> >> +blocks that are already in QEMU.
> >> +
> >> +         virtio-blk       ||
> >> +             ^            ||                            .----------
> >> +             |            ||                            | Secondary
> >> +        1 Quorum          ||                            '----------
> >> +         /      \         ||
> >> +        /        \        ||
> >> +   Primary      2 NBD  ------->  2 NBD
> >> +     disk       client    ||     server                  virtio-blk
> >> +                          ||        ^                         ^
> >> +--------.                 ||        |                         |
> >> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
> >> +--------'                 ||                   backing
> >> +
> >> +1) The disk on the primary is represented by a block device with two
> >> +children, providing replication between a primary disk and the host that
> >> +runs the secondary VM. The read pattern for quorum can be extended to
> >> +make the primary always read from the local disk instead of going through
> >> +NBD.
> >> +
> >> +2) The secondary disk receives writes from the primary VM through QEMU's
> >> +embedded NBD server (speculative write-through).
> >> +
> >> +3) The disk on the secondary is represented by a custom block device
> >> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
> >> +and the disk buffer uses bdrv_add_before_write_notifier to implement
> >> +copy-on-write, similar to block/backup.c.
> >> +
> >> +== New block driver interface ==
> >> +We add three block driver interfaces to control block replication:
> >> +a. bdrv_start_replication()
> >> +   Start block replication, called in migration/checkpoint thread.
> >> +   We must call bdrv_start_replication() in secondary QEMU before
> >> +   calling bdrv_start_replication() in primary QEMU.
> >> +b. bdrv_do_checkpoint()
> >> +   This interface is called after all VM state is transfered to
> >> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> >> +c. bdrv_stop_replication()
> >> +   It is called when failover. We will flush the Disk buffer into
> >> +   Secondary Disk and stop block replication.
> >> +
> >> +== Usage ==
> >> +Primary:
> >> +  -drive if=xxx,driver=quorum,read-pattern=first,\
> >> +         children.0.file.filename=1.raw,\
> >> +         children.0.driver=raw,\
> >> +         children.1.file.driver=nbd+colo,\
> >> +         children.1.file.host=xxx,\
> >> +         children.1.file.port=xxx,\
> >> +         children.1.file.export=xxx,\
> >> +         children.1.driver=raw
> >> +  Note:
> >> +  1. NBD Client should not be the first child of quorum.
> >> +  2. There should be only one NBD Client.
> >> +  3. host is the secondary physical machine's hostname or IP
> >> +  4. Each disk must have its own export name.
> >> +
> >> +Secondary:
> >> +  -drive if=xxx,driver=blkcolo,export=xxx,\
> >> +         backing.file.filename=1.raw,\
> >> +         backing.driver=raw
> >> +  Then run qmp command:
> >> +    nbd_server_start host:port
> >> +  Note:
> >> +  1. The export name for the same disk must be the same in primary
> >> +     and secondary QEMU command line
> >> +  2. The qmp command nbd_server_start must be run before running the
> >> +     qmp command migrate on primary QEMU
> >> +  3. Don't use nbd_server_start's other options
> >> -- 
> >> 2.1.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > .
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-03  7:59                   ` Fam Zheng
  2015-03-03 12:12                     ` Wen Congyang
@ 2015-03-11  6:44                     ` Wen Congyang
  2015-03-11  6:49                       ` Fam Zheng
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-11  6:44 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 03/03/2015 03:59 PM, Fam Zheng wrote:
> On Tue, 03/03 15:53, Wen Congyang wrote:
>> I test qcow2_make_empty()'s performance. The result shows that it may
>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>> I think disk buff is necessary(just use it to replace qcow2).
> 
> Why not tmpfs or ramdisk?

Another problem:
After failover, secondary write request will be written in (active disk)?
It is better to write request to (nbd target). Is there any feature can
be reused to implement it?

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-11  6:44                     ` Wen Congyang
@ 2015-03-11  6:49                       ` Fam Zheng
  2015-03-11  7:01                         ` Wen Congyang
  2015-03-13  9:01                         ` Wen Congyang
  0 siblings, 2 replies; 81+ messages in thread
From: Fam Zheng @ 2015-03-11  6:49 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On Wed, 03/11 14:44, Wen Congyang wrote:
> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> > On Tue, 03/03 15:53, Wen Congyang wrote:
> >> I test qcow2_make_empty()'s performance. The result shows that it may
> >> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >> I think disk buff is necessary(just use it to replace qcow2).
> > 
> > Why not tmpfs or ramdisk?
> 
> Another problem:
> After failover, secondary write request will be written in (active disk)?
> It is better to write request to (nbd target). Is there any feature can
> be reused to implement it?

You can use block commit or stream to move the data.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-11  6:49                       ` Fam Zheng
@ 2015-03-11  7:01                         ` Wen Congyang
  2015-03-11  7:04                           ` Fam Zheng
  2015-03-13  9:01                         ` Wen Congyang
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-11  7:01 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 03/11/2015 02:49 PM, Fam Zheng wrote:
> On Wed, 03/11 14:44, Wen Congyang wrote:
>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>
>>> Why not tmpfs or ramdisk?
>>
>> Another problem:
>> After failover, secondary write request will be written in (active disk)?
>> It is better to write request to (nbd target). Is there any feature can
>> be reused to implement it?
> 
> You can use block commit or stream to move the data.

When doing failover, we can use it to move the data. After failover,
I need an endless job to move the data.

Thanks
Wen Congyang

> 
> Fam
> 
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-11  7:01                         ` Wen Congyang
@ 2015-03-11  7:04                           ` Fam Zheng
  2015-03-11  7:12                             ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-03-11  7:04 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On Wed, 03/11 15:01, Wen Congyang wrote:
> On 03/11/2015 02:49 PM, Fam Zheng wrote:
> > On Wed, 03/11 14:44, Wen Congyang wrote:
> >> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> >>> On Tue, 03/03 15:53, Wen Congyang wrote:
> >>>> I test qcow2_make_empty()'s performance. The result shows that it may
> >>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >>>> I think disk buff is necessary(just use it to replace qcow2).
> >>>
> >>> Why not tmpfs or ramdisk?
> >>
> >> Another problem:
> >> After failover, secondary write request will be written in (active disk)?
> >> It is better to write request to (nbd target). Is there any feature can
> >> be reused to implement it?
> > 
> > You can use block commit or stream to move the data.
> 
> When doing failover, we can use it to move the data. After failover,
> I need an endless job to move the data.
> 

I see what you mean. After failover, does the nbd server receive more data
(i.e. do you need a buffer to stash data from the other side)? If you commit
(active disk) to (nbd target), all the writes will go to a single image.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-11  7:04                           ` Fam Zheng
@ 2015-03-11  7:12                             ` Wen Congyang
  0 siblings, 0 replies; 81+ messages in thread
From: Wen Congyang @ 2015-03-11  7:12 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 03/11/2015 03:04 PM, Fam Zheng wrote:
> On Wed, 03/11 15:01, Wen Congyang wrote:
>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>
>>>>> Why not tmpfs or ramdisk?
>>>>
>>>> Another problem:
>>>> After failover, secondary write request will be written in (active disk)?
>>>> It is better to write request to (nbd target). Is there any feature can
>>>> be reused to implement it?
>>>
>>> You can use block commit or stream to move the data.
>>
>> When doing failover, we can use it to move the data. After failover,
>> I need an endless job to move the data.
>>
> 
> I see what you mean. After failover, does the nbd server receive more data
> (i.e. do you need a buffer to stash data from the other side)? If you commit
> (active disk) to (nbd target), all the writes will go to a single image.

After failover(primary host downs), only secondary qemu works, and nbd server
doesn't receive any more data.

Thanks
Wen Congyang

> 
> Fam
> 
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-11  6:49                       ` Fam Zheng
  2015-03-11  7:01                         ` Wen Congyang
@ 2015-03-13  9:01                         ` Wen Congyang
  2015-03-13  9:05                           ` Fam Zheng
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-13  9:01 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, qemu devel, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, jsnow, zhanghailiang

On 03/11/2015 02:49 PM, Fam Zheng wrote:
> On Wed, 03/11 14:44, Wen Congyang wrote:
>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>
>>> Why not tmpfs or ramdisk?
>>
>> Another problem:
>> After failover, secondary write request will be written in (active disk)?
>> It is better to write request to (nbd target). Is there any feature can
>> be reused to implement it?
> 
> You can use block commit or stream to move the data.

Can the job stream move the data? I don't find the write ops in block/stream.c.

Thanks
Wen Congyang

> 
> Fam
> 
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-13  9:01                         ` Wen Congyang
@ 2015-03-13  9:05                           ` Fam Zheng
  2015-03-16  6:19                             ` Wen Congyang
  0 siblings, 1 reply; 81+ messages in thread
From: Fam Zheng @ 2015-03-13  9:05 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On Fri, 03/13 17:01, Wen Congyang wrote:
> On 03/11/2015 02:49 PM, Fam Zheng wrote:
> > On Wed, 03/11 14:44, Wen Congyang wrote:
> >> On 03/03/2015 03:59 PM, Fam Zheng wrote:
> >>> On Tue, 03/03 15:53, Wen Congyang wrote:
> >>>> I test qcow2_make_empty()'s performance. The result shows that it may
> >>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
> >>>> I think disk buff is necessary(just use it to replace qcow2).
> >>>
> >>> Why not tmpfs or ramdisk?
> >>
> >> Another problem:
> >> After failover, secondary write request will be written in (active disk)?
> >> It is better to write request to (nbd target). Is there any feature can
> >> be reused to implement it?
> > 
> > You can use block commit or stream to move the data.
> 
> Can the job stream move the data? I don't find the write ops in block/stream.c.

It is bdrv_co_copy_on_readv that moves data.

Fam

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-13  9:05                           ` Fam Zheng
@ 2015-03-16  6:19                             ` Wen Congyang
  2015-03-25 12:41                               ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-16  6:19 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, jsnow, zhanghailiang

On 03/13/2015 05:05 PM, Fam Zheng wrote:
> On Fri, 03/13 17:01, Wen Congyang wrote:
>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>
>>>>> Why not tmpfs or ramdisk?
>>>>
>>>> Another problem:
>>>> After failover, secondary write request will be written in (active disk)?
>>>> It is better to write request to (nbd target). Is there any feature can
>>>> be reused to implement it?
>>>
>>> You can use block commit or stream to move the data.
>>
>> Can the job stream move the data? I don't find the write ops in block/stream.c.
> 
> It is bdrv_co_copy_on_readv that moves data.

Does the job stream move the data from base to top?

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child
  2015-02-23 20:43   ` Max Reitz
  2015-02-24  2:33     ` Wen Congyang
@ 2015-03-18  5:29     ` Wen Congyang
  2015-03-18 12:57       ` Max Reitz
  1 sibling, 1 reply; 81+ messages in thread
From: Wen Congyang @ 2015-03-18  5:29 UTC (permalink / raw)
  To: Max Reitz, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 02/24/2015 04:43 AM, Max Reitz wrote:
> On 2015-02-11 at 22:07, Wen Congyang wrote:
>> We connect to NBD server when starting block replication, so
>> the length is 0 before starting block replication.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   block/quorum.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/block/quorum.c b/block/quorum.c
>> index 5ed1ff8..e6aff5f 100644
>> --- a/block/quorum.c
>> +++ b/block/quorum.c
>> @@ -734,6 +734,11 @@ static int64_t quorum_getlength(BlockDriverState *bs)
>>           if (value < 0) {
>>               return value;
>>           }
>> +
>> +        if (!value) {
>> +            continue;
>> +        }
>> +
>>           if (value != result) {
>>               return -EIO;
>>           }
> 
> Hm, what do you think about some specific error value returned by your delayed NBD implementation? Like -ENOTCONN or something like that? Then we'd be able to discern a real 0-length block device from a not-yet-connected NBD server.

In my newest test, it cannot return -ENOTCONN, otherwise bdrv_open() will fail when we open the child.
Here is the backtrace:
(gdb) bt
#0  bdrv_open_common (bs=0x5555563b2230, file=0x0, options=0x5555563b55a0, flags=57410, drv=0x555555e90c00, errp=0x7fffffffd460) at block.c:1070
#1  0x000055555595ea72 in bdrv_open (pbs=0x7fffffffd5f8, filename=0x0, reference=0x0, options=0x5555563b55a0, flags=57410, drv=0x555555e90c00, errp=0x7fffffffd5e0) at block.c:1677
#2  0x000055555595e3a9 in bdrv_open_image (pbs=0x7fffffffd5f8, filename=0x0, options=0x5555563a6730, bdref_key=0x555555a86b4c "file", flags=57410, allow_none=true, errp=0x7fffffffd5e0) at block.c:1481
#3  0x000055555595e9ae in bdrv_open (pbs=0x555556388008, filename=0x0, reference=0x0, options=0x5555563a6730, flags=8258, drv=0x555555e8b800, errp=0x7fffffffd6b8) at block.c:1655
#4  0x00005555559b0058 in quorum_open (bs=0x55555639bd90, options=0x55555639f100, flags=8258, errp=0x7fffffffd758) at block/quorum.c:1000
#5  0x000055555595d0b8 in bdrv_open_common (bs=0x55555639bd90, file=0x0, options=0x55555639f100, flags=8258, drv=0x555555e8e5c0, errp=0x7fffffffd840) at block.c:1045
#6  0x000055555595ea72 in bdrv_open (pbs=0x55555639bd50, filename=0x0, reference=0x0, options=0x55555639f100, flags=8258, drv=0x555555e8e5c0, errp=0x7fffffffdb70) at block.c:1677
#7  0x00005555559b3bd3 in blk_new_open (name=0x55555639bc60 "virtio0", filename=0x0, reference=0x0, options=0x55555639a420, flags=66, errp=0x7fffffffdb70) at block/block-backend.c:129
#8  0x0000555555754f78 in blockdev_init (file=0x0, bs_opts=0x55555639a420, errp=0x7fffffffdb70) at blockdev.c:536
#9  0x0000555555755d90 in drive_new (all_opts=0x5555563777e0, block_default_type=IF_IDE) at blockdev.c:971
#10 0x000055555576b1f0 in drive_init_func (opts=0x5555563777e0, opaque=0x555556372b48) at vl.c:1104
#11 0x0000555555a1b019 in qemu_opts_foreach (list=0x555555e44060, func=0x55555576b1ba <drive_init_func>, opaque=0x555556372b48, abort_on_failure=1) at util/qemu-option.c:1059
#12 0x00005555557743dd in main (argc=25, argv=0x7fffffffe0c8, envp=0x7fffffffe198) at vl.c:4191

refresh_total_sectors() will fail if we return -ENOTCONN.

Thanks
Wen Congyang

> 
> Also, while you did write that one shouldn't be using the NBD client as the first quorum child, I think we should try to support that case anyway. For this patch, that means accepting that bdrv_getlength(s->bs[0]) may be off.
> 
> Max
> .
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child
  2015-03-18  5:29     ` Wen Congyang
@ 2015-03-18 12:57       ` Max Reitz
  0 siblings, 0 replies; 81+ messages in thread
From: Max Reitz @ 2015-03-18 12:57 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Kevin Wolf, Stefan Hajnoczi, Paolo Bonzini
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Gonglei, Yang Hongyang, zhanghailiang

On 2015-03-18 at 01:29, Wen Congyang wrote:
> On 02/24/2015 04:43 AM, Max Reitz wrote:
>> On 2015-02-11 at 22:07, Wen Congyang wrote:
>>> We connect to NBD server when starting block replication, so
>>> the length is 0 before starting block replication.
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> ---
>>>    block/quorum.c | 5 +++++
>>>    1 file changed, 5 insertions(+)
>>>
>>> diff --git a/block/quorum.c b/block/quorum.c
>>> index 5ed1ff8..e6aff5f 100644
>>> --- a/block/quorum.c
>>> +++ b/block/quorum.c
>>> @@ -734,6 +734,11 @@ static int64_t quorum_getlength(BlockDriverState *bs)
>>>            if (value < 0) {
>>>                return value;
>>>            }
>>> +
>>> +        if (!value) {
>>> +            continue;
>>> +        }
>>> +
>>>            if (value != result) {
>>>                return -EIO;
>>>            }
>> Hm, what do you think about some specific error value returned by your delayed NBD implementation? Like -ENOTCONN or something like that? Then we'd be able to discern a real 0-length block device from a not-yet-connected NBD server.
> In my newest test, it cannot return -ENOTCONN, otherwise bdrv_open() will fail when we open the child.
> Here is the backtrace:
> (gdb) bt
> #0  bdrv_open_common (bs=0x5555563b2230, file=0x0, options=0x5555563b55a0, flags=57410, drv=0x555555e90c00, errp=0x7fffffffd460) at block.c:1070
> #1  0x000055555595ea72 in bdrv_open (pbs=0x7fffffffd5f8, filename=0x0, reference=0x0, options=0x5555563b55a0, flags=57410, drv=0x555555e90c00, errp=0x7fffffffd5e0) at block.c:1677
> #2  0x000055555595e3a9 in bdrv_open_image (pbs=0x7fffffffd5f8, filename=0x0, options=0x5555563a6730, bdref_key=0x555555a86b4c "file", flags=57410, allow_none=true, errp=0x7fffffffd5e0) at block.c:1481
> #3  0x000055555595e9ae in bdrv_open (pbs=0x555556388008, filename=0x0, reference=0x0, options=0x5555563a6730, flags=8258, drv=0x555555e8b800, errp=0x7fffffffd6b8) at block.c:1655
> #4  0x00005555559b0058 in quorum_open (bs=0x55555639bd90, options=0x55555639f100, flags=8258, errp=0x7fffffffd758) at block/quorum.c:1000
> #5  0x000055555595d0b8 in bdrv_open_common (bs=0x55555639bd90, file=0x0, options=0x55555639f100, flags=8258, drv=0x555555e8e5c0, errp=0x7fffffffd840) at block.c:1045
> #6  0x000055555595ea72 in bdrv_open (pbs=0x55555639bd50, filename=0x0, reference=0x0, options=0x55555639f100, flags=8258, drv=0x555555e8e5c0, errp=0x7fffffffdb70) at block.c:1677
> #7  0x00005555559b3bd3 in blk_new_open (name=0x55555639bc60 "virtio0", filename=0x0, reference=0x0, options=0x55555639a420, flags=66, errp=0x7fffffffdb70) at block/block-backend.c:129
> #8  0x0000555555754f78 in blockdev_init (file=0x0, bs_opts=0x55555639a420, errp=0x7fffffffdb70) at blockdev.c:536
> #9  0x0000555555755d90 in drive_new (all_opts=0x5555563777e0, block_default_type=IF_IDE) at blockdev.c:971
> #10 0x000055555576b1f0 in drive_init_func (opts=0x5555563777e0, opaque=0x555556372b48) at vl.c:1104
> #11 0x0000555555a1b019 in qemu_opts_foreach (list=0x555555e44060, func=0x55555576b1ba <drive_init_func>, opaque=0x555556372b48, abort_on_failure=1) at util/qemu-option.c:1059
> #12 0x00005555557743dd in main (argc=25, argv=0x7fffffffe0c8, envp=0x7fffffffe198) at vl.c:4191
>
> refresh_total_sectors() will fail if we return -ENOTCONN.

Okay, then 0 will be fine, too.

Max

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
  2015-03-16  6:19                             ` Wen Congyang
@ 2015-03-25 12:41                               ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2015-03-25 12:41 UTC (permalink / raw)
  To: Wen Congyang, Fam Zheng
  Cc: Kevin Wolf, Lai Jiangshan, Jiang Yunhong, Dong Eddie, qemu devel,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	jsnow, zhanghailiang



On 16/03/2015 07:19, Wen Congyang wrote:
> On 03/13/2015 05:05 PM, Fam Zheng wrote:
>> On Fri, 03/13 17:01, Wen Congyang wrote:
>>> On 03/11/2015 02:49 PM, Fam Zheng wrote:
>>>> On Wed, 03/11 14:44, Wen Congyang wrote:
>>>>> On 03/03/2015 03:59 PM, Fam Zheng wrote:
>>>>>> On Tue, 03/03 15:53, Wen Congyang wrote:
>>>>>>> I test qcow2_make_empty()'s performance. The result shows that it may
>>>>>>> take about 100ms(normal sata disk). It is not acceptable for COLO. So
>>>>>>> I think disk buff is necessary(just use it to replace qcow2).
>>>>>>
>>>>>> Why not tmpfs or ramdisk?
>>>>>
>>>>> Another problem:
>>>>> After failover, secondary write request will be written in (active disk)?
>>>>> It is better to write request to (nbd target). Is there any feature can
>>>>> be reused to implement it?
>>>>
>>>> You can use block commit or stream to move the data.
>>>
>>> Can the job stream move the data? I don't find the write ops in block/stream.c.
>>
>> It is bdrv_co_copy_on_readv that moves data.
> 
> Does the job stream move the data from base to top?

Yes.  block-commit goes in the other direction.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2015-03-25 12:42 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-12  3:07 [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Wen Congyang
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description Wen Congyang
2015-02-12  7:21   ` Fam Zheng
2015-02-12  7:40     ` Wen Congyang
2015-02-12  8:44       ` Fam Zheng
2015-02-12  9:33         ` Wen Congyang
2015-02-12  9:44           ` Fam Zheng
2015-02-12 10:11             ` Wen Congyang
2015-02-12 10:26               ` famz
2015-02-13  5:09                 ` Wen Congyang
2015-02-13  7:01                   ` Fam Zheng
2015-02-13 20:29                     ` John Snow
2015-03-03  7:53                 ` Wen Congyang
2015-03-03  7:59                   ` Fam Zheng
2015-03-03 12:12                     ` Wen Congyang
2015-03-11  6:44                     ` Wen Congyang
2015-03-11  6:49                       ` Fam Zheng
2015-03-11  7:01                         ` Wen Congyang
2015-03-11  7:04                           ` Fam Zheng
2015-03-11  7:12                             ` Wen Congyang
2015-03-13  9:01                         ` Wen Congyang
2015-03-13  9:05                           ` Fam Zheng
2015-03-16  6:19                             ` Wen Congyang
2015-03-25 12:41                               ` Paolo Bonzini
2015-02-12  9:36         ` Hongyang Yang
2015-02-12  9:46           ` Fam Zheng
2015-02-24  7:50         ` Wen Congyang
2015-02-25  2:46           ` Fam Zheng
2015-02-25  8:36             ` Wen Congyang
2015-02-25  8:58               ` Fam Zheng
2015-02-25  9:58                 ` Wen Congyang
2015-02-26  6:38             ` Wen Congyang
2015-02-26  8:44               ` Fam Zheng
2015-02-26  9:07                 ` Wen Congyang
2015-02-26 10:02                   ` Fam Zheng
2015-02-27  2:27                     ` Wen Congyang
2015-02-27  2:32                       ` Fam Zheng
2015-02-25  8:11         ` Wen Congyang
2015-02-25  8:18           ` Fam Zheng
2015-02-25  9:10         ` Wen Congyang
2015-02-25  9:45           ` Fam Zheng
2015-03-04 16:35   ` Dr. David Alan Gilbert
2015-03-05  1:03     ` Wen Congyang
2015-03-05 19:04       ` Dr. David Alan Gilbert
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 02/14] quorom: add a new read pattern Wen Congyang
2015-02-12  6:42   ` Gonglei
2015-02-23 20:36   ` Max Reitz
2015-02-23 21:56   ` Eric Blake
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 03/14] quorum: ignore 0-length child Wen Congyang
2015-02-23 20:43   ` Max Reitz
2015-02-24  2:33     ` Wen Congyang
2015-03-18  5:29     ` Wen Congyang
2015-03-18 12:57       ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 04/14] Add new block driver interfaces to control disk replication Wen Congyang
2015-02-23 20:57   ` Max Reitz
2015-02-23 21:58     ` Eric Blake
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 05/14] quorom: implement block driver interfaces for block replication Wen Congyang
2015-02-23 21:22   ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 06/14] NBD client: connect to nbd server later Wen Congyang
2015-02-23 21:31   ` Max Reitz
2015-02-25  2:23     ` Wen Congyang
2015-02-25 14:22       ` Max Reitz
2015-02-26 14:07         ` Paolo Bonzini
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 07/14] NBD client: implement block driver interfaces for block replication Wen Congyang
2015-02-23 21:41   ` Max Reitz
2015-02-26 14:08     ` Paolo Bonzini
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 08/14] block: add a new API to create a hidden BlockBackend Wen Congyang
2015-02-23 21:48   ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 09/14] block: give backing image its own BlockBackend Wen Congyang
2015-02-23 21:53   ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 10/14] allow the backing image access the origin BlockDriverState Wen Congyang
2015-02-23 22:01   ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 11/14] allow writing to the backing file Wen Congyang
2015-02-23 22:03   ` Max Reitz
2015-02-26 14:15     ` Paolo Bonzini
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 12/14] Add disk buffer for block replication Wen Congyang
2015-02-23 22:27   ` Max Reitz
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 13/14] COW: move cow interfaces to a seperate file Wen Congyang
2015-02-12  3:07 ` [Qemu-devel] [RFC PATCH 14/14] COLO: implement a new block driver Wen Congyang
2015-02-23 22:35   ` Max Reitz
2015-02-18 16:26 ` [Qemu-devel] [RFC PATCH 00/14] Block replication for continuous checkpoints Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.