[Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints
@ 2015-04-03 10:01 Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
                   ` (13 more replies)
  0 siblings, 14 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Yang Hongyang

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/wencongyang/qemu-colo/commits/block-replication-v3

You can get the patch with the other COLO patches here:
https://github.com/wencongyang/qemu-colo/tree/colo_huawei_v4.4

Changs Log:
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (14):
  docs: block replication's description
  quorum: allow ignoring child errors
  NBD client: connect to nbd server later
  Add new block driver interfaces to control block replication
  quorum: implement block driver interfaces for block replication
  NBD client: implement block driver interfaces for block replication
  allow writing to the backing file
  Allow creating backup jobs when opening BDS
  block: Parse "backing_reference" option to reference existing BDS
  util/hbitmap: Add an API to reset all set bits in hbitmap
  Backup: clear all bitmap when doing block checkpoint
  qcow2: support colo
  skip nbd_target when starting block replication
  Don't allow a disk use backing reference target

 block.c                    | 246 +++++++++++++++++++++++-
 block/backup.c             |  13 ++
 block/nbd.c                | 171 +++++++++++++++--
 block/qcow2.c              | 454 ++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h              |   6 +
 block/quorum.c             | 141 +++++++++++++-
 blockjob.c                 |  10 +
 docs/block-replication.txt | 153 +++++++++++++++
 include/block/block.h      |   6 +
 include/block/block_int.h  |  12 ++
 include/block/blockjob.h   |  12 ++
 include/qemu/hbitmap.h     |   8 +
 qapi/block.json            |  16 ++
 stubs/Makefile.objs        |   1 +
 stubs/backup.c             |  11 ++
 tests/qemu-iotests/051     |  13 ++
 tests/qemu-iotests/051.out |  13 ++
 tests/test-hbitmap.c       |  39 ++++
 util/hbitmap.c             |  16 ++
 19 files changed, 1314 insertions(+), 27 deletions(-)
 create mode 100644 docs/block-replication.txt
 create mode 100644 stubs/backup.c

-- 
2.1.0

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-20 15:30   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2015-04-22 10:39   ` [Qemu-devel] " Dr. David Alan Gilbert
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 02/14] quorum: allow ignoring child errors Wen Congyang
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..4426ffc
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,153 @@
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content(it could be from either "Secondary Write Requests" or
+       previous COW of "Primary Write Requests") in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                                         virtio-blk
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+(called active-disk). It should be an empty disk, and the format should
+be qcow2.
+
+4) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also be an empty disk, and
+the driver supports bdrv_make_empty().
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transferred to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. bdrv_stop_replication()
+   It is called on failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+         children.1.file.driver=nbd+colo,\
+         children.1.file.host=xxx,\
+         children.1.file.port=xxx,\
+         children.1.file.export=xxx,\
+         children.1.driver=raw,\
+         children.1.ignore-errors=on
+  Note:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+  3. host is the secondary physical machine's hostname or IP
+  4. Each disk must have its own export name.
+  5. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
+
+Secondary:
+  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
+  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
+         backing_reference.drive_id=nbd_target1,\
+         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
+         backing_reference.hidden-disk.driver=qcow2,\
+         backing_reference.hidden-disk.allow-write-backing-file=on
+  Then run qmp command:
+    nbd_server_start host:port
+  Note:
+  1. The export name for the same disk must be the same in primary
+     and secondary QEMU command line
+  2. The qmp command nbd-server-start must be run before running the
+     qmp command migrate on primary QEMU
+  3. Don't use nbd-server-start's other options
+  4. Active disk, hidden disk and nbd target's length should be the
+     same.
+  5. It is better to put active disk and hidden disk in ramdisk.
+  6. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 02/14] quorum: allow ignoring child errors
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 03/14] NBD client: connect to nbd server later Wen Congyang
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

If the child is not ready, read/write/getlength/flush will
return -errno. It is not critical error, and can be ignored:
1. read/write:
   Just not report the error event.
2. getlength:
   just ignore it. If all children's getlength return -errno,
   and be ignored, return -EIO.
3. flush:
   Just ignore it. If all children's getlength return -errno,
   and be ignored, return 0.

Usage: children.x.ignore-errors=true

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/quorum.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 7 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 437b122..ca4f970 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -30,6 +30,7 @@
 #define QUORUM_OPT_BLKVERIFY      "blkverify"
 #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
 #define QUORUM_OPT_READ_PATTERN   "read-pattern"
+#define QUORUM_CHILDREN_OPT_IGNORE_ERRORS   "ignore-errors"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -65,6 +66,7 @@ typedef struct QuorumVotes {
 /* the following structure holds the state of one quorum instance */
 typedef struct BDRVQuorumState {
     BlockDriverState **bs; /* children BlockDriverStates */
+    bool *ignore_errors;   /* ignore children's error? */
     int num_children;      /* children count */
     int threshold;         /* if less than threshold children reads gave the
                             * same result a quorum error occurs.
@@ -97,6 +99,7 @@ typedef struct QuorumChildRequest {
     uint8_t *buf;
     int ret;
     QuorumAIOCB *parent;
+    int index;
 } QuorumChildRequest;
 
 /* Quorum will use the following structure to track progress of each read/write
@@ -209,6 +212,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
         acb->qcrs[i].buf = NULL;
         acb->qcrs[i].ret = 0;
         acb->qcrs[i].parent = acb;
+        acb->qcrs[i].index = i;
     }
 
     return acb;
@@ -305,7 +309,7 @@ static void quorum_aio_cb(void *opaque, int ret)
     acb->count++;
     if (ret == 0) {
         acb->success_count++;
-    } else {
+    } else if (!s->ignore_errors[sacb->index]) {
         quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
     }
     assert(acb->count <= s->num_children);
@@ -720,19 +724,31 @@ static BlockAIOCB *quorum_aio_writev(BlockDriverState *bs,
 static int64_t quorum_getlength(BlockDriverState *bs)
 {
     BDRVQuorumState *s = bs->opaque;
-    int64_t result;
+    int64_t result = -EIO;
     int i;
 
     /* check that all file have the same length */
-    result = bdrv_getlength(s->bs[0]);
-    if (result < 0) {
-        return result;
-    }
-    for (i = 1; i < s->num_children; i++) {
+    for (i = 0; i < s->num_children; i++) {
         int64_t value = bdrv_getlength(s->bs[i]);
+
         if (value < 0) {
             return value;
         }
+
+        if (value == 0 && s->ignore_errors[i]) {
+            /*
+             * If the child is not ready, it cannot return -errno,
+             * otherwise refresh_total_sectors() will fail when
+             * we open the child.
+             */
+            continue;
+        }
+
+        if (result == -EIO) {
+            result = value;
+            continue;
+        }
+
         if (value != result) {
             return -EIO;
         }
@@ -770,6 +786,9 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
 
     for (i = 0; i < s->num_children; i++) {
         result = bdrv_co_flush(s->bs[i]);
+        if (result < 0 && s->ignore_errors[i]) {
+            result = 0;
+        }
         result_value.l = result;
         quorum_count_vote(&error_votes, &result_value, i);
     }
@@ -844,6 +863,19 @@ static QemuOptsList quorum_runtime_opts = {
     },
 };
 
+static QemuOptsList quorum_children_common_opts = {
+    .name = "quorum children",
+    .head = QTAILQ_HEAD_INITIALIZER(quorum_children_common_opts.head),
+    .desc = {
+        {
+            .name = QUORUM_CHILDREN_OPT_IGNORE_ERRORS,
+            .type = QEMU_OPT_BOOL,
+            .help = "ignore child I/O error",
+        },
+        { /* end of list */ }
+    },
+};
+
 static int parse_read_pattern(const char *opt)
 {
     int i;
@@ -939,11 +971,14 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     /* allocate the children BlockDriverState array */
     s->bs = g_new0(BlockDriverState *, s->num_children);
     opened = g_new0(bool, s->num_children);
+    s->ignore_errors = g_new0(bool, s->num_children);
 
     for (i = 0, lentry = qlist_first(list); lentry;
          lentry = qlist_next(lentry), i++) {
         QDict *d;
         QString *string;
+        QemuOpts *children_opts = NULL;
+        bool value;
 
         switch (qobject_type(lentry->value))
         {
@@ -951,6 +986,20 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
             case QTYPE_QDICT:
                 d = qobject_to_qdict(lentry->value);
                 QINCREF(d);
+
+                children_opts = qemu_opts_create(&quorum_children_common_opts,
+                                                 NULL, 0, &error_abort);
+                qemu_opts_absorb_qdict(children_opts, d, &local_err);
+                if (local_err) {
+                    ret = -EINVAL;
+                    qemu_opts_del(children_opts);
+                    goto close_exit;
+                }
+                value = qemu_opt_get_bool(children_opts,
+                                          QUORUM_CHILDREN_OPT_IGNORE_ERRORS,
+                                          false);
+                s->ignore_errors[i]  = value;
+                qemu_opts_del(children_opts);
                 ret = bdrv_open(&s->bs[i], NULL, NULL, d, flags, NULL,
                                 &local_err);
                 break;
@@ -1008,6 +1057,7 @@ static void quorum_close(BlockDriverState *bs)
     }
 
     g_free(s->bs);
+    g_free(s->ignore_errors);
 }
 
 static void quorum_detach_aio_context(BlockDriverState *bs)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 03/14] NBD client: connect to nbd server later
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 02/14] quorum: allow ignoring child errors Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication Wen Congyang
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

The secondary qemu starts later than the primary qemu, so we
cannot connect to nbd server in bdrv_open().

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/nbd.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 108 insertions(+), 14 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 2176186..3faf865 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -44,6 +44,8 @@
 typedef struct BDRVNBDState {
     NbdClientSession client;
     QemuOpts *socket_opts;
+    char *export;
+    bool connected;
 } BDRVNBDState;
 
 static int nbd_parse_uri(const char *filename, QDict *options)
@@ -254,50 +256,95 @@ static int nbd_establish_connection(BlockDriverState *bs, Error **errp)
     return sock;
 }
 
-static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
-                    Error **errp)
+static int nbd_connect_server(BlockDriverState *bs, Error **errp)
 {
     BDRVNBDState *s = bs->opaque;
-    char *export = NULL;
     int result, sock;
-    Error *local_err = NULL;
-
-    /* Pop the config into our state object. Exit if invalid. */
-    nbd_config(s, options, &export, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return -EINVAL;
-    }
 
     /* establish TCP connection, return error if it fails
      * TODO: Configurable retry-until-timeout behaviour.
      */
     sock = nbd_establish_connection(bs, errp);
     if (sock < 0) {
-        g_free(export);
+        g_free(s->export);
         return sock;
     }
 
     /* NBD handshake */
-    result = nbd_client_init(bs, sock, export, errp);
-    g_free(export);
+    result = nbd_client_init(bs, sock, s->export, errp);
+    g_free(s->export);
+    s->export = NULL;
+    if (!result) {
+        s->connected = true;
+    }
+
     return result;
 }
 
+static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
+                    Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+    Error *local_err = NULL;
+
+    /* Pop the config into our state object. Exit if invalid. */
+    nbd_config(s, options, &s->export, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+
+    return nbd_connect_server(bs, errp);
+}
+
+static int nbd_open_colo(BlockDriverState *bs, QDict *options, int flags,
+                         Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+    Error *local_err = NULL;
+
+    /* Pop the config into our state object. Exit if invalid. */
+    nbd_config(s, options, &s->export, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
 static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
                         int nb_sectors, QEMUIOVector *qiov)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -EIO;
+    }
+
     return nbd_client_co_readv(bs, sector_num, nb_sectors, qiov);
 }
 
 static int nbd_co_writev(BlockDriverState *bs, int64_t sector_num,
                          int nb_sectors, QEMUIOVector *qiov)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -EIO;
+    }
+
     return nbd_client_co_writev(bs, sector_num, nb_sectors, qiov);
 }
 
 static int nbd_co_flush(BlockDriverState *bs)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -EIO;
+    }
+
     return nbd_client_co_flush(bs);
 }
 
@@ -310,6 +357,12 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
                           int nb_sectors)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return -EIO;
+    }
+
     return nbd_client_co_discard(bs, sector_num, nb_sectors);
 }
 
@@ -319,23 +372,44 @@ static void nbd_close(BlockDriverState *bs)
 
     qemu_opts_del(s->socket_opts);
     nbd_client_close(bs);
+    s->connected = false;
 }
 
 static int64_t nbd_getlength(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
 
+    if (!s->connected) {
+        /*
+         * We cannot return -ENOTCONN, otherwise refresh_total_sectors()
+         * will fail, and we cannot open nbd client.
+         */
+        return 0;
+    }
+
     return s->client.size;
 }
 
 static void nbd_detach_aio_context(BlockDriverState *bs)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return;
+    }
+
     nbd_client_detach_aio_context(bs);
 }
 
 static void nbd_attach_aio_context(BlockDriverState *bs,
                                    AioContext *new_context)
 {
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        return;
+    }
+
     nbd_client_attach_aio_context(bs, new_context);
 }
 
@@ -438,11 +512,31 @@ static BlockDriver bdrv_nbd_unix = {
     .bdrv_refresh_filename      = nbd_refresh_filename,
 };
 
+static BlockDriver bdrv_nbd_colo = {
+    .format_name                = "nbd+colo",
+    .protocol_name              = "nbd+colo",
+    .instance_size              = sizeof(BDRVNBDState),
+    .bdrv_parse_filename        = nbd_parse_filename,
+    .bdrv_file_open             = nbd_open_colo,
+    .bdrv_co_readv              = nbd_co_readv,
+    .bdrv_co_writev             = nbd_co_writev,
+    .bdrv_close                 = nbd_close,
+    .bdrv_co_flush_to_os        = nbd_co_flush,
+    .bdrv_co_discard            = nbd_co_discard,
+    .bdrv_getlength             = nbd_getlength,
+    .bdrv_detach_aio_context    = nbd_detach_aio_context,
+    .bdrv_attach_aio_context    = nbd_attach_aio_context,
+    .bdrv_refresh_filename      = nbd_refresh_filename,
+
+    .has_variable_length        = true,
+};
+
 static void bdrv_nbd_init(void)
 {
     bdrv_register(&bdrv_nbd);
     bdrv_register(&bdrv_nbd_tcp);
     bdrv_register(&bdrv_nbd_unix);
+    bdrv_register(&bdrv_nbd_colo);
 }
 
 block_init(bdrv_nbd_init);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (2 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 03/14] NBD client: connect to nbd server later Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-22 12:56   ` Eric Blake
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 05/14] quorum: implement block driver interfaces for " Wen Congyang
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Luiz Capitulino, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, Michael Roth, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   | 40 ++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |  5 +++++
 include/block/block_int.h | 11 +++++++++++
 qapi/block.json           | 16 ++++++++++++++++
 4 files changed, 72 insertions(+)

diff --git a/block.c b/block.c
index f2f8ae7..82e139b 100644
--- a/block.c
+++ b/block.c
@@ -6229,3 +6229,43 @@ BlockAcctStats *bdrv_get_stats(BlockDriverState *bs)
 {
     return &bs->stats;
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_start_replication) {
+        drv->bdrv_start_replication(bs, mode, errp);
+    } else if (bs->file) {
+        bdrv_start_replication(bs->file, mode, errp);
+    } else {
+        error_setg(errp, "this feature or command is not currently supported");
+    }
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_do_checkpoint) {
+        drv->bdrv_do_checkpoint(bs, errp);
+    } else if (bs->file) {
+        bdrv_do_checkpoint(bs->file, errp);
+    } else {
+        error_setg(errp, "this feature or command is not currently supported");
+    }
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_stop_replication) {
+        drv->bdrv_stop_replication(bs, errp);
+    } else if (bs->file) {
+        bdrv_stop_replication(bs->file, errp);
+    } else {
+        error_setg(errp, "this feature or command is not currently supported");
+    }
+}
diff --git a/include/block/block.h b/include/block/block.h
index 4c57d63..2f8f361 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -569,4 +569,9 @@ void bdrv_flush_io_queue(BlockDriverState *bs);
 
 BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index dccb092..b15d5b8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -290,6 +290,17 @@ struct BlockDriver {
      */
     int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
 
+
+    void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+                                   Error **errp);
+    /* Drop Disk buffer when doing checkpoint. */
+    void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+    /*
+     * After failover, we should flush Disk buffer into secondary disk
+     * and stop block replication.
+     */
+    void (*bdrv_stop_replication)(BlockDriverState *bs, Error **errp);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block.json b/qapi/block.json
index e313465..7884cc6 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -40,6 +40,22 @@
   'data': ['auto', 'none', 'lba', 'large', 'rechs']}
 
 ##
+# @COLOMode
+#
+# An enumeration of replication modes.
+#
+# @unprotected: COLO is not started or after failover.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.4
+##
+{ 'enum' : 'ReplicationMode',
+  'data' : ['unprotected', 'primary', 'secondary']}
+
+##
 # @BlockdevSnapshotInternal
 #
 # @device: the name of the device to generate the snapshot from
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 05/14] quorum: implement block driver interfaces for block replication
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (3 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 06/14] NBD client: " Wen Congyang
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/quorum.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index ca4f970..7f19ec6 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -84,6 +84,8 @@ typedef struct BDRVQuorumState {
                             */
 
     QuorumReadPattern read_pattern;
+
+    int colo_index; /* store which child supports block replication */
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -1024,6 +1026,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     g_free(opened);
+    s->colo_index = -1;
     goto exit;
 
 close_exit:
@@ -1114,6 +1117,76 @@ static void quorum_refresh_filename(BlockDriverState *bs)
     bs->full_open_options = opts;
 }
 
+static void quorum_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                                     Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int count = 0, i, index;
+    Error *local_err = NULL;
+
+    /*
+     * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != REPLICATION_MODE_PRIMARY) {
+        error_setg(errp, "Invalid parameter '%s'", "mode");
+        return;
+    }
+
+    if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) {
+        error_setg(errp, "Invalid parameter '%s'", "read pattern");
+        return;
+    }
+
+    for (i = 0; i < s->num_children; i++) {
+        bdrv_start_replication(s->bs[i], mode, &local_err);
+        if (local_err) {
+            error_free(local_err);
+            local_err = NULL;
+        } else {
+            count++;
+            index = i;
+        }
+    }
+
+    if (count == 0) {
+        /* No child supports block replication */
+        error_setg(errp, "this feature or command is not currently supported");
+    } else if (count > 1) {
+        for (i = 0; i < s->num_children; i++) {
+            bdrv_stop_replication(s->bs[i], NULL);
+        }
+        error_setg(errp, "too many children support block replication");
+    } else {
+        s->colo_index = index;
+    }
+}
+
+static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->colo_index < 0) {
+        error_setg(errp, "Block replication is not started");
+        return;
+    }
+
+    bdrv_do_checkpoint(s->bs[s->colo_index], errp);
+}
+
+static void quorum_stop_replication(BlockDriverState *bs, Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->colo_index < 0) {
+        error_setg(errp, "Block replication is not started");
+        return;
+    }
+
+    bdrv_stop_replication(s->bs[s->colo_index], errp);
+    s->colo_index = -1;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name                        = "quorum",
     .protocol_name                      = "quorum",
@@ -1137,6 +1210,10 @@ static BlockDriver bdrv_quorum = {
 
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+    .bdrv_start_replication             = quorum_start_replication,
+    .bdrv_do_checkpoint                 = quorum_do_checkpoint,
+    .bdrv_stop_replication              = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 06/14] NBD client: implement block driver interfaces for block replication
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (4 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 05/14] quorum: implement block driver interfaces for " Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 07/14] allow writing to the backing file Wen Congyang
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/nbd.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 3faf865..269ffe0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -458,6 +458,52 @@ static void nbd_refresh_filename(BlockDriverState *bs)
     bs->full_open_options = opts;
 }
 
+static void nbd_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                                  Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+
+    /*
+     * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != REPLICATION_MODE_PRIMARY) {
+        error_setg(errp, "Invalid parameter '%s'", "mode");
+        return;
+    }
+
+    if (s->connected) {
+        error_setg(errp, "The connection is established");
+        return;
+    }
+
+    /* TODO: NBD client should be one child of quorum, how to verify it? */
+    nbd_connect_server(bs, errp);
+}
+
+static void nbd_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        error_setg(errp, "The connection is not established");
+        return;
+    }
+}
+
+static void nbd_stop_replication(BlockDriverState *bs, Error **errp)
+{
+    BDRVNBDState *s = bs->opaque;
+
+    if (!s->connected) {
+        error_setg(errp, "The connection is not established");
+        return;
+    }
+
+    nbd_client_close(bs);
+    s->connected = false;
+}
+
 static BlockDriver bdrv_nbd = {
     .format_name                = "nbd",
     .protocol_name              = "nbd",
@@ -527,6 +573,9 @@ static BlockDriver bdrv_nbd_colo = {
     .bdrv_detach_aio_context    = nbd_detach_aio_context,
     .bdrv_attach_aio_context    = nbd_attach_aio_context,
     .bdrv_refresh_filename      = nbd_refresh_filename,
+    .bdrv_start_replication     = nbd_start_replication,
+    .bdrv_do_checkpoint         = nbd_do_checkpoint,
+    .bdrv_stop_replication      = nbd_stop_replication,
 
     .has_variable_length        = true,
 };
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 07/14] allow writing to the backing file
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (5 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 06/14] NBD client: " Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS Wen Congyang
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 82e139b..23f0a81 100644
--- a/block.c
+++ b/block.c
@@ -1247,6 +1247,20 @@ out:
     bdrv_refresh_limits(bs, NULL);
 }
 
+#define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
+static QemuOptsList backing_file_opts = {
+    .name = "backing_file",
+    .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
+    .desc = {
+        {
+            .name = ALLOW_WRITE_BACKING_FILE,
+            .type = QEMU_OPT_BOOL,
+            .help = "allow write to backing file",
+        },
+        { /* end of list */ }
+    },
+};
+
 /*
  * Opens the backing file for a BlockDriverState if not yet open
  *
@@ -1261,6 +1275,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     int ret = 0;
     BlockDriverState *backing_hd;
     Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    int flags;
 
     if (bs->backing_hd != NULL) {
         QDECREF(options);
@@ -1273,6 +1289,19 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     }
 
     bs->open_flags &= ~BDRV_O_NO_BACKING;
+    flags = bdrv_backing_flags(bs->open_flags);
+
+    opts = qemu_opts_create(&backing_file_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        goto free_exit;
+    }
+    if (qemu_opt_get_bool(opts, ALLOW_WRITE_BACKING_FILE, false)) {
+        flags |= BDRV_O_RDWR;
+    }
+    qemu_opts_del(opts);
+
     if (qdict_haskey(options, "file.filename")) {
         backing_filename[0] = '\0';
     } else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) {
@@ -1305,7 +1334,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     assert(bs->backing_hd == NULL);
     ret = bdrv_open(&backing_hd,
                     *backing_filename ? backing_filename : NULL, NULL, options,
-                    bdrv_backing_flags(bs->open_flags), NULL, &local_err);
+                    flags, NULL, &local_err);
     if (ret < 0) {
         bdrv_unref(backing_hd);
         backing_hd = NULL;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (6 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 07/14] allow writing to the backing file Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 11:06   ` Paolo Bonzini
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 09/14] block: Parse "backing_reference" option to reference existing BDS Wen Congyang
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

When opening BDS, we need to create backup jobs for
image-fleecing. This feature is not used by qemu-img,
qemu-io or qemu-nbd. So just adding the referenced
functions to stubs.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 stubs/Makefile.objs |  1 +
 stubs/backup.c      | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 stubs/backup.c

diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 8beff4c..5ae2214 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -39,3 +39,4 @@ stub-obj-$(CONFIG_WIN32) += fd-register.o
 stub-obj-y += cpus.o
 stub-obj-y += kvm.o
 stub-obj-y += qmp_pc_dimm_device_list.o
+stub-obj-y += backup.o
diff --git a/stubs/backup.c b/stubs/backup.c
new file mode 100644
index 0000000..3ebcc71
--- /dev/null
+++ b/stubs/backup.c
@@ -0,0 +1,11 @@
+#include "block/block_int.h"
+
+void backup_start(BlockDriverState *bs, BlockDriverState *target,
+                  int64_t speed, MirrorSyncMode sync_mode,
+                  BlockdevOnError on_source_error,
+                  BlockdevOnError on_target_error,
+                  BlockCompletionFunc *cb, void *opaque,
+                  Error **errp)
+{
+    error_setg(errp, "this feature or command is not currently supported");
+}
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 09/14] block: Parse "backing_reference" option to reference existing BDS
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (7 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap Wen Congyang
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Usage:
-drive file=xxx,id=Y, \
-drive file=xxxx,id=X,backing_reference.drive_id=Y,backing_reference.hidden-disk.*

It will create such backing chain:
               {virtio-blk dev 'Y'}                                      {virtio-blk dev 'X'}
                         |                                                          |
                         |                                                          |
                         v                                                          v

    [base] <- [mid] <- ( Y )  <----------------- (hidden target) <--------------- ( X )

                         v                              ^
                         v                              ^
                         v                              ^
                         v                              ^
                         >>>> drive-backup sync=none >>>>

X's backing file is hidden-disk, and hidden-disk's backing file is Y.
Disk Y may be opened or reopened in read-write mode, so A block backup
job is automatically created: source is Y and target is hidden disk.
Active disk X, hidden disk, and Y are all on the same AioContext.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c                    | 154 ++++++++++++++++++++++++++++++++++++++++++++-
 include/block/block.h      |   1 +
 include/block/block_int.h  |   1 +
 tests/qemu-iotests/051     |  13 ++++
 tests/qemu-iotests/051.out |  13 ++++
 5 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 23f0a81..2df950b 100644
--- a/block.c
+++ b/block.c
@@ -1351,6 +1351,119 @@ free_exit:
     return ret;
 }
 
+static void backing_reference_completed(void *opaque, int ret)
+{
+    BlockDriverState *hidden_disk = opaque;
+
+    assert(!hidden_disk->backing_reference);
+}
+
+static int bdrv_open_backing_reference_file(BlockDriverState *bs,
+                                            QDict *options, Error **errp)
+{
+    const char *backing_name;
+    QDict *hidden_disk_options = NULL;
+    BlockDriverState *backing_hd, *hidden_disk;
+    BlockBackend *backing_blk;
+    AioContext *aio_context;
+    Error *local_err = NULL;
+    int ret = 0;
+
+    backing_name = qdict_get_try_str(options, "drive_id");
+    if (!backing_name) {
+        error_setg(errp, "Backing reference needs option drive_id");
+        ret = -EINVAL;
+        goto free_exit;
+    }
+    qdict_del(options, "drive_id");
+
+    qdict_extract_subqdict(options, &hidden_disk_options, "hidden-disk.");
+    if (!qdict_size(hidden_disk_options)) {
+        error_setg(errp, "Backing reference needs option hidden-disk.*");
+        ret = -EINVAL;
+        goto free_exit;
+    }
+
+    if (qdict_size(options)) {
+        const QDictEntry *entry = qdict_first(options);
+        error_setg(errp, "Backing reference used by '%s' doesn't support "
+                   "the option '%s'", bdrv_get_device_name(bs), entry->key);
+        ret = -EINVAL;
+        goto free_exit;
+    }
+
+    backing_blk = blk_by_name(backing_name);
+    if (!backing_blk) {
+        error_set(errp, QERR_DEVICE_NOT_FOUND, backing_name);
+        ret = -ENOENT;
+        goto free_exit;
+    }
+
+    backing_hd = blk_bs(backing_blk);
+    /* Backing reference itself? */
+    if (backing_hd == bs || bdrv_find_overlay(backing_hd, bs)) {
+        error_setg(errp, "Backing reference itself");
+        ret = -EINVAL;
+        goto free_exit;
+    }
+
+    if (bdrv_op_is_blocked(backing_hd, BLOCK_OP_TYPE_BACKING_REFERENCE,
+                           errp)) {
+        ret = -EBUSY;
+        goto free_exit;
+    }
+
+    /* hidden-disk is bs's backing file */
+    ret = bdrv_open_backing_file(bs, hidden_disk_options, errp);
+    hidden_disk_options = NULL;
+    if (ret < 0) {
+        goto free_exit;
+    }
+
+    hidden_disk = bs->backing_hd;
+    if (!hidden_disk->drv || !hidden_disk->drv->supports_backing) {
+        ret = -EINVAL;
+        error_setg(errp, "Hidden disk's driver doesn't support backing files");
+        goto free_exit;
+    }
+
+    bdrv_set_backing_hd(hidden_disk, backing_hd);
+    bdrv_ref(backing_hd);
+
+    /*
+     * backing hd may be opened or reopened in read-write mode, so we
+     * should backup backing hd to hidden disk
+     */
+    bdrv_op_unblock(hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                    bs->backing_blocker);
+    bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                    hidden_disk->backing_blocker);
+
+    bdrv_ref(hidden_disk);
+
+    aio_context = bdrv_get_aio_context(bs);
+    aio_context_acquire(aio_context);
+    bdrv_set_aio_context(backing_hd, aio_context);
+    backup_start(backing_hd, hidden_disk, 0, MIRROR_SYNC_MODE_NONE,
+                 BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT,
+                 backing_reference_completed, hidden_disk, &local_err);
+    aio_context_release(aio_context);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        bdrv_unref(hidden_disk);
+        /* FIXME, use which errno? */
+        ret = -EIO;
+        goto free_exit;
+    }
+
+    bs->backing_reference = true;
+
+free_exit:
+    QDECREF(hidden_disk_options);
+    QDECREF(options);
+    return ret;
+}
+
 /*
  * Opens a disk image whose options are given as BlockdevRef in another block
  * device's options.
@@ -1604,13 +1717,37 @@ int bdrv_open(BlockDriverState **pbs, const char *filename,
 
     /* If there is a backing file, use it */
     if ((flags & BDRV_O_NO_BACKING) == 0) {
-        QDict *backing_options;
+        QDict *backing_options, *backing_reference_options;
 
+        qdict_extract_subqdict(options, &backing_reference_options,
+                               "backing_reference.");
         qdict_extract_subqdict(options, &backing_options, "backing.");
-        ret = bdrv_open_backing_file(bs, backing_options, &local_err);
-        if (ret < 0) {
+
+        if (qdict_size(backing_reference_options) &&
+            qdict_size(backing_options)) {
+            error_setg(&local_err,
+                       "Option \"backing_reference.*\" and \"backing.*\""
+                       " cannot be used together");
+            ret = -EINVAL;
+            QDECREF(backing_reference_options);
+            QDECREF(backing_options);
             goto close_and_fail;
         }
+        if (qdict_size(backing_reference_options)) {
+            QDECREF(backing_options);
+            ret = bdrv_open_backing_reference_file(bs,
+                                                   backing_reference_options,
+                                                   &local_err);
+            if (ret) {
+                goto close_and_fail;
+            }
+        } else {
+            QDECREF(backing_reference_options);
+            ret = bdrv_open_backing_file(bs, backing_options, &local_err);
+            if (ret < 0) {
+                goto close_and_fail;
+            }
+        }
     }
 
     bdrv_refresh_filename(bs);
@@ -1941,6 +2078,14 @@ void bdrv_close(BlockDriverState *bs)
     if (bs->drv) {
         if (bs->backing_hd) {
             BlockDriverState *backing_hd = bs->backing_hd;
+            if (bs->backing_reference) {
+                assert(backing_hd->backing_hd);
+                if (backing_hd->backing_hd->job) {
+                    block_job_cancel(backing_hd->backing_hd->job);
+                }
+                bdrv_set_backing_hd(backing_hd, NULL);
+                bdrv_unref(backing_hd->backing_hd);
+            }
             bdrv_set_backing_hd(bs, NULL);
             bdrv_unref(backing_hd);
         }
@@ -5913,6 +6058,9 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
 
     bs->aio_context = new_context;
 
+    if (bs->backing_reference) {
+        bdrv_attach_aio_context(bs->backing_hd->backing_hd, new_context);
+    }
     if (bs->backing_hd) {
         bdrv_attach_aio_context(bs->backing_hd, new_context);
     }
diff --git a/include/block/block.h b/include/block/block.h
index 2f8f361..d2fc77e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -159,6 +159,7 @@ typedef enum BlockOpType {
     BLOCK_OP_TYPE_RESIZE,
     BLOCK_OP_TYPE_STREAM,
     BLOCK_OP_TYPE_REPLACE,
+    BLOCK_OP_TYPE_BACKING_REFERENCE,
     BLOCK_OP_TYPE_MAX,
 } BlockOpType;
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index b15d5b8..ca2c2f4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -375,6 +375,7 @@ struct BlockDriverState {
     QDict *full_open_options;
     char exact_filename[PATH_MAX];
 
+    bool backing_reference;
     BlockDriverState *backing_hd;
     BlockDriverState *file;
 
diff --git a/tests/qemu-iotests/051 b/tests/qemu-iotests/051
index 0360f37..fd67f40 100755
--- a/tests/qemu-iotests/051
+++ b/tests/qemu-iotests/051
@@ -116,6 +116,19 @@ run_qemu -drive file="$TEST_IMG",file.backing.driver=file,file.backing.filename=
 run_qemu -drive file="$TEST_IMG",file.backing.driver=qcow2,file.backing.file.filename="$TEST_IMG.orig"
 
 echo
+echo === Backing file reference ===
+echo
+
+run_qemu -drive file="$TEST_IMG",if=none,id=drive0 \
+    -drive file="$TEST_IMG",driver=qcow2,backing_reference.drive_id=drive0,backing_reference.hidden-disk.filename="$TEST_IMG.hidden"
+
+run_qemu -drive file="$TEST_IMG",if=none,id=drive0 \
+    -drive file="$TEST_IMG",driver=qcow2,backing_reference.drive_id=drive0,backing_reference.hidden-disk.filename="$TEST_IMG.hidden",backing.file.filename="$TEST_IMG.orig"
+
+run_qemu -drive file="$TEST_IMG",if=none,id=drive0 \
+    -drive file="$TEST_IMG",driver=qcow2,file.backing_reference.drive_id=drive0,file.backing_reference.hidden-disk.filename="$TEST_IMG.hidden",file.backing.file.filename="$TEST_IMG.orig"
+
+echo
 echo === Enable and disable lazy refcounting on the command line, plus some invalid values ===
 echo
 
diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out
index 2890eac..cb8340b 100644
--- a/tests/qemu-iotests/051.out
+++ b/tests/qemu-iotests/051.out
@@ -75,6 +75,19 @@ Testing: -drive file=TEST_DIR/t.qcow2,file.backing.driver=qcow2,file.backing.fil
 QEMU_PROG: -drive file=TEST_DIR/t.qcow2,file.backing.driver=qcow2,file.backing.file.filename=TEST_DIR/t.qcow2.orig: Driver doesn't support backing files
 
 
+=== Backing file reference ===
+
+Testing: -drive file=TEST_DIR/t.qcow2,if=none,id=drive0 -drive file=TEST_DIR/t.qcow2,driver=qcow2,backing_reference.drive_id=drive0,backing_reference.hidden-disk.filename=TEST_DIR/t.qcow2.hidden
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) q^[[K^[[Dqu^[[K^[[D^[[Dqui^[[K^[[D^[[D^[[Dquit^[[K
+
+Testing: -drive file=TEST_DIR/t.qcow2,if=none,id=drive0 -drive file=TEST_DIR/t.qcow2,driver=qcow2,backing_reference.drive_id=drive0,backing_reference.hidden-disk.filename=TEST_DIR/t.qcow2.hidden,backing.file.filename=TEST_DIR/t.qcow2.orig
+QEMU_PROG: -drive file=TEST_DIR/t.qcow2,driver=qcow2,backing=drive0,backing.file.filename=TEST_DIR/t.qcow2.orig: Option "backing_reference.*" and "backing.*" cannot be used together
+
+Testing: -drive file=TEST_DIR/t.qcow2,if=none,id=drive0 -drive file=TEST_DIR/t.qcow2,driver=qcow2,file.backing_reference.drive_id=drive0,file.backing_reference.hidden-disk.filename=TEST_DIR/t.qcow2.hidden,file.backing.file.filename=TEST_DIR/t.qcow2.orig
+QEMU_PROG: -drive file=TEST_DIR/t.qcow2,driver=qcow2,file.backing=drive0,file.backing.file.filename=TEST_DIR/t.qcow2.orig: Option "backing_reference.*" and "backing.*" cannot be used together
+
+
 === Enable and disable lazy refcounting on the command line, plus some invalid values ===
 
 Testing: -drive file=TEST_DIR/t.qcow2,format=qcow2,lazy-refcounts=on
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (8 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 09/14] block: Parse "backing_reference" option to reference existing BDS Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 11:05   ` Paolo Bonzini
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint Wen Congyang
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 include/qemu/hbitmap.h |  8 ++++++++
 tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
 util/hbitmap.c         | 16 ++++++++++++++++
 3 files changed, 63 insertions(+)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 550d7ce..95a55e4 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
 void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
 
 /**
+ * hbitmap_reset_all:
+ * @hb: HBitmap to operate on.
+ *
+ * Reset all bits in an HBitmap.
+ */
+void hbitmap_reset_all(HBitmap *hb);
+
+/**
  * hbitmap_get:
  * @hb: HBitmap to operate on.
  * @item: Bit to query (0-based).
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 8c902f2..1f0078a 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -11,6 +11,7 @@
 
 #include <glib.h>
 #include <stdarg.h>
+#include <string.h>
 #include "qemu/hbitmap.h"
 
 #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
@@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
     }
 }
 
+static void hbitmap_test_reset_all(TestHBitmapData *data)
+{
+    size_t n;
+
+    hbitmap_reset_all(data->hb);
+
+    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
+    if (n == 0) {
+        n = 1;
+    }
+    memset(data->bits, 0, n * sizeof(unsigned long));
+
+    if (data->granularity == 0) {
+        hbitmap_test_check(data, 0);
+    }
+}
+
 static void hbitmap_test_check_get(TestHBitmapData *data)
 {
     uint64_t count = 0;
@@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
     hbitmap_test_set(data, L3 / 2, L3);
 }
 
+static void test_hbitmap_reset_all(TestHBitmapData *data,
+                                   const void *unused)
+{
+    hbitmap_test_init(data, L3 * 2, 0);
+    hbitmap_test_set(data, L1 - 1, L1 + 2);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, 0, L1 * 3);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, L2, L1);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, L2, L3 - L2 + 1);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, L3 - 1, 3);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, 0, L3 * 2);
+    hbitmap_test_reset_all(data);
+    hbitmap_test_set(data, L3 / 2, L3);
+    hbitmap_test_reset_all(data);
+}
+
 static void test_hbitmap_granularity(TestHBitmapData *data,
                                      const void *unused)
 {
@@ -394,6 +432,7 @@ int main(int argc, char **argv)
     hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
     hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
     hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
+    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
     hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
     g_test_run();
 
diff --git a/util/hbitmap.c b/util/hbitmap.c
index ab13971..acce93c 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
     hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
 }
 
+void hbitmap_reset_all(HBitmap *hb)
+{
+    uint64_t size = hb->size;
+    unsigned int i;
+
+    /* Same as hbitmap_alloc() except memset() */
+    for (i = HBITMAP_LEVELS; --i >= 1; ) {
+        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
+        memset(hb->levels[i], 0, size * sizeof(unsigned long));
+    }
+
+    assert(size == 1);
+    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
+    hb->count = 0;
+}
+
 bool hbitmap_get(const HBitmap *hb, uint64_t item)
 {
     /* Compute position and bit in the last layer.  */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (9 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 11:09   ` Paolo Bonzini
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 12/14] qcow2: support colo Wen Congyang
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jeff Cody, Jiang Yunhong,
	Dong Eddie, Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi,
	Yang Hongyang, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Jeff Cody <jcody@redhat.com>
---
 block/backup.c           | 13 +++++++++++++
 blockjob.c               | 10 ++++++++++
 include/block/blockjob.h | 12 ++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 1c535b1..e8b8931 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -208,11 +208,24 @@ static void backup_iostatus_reset(BlockJob *job)
     bdrv_iostatus_reset(s->target);
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+        error_setg(errp, "this feature or command is not currently supported");
+        return;
+    }
+
+    hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
     .instance_size  = sizeof(BackupBlockJob),
     .job_type       = BLOCK_JOB_TYPE_BACKUP,
     .set_speed      = backup_set_speed,
     .iostatus_reset = backup_iostatus_reset,
+    .do_checkpoint  = backup_do_checkpoint,
 };
 
 static BlockErrorAction backup_error_action(BackupBlockJob *job,
diff --git a/blockjob.c b/blockjob.c
index ba2255d..dbac81e 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -388,3 +388,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
 
     qemu_bh_schedule(data->bh);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+    if (!job->driver->do_checkpoint) {
+        error_setg(errp, "this feature or command is not currently supported");
+        return;
+    }
+
+    job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index b6d4ebb..c6f1cad 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
      * manually.
      */
     void (*complete)(BlockJob *job, Error **errp);
+
+    /** Optional callback for job types that support checkpoint. */
+    void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -334,4 +337,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
                                   BlockJobDeferToMainLoopFn *fn,
                                   void *opaque);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 12/14] qcow2: support colo
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (10 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 13/14] skip nbd_target when starting block replication Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 14/14] Don't allow a disk use backing reference target Wen Congyang
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/qcow2.c | 454 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h |   6 +
 2 files changed, 458 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 32bdf75..f767a12 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -35,6 +35,7 @@
 #include "qapi-event.h"
 #include "trace.h"
 #include "qemu/option_int.h"
+#include "block/blockjob.h"
 
 /*
   Differences with QCOW:
@@ -1496,7 +1497,7 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp)
     memset(s, 0, sizeof(BDRVQcowState));
     options = qdict_clone_shallow(bs->options);
 
-    ret = qcow2_open(bs, options, flags, &local_err);
+    ret = bs->drv->bdrv_open(bs, options, flags, &local_err);
     QDECREF(options);
     if (local_err) {
         error_setg(errp, "Could not reopen qcow2 layer: %s",
@@ -1682,7 +1683,7 @@ int qcow2_update_header(BlockDriverState *bs)
     buflen -= ret;
 
     /* Backing file name */
-    if (*bs->backing_file) {
+    if (*bs->backing_file && !bs->backing_reference) {
         size_t backing_file_len = strlen(bs->backing_file);
 
         if (buflen < backing_file_len) {
@@ -2947,9 +2948,458 @@ BlockDriver bdrv_qcow2 = {
     .bdrv_amend_options  = qcow2_amend_options,
 };
 
+/*********************************************************/
+/*
+ * qcow2 colo functions.
+ *
+ * Note:
+ * 1. The image format is qcow2, but it is only used for block replication.
+ * 2. The image is created by qcow2, not qcow+colo.
+ * 3. The image is an empty image.
+ * 4. The image doesn't contain any snapshot.
+ * 5. The image doesn't contain backing_file in image header.
+ * 6. Active disk and hidden disk use this driver.
+ * 7. The size of Active disk, hidden disk, nbd target should be the same.
+ */
+
+enum {
+    COLO_NONE,      /* block replication is not started */
+    COLO_RUNNING,   /* block replication is running */
+    COLO_DONE,      /* block replication is done(failover) */
+};
+
+static int qcow2_colo_probe(const uint8_t *buf, int buf_size,
+                            const char *filename)
+{
+    /* Use qcow2 as default */
+    return 0;
+}
+
+#define COLO_OPT_EXPORT         "export"
+static QemuOptsList qcow2_colo_runtime_opts = {
+    .name = "qcow2+colo",
+    .head = QTAILQ_HEAD_INITIALIZER(qcow2_colo_runtime_opts.head),
+    .desc = {
+        {
+            .name = COLO_OPT_EXPORT,
+            .type = QEMU_OPT_STRING,
+            .help = "The NBD server name",
+        },
+        { /* end of list */ }
+    },
+};
+
+/*
+ * usage: -drive if=xxx,driver=qcow2+colo,export=xxx,\
+ *        backing_reference.drive_id=xxxx,backing_reference.hidden-disk.*
+ */
+static int qcow2_colo_open(BlockDriverState *bs, QDict *options, int flags,
+                           Error **errp)
+{
+    int ret;
+    BDRVQcowState *s = bs->opaque;;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+
+    ret = qcow2_open(bs, options, flags, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = -ENOTSUP;
+    if (s->nb_snapshots) {
+        error_setg(errp, "qcow2+colo doesn't support snapshot");
+        goto fail;
+    }
+
+    if (!bs->backing_hd && bs->backing_file[0] != '\0') {
+        error_setg(errp,
+                   "qcow2+colo doesn't support backing_file in image header");
+        goto fail;
+    }
+
+    opts = qemu_opts_create(&qcow2_colo_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    s->export_name = g_strdup(qemu_opt_get(opts, COLO_OPT_EXPORT));
+    if (!s->export_name) {
+        error_setg(&local_err, "Missing the option export");
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    return 0;
+
+fail:
+    qcow2_close(bs);
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static coroutine_fn int qcow2_colo_co_readv(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            int remaining_sectors,
+                                            QEMUIOVector *qiov)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_co_readv(bs, sector_num, remaining_sectors, qiov);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_co_readv(nbd_target, sector_num, remaining_sectors, qiov);
+    default:
+        abort();
+    }
+}
+
+static coroutine_fn int qcow2_colo_co_writev(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             int remaining_sectors,
+                                             QEMUIOVector *qiov)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_co_writev(bs, sector_num, remaining_sectors, qiov);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_co_writev(nbd_target, sector_num, remaining_sectors, qiov);
+    default:
+        abort();
+    }
+}
+
+static coroutine_fn int qcow2_colo_co_write_zeroes(BlockDriverState *bs,
+                                                   int64_t sector_num,
+                                                   int nb_sectors,
+                                                   BdrvRequestFlags flags)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_co_write_zeroes(bs, sector_num, nb_sectors, flags);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_co_write_zeroes(nbd_target, sector_num, nb_sectors, flags);
+    default:
+        abort();
+    }
+}
+
+static coroutine_fn int qcow2_colo_co_flush_to_os(BlockDriverState *bs)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_co_flush_to_os(bs);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_co_flush(nbd_target);
+    default:
+        abort();
+    }
+}
+
+static coroutine_fn int qcow2_colo_co_discard(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int nb_sectors)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_co_discard(bs, sector_num, nb_sectors);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_co_discard(nbd_target, sector_num, nb_sectors);
+    default:
+        abort();
+    }
+}
+
+static int qcow2_colo_write_compressed(BlockDriverState *bs, int64_t sector_num,
+                                       const uint8_t *buf, int nb_sectors)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+
+    switch (s->colo_state) {
+    case COLO_NONE:
+        return -EIO;
+    case COLO_RUNNING:
+        return qcow2_write_compressed(bs, sector_num, buf, nb_sectors);
+    case COLO_DONE:
+        nbd_target = bs->backing_hd->backing_hd;
+        return bdrv_write_compressed(nbd_target, sector_num, buf, nb_sectors);
+    default:
+        abort();
+    }
+}
+
+static void qcow2_colo_start_replication(BlockDriverState *bs,
+                                         ReplicationMode mode, Error **errp)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target, *hidden_disk;
+    Error *local_err = NULL;
+    int64_t active_length, hidden_length, nbd_length;
+
+    /*
+     * TODO: support REPLICATION_MODE_PRIMARY if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != REPLICATION_MODE_SECONDARY) {
+        error_setg(errp, "Invalid parameter '%s'", "mode");
+        return;
+    }
+
+    if (!bs->backing_reference) {
+        error_setg(errp, "this feature or command is not currently supported");
+        return;
+    }
+
+    if (s->colo_state == COLO_RUNNING) {
+        error_setg(errp, "Block replication is running");
+        return;
+    } else if (s->colo_state == COLO_DONE) {
+        error_setg(errp, "Cannot restart block replication");
+        return;
+    }
+
+    nbd_target = bs->backing_hd->backing_hd;
+    if (!nbd_target->job ||
+        nbd_target->job->driver->job_type != BLOCK_JOB_TYPE_BACKUP) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    hidden_disk = bs->backing_hd;
+    nbd_target = hidden_disk->backing_hd;
+
+    /* verify the length */
+    active_length = bdrv_getlength(bs);
+    hidden_length = bdrv_getlength(hidden_disk);
+    nbd_length = bdrv_getlength(nbd_target);
+    if (active_length < 0 || hidden_length < 0 || nbd_length < 0 ||
+        active_length != hidden_length || hidden_length != nbd_length) {
+        error_setg(errp, "active disk, hidden disk, nbd target's length are "
+                   "not the same");
+        return;
+    }
+
+    if (!hidden_disk->drv->bdrv_make_empty) {
+        error_set(errp, QERR_UNSUPPORTED);
+        return;
+    }
+
+    /* start NBD server */
+    s->exp = nbd_export_new(nbd_target->blk, 0, -1, 0, NULL, &local_err);
+    if (!s->exp) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    nbd_export_set_name(s->exp, s->export_name);
+
+    s->colo_state = COLO_RUNNING;
+}
+
+static void qcow2_colo_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *hidden_disk, *nbd_target;
+    Error *local_err = NULL;
+    int ret;
+
+    if (s->colo_state != COLO_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    hidden_disk = bs->backing_hd;
+    nbd_target = hidden_disk->backing_hd;
+    if (!nbd_target->job) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_do_checkpoint(nbd_target->job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    ret = qcow2_make_empty(bs);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make active disk empty");
+        return;
+    }
+
+    ret = hidden_disk->drv->bdrv_make_empty(hidden_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make hidden disk empty");
+        return;
+    }
+}
+
+/*
+ * TODO: Use blockjob?
+ */
+static void commit_data(BlockDriverState *from, BlockDriverState *to,
+                        int commit_buffer_sectors, Error **errp)
+{
+    int64_t len, target_length;
+    int64_t sector_num, end;
+    void *buf = NULL;
+    int n = 0, ret;
+
+    len = bdrv_getlength(from);
+    target_length = bdrv_getlength(to);
+    if (len < 0 || target_length < 0) {
+        /* should not happen */
+        error_setg(errp, "An undefined error has occurred");
+        return;
+    }
+
+    assert(len == target_length);
+    end = len >> BDRV_SECTOR_BITS;
+    buf = qemu_blockalign(from, commit_buffer_sectors << BDRV_SECTOR_BITS);
+
+    for (sector_num = 0; sector_num < end; sector_num += n) {
+        ret = bdrv_is_allocated(from, sector_num, commit_buffer_sectors, &n);
+        if (ret < 0) {
+            error_setg(errp, "An undefined error has occurred");
+            return;
+        }
+
+        if (ret == 0) {
+            continue;
+        }
+
+        ret = bdrv_read(from, sector_num, buf, n);
+        if (ret) {
+            error_setg(errp, "An IO error has occurred");
+            return;
+        }
+
+        ret = bdrv_write(to, sector_num, buf, n);
+        if (ret) {
+            error_setg(errp, "An IO error has occurred");
+            return;
+        }
+    }
+}
+
+static void qcow2_colo_stop_replication(BlockDriverState *bs, Error **errp)
+{
+    BDRVQcowState *s = bs->opaque;
+    BlockDriverState *nbd_target;
+    Error *local_err = NULL;
+
+    if (s->colo_state != COLO_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    /* stop NBD server */
+    nbd_export_close(s->exp);
+    nbd_export_put(s->exp);
+
+    nbd_target = bs->backing_hd->backing_hd;
+
+    if (!nbd_target->job ||
+        nbd_target->job->driver->job_type != BLOCK_JOB_TYPE_BACKUP) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_cancel(nbd_target->job);
+
+    /* commit data from active disk to hidden disk*/
+    commit_data(bs, bs->backing_hd, s->cluster_sectors, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* commit data from hidden disk to nbd target */
+    commit_data(bs->backing_hd, nbd_target, s->cluster_sectors, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+BlockDriver bdrv_qcow2_colo = {
+    .format_name        = "qcow2+colo",
+    .instance_size      = sizeof(BDRVQcowState),
+    .bdrv_probe         = qcow2_colo_probe,
+    .bdrv_open          = qcow2_colo_open,
+    .bdrv_close         = qcow2_close,
+    .bdrv_reopen_prepare  = qcow2_reopen_prepare,
+    .bdrv_has_zero_init = bdrv_has_zero_init_1,
+    .bdrv_co_get_block_status = qcow2_co_get_block_status,
+    .bdrv_set_key       = qcow2_set_key,
+
+    .bdrv_co_readv          = qcow2_colo_co_readv,
+    .bdrv_co_writev         = qcow2_colo_co_writev,
+    .bdrv_co_flush_to_os    = qcow2_colo_co_flush_to_os,
+
+    .bdrv_co_write_zeroes   = qcow2_colo_co_write_zeroes,
+    .bdrv_co_discard        = qcow2_colo_co_discard,
+    .bdrv_write_compressed  = qcow2_colo_write_compressed,
+    .bdrv_make_empty        = qcow2_make_empty,
+
+    .bdrv_get_info          = qcow2_get_info,
+    .bdrv_get_specific_info = qcow2_get_specific_info,
+
+    .bdrv_save_vmstate    = qcow2_save_vmstate,
+    .bdrv_load_vmstate    = qcow2_load_vmstate,
+
+    .supports_backing           = true,
+
+    .bdrv_refresh_limits        = qcow2_refresh_limits,
+    .bdrv_invalidate_cache      = qcow2_invalidate_cache,
+
+    .bdrv_check          = qcow2_check,
+    .bdrv_amend_options  = qcow2_amend_options,
+
+    .bdrv_start_replication     = qcow2_colo_start_replication,
+    .bdrv_do_checkpoint         = qcow2_colo_do_checkpoint,
+    .bdrv_stop_replication      = qcow2_colo_stop_replication,
+};
+
 static void bdrv_qcow2_init(void)
 {
     bdrv_register(&bdrv_qcow2);
+    bdrv_register(&bdrv_qcow2_colo);
 }
 
 block_init(bdrv_qcow2_init);
diff --git a/block/qcow2.h b/block/qcow2.h
index aa6d367..9d5e260 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -27,6 +27,7 @@
 
 #include "qemu/aes.h"
 #include "block/coroutine.h"
+#include "block/nbd.h"
 
 //#define DEBUG_ALLOC
 //#define DEBUG_ALLOC2
@@ -283,6 +284,11 @@ typedef struct BDRVQcowState {
     QLIST_HEAD(, Qcow2UnknownHeaderExtension) unknown_header_ext;
     QTAILQ_HEAD (, Qcow2DiscardRegion) discards;
     bool cache_discards;
+
+    /* Used for block replication */
+    int colo_state;
+    const char *export_name;
+    NBDExport *exp;
 } BDRVQcowState;
 
 struct QCowAIOCB;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 13/14] skip nbd_target when starting block replication
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (11 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 12/14] qcow2: support colo Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 14/14] Don't allow a disk use backing reference target Wen Congyang
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block.c b/block.c
index 2df950b..0bf097d 100644
--- a/block.c
+++ b/block.c
@@ -6412,6 +6412,10 @@ void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_start_replication) {
         drv->bdrv_start_replication(bs, mode, errp);
     } else if (bs->file) {
@@ -6425,6 +6429,10 @@ void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_do_checkpoint) {
         drv->bdrv_do_checkpoint(bs, errp);
     } else if (bs->file) {
@@ -6438,6 +6446,10 @@ void bdrv_stop_replication(BlockDriverState *bs, Error **errp)
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_stop_replication) {
         drv->bdrv_stop_replication(bs, errp);
     } else if (bs->file) {
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PATCH COLO v3 14/14] Don't allow a disk use backing reference target
  2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
                   ` (12 preceding siblings ...)
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 13/14] skip nbd_target when starting block replication Wen Congyang
@ 2015-04-03 10:01 ` Wen Congyang
  13 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-03 10:01 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, qemu block, Lai Jiangshan, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block.c b/block.c
index 0bf097d..6b8bf63 100644
--- a/block.c
+++ b/block.c
@@ -1400,6 +1400,14 @@ static int bdrv_open_backing_reference_file(BlockDriverState *bs,
     }
 
     backing_hd = blk_bs(backing_blk);
+    /* Don't allow a disk use backing reference target */
+    ret = blk_attach_dev(backing_hd->blk, bs);
+    if (ret < 0) {
+        error_setg(errp, "backing_hd %s is used by the other device model",
+                   backing_name);
+        goto free_exit;
+    }
+
     /* Backing reference itself? */
     if (backing_hd == bs || bdrv_find_overlay(backing_hd, bs)) {
         error_setg(errp, "Backing reference itself");
@@ -2083,6 +2091,7 @@ void bdrv_close(BlockDriverState *bs)
                 if (backing_hd->backing_hd->job) {
                     block_job_cancel(backing_hd->backing_hd->job);
                 }
+                blk_detach_dev(backing_hd->backing_hd->blk, bs);
                 bdrv_set_backing_hd(backing_hd, NULL);
                 bdrv_unref(backing_hd->backing_hd);
             }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap Wen Congyang
@ 2015-04-03 11:05   ` Paolo Bonzini
  2015-05-01 16:47     ` [Qemu-devel] [Qemu-block] " John Snow
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-03 11:05 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz
  Cc: Kevin Wolf, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang



On 03/04/2015 12:01, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  include/qemu/hbitmap.h |  8 ++++++++
>  tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
>  util/hbitmap.c         | 16 ++++++++++++++++
>  3 files changed, 63 insertions(+)
> 
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index 550d7ce..95a55e4 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
>  void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
>  
>  /**
> + * hbitmap_reset_all:
> + * @hb: HBitmap to operate on.
> + *
> + * Reset all bits in an HBitmap.
> + */
> +void hbitmap_reset_all(HBitmap *hb);
> +
> +/**
>   * hbitmap_get:
>   * @hb: HBitmap to operate on.
>   * @item: Bit to query (0-based).
> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
> index 8c902f2..1f0078a 100644
> --- a/tests/test-hbitmap.c
> +++ b/tests/test-hbitmap.c
> @@ -11,6 +11,7 @@
>  
>  #include <glib.h>
>  #include <stdarg.h>
> +#include <string.h>
>  #include "qemu/hbitmap.h"
>  
>  #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
> @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
>      }
>  }
>  
> +static void hbitmap_test_reset_all(TestHBitmapData *data)
> +{
> +    size_t n;
> +
> +    hbitmap_reset_all(data->hb);
> +
> +    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
> +    if (n == 0) {
> +        n = 1;
> +    }
> +    memset(data->bits, 0, n * sizeof(unsigned long));
> +
> +    if (data->granularity == 0) {
> +        hbitmap_test_check(data, 0);
> +    }
> +}
> +
>  static void hbitmap_test_check_get(TestHBitmapData *data)
>  {
>      uint64_t count = 0;
> @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
>      hbitmap_test_set(data, L3 / 2, L3);
>  }
>  
> +static void test_hbitmap_reset_all(TestHBitmapData *data,
> +                                   const void *unused)
> +{
> +    hbitmap_test_init(data, L3 * 2, 0);
> +    hbitmap_test_set(data, L1 - 1, L1 + 2);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, 0, L1 * 3);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, L2, L1);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, L2, L3 - L2 + 1);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, L3 - 1, 3);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, 0, L3 * 2);
> +    hbitmap_test_reset_all(data);
> +    hbitmap_test_set(data, L3 / 2, L3);
> +    hbitmap_test_reset_all(data);
> +}
> +
>  static void test_hbitmap_granularity(TestHBitmapData *data,
>                                       const void *unused)
>  {
> @@ -394,6 +432,7 @@ int main(int argc, char **argv)
>      hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
>      hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
>      hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
> +    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
>      hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
>      g_test_run();
>  
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index ab13971..acce93c 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
>      hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
>  }
>  
> +void hbitmap_reset_all(HBitmap *hb)
> +{
> +    uint64_t size = hb->size;
> +    unsigned int i;
> +
> +    /* Same as hbitmap_alloc() except memset() */
> +    for (i = HBITMAP_LEVELS; --i >= 1; ) {
> +        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
> +        memset(hb->levels[i], 0, size * sizeof(unsigned long));
> +    }
> +
> +    assert(size == 1);
> +    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
> +    hb->count = 0;
> +}
> +
>  bool hbitmap_get(const HBitmap *hb, uint64_t item)
>  {
>      /* Compute position and bit in the last layer.  */
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS Wen Congyang
@ 2015-04-03 11:06   ` Paolo Bonzini
  0 siblings, 0 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-03 11:06 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz
  Cc: Kevin Wolf, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang



On 03/04/2015 12:01, Wen Congyang wrote:
> When opening BDS, we need to create backup jobs for
> image-fleecing. This feature is not used by qemu-img,
> qemu-io or qemu-nbd. So just adding the referenced
> functions to stubs.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  stubs/Makefile.objs |  1 +
>  stubs/backup.c      | 11 +++++++++++
>  2 files changed, 12 insertions(+)
>  create mode 100644 stubs/backup.c
> 
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 8beff4c..5ae2214 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -39,3 +39,4 @@ stub-obj-$(CONFIG_WIN32) += fd-register.o
>  stub-obj-y += cpus.o
>  stub-obj-y += kvm.o
>  stub-obj-y += qmp_pc_dimm_device_list.o
> +stub-obj-y += backup.o
> diff --git a/stubs/backup.c b/stubs/backup.c
> new file mode 100644
> index 0000000..3ebcc71
> --- /dev/null
> +++ b/stubs/backup.c
> @@ -0,0 +1,11 @@
> +#include "block/block_int.h"
> +
> +void backup_start(BlockDriverState *bs, BlockDriverState *target,
> +                  int64_t speed, MirrorSyncMode sync_mode,
> +                  BlockdevOnError on_source_error,
> +                  BlockdevOnError on_target_error,
> +                  BlockCompletionFunc *cb, void *opaque,
> +                  Error **errp)
> +{
> +    error_setg(errp, "this feature or command is not currently supported");
> +}
> 

I wouldn't mind just moving block/backup.o from common-objs-y to
block-objs-y.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint Wen Congyang
@ 2015-04-03 11:09   ` Paolo Bonzini
  2015-04-07  1:45     ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-03 11:09 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz
  Cc: Kevin Wolf, Lai Jiangshan, qemu block, Jeff Cody, Jiang Yunhong,
	Dong Eddie, Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi,
	Yang Hongyang, zhanghailiang



On 03/04/2015 12:01, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Jeff Cody <jcody@redhat.com>
> ---
>  block/backup.c           | 13 +++++++++++++
>  blockjob.c               | 10 ++++++++++
>  include/block/blockjob.h | 12 ++++++++++++
>  3 files changed, 35 insertions(+)
> 
> diff --git a/block/backup.c b/block/backup.c
> index 1c535b1..e8b8931 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -208,11 +208,24 @@ static void backup_iostatus_reset(BlockJob *job)
>      bdrv_iostatus_reset(s->target);
>  }
>  
> +static void backup_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
> +
> +    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
> +        error_setg(errp, "this feature or command is not currently supported");
> +        return;
> +    }
> +
> +    hbitmap_reset_all(backup_job->bitmap);
> +}
> +
>  static const BlockJobDriver backup_job_driver = {
>      .instance_size  = sizeof(BackupBlockJob),
>      .job_type       = BLOCK_JOB_TYPE_BACKUP,
>      .set_speed      = backup_set_speed,
>      .iostatus_reset = backup_iostatus_reset,
> +    .do_checkpoint  = backup_do_checkpoint,
>  };
>  
>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
> diff --git a/blockjob.c b/blockjob.c
> index ba2255d..dbac81e 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -388,3 +388,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>  
>      qemu_bh_schedule(data->bh);
>  }
> +
> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +    if (!job->driver->do_checkpoint) {
> +        error_setg(errp, "this feature or command is not currently supported");
> +        return;
> +    }
> +
> +    job->driver->do_checkpoint(job, errp);
> +}
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index b6d4ebb..c6f1cad 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
>       * manually.
>       */
>      void (*complete)(BlockJob *job, Error **errp);
> +
> +    /** Optional callback for job types that support checkpoint. */
> +    void (*do_checkpoint)(BlockJob *job, Error **errp);
>  } BlockJobDriver;
>  
>  /**
> @@ -334,4 +337,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>                                    BlockJobDeferToMainLoopFn *fn,
>                                    void *opaque);
>  
> +/**
> + * block_job_do_checkpoint:
> + * @job: The job.
> + * @errp: Error object.
> + *
> + * Do block checkpoint on the specified job.
> + */
> +void block_job_do_checkpoint(BlockJob *job, Error **errp);
> +
>  #endif
> 

Does this only run on the secondary, or also on the primary?  What
happens if you use a block job on the primary?

Perhaps a variant of backup_job_driver is needed for COLO, and the
default behavior of block_job_do_checkpoint should be to do nothing?

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint
  2015-04-03 11:09   ` Paolo Bonzini
@ 2015-04-07  1:45     ` Wen Congyang
  0 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-07  1:45 UTC (permalink / raw)
  To: Paolo Bonzini, qemu devel, Fam Zheng, Max Reitz
  Cc: Kevin Wolf, Lai Jiangshan, qemu block, Jeff Cody, Jiang Yunhong,
	Dong Eddie, Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi,
	Yang Hongyang, zhanghailiang

On 04/03/2015 07:09 PM, Paolo Bonzini wrote:
> 
> 
> On 03/04/2015 12:01, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Cc: Jeff Cody <jcody@redhat.com>
>> ---
>>  block/backup.c           | 13 +++++++++++++
>>  blockjob.c               | 10 ++++++++++
>>  include/block/blockjob.h | 12 ++++++++++++
>>  3 files changed, 35 insertions(+)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 1c535b1..e8b8931 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -208,11 +208,24 @@ static void backup_iostatus_reset(BlockJob *job)
>>      bdrv_iostatus_reset(s->target);
>>  }
>>  
>> +static void backup_do_checkpoint(BlockJob *job, Error **errp)
>> +{
>> +    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
>> +
>> +    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
>> +        error_setg(errp, "this feature or command is not currently supported");
>> +        return;
>> +    }
>> +
>> +    hbitmap_reset_all(backup_job->bitmap);
>> +}
>> +
>>  static const BlockJobDriver backup_job_driver = {
>>      .instance_size  = sizeof(BackupBlockJob),
>>      .job_type       = BLOCK_JOB_TYPE_BACKUP,
>>      .set_speed      = backup_set_speed,
>>      .iostatus_reset = backup_iostatus_reset,
>> +    .do_checkpoint  = backup_do_checkpoint,
>>  };
>>  
>>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
>> diff --git a/blockjob.c b/blockjob.c
>> index ba2255d..dbac81e 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -388,3 +388,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>>  
>>      qemu_bh_schedule(data->bh);
>>  }
>> +
>> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
>> +{
>> +    if (!job->driver->do_checkpoint) {
>> +        error_setg(errp, "this feature or command is not currently supported");
>> +        return;
>> +    }
>> +
>> +    job->driver->do_checkpoint(job, errp);
>> +}
>> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
>> index b6d4ebb..c6f1cad 100644
>> --- a/include/block/blockjob.h
>> +++ b/include/block/blockjob.h
>> @@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
>>       * manually.
>>       */
>>      void (*complete)(BlockJob *job, Error **errp);
>> +
>> +    /** Optional callback for job types that support checkpoint. */
>> +    void (*do_checkpoint)(BlockJob *job, Error **errp);
>>  } BlockJobDriver;
>>  
>>  /**
>> @@ -334,4 +337,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>>                                    BlockJobDeferToMainLoopFn *fn,
>>                                    void *opaque);
>>  
>> +/**
>> + * block_job_do_checkpoint:
>> + * @job: The job.
>> + * @errp: Error object.
>> + *
>> + * Do block checkpoint on the specified job.
>> + */
>> +void block_job_do_checkpoint(BlockJob *job, Error **errp);
>> +
>>  #endif
>>
> 
> Does this only run on the secondary, or also on the primary?  What
> happens if you use a block job on the primary?

It is only for secondary. This new blockjob API is only called in qcow2+colo,
which is used for secondary qemu. So primary qemu will not come here.
If it happens, it is a bug. Should we check it here?

Thanks
Wen Congyang

> 
> Perhaps a variant of backup_job_driver is needed for COLO, and the
> default behavior of block_job_do_checkpoint should be to do nothing?
> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
@ 2015-04-20 15:30   ` Stefan Hajnoczi
  2015-04-21  1:25     ` Wen Congyang
  2015-04-22 10:39   ` [Qemu-devel] " Dr. David Alan Gilbert
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-20 15:30 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, Dr. David Alan Gilbert, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 10120 bytes --]

On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content(it could be from either "Secondary Write Requests" or
> +       previous COW of "Primary Write Requests") in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                                         virtio-blk
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none

Nice to see that you've been able to construct the replication flow from
existing block layer features!

> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +be qcow2.
> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the driver supports bdrv_make_empty().
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transferred to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called on failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.

I understand the general flow but this description does not demonstrate
that failover works or what happens when internal operations fail (e.g.
during checkpoint commit or during failover).  Since fault tolerance is
the goal, it is necessary to list the failure scenarios explicitly and
show that the design handles them.  With that level of planning, some
cases will probably be missed in the code and the system won't actually
be fault tolerant.

One general question about the design: the Secondary host needs 3x
storage space since it has the Secondary Disk, hidden-disk, and
active-disk.  Each image requires a certain amount of space depending on
writes or COW operations.  Is 3x the upper bound or is there a way to
reduce the bound?

The bound is important since large amounts of data become a bottleneck
for writeout/commit operations.  They could cause downtime if the guest
is blocked until the entire Disk Buffer has been written to the
Secondary Disk during failover, for example.

> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +  5. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd-server-start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd-server-start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
> +  6. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.

Please do not introduce "<name>+colo" block drivers.  This approach is
invasive and makes block replication specific to only a few block
drivers, e.g. NBD or qcow2.

A cleaner approach is a QMP command or -drive options that work for any
BlockDriverState.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-20 15:30   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2015-04-21  1:25     ` Wen Congyang
  2015-04-21 15:28       ` Paolo Bonzini
  2015-04-22  9:29       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 2 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-21  1:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, Dr. David Alan Gilbert, zhanghailiang

On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..4426ffc
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,153 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +Block replication is used for continuous checkpoints. It is designed
>> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
>> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
>> +where the Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content(it could be from either "Secondary Write Requests" or
>> +       previous COW of "Primary Write Requests") in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                                         virtio-blk
>> +                          ||        ^                                                ^
>> +--------.                 ||        |                                                |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
>> +--------'                 ||        |          backing        ^       backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
> 
> Nice to see that you've been able to construct the replication flow from
> existing block layer features!
> 
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +be qcow2.
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the driver supports bdrv_make_empty().
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transferred to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called on failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
> 
> I understand the general flow but this description does not demonstrate
> that failover works or what happens when internal operations fail (e.g.
> during checkpoint commit or during failover).  Since fault tolerance is
> the goal, it is necessary to list the failure scenarios explicitly and
> show that the design handles them.  With that level of planning, some
> cases will probably be missed in the code and the system won't actually
> be fault tolerant.

OK, I will add the description about failover.

> 
> One general question about the design: the Secondary host needs 3x
> storage space since it has the Secondary Disk, hidden-disk, and
> active-disk.  Each image requires a certain amount of space depending on
> writes or COW operations.  Is 3x the upper bound or is there a way to
> reduce the bound?

active disk and hidden disk are temp file. It will be maked empty in
bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
many spaces if we do checkpoint periodically.

> 
> The bound is important since large amounts of data become a bottleneck
> for writeout/commit operations.  They could cause downtime if the guest
> is blocked until the entire Disk Buffer has been written to the
> Secondary Disk during failover, for example.

OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
I run filebench in the guest. Is there anyway to speed it up?

> 
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +  5. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd-server-start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd-server-start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
>> +  6. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
> 
> Please do not introduce "<name>+colo" block drivers.  This approach is
> invasive and makes block replication specific to only a few block
> drivers, e.g. NBD or qcow2.

NBD is used to connect to secondary qemu, so it must be used. But the primary
qemu uses quorum, so the primary disk can be any format.
The secondary disk is nbd target, and it can also be any format. The cache
disk(active disk/hidden disk) is an empty disk, and it is created before run
COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
supports backing file. But the driver should be updated to support colo mode.

> 
> A cleaner approach is a QMP command or -drive options that work for any
> BlockDriverState.

OK, I will add a new drive option to avoid use "<name>+colo".

Thanks
Wen Congyang

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-21  1:25     ` Wen Congyang
@ 2015-04-21 15:28       ` Paolo Bonzini
  2015-04-22  9:18         ` Stefan Hajnoczi
  2015-04-22  9:31         ` Kevin Wolf
  2015-04-22  9:29       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 2 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-21 15:28 UTC (permalink / raw)
  To: Wen Congyang, Stefan Hajnoczi
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang



On 21/04/2015 03:25, Wen Congyang wrote:
>> > Please do not introduce "<name>+colo" block drivers.  This approach is
>> > invasive and makes block replication specific to only a few block
>> > drivers, e.g. NBD or qcow2.
> NBD is used to connect to secondary qemu, so it must be used. But the primary
> qemu uses quorum, so the primary disk can be any format.
> The secondary disk is nbd target, and it can also be any format. The cache
> disk(active disk/hidden disk) is an empty disk, and it is created before run
> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> supports backing file. But the driver should be updated to support colo mode.
> 
> > A cleaner approach is a QMP command or -drive options that work for any
> > BlockDriverState.
> 
> OK, I will add a new drive option to avoid use "<name>+colo".

Actually I liked the "foo+colo" names.

These are just internal details of the implementations and the
primary/secondary disks actually can be any format.

Stefan, what was your worry with the +colo block drivers?

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-21 15:28       ` Paolo Bonzini
@ 2015-04-22  9:18         ` Stefan Hajnoczi
  2015-04-22  9:28           ` Wen Congyang
  2015-04-22  9:31         ` Kevin Wolf
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-22  9:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "<name>+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> > supports backing file. But the driver should be updated to support colo mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "<name>+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

Why does NBD need to know about COLO?  It should be possible to use
iSCSI or other protocols too.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-22  9:18         ` Stefan Hajnoczi
@ 2015-04-22  9:28           ` Wen Congyang
  2015-04-23  9:55             ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-22  9:28 UTC (permalink / raw)
  To: Stefan Hajnoczi, Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
>> On 21/04/2015 03:25, Wen Congyang wrote:
>>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
>>>>> invasive and makes block replication specific to only a few block
>>>>> drivers, e.g. NBD or qcow2.
>>> NBD is used to connect to secondary qemu, so it must be used. But the primary
>>> qemu uses quorum, so the primary disk can be any format.
>>> The secondary disk is nbd target, and it can also be any format. The cache
>>> disk(active disk/hidden disk) is an empty disk, and it is created before run
>>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
>>> supports backing file. But the driver should be updated to support colo mode.
>>>
>>>> A cleaner approach is a QMP command or -drive options that work for any
>>>> BlockDriverState.
>>>
>>> OK, I will add a new drive option to avoid use "<name>+colo".
>>
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> Why does NBD need to know about COLO?  It should be possible to use
> iSCSI or other protocols too.

Hmm, if you want to use iSCSI or other protocols, you should update the driver
to implement block replication's control interface.

Currently, we only support nbd now.

Thanks
Wen Congyang

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-21  1:25     ` Wen Congyang
  2015-04-21 15:28       ` Paolo Bonzini
@ 2015-04-22  9:29       ` Stefan Hajnoczi
  2015-04-22  9:42         ` Wen Congyang
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-22  9:29 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, Dr. David Alan Gilbert, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> > On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> > One general question about the design: the Secondary host needs 3x
> > storage space since it has the Secondary Disk, hidden-disk, and
> > active-disk.  Each image requires a certain amount of space depending on
> > writes or COW operations.  Is 3x the upper bound or is there a way to
> > reduce the bound?
> 
> active disk and hidden disk are temp file. It will be maked empty in
> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
> many spaces if we do checkpoint periodically.

A question related to checkpoints: both Primary and Secondary are active
(running) in COLO.  The Secondary will be slower since it performs extra
work; disk I/O on the Secondary has a COW overhead.

Does this force the Primary to wait for checkpoint commit so that the
Secondary can catch up?

I'm a little confused about that since the point of COLO is to avoid the
overheads of microcheckpointing, but there still seems to be a
checkpointing bottleneck for disk I/O-intensive applications.

> > 
> > The bound is important since large amounts of data become a bottleneck
> > for writeout/commit operations.  They could cause downtime if the guest
> > is blocked until the entire Disk Buffer has been written to the
> > Secondary Disk during failover, for example.
> 
> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
> I run filebench in the guest. Is there anyway to speed it up?

Is it necessary to commit the active disk and hidden disk to the
Secondary Disk on failover?  Maybe the VM could continue executing
immediately and run a block-commit job.  The active disk and hidden disk
files can be dropped once block-commit finishes.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-21 15:28       ` Paolo Bonzini
  2015-04-22  9:18         ` Stefan Hajnoczi
@ 2015-04-22  9:31         ` Kevin Wolf
  2015-04-22 10:12           ` [Qemu-devel] " Paolo Bonzini
  1 sibling, 1 reply; 74+ messages in thread
From: Kevin Wolf @ 2015-04-22  9:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Stefan Hajnoczi,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, Dr. David Alan Gilbert,
	zhanghailiang

Am 21.04.2015 um 17:28 hat Paolo Bonzini geschrieben:
> 
> 
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "<name>+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> > supports backing file. But the driver should be updated to support colo mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "<name>+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

I haven't read the patches yet, so I may be misunderstanding, but
wouldn't a separate filter driver be more appropriate than modifying
qcow2 with logic that has nothing to do with the image format?

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-22  9:29       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2015-04-22  9:42         ` Wen Congyang
  0 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-22  9:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang, Dr. David Alan Gilbert, zhanghailiang

On 04/22/2015 05:29 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
>> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
>>> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>>> One general question about the design: the Secondary host needs 3x
>>> storage space since it has the Secondary Disk, hidden-disk, and
>>> active-disk.  Each image requires a certain amount of space depending on
>>> writes or COW operations.  Is 3x the upper bound or is there a way to
>>> reduce the bound?
>>
>> active disk and hidden disk are temp file. It will be maked empty in
>> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
>> many spaces if we do checkpoint periodically.
> 
> A question related to checkpoints: both Primary and Secondary are active
> (running) in COLO.  The Secondary will be slower since it performs extra
> work; disk I/O on the Secondary has a COW overhead.
> 
> Does this force the Primary to wait for checkpoint commit so that the
> Secondary can catch up?
> 
> I'm a little confused about that since the point of COLO is to avoid the
> overheads of microcheckpointing, but there still seems to be a
> checkpointing bottleneck for disk I/O-intensive applications.
> 
>>>
>>> The bound is important since large amounts of data become a bottleneck
>>> for writeout/commit operations.  They could cause downtime if the guest
>>> is blocked until the entire Disk Buffer has been written to the
>>> Secondary Disk during failover, for example.
>>
>> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
>> I run filebench in the guest. Is there anyway to speed it up?
> 
> Is it necessary to commit the active disk and hidden disk to the
> Secondary Disk on failover?  Maybe the VM could continue executing
> immediately and run a block-commit job.  The active disk and hidden disk
> files can be dropped once block-commit finishes.
> 

We need to stop the vm before doing checkpoint. So if vm_stop() takes
too much time, it will affect the performance.

On failover, we can commit the data while the vm is running. But the active
disk and hidden disk may be put in ramfs, The guest writes faster than
block-commit...

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-22  9:31         ` Kevin Wolf
@ 2015-04-22 10:12           ` Paolo Bonzini
  2015-04-23  9:00             ` Kevin Wolf
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-22 10:12 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang



On 22/04/2015 11:31, Kevin Wolf wrote:
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> I haven't read the patches yet, so I may be misunderstanding, but
> wouldn't a separate filter driver be more appropriate than modifying
> qcow2 with logic that has nothing to do with the image format?

Possibly; on the other hand, why multiply the size of the test matrix
with options that no one will use and that will bitrot?

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
  2015-04-20 15:30   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2015-04-22 10:39   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-22 10:39 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, zhanghailiang

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content(it could be from either "Secondary Write Requests" or
> +       previous COW of "Primary Write Requests") in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                                         virtio-blk
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none
> +
> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +be qcow2.

If active_disk is empty, how do you get an initial copy of the primary's
disk contents over to the secondary?

It would be interesting to consider how this would change to support
recovery back to the point where we have a pair of machines after the 
primary failed (continuous ft);  somehow the drive on the secondary would
have to flip over to being a quorum set and look like the current primary.

Dave

> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the driver supports bdrv_make_empty().
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transferred to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called on failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +  5. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd-server-start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd-server-start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
> +  6. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> -- 
> 2.1.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication
  2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication Wen Congyang
@ 2015-04-22 12:56   ` Eric Blake
  0 siblings, 0 replies; 74+ messages in thread
From: Eric Blake @ 2015-04-22 12:56 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini
  Cc: Kevin Wolf, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Luiz Capitulino, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, Michael Roth, zhanghailiang

On 04/03/2015 04:01 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.c                   | 40 ++++++++++++++++++++++++++++++++++++++++
>  include/block/block.h     |  5 +++++
>  include/block/block_int.h | 11 +++++++++++
>  qapi/block.json           | 16 ++++++++++++++++
>  4 files changed, 72 insertions(+)

> +++ b/qapi/block.json
> @@ -40,6 +40,22 @@
>    'data': ['auto', 'none', 'lba', 'large', 'rechs']}
>  
>  ##
> +# @COLOMode

Comment...

> +#
> +# An enumeration of replication modes.
> +#
> +# @unprotected: COLO is not started or after failover.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.4
> +##
> +{ 'enum' : 'ReplicationMode',

...doesn't match type name.

> +  'data' : ['unprotected', 'primary', 'secondary']}
> +


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-22 10:12           ` [Qemu-devel] " Paolo Bonzini
@ 2015-04-23  9:00             ` Kevin Wolf
  2015-04-23  9:14               ` Wen Congyang
  2015-04-23  9:26               ` Paolo Bonzini
  0 siblings, 2 replies; 74+ messages in thread
From: Kevin Wolf @ 2015-04-23  9:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
> On 22/04/2015 11:31, Kevin Wolf wrote:
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > I haven't read the patches yet, so I may be misunderstanding, but
> > wouldn't a separate filter driver be more appropriate than modifying
> > qcow2 with logic that has nothing to do with the image format?
> 
> Possibly; on the other hand, why multiply the size of the test matrix
> with options that no one will use and that will bitrot?

Because it may be the right design.

If you're really worried about the test matrix, put a check in the
filter block driver that its bs->file is qcow2. Of course, such an
artificial restriction looks a bit ugly, but using a bad design just
in order to get the same restriction is even worse.

Stefan originally wanted to put image streaming in the QED driver. I
think we'll agree today that it was right to reject that. It's simply
not functionality related to the format. Adding replication logic to
qcow2 looks similar to me in that respect.

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:00             ` Kevin Wolf
@ 2015-04-23  9:14               ` Wen Congyang
  2015-04-23 10:05                 ` Paolo Bonzini
  2015-04-23  9:26               ` Paolo Bonzini
  1 sibling, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-23  9:14 UTC (permalink / raw)
  To: Kevin Wolf, Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 04/23/2015 05:00 PM, Kevin Wolf wrote:
> Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
>> On 22/04/2015 11:31, Kevin Wolf wrote:
>>>> Actually I liked the "foo+colo" names.
>>>>
>>>> These are just internal details of the implementations and the
>>>> primary/secondary disks actually can be any format.
>>>>
>>>> Stefan, what was your worry with the +colo block drivers?
>>>
>>> I haven't read the patches yet, so I may be misunderstanding, but
>>> wouldn't a separate filter driver be more appropriate than modifying
>>> qcow2 with logic that has nothing to do with the image format?
>>
>> Possibly; on the other hand, why multiply the size of the test matrix
>> with options that no one will use and that will bitrot?
> 
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.

The bs->file->driver should support backing file, and use backing reference
already.

What about the primary side? We should control when to connect to NBD server,
not in nbd_open().

Thanks
Wen Congyang

> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.
> 
> Kevin
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:00             ` Kevin Wolf
  2015-04-23  9:14               ` Wen Congyang
@ 2015-04-23  9:26               ` Paolo Bonzini
  2015-04-23  9:37                 ` Kevin Wolf
  2015-04-23  9:41                 ` Wen Congyang
  1 sibling, 2 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23  9:26 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 23/04/2015 11:00, Kevin Wolf wrote:
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.
> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.

Yes, I can't deny it is similar.  Still, there is a very important
difference: limiting colo's internal workings to qcow2 or NBD doesn't
limit what the user can do (while streaming limited the user to image
files in QED format).

It may also depend on how the patches look like and how much the colo
code relies on other internal state.

For NBD the answer is almost nothing, and you don't even need a filter
driver.  You only need to separate sharply the "configure" and "open"
phases.  So it may indeed be possible to generalize the handling of the
secondary to non-NBD.

It may be the same for the primary; I admit I haven't even tried to read
the qcow2 patch, as I couldn't do a meaningful review.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:26               ` Paolo Bonzini
@ 2015-04-23  9:37                 ` Kevin Wolf
  2015-04-23  9:41                 ` Wen Congyang
  1 sibling, 0 replies; 74+ messages in thread
From: Kevin Wolf @ 2015-04-23  9:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

Am 23.04.2015 um 11:26 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
> > Because it may be the right design.
> > 
> > If you're really worried about the test matrix, put a check in the
> > filter block driver that its bs->file is qcow2. Of course, such an
> > artificial restriction looks a bit ugly, but using a bad design just
> > in order to get the same restriction is even worse.
> > 
> > Stefan originally wanted to put image streaming in the QED driver. I
> > think we'll agree today that it was right to reject that. It's simply
> > not functionality related to the format. Adding replication logic to
> > qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

The qcow2 patch only modifies two existing lines. The rest it adds is a
the qcow2+colo BlockDriver, which references some qcow2 functions
directly and has a wrapper for others. On a quick scan, it didn't seem
like it accesses any internal qcow2 variables or calls any private
functions.

In other words, it's the perfect example for a filter.

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:26               ` Paolo Bonzini
  2015-04-23  9:37                 ` Kevin Wolf
@ 2015-04-23  9:41                 ` Wen Congyang
  1 sibling, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-23  9:41 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 04/23/2015 05:26 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
>> Because it may be the right design.
>>
>> If you're really worried about the test matrix, put a check in the
>> filter block driver that its bs->file is qcow2. Of course, such an
>> artificial restriction looks a bit ugly, but using a bad design just
>> in order to get the same restriction is even worse.
>>
>> Stefan originally wanted to put image streaming in the QED driver. I
>> think we'll agree today that it was right to reject that. It's simply
>> not functionality related to the format. Adding replication logic to
>> qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

For qcow2, we need to read/write from NBD target directly after failover,
because the cache image(the format is qcow2) may be put in ramfs to get
better performance. The other thing is not changed.

For qcow2, if we use a filter driver, the bs->file->drv should support
backing file, and make_empty. So it can be the other format.

Thanks
Wen Congyang

> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-22  9:28           ` Wen Congyang
@ 2015-04-23  9:55             ` Stefan Hajnoczi
  2015-04-23 10:11               ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-23  9:55 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Stefan Hajnoczi,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, Dr. David Alan Gilbert,
	zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> > On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> >> On 21/04/2015 03:25, Wen Congyang wrote:
> >>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
> >>>>> invasive and makes block replication specific to only a few block
> >>>>> drivers, e.g. NBD or qcow2.
> >>> NBD is used to connect to secondary qemu, so it must be used. But the primary
> >>> qemu uses quorum, so the primary disk can be any format.
> >>> The secondary disk is nbd target, and it can also be any format. The cache
> >>> disk(active disk/hidden disk) is an empty disk, and it is created before run
> >>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> >>> supports backing file. But the driver should be updated to support colo mode.
> >>>
> >>>> A cleaner approach is a QMP command or -drive options that work for any
> >>>> BlockDriverState.
> >>>
> >>> OK, I will add a new drive option to avoid use "<name>+colo".
> >>
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > Why does NBD need to know about COLO?  It should be possible to use
> > iSCSI or other protocols too.
> 
> Hmm, if you want to use iSCSI or other protocols, you should update the driver
> to implement block replication's control interface.
> 
> Currently, we only support nbd now.

I took a quick look at the NBD patches in this series, it looks like
they are a hacky way to make quorum dynamically reconfigurable.

In other words, what you really need is a way to enable/disable a quorum
child or even add/remove children at run-time.

NBD is not the right place to implement that.  Add APIs to quorum so
COLO code can use them.

Or maybe I'm misinterpreting the patches, I only took a quick look...

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:14               ` Wen Congyang
@ 2015-04-23 10:05                 ` Paolo Bonzini
  2015-04-23 10:17                   ` Kevin Wolf
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 10:05 UTC (permalink / raw)
  To: Wen Congyang, Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 23/04/2015 11:14, Wen Congyang wrote:
> The bs->file->driver should support backing file, and use backing reference
> already.
> 
> What about the primary side? We should control when to connect to NBD server,
> not in nbd_open().

My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
bdrv_open and a separate bdrv_connect callback.  Open would fail if
BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.

You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Perhaps quorum is not a great match after all, and it's better to add a
new "colo" driver similar to quorum but simpler and only using the read
policy that you need for colo.  The new driver would also know how to
use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
be too big.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23  9:55             ` Stefan Hajnoczi
@ 2015-04-23 10:11               ` Wen Congyang
  0 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-23 10:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Stefan Hajnoczi,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, Dr. David Alan Gilbert,
	zhanghailiang

On 04/23/2015 05:55 PM, Stefan Hajnoczi wrote:
> On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
>> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
>>>> On 21/04/2015 03:25, Wen Congyang wrote:
>>>>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
>>>>>>> invasive and makes block replication specific to only a few block
>>>>>>> drivers, e.g. NBD or qcow2.
>>>>> NBD is used to connect to secondary qemu, so it must be used. But the primary
>>>>> qemu uses quorum, so the primary disk can be any format.
>>>>> The secondary disk is nbd target, and it can also be any format. The cache
>>>>> disk(active disk/hidden disk) is an empty disk, and it is created before run
>>>>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
>>>>> supports backing file. But the driver should be updated to support colo mode.
>>>>>
>>>>>> A cleaner approach is a QMP command or -drive options that work for any
>>>>>> BlockDriverState.
>>>>>
>>>>> OK, I will add a new drive option to avoid use "<name>+colo".
>>>>
>>>> Actually I liked the "foo+colo" names.
>>>>
>>>> These are just internal details of the implementations and the
>>>> primary/secondary disks actually can be any format.
>>>>
>>>> Stefan, what was your worry with the +colo block drivers?
>>>
>>> Why does NBD need to know about COLO?  It should be possible to use
>>> iSCSI or other protocols too.
>>
>> Hmm, if you want to use iSCSI or other protocols, you should update the driver
>> to implement block replication's control interface.
>>
>> Currently, we only support nbd now.
> 
> I took a quick look at the NBD patches in this series, it looks like
> they are a hacky way to make quorum dynamically reconfigurable.
> 
> In other words, what you really need is a way to enable/disable a quorum
> child or even add/remove children at run-time.
> 
> NBD is not the right place to implement that.  Add APIs to quorum so
> COLO code can use them.
> 
> Or maybe I'm misinterpreting the patches, I only took a quick look...

Hmm, if we can enable/disable or add/remove a child at run-time, it is another
choice.

Thanks
Wen Congyang

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:05                 ` Paolo Bonzini
@ 2015-04-23 10:17                   ` Kevin Wolf
  2015-04-23 10:33                     ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Kevin Wolf @ 2015-04-23 10:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

Am 23.04.2015 um 12:05 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:14, Wen Congyang wrote:
> > The bs->file->driver should support backing file, and use backing reference
> > already.
> > 
> > What about the primary side? We should control when to connect to NBD server,
> > not in nbd_open().

Why do you need to create the block device before the connection should
be made?

> My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
> bdrv_open and a separate bdrv_connect callback.  Open would fail if
> BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.
> 
> You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Please don't add new flags. If we have to, we can introduce a new option
(in the QDict), but first let's check if it's really necessary.

> Perhaps quorum is not a great match after all, and it's better to add a
> new "colo" driver similar to quorum but simpler and only using the read
> policy that you need for colo.  The new driver would also know how to
> use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> be too big.

I thought the same, but haven't looked at the details yet. But if I
understand correctly, the plan is to take quorum and add options to turn
off the functionality of using a quorum - that's a bit odd.

What I think is really needed here is essentially an active mirror
filter.

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:17                   ` Kevin Wolf
@ 2015-04-23 10:33                     ` Paolo Bonzini
  2015-04-23 10:40                       ` Kevin Wolf
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 10:33 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang



On 23/04/2015 12:17, Kevin Wolf wrote:
> > Perhaps quorum is not a great match after all, and it's better to add a
> > new "colo" driver similar to quorum but simpler and only using the read
> > policy that you need for colo.  The new driver would also know how to
> > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > be too big.
>
> I thought the same, but haven't looked at the details yet. But if I
> understand correctly, the plan is to take quorum and add options to turn
> off the functionality of using a quorum - that's a bit odd.

Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
quorum.c colo.c" and clean up the code instead of adding options to
quorum.  There's not going to be more duplication between quorum.c and
colo.c than, say, between colo.c and blkverify.c.

> What I think is really needed here is essentially an active mirror
> filter.

Yes, an active synchronous mirror.  It can be either a filter or a
device.  Has anyone ever come up with a design for filters?  Colo
doesn't need much more complexity than a "toy" blkverify filter.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:33                     ` Paolo Bonzini
@ 2015-04-23 10:40                       ` Kevin Wolf
  2015-04-23 10:44                         ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Kevin Wolf @ 2015-04-23 10:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

Am 23.04.2015 um 12:33 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:17, Kevin Wolf wrote:
> > > Perhaps quorum is not a great match after all, and it's better to add a
> > > new "colo" driver similar to quorum but simpler and only using the read
> > > policy that you need for colo.  The new driver would also know how to
> > > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > > be too big.
> >
> > I thought the same, but haven't looked at the details yet. But if I
> > understand correctly, the plan is to take quorum and add options to turn
> > off the functionality of using a quorum - that's a bit odd.
> 
> Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
> quorum.c colo.c" and clean up the code instead of adding options to
> quorum.  There's not going to be more duplication between quorum.c and
> colo.c than, say, between colo.c and blkverify.c.

The question that is still open for me is whether it would be a colo.c
or an active-mirror.c, i.e. if this would be tied specifically to COLO
or if it could be kept generic enough that it could be used for other
use cases as well.

> > What I think is really needed here is essentially an active mirror
> > filter.
> 
> Yes, an active synchronous mirror.  It can be either a filter or a
> device.  Has anyone ever come up with a design for filters?  Colo
> doesn't need much more complexity than a "toy" blkverify filter.

I think what we're doing now for quorum/blkverify/blkdebug is okay.

The tricky and yet unsolved part is how to add/remove filter BDSes at
runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:40                       ` Kevin Wolf
@ 2015-04-23 10:44                         ` Paolo Bonzini
  2015-04-23 11:35                           ` Wen Congyang
  2015-04-23 11:36                           ` Kevin Wolf
  0 siblings, 2 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 10:44 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang



On 23/04/2015 12:40, Kevin Wolf wrote:
> The question that is still open for me is whether it would be a colo.c
> or an active-mirror.c, i.e. if this would be tied specifically to COLO
> or if it could be kept generic enough that it could be used for other
> use cases as well.

Understood (now).

>>> What I think is really needed here is essentially an active mirror
>>> filter.
>>
>> Yes, an active synchronous mirror.  It can be either a filter or a
>> device.  Has anyone ever come up with a design for filters?  Colo
>> doesn't need much more complexity than a "toy" blkverify filter.
> 
> I think what we're doing now for quorum/blkverify/blkdebug is okay.
> 
> The tricky and yet unsolved part is how to add/remove filter BDSes at
> runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Yes, it is.  The "defer connection to NBD when replication is started"
is effectively "add the COLO filter" (with the NBD connection as a
children) when replication is started.

Similarly "close the NBD device when replication is stopped" is
effectively "remove the COLO filter" (which brings the NBD connection
down with it).

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:44                         ` Paolo Bonzini
@ 2015-04-23 11:35                           ` Wen Congyang
  2015-04-23 11:36                           ` Kevin Wolf
  1 sibling, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-04-23 11:35 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf
  Cc: Fam Zheng, Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	Dr. David Alan Gilbert, zhanghailiang

On 04/23/2015 06:44 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 12:40, Kevin Wolf wrote:
>> The question that is still open for me is whether it would be a colo.c
>> or an active-mirror.c, i.e. if this would be tied specifically to COLO
>> or if it could be kept generic enough that it could be used for other
>> use cases as well.
> 
> Understood (now).
> 
>>>> What I think is really needed here is essentially an active mirror
>>>> filter.
>>>
>>> Yes, an active synchronous mirror.  It can be either a filter or a
>>> device.  Has anyone ever come up with a design for filters?  Colo
>>> doesn't need much more complexity than a "toy" blkverify filter.
>>
>> I think what we're doing now for quorum/blkverify/blkdebug is okay.
>>
>> The tricky and yet unsolved part is how to add/remove filter BDSes at
>> runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Hmm, I don't understand it clearly. Do you mean:
1. COLO filter is quorum's child
2. We can add/remove quorum's child at run-time.

If I misunderstand something, please correct me.

Thanks
Wen Congyang

> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 10:44                         ` Paolo Bonzini
  2015-04-23 11:35                           ` Wen Congyang
@ 2015-04-23 11:36                           ` Kevin Wolf
  2015-04-23 11:53                             ` Paolo Bonzini
  1 sibling, 1 reply; 74+ messages in thread
From: Kevin Wolf @ 2015-04-23 11:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, Dr. David Alan Gilbert,
	zhanghailiang

Am 23.04.2015 um 12:44 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:40, Kevin Wolf wrote:
> > The question that is still open for me is whether it would be a colo.c
> > or an active-mirror.c, i.e. if this would be tied specifically to COLO
> > or if it could be kept generic enough that it could be used for other
> > use cases as well.
> 
> Understood (now).
> 
> >>> What I think is really needed here is essentially an active mirror
> >>> filter.
> >>
> >> Yes, an active synchronous mirror.  It can be either a filter or a
> >> device.  Has anyone ever come up with a design for filters?  Colo
> >> doesn't need much more complexity than a "toy" blkverify filter.
> > 
> > I think what we're doing now for quorum/blkverify/blkdebug is okay.
> > 
> > The tricky and yet unsolved part is how to add/remove filter BDSes at
> > runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Crap. Then we need to figure out dynamic reconfiguration for filters
(CCed Markus and Jeff).

And this is really part of the fundamental operation mode and not just a
way to give users a way to change their mind at runtime? Because if it
were, we could go forward without that for the start and add dynamic
reconfiguration in a second step.

Anyway, even if we move it to a second step, it looks like we need to
design something rather soon now.

Kevin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 11:36                           ` Kevin Wolf
@ 2015-04-23 11:53                             ` Paolo Bonzini
  2015-04-23 12:05                               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 11:53 UTC (permalink / raw)
  To: Kevin Wolf, Wen Congyang, Dr. David Alan Gilbert
  Cc: Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 23/04/2015 13:36, Kevin Wolf wrote:
> Crap. Then we need to figure out dynamic reconfiguration for filters
> (CCed Markus and Jeff).
> 
> And this is really part of the fundamental operation mode and not just a
> way to give users a way to change their mind at runtime? Because if it
> were, we could go forward without that for the start and add dynamic
> reconfiguration in a second step.

I honestly don't know.  Wen, David?

Paolo

> Anyway, even if we move it to a second step, it looks like we need to
> design something rather soon now.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 11:53                             ` Paolo Bonzini
@ 2015-04-23 12:05                               ` Dr. David Alan Gilbert
  2015-04-23 12:11                                 ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-23 12:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 23/04/2015 13:36, Kevin Wolf wrote:
> > Crap. Then we need to figure out dynamic reconfiguration for filters
> > (CCed Markus and Jeff).
> > 
> > And this is really part of the fundamental operation mode and not just a
> > way to give users a way to change their mind at runtime? Because if it
> > were, we could go forward without that for the start and add dynamic
> > reconfiguration in a second step.
> 
> I honestly don't know.  Wen, David?

As presented at the moment, I don't see there's any dynamic reconfiguration
on the primary side at the moment - it starts up in the configuration with
the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
setup; the primary doesn't start running until the secondary is connected.

Similarly the secondary startups in the configuration and stays that way;
the interesting question to me is what happens after a failure.

If the secondary fails, then your primary is still quorum(disk, NBD) but
the NBD side is dead - so I don't think you need to do anything there
immediately.

If the primary fails, and the secondary takes over, then a lot of the
stuff on the secondary now becomes redundent; does that stay the same
and just operate in some form of passthrough - or does it need to
change configuration?

The hard part to me is how to bring it back into fault-tolerance now;
after a primary failure, the secondary now needs to morph into something
like a primary, and somehow you need to bring up a new secondary
and get that new secondary an image of the primaries current disk.

Dave

> Paolo
> 
> > Anyway, even if we move it to a second step, it looks like we need to
> > design something rather soon now.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 12:05                               ` Dr. David Alan Gilbert
@ 2015-04-23 12:11                                 ` Paolo Bonzini
  2015-04-23 12:19                                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 12:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> As presented at the moment, I don't see there's any dynamic reconfiguration
> on the primary side at the moment

So that means the bdrv_start_replication and bdrv_stop_replication
callbacks are more or less redundant, at least on the primary?

In fact, who calls them?  Certainly nothing in this patch set...
:)

Paolo

 - it starts up in the configuration with
> the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
> setup; the primary doesn't start running until the secondary is connected.
> 
> Similarly the secondary startups in the configuration and stays that way;
> the interesting question to me is what happens after a failure.
> 
> If the secondary fails, then your primary is still quorum(disk, NBD) but
> the NBD side is dead - so I don't think you need to do anything there
> immediately.
> 
> If the primary fails, and the secondary takes over, then a lot of the
> stuff on the secondary now becomes redundent; does that stay the same
> and just operate in some form of passthrough - or does it need to
> change configuration?
> 
> The hard part to me is how to bring it back into fault-tolerance now;
> after a primary failure, the secondary now needs to morph into something
> like a primary, and somehow you need to bring up a new secondary
> and get that new secondary an image of the primaries current disk.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 12:11                                 ` Paolo Bonzini
@ 2015-04-23 12:19                                   ` Dr. David Alan Gilbert
  2015-04-23 12:23                                     ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-23 12:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> > As presented at the moment, I don't see there's any dynamic reconfiguration
> > on the primary side at the moment
> 
> So that means the bdrv_start_replication and bdrv_stop_replication
> callbacks are more or less redundant, at least on the primary?
> 
> In fact, who calls them?  Certainly nothing in this patch set...
> :)

In the main colo set (I'm looking at the February version) there
are calls to them, the 'stop_replication' is called at failover time.

Here is I think the later version:
http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

Dave

> 
> Paolo
> 
>  - it starts up in the configuration with
> > the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
> > setup; the primary doesn't start running until the secondary is connected.
> > 
> > Similarly the secondary startups in the configuration and stays that way;
> > the interesting question to me is what happens after a failure.
> > 
> > If the secondary fails, then your primary is still quorum(disk, NBD) but
> > the NBD side is dead - so I don't think you need to do anything there
> > immediately.
> > 
> > If the primary fails, and the secondary takes over, then a lot of the
> > stuff on the secondary now becomes redundent; does that stay the same
> > and just operate in some form of passthrough - or does it need to
> > change configuration?
> > 
> > The hard part to me is how to bring it back into fault-tolerance now;
> > after a primary failure, the secondary now needs to morph into something
> > like a primary, and somehow you need to bring up a new secondary
> > and get that new secondary an image of the primaries current disk.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 12:19                                   ` Dr. David Alan Gilbert
@ 2015-04-23 12:23                                     ` Paolo Bonzini
  2015-04-24  2:01                                       ` Fam Zheng
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-23 12:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
>> > So that means the bdrv_start_replication and bdrv_stop_replication
>> > callbacks are more or less redundant, at least on the primary?
>> > 
>> > In fact, who calls them?  Certainly nothing in this patch set...
>> > :)
> In the main colo set (I'm looking at the February version) there
> are calls to them, the 'stop_replication' is called at failover time.
> 
> Here is I think the later version:
> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

I think the primary shouldn't do any I/O after failover (and the
secondary should close the NBD server) so it is probably okay to ignore
the removal for now.  Inserting the filter dynamically is probably
needed though.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-23 12:23                                     ` Paolo Bonzini
@ 2015-04-24  2:01                                       ` Fam Zheng
  2015-04-24  2:16                                         ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: Fam Zheng @ 2015-04-24  2:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Yang Hongyang, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert, qemu devel,
	zhanghailiang, Gonglei, Stefan Hajnoczi, Max Reitz,
	Lai Jiangshan

On Thu, 04/23 14:23, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
> >> > So that means the bdrv_start_replication and bdrv_stop_replication
> >> > callbacks are more or less redundant, at least on the primary?
> >> > 
> >> > In fact, who calls them?  Certainly nothing in this patch set...
> >> > :)
> > In the main colo set (I'm looking at the February version) there
> > are calls to them, the 'stop_replication' is called at failover time.
> > 
> > Here is I think the later version:
> > http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
> 
> I think the primary shouldn't do any I/O after failover (and the
> secondary should close the NBD server) so it is probably okay to ignore
> the removal for now.  Inserting the filter dynamically is probably
> needed though.

Or maybe just enabling/disabling?

Fam

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  2:01                                       ` Fam Zheng
@ 2015-04-24  2:16                                         ` Wen Congyang
  2015-04-24  7:47                                           ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-24  2:16 UTC (permalink / raw)
  To: Fam Zheng, Paolo Bonzini
  Cc: Kevin Wolf, Yang Hongyang, Lai Jiangshan, qemu block, armbru,
	jcody, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	qemu devel, Gonglei, Stefan Hajnoczi, Max Reitz, zhanghailiang

On 04/24/2015 10:01 AM, Fam Zheng wrote:
> On Thu, 04/23 14:23, Paolo Bonzini wrote:
>>
>>
>> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
>>>>> So that means the bdrv_start_replication and bdrv_stop_replication
>>>>> callbacks are more or less redundant, at least on the primary?
>>>>>
>>>>> In fact, who calls them?  Certainly nothing in this patch set...
>>>>> :)
>>> In the main colo set (I'm looking at the February version) there
>>> are calls to them, the 'stop_replication' is called at failover time.
>>>
>>> Here is I think the later version:
>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
>>
>> I think the primary shouldn't do any I/O after failover (and the
>> secondary should close the NBD server) so it is probably okay to ignore
>> the removal for now.  Inserting the filter dynamically is probably
>> needed though.
> 
> Or maybe just enabling/disabling?

Hmm, after failover, the secondary qemu should become primary qemu, but we don't
know the nbd server's IP/port when we execute the secondary qemu. So we need
to inserting nbd client dynamically after failover.

Thanks
Wen Congyang

> 
> Fam
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  2:16                                         ` Wen Congyang
@ 2015-04-24  7:47                                           ` Paolo Bonzini
  2015-04-24  7:55                                             ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-24  7:47 UTC (permalink / raw)
  To: Wen Congyang, Fam Zheng
  Cc: Kevin Wolf, Yang Hongyang, Lai Jiangshan, qemu block, armbru,
	jcody, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	qemu devel, Gonglei, Stefan Hajnoczi, Max Reitz, zhanghailiang



On 24/04/2015 04:16, Wen Congyang wrote:
>>> >> I think the primary shouldn't do any I/O after failover (and the
>>> >> secondary should close the NBD server) so it is probably okay to ignore
>>> >> the removal for now.  Inserting the filter dynamically is probably
>>> >> needed though.
>> > 
>> > Or maybe just enabling/disabling?
> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
> know the nbd server's IP/port when we execute the secondary qemu. So we need
> to inserting nbd client dynamically after failover.

True, but secondary->primary switch is already not supported in v3.

Kevin/Stefan, is there a design document somewhere that covers at least
static filters?

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  7:47                                           ` Paolo Bonzini
@ 2015-04-24  7:55                                             ` Wen Congyang
  2015-04-24  8:58                                               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-24  7:55 UTC (permalink / raw)
  To: Paolo Bonzini, Fam Zheng
  Cc: Kevin Wolf, Yang Hongyang, Lai Jiangshan, qemu block, armbru,
	jcody, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	qemu devel, Gonglei, Stefan Hajnoczi, Max Reitz, zhanghailiang

On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 04:16, Wen Congyang wrote:
>>>>>> I think the primary shouldn't do any I/O after failover (and the
>>>>>> secondary should close the NBD server) so it is probably okay to ignore
>>>>>> the removal for now.  Inserting the filter dynamically is probably
>>>>>> needed though.
>>>>
>>>> Or maybe just enabling/disabling?
>> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
>> know the nbd server's IP/port when we execute the secondary qemu. So we need
>> to inserting nbd client dynamically after failover.
> 
> True, but secondary->primary switch is already not supported in v3.

Yes, we should consider it, and support it more easily later.

If we can add a filter dynamically, we can add a filter that's file is nbd
dynamically after secondary qemu's nbd server is ready. In this case, I think
there is no need to touch nbd client.

Thanks
Wen Congyang

> 
> Kevin/Stefan, is there a design document somewhere that covers at least
> static filters?
> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  7:55                                             ` Wen Congyang
@ 2015-04-24  8:58                                               ` Dr. David Alan Gilbert
  2015-04-24  9:04                                                 ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-24  8:58 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang, zhanghailiang

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 04:16, Wen Congyang wrote:
> >>>>>> I think the primary shouldn't do any I/O after failover (and the
> >>>>>> secondary should close the NBD server) so it is probably okay to ignore
> >>>>>> the removal for now.  Inserting the filter dynamically is probably
> >>>>>> needed though.
> >>>>
> >>>> Or maybe just enabling/disabling?
> >> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
> >> know the nbd server's IP/port when we execute the secondary qemu. So we need
> >> to inserting nbd client dynamically after failover.
> > 
> > True, but secondary->primary switch is already not supported in v3.
> 
> Yes, we should consider it, and support it more easily later.
> 
> If we can add a filter dynamically, we can add a filter that's file is nbd
> dynamically after secondary qemu's nbd server is ready. In this case, I think
> there is no need to touch nbd client.

Yes, I think maybe the harder part is getting a copy of the current disk
contents to the new secondary while the new primary is still running.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Kevin/Stefan, is there a design document somewhere that covers at least
> > static filters?
> > 
> > Paolo
> > .
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  8:58                                               ` Dr. David Alan Gilbert
@ 2015-04-24  9:04                                                 ` Paolo Bonzini
  2015-04-24  9:38                                                   ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-24  9:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
>> > If we can add a filter dynamically, we can add a filter that's file is nbd
>> > dynamically after secondary qemu's nbd server is ready. In this case, I think
>> > there is no need to touch nbd client.
> Yes, I think maybe the harder part is getting a copy of the current disk
> contents to the new secondary while the new primary is still running.

That can be done with drive-mirror.  But I think it's too early for that.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  9:38                                                   ` Wen Congyang
@ 2015-04-24  9:36                                                     ` Paolo Bonzini
  2015-04-24  9:53                                                       ` Wen Congyang
  2015-04-27  9:37                                                       ` Stefan Hajnoczi
  0 siblings, 2 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-24  9:36 UTC (permalink / raw)
  To: Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 24/04/2015 11:38, Wen Congyang wrote:
>> > 
>> > That can be done with drive-mirror.  But I think it's too early for that.
> Do you mean use drive-mirror instead of quorum?

Only before starting up a new secondary.  Basically you do a migration
with non-shared storage, and then start the secondary in colo mode.

But it's only for the failover case.  Quorum (or a new block/colo.c
driver or filter) is fine for normal colo operation.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  9:04                                                 ` Paolo Bonzini
@ 2015-04-24  9:38                                                   ` Wen Congyang
  2015-04-24  9:36                                                     ` Paolo Bonzini
  0 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-24  9:38 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang

On 04/24/2015 05:04 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
>>>> If we can add a filter dynamically, we can add a filter that's file is nbd
>>>> dynamically after secondary qemu's nbd server is ready. In this case, I think
>>>> there is no need to touch nbd client.
>> Yes, I think maybe the harder part is getting a copy of the current disk
>> contents to the new secondary while the new primary is still running.
> 
> That can be done with drive-mirror.  But I think it's too early for that.

Do you mean use drive-mirror instead of quorum?

Hmm, I don't find the final design for primary QEMU...

Thanks
Wen Congyang

> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  9:36                                                     ` Paolo Bonzini
@ 2015-04-24  9:53                                                       ` Wen Congyang
  2015-04-24 10:03                                                         ` Paolo Bonzini
  2015-04-27  9:37                                                       ` Stefan Hajnoczi
  1 sibling, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-04-24  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang

On 04/24/2015 05:36 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:
>>>>
>>>> That can be done with drive-mirror.  But I think it's too early for that.
>> Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Is nbd+colo needed to connect the NBD server later?

Thanks
Wen Congyang

> 
> Paolo
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  9:53                                                       ` Wen Congyang
@ 2015-04-24 10:03                                                         ` Paolo Bonzini
  0 siblings, 0 replies; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-24 10:03 UTC (permalink / raw)
  To: Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, Yang Hongyang, zhanghailiang



On 24/04/2015 11:53, Wen Congyang wrote:
>> > Only before starting up a new secondary.  Basically you do a migration
>> > with non-shared storage, and then start the secondary in colo mode.
>> > 
>> > But it's only for the failover case.  Quorum (or a new block/colo.c
>> > driver or filter) is fine for normal colo operation.
> Is nbd+colo needed to connect the NBD server later?

Elsewhere in the thread I proposed a new flag BDRV_O_NO_CONNECT and a
new BlockDriver function pointer bdrv_connect.

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-24  9:36                                                     ` Paolo Bonzini
  2015-04-24  9:53                                                       ` Wen Congyang
@ 2015-04-27  9:37                                                       ` Stefan Hajnoczi
  2015-04-29  8:29                                                         ` Paolo Bonzini
  2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-27  9:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Yang Hongyang, Fam Zheng, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert, qemu devel,
	zhanghailiang, Gonglei, Max Reitz, Lai Jiangshan

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:
> >> > 
> >> > That can be done with drive-mirror.  But I think it's too early for that.
> > Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Perhaps this patch series should mirror the Secondary's disk to a Backup
Secondary so that the system can be protected very quickly after
failover.

I think anyone serious about fault tolerance would deploy a Backup
Secondary, otherwise the system cannot survive two failures unless a
human administrator is lucky/fast enough to set up a new Secondary.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-27  9:37                                                       ` Stefan Hajnoczi
@ 2015-04-29  8:29                                                         ` Paolo Bonzini
  2015-04-29  8:37                                                           ` Gonglei
  2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 74+ messages in thread
From: Paolo Bonzini @ 2015-04-29  8:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Yang Hongyang, Fam Zheng, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert, qemu devel,
	zhanghailiang, Gonglei, Max Reitz, Lai Jiangshan



On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>>> But it's only for the failover case.  Quorum (or a new 
>>> block/colo.c driver or filter) is fine for normal colo 
>>> operation.
> Perhaps this patch series should mirror the Secondary's disk to a 
> Backup Secondary so that the system can be protected very quickly 
> after failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
>  Secondary, otherwise the system cannot survive two failures
> unless a human administrator is lucky/fast enough to set up a new 
> Secondary.

Let's do one thing at a time.  Otherwise nothing of this is going to
be ever completed...

Paolo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-29  8:29                                                         ` Paolo Bonzini
@ 2015-04-29  8:37                                                           ` Gonglei
  2015-04-30 14:56                                                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: Gonglei @ 2015-04-29  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Yang Hongyang, Fam Zheng, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert, qemu devel,
	zhanghailiang, Max Reitz, Lai Jiangshan

On 2015/4/29 16:29, Paolo Bonzini wrote:
> 
> 
> On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>>>> But it's only for the failover case.  Quorum (or a new 
>>>> block/colo.c driver or filter) is fine for normal colo 
>>>> operation.
>> Perhaps this patch series should mirror the Secondary's disk to a 
>> Backup Secondary so that the system can be protected very quickly 
>> after failover.
>>
>> I think anyone serious about fault tolerance would deploy a Backup
>>  Secondary, otherwise the system cannot survive two failures
>> unless a human administrator is lucky/fast enough to set up a new 
>> Secondary.
> 
> Let's do one thing at a time.  Otherwise nothing of this is going to
> be ever completed...
> 
Yes, and the continuous backup feature is on our TODO list. We hope
this series (including basic functions and  COLO framework) can be
upstream first.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-29  8:37                                                           ` Gonglei
@ 2015-04-30 14:56                                                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-04-30 14:56 UTC (permalink / raw)
  To: Gonglei
  Cc: Kevin Wolf, Yang Hongyang, Fam Zheng, Lai Jiangshan, armbru,
	jcody, Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	qemu devel, Max Reitz, Paolo Bonzini, qemu block, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On Wed, Apr 29, 2015 at 04:37:49PM +0800, Gonglei wrote:
> On 2015/4/29 16:29, Paolo Bonzini wrote:
> > 
> > 
> > On 27/04/2015 11:37, Stefan Hajnoczi wrote:
> >>>> But it's only for the failover case.  Quorum (or a new 
> >>>> block/colo.c driver or filter) is fine for normal colo 
> >>>> operation.
> >> Perhaps this patch series should mirror the Secondary's disk to a 
> >> Backup Secondary so that the system can be protected very quickly 
> >> after failover.
> >>
> >> I think anyone serious about fault tolerance would deploy a Backup
> >>  Secondary, otherwise the system cannot survive two failures
> >> unless a human administrator is lucky/fast enough to set up a new 
> >> Secondary.
> > 
> > Let's do one thing at a time.  Otherwise nothing of this is going to
> > be ever completed...
> > 
> Yes, and the continuous backup feature is on our TODO list. We hope
> this series (including basic functions and  COLO framework) can be
> upstream first.

That's fine, I just wanted to make sure you have the issue in mind.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-04-03 11:05   ` Paolo Bonzini
@ 2015-05-01 16:47     ` John Snow
  2015-05-07  2:20       ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: John Snow @ 2015-05-01 16:47 UTC (permalink / raw)
  To: Paolo Bonzini, Wen Congyang, qemu devel, Fam Zheng, Max Reitz
  Cc: Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang



On 04/03/2015 07:05 AM, Paolo Bonzini wrote:
>
>
> On 03/04/2015 12:01, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>   include/qemu/hbitmap.h |  8 ++++++++
>>   tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
>>   util/hbitmap.c         | 16 ++++++++++++++++
>>   3 files changed, 63 insertions(+)
>>
>> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
>> index 550d7ce..95a55e4 100644
>> --- a/include/qemu/hbitmap.h
>> +++ b/include/qemu/hbitmap.h
>> @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
>>   void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
>>
>>   /**
>> + * hbitmap_reset_all:
>> + * @hb: HBitmap to operate on.
>> + *
>> + * Reset all bits in an HBitmap.
>> + */
>> +void hbitmap_reset_all(HBitmap *hb);
>> +
>> +/**
>>    * hbitmap_get:
>>    * @hb: HBitmap to operate on.
>>    * @item: Bit to query (0-based).
>> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
>> index 8c902f2..1f0078a 100644
>> --- a/tests/test-hbitmap.c
>> +++ b/tests/test-hbitmap.c
>> @@ -11,6 +11,7 @@
>>
>>   #include <glib.h>
>>   #include <stdarg.h>
>> +#include <string.h>
>>   #include "qemu/hbitmap.h"
>>
>>   #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
>> @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
>>       }
>>   }
>>
>> +static void hbitmap_test_reset_all(TestHBitmapData *data)
>> +{
>> +    size_t n;
>> +
>> +    hbitmap_reset_all(data->hb);
>> +
>> +    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
>> +    if (n == 0) {
>> +        n = 1;
>> +    }
>> +    memset(data->bits, 0, n * sizeof(unsigned long));
>> +
>> +    if (data->granularity == 0) {
>> +        hbitmap_test_check(data, 0);
>> +    }
>> +}
>> +
>>   static void hbitmap_test_check_get(TestHBitmapData *data)
>>   {
>>       uint64_t count = 0;
>> @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
>>       hbitmap_test_set(data, L3 / 2, L3);
>>   }
>>
>> +static void test_hbitmap_reset_all(TestHBitmapData *data,
>> +                                   const void *unused)
>> +{
>> +    hbitmap_test_init(data, L3 * 2, 0);
>> +    hbitmap_test_set(data, L1 - 1, L1 + 2);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, 0, L1 * 3);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, L2, L1);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, L2, L3 - L2 + 1);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, L3 - 1, 3);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, 0, L3 * 2);
>> +    hbitmap_test_reset_all(data);
>> +    hbitmap_test_set(data, L3 / 2, L3);
>> +    hbitmap_test_reset_all(data);
>> +}
>> +
>>   static void test_hbitmap_granularity(TestHBitmapData *data,
>>                                        const void *unused)
>>   {
>> @@ -394,6 +432,7 @@ int main(int argc, char **argv)
>>       hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
>>       hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
>>       hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
>> +    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
>>       hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
>>       g_test_run();
>>
>> diff --git a/util/hbitmap.c b/util/hbitmap.c
>> index ab13971..acce93c 100644
>> --- a/util/hbitmap.c
>> +++ b/util/hbitmap.c
>> @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
>>       hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
>>   }
>>
>> +void hbitmap_reset_all(HBitmap *hb)
>> +{
>> +    uint64_t size = hb->size;
>> +    unsigned int i;
>> +
>> +    /* Same as hbitmap_alloc() except memset() */
>> +    for (i = HBITMAP_LEVELS; --i >= 1; ) {
>> +        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
>> +        memset(hb->levels[i], 0, size * sizeof(unsigned long));
>> +    }
>> +

For what it's worth, I recently added in a hb->sizes[i] cache to store 
the size of each array so you don't have to recompute this all the time.

>> +    assert(size == 1);
>> +    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
>> +    hb->count = 0;
>> +}
>> +
>>   bool hbitmap_get(const HBitmap *hb, uint64_t item)
>>   {
>>       /* Compute position and bit in the last layer.  */
>>
>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-04-27  9:37                                                       ` Stefan Hajnoczi
  2015-04-29  8:29                                                         ` Paolo Bonzini
@ 2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
  2015-05-06  2:26                                                           ` Dong, Eddie
  2015-05-08  8:42                                                           ` Stefan Hajnoczi
  1 sibling, 2 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-05-05 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, zhanghailiang

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 11:38, Wen Congyang wrote:
> > >> > 
> > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > Do you mean use drive-mirror instead of quorum?
> > 
> > Only before starting up a new secondary.  Basically you do a migration
> > with non-shared storage, and then start the secondary in colo mode.
> > 
> > But it's only for the failover case.  Quorum (or a new block/colo.c
> > driver or filter) is fine for normal colo operation.
> 
> Perhaps this patch series should mirror the Secondary's disk to a Backup
> Secondary so that the system can be protected very quickly after
> failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
> Secondary, otherwise the system cannot survive two failures unless a
> human administrator is lucky/fast enough to set up a new Secondary.

I'd assumed that a higher level management layer would do the allocation
of a new secondary after the first failover, so no human need be involved.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
@ 2015-05-06  2:26                                                           ` Dong, Eddie
  2015-05-06  2:49                                                             ` Fam Zheng
  2015-05-08  8:42                                                           ` Stefan Hajnoczi
  1 sibling, 1 reply; 74+ messages in thread
From: Dong, Eddie @ 2015-05-06  2:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang, Yunhong, Dong, Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, zhanghailiang



> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Tuesday, May 05, 2015 11:24 PM
> To: Stefan Hajnoczi
> Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> Hongyang; zhanghailiang; armbru@redhat.com; jcody@redhat.com
> Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> 
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > >
> > >
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> >
> > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > Do you mean use drive-mirror instead of quorum?
> > >
> > > Only before starting up a new secondary.  Basically you do a
> > > migration with non-shared storage, and then start the secondary in colo
> mode.
> > >
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> >
> > Perhaps this patch series should mirror the Secondary's disk to a
> > Backup Secondary so that the system can be protected very quickly
> > after failover.
> >
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation of a
> new secondary after the first failover, so no human need be involved.
> 

I agree. The cloud OS, such as open stack, will have the capability to handle the case, together with certain API in VMM side for this (libvirt?). 

Thx Eddie

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-06  2:26                                                           ` Dong, Eddie
@ 2015-05-06  2:49                                                             ` Fam Zheng
  0 siblings, 0 replies; 74+ messages in thread
From: Fam Zheng @ 2015-05-06  2:49 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Kevin Wolf, Yang Hongyang, Lai Jiangshan, armbru, jcody, Jiang,
	Yunhong, Dr. David Alan Gilbert, qemu devel, Gonglei, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, qemu block, zhanghailiang

On Wed, 05/06 02:26, Dong, Eddie wrote:
> 
> 
> > -----Original Message-----
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Tuesday, May 05, 2015 11:24 PM
> > To: Stefan Hajnoczi
> > Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> > block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> > Hongyang; zhanghailiang; armbru@redhat.com; jcody@redhat.com
> > Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> > 
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > >
> > > >
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> >
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > >
> > > > Only before starting up a new secondary.  Basically you do a
> > > > migration with non-shared storage, and then start the secondary in colo
> > mode.
> > > >
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > >
> > > Perhaps this patch series should mirror the Secondary's disk to a
> > > Backup Secondary so that the system can be protected very quickly
> > > after failover.
> > >
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation of a
> > new secondary after the first failover, so no human need be involved.
> > 
> 
> I agree. The cloud OS, such as open stack, will have the capability to handle
> the case, together with certain API in VMM side for this (libvirt?). 

The question here is the QMP API to switch secondary mode to primary mode is
not mentioned in this series.  I think that interface matters for this series.

Fam

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-05-01 16:47     ` [Qemu-devel] [Qemu-block] " John Snow
@ 2015-05-07  2:20       ` Wen Congyang
  2015-05-07 18:32         ` John Snow
  0 siblings, 1 reply; 74+ messages in thread
From: Wen Congyang @ 2015-05-07  2:20 UTC (permalink / raw)
  To: John Snow, Paolo Bonzini, qemu devel, Fam Zheng, Max Reitz
  Cc: Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

On 05/02/2015 12:47 AM, John Snow wrote:
> 
> 
> On 04/03/2015 07:05 AM, Paolo Bonzini wrote:
>>
>>
>> On 03/04/2015 12:01, Wen Congyang wrote:
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> ---
>>>   include/qemu/hbitmap.h |  8 ++++++++
>>>   tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
>>>   util/hbitmap.c         | 16 ++++++++++++++++
>>>   3 files changed, 63 insertions(+)
>>>
>>> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
>>> index 550d7ce..95a55e4 100644
>>> --- a/include/qemu/hbitmap.h
>>> +++ b/include/qemu/hbitmap.h
>>> @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
>>>   void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
>>>
>>>   /**
>>> + * hbitmap_reset_all:
>>> + * @hb: HBitmap to operate on.
>>> + *
>>> + * Reset all bits in an HBitmap.
>>> + */
>>> +void hbitmap_reset_all(HBitmap *hb);
>>> +
>>> +/**
>>>    * hbitmap_get:
>>>    * @hb: HBitmap to operate on.
>>>    * @item: Bit to query (0-based).
>>> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
>>> index 8c902f2..1f0078a 100644
>>> --- a/tests/test-hbitmap.c
>>> +++ b/tests/test-hbitmap.c
>>> @@ -11,6 +11,7 @@
>>>
>>>   #include <glib.h>
>>>   #include <stdarg.h>
>>> +#include <string.h>
>>>   #include "qemu/hbitmap.h"
>>>
>>>   #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
>>> @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
>>>       }
>>>   }
>>>
>>> +static void hbitmap_test_reset_all(TestHBitmapData *data)
>>> +{
>>> +    size_t n;
>>> +
>>> +    hbitmap_reset_all(data->hb);
>>> +
>>> +    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
>>> +    if (n == 0) {
>>> +        n = 1;
>>> +    }
>>> +    memset(data->bits, 0, n * sizeof(unsigned long));
>>> +
>>> +    if (data->granularity == 0) {
>>> +        hbitmap_test_check(data, 0);
>>> +    }
>>> +}
>>> +
>>>   static void hbitmap_test_check_get(TestHBitmapData *data)
>>>   {
>>>       uint64_t count = 0;
>>> @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
>>>       hbitmap_test_set(data, L3 / 2, L3);
>>>   }
>>>
>>> +static void test_hbitmap_reset_all(TestHBitmapData *data,
>>> +                                   const void *unused)
>>> +{
>>> +    hbitmap_test_init(data, L3 * 2, 0);
>>> +    hbitmap_test_set(data, L1 - 1, L1 + 2);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, 0, L1 * 3);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, L2, L1);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, L2, L3 - L2 + 1);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, L3 - 1, 3);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, 0, L3 * 2);
>>> +    hbitmap_test_reset_all(data);
>>> +    hbitmap_test_set(data, L3 / 2, L3);
>>> +    hbitmap_test_reset_all(data);
>>> +}
>>> +
>>>   static void test_hbitmap_granularity(TestHBitmapData *data,
>>>                                        const void *unused)
>>>   {
>>> @@ -394,6 +432,7 @@ int main(int argc, char **argv)
>>>       hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
>>>       hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
>>>       hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
>>> +    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
>>>       hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
>>>       g_test_run();
>>>
>>> diff --git a/util/hbitmap.c b/util/hbitmap.c
>>> index ab13971..acce93c 100644
>>> --- a/util/hbitmap.c
>>> +++ b/util/hbitmap.c
>>> @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
>>>       hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
>>>   }
>>>
>>> +void hbitmap_reset_all(HBitmap *hb)
>>> +{
>>> +    uint64_t size = hb->size;
>>> +    unsigned int i;
>>> +
>>> +    /* Same as hbitmap_alloc() except memset() */
>>> +    for (i = HBITMAP_LEVELS; --i >= 1; ) {
>>> +        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
>>> +        memset(hb->levels[i], 0, size * sizeof(unsigned long));
>>> +    }
>>> +
> 
> For what it's worth, I recently added in a hb->sizes[i] cache to store the size of each array so you don't have to recompute this all the time.

Yes, will fix it in the next version.

Thanks
Wen Congyang

> 
>>> +    assert(size == 1);
>>> +    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
>>> +    hb->count = 0;
>>> +}
>>> +
>>>   bool hbitmap_get(const HBitmap *hb, uint64_t item)
>>>   {
>>>       /* Compute position and bit in the last layer.  */
>>>
>>
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-05-07  2:20       ` Wen Congyang
@ 2015-05-07 18:32         ` John Snow
  2015-05-08  0:59           ` Wen Congyang
  0 siblings, 1 reply; 74+ messages in thread
From: John Snow @ 2015-05-07 18:32 UTC (permalink / raw)
  To: Wen Congyang, Paolo Bonzini, qemu devel, Fam Zheng, Max Reitz
  Cc: Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang



On 05/06/2015 10:20 PM, Wen Congyang wrote:
> On 05/02/2015 12:47 AM, John Snow wrote:
>>
>>
>> On 04/03/2015 07:05 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 03/04/2015 12:01, Wen Congyang wrote:
>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>>> ---
>>>>   include/qemu/hbitmap.h |  8 ++++++++
>>>>   tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
>>>>   util/hbitmap.c         | 16 ++++++++++++++++
>>>>   3 files changed, 63 insertions(+)
>>>>
>>>> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
>>>> index 550d7ce..95a55e4 100644
>>>> --- a/include/qemu/hbitmap.h
>>>> +++ b/include/qemu/hbitmap.h
>>>> @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
>>>>   void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
>>>>
>>>>   /**
>>>> + * hbitmap_reset_all:
>>>> + * @hb: HBitmap to operate on.
>>>> + *
>>>> + * Reset all bits in an HBitmap.
>>>> + */
>>>> +void hbitmap_reset_all(HBitmap *hb);
>>>> +
>>>> +/**
>>>>    * hbitmap_get:
>>>>    * @hb: HBitmap to operate on.
>>>>    * @item: Bit to query (0-based).
>>>> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
>>>> index 8c902f2..1f0078a 100644
>>>> --- a/tests/test-hbitmap.c
>>>> +++ b/tests/test-hbitmap.c
>>>> @@ -11,6 +11,7 @@
>>>>
>>>>   #include <glib.h>
>>>>   #include <stdarg.h>
>>>> +#include <string.h>
>>>>   #include "qemu/hbitmap.h"
>>>>
>>>>   #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
>>>> @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
>>>>       }
>>>>   }
>>>>
>>>> +static void hbitmap_test_reset_all(TestHBitmapData *data)
>>>> +{
>>>> +    size_t n;
>>>> +
>>>> +    hbitmap_reset_all(data->hb);
>>>> +
>>>> +    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
>>>> +    if (n == 0) {
>>>> +        n = 1;
>>>> +    }
>>>> +    memset(data->bits, 0, n * sizeof(unsigned long));
>>>> +
>>>> +    if (data->granularity == 0) {
>>>> +        hbitmap_test_check(data, 0);
>>>> +    }
>>>> +}
>>>> +
>>>>   static void hbitmap_test_check_get(TestHBitmapData *data)
>>>>   {
>>>>       uint64_t count = 0;
>>>> @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
>>>>       hbitmap_test_set(data, L3 / 2, L3);
>>>>   }
>>>>
>>>> +static void test_hbitmap_reset_all(TestHBitmapData *data,
>>>> +                                   const void *unused)
>>>> +{
>>>> +    hbitmap_test_init(data, L3 * 2, 0);
>>>> +    hbitmap_test_set(data, L1 - 1, L1 + 2);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, 0, L1 * 3);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, L2, L1);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, L2, L3 - L2 + 1);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, L3 - 1, 3);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, 0, L3 * 2);
>>>> +    hbitmap_test_reset_all(data);
>>>> +    hbitmap_test_set(data, L3 / 2, L3);
>>>> +    hbitmap_test_reset_all(data);
>>>> +}
>>>> +
>>>>   static void test_hbitmap_granularity(TestHBitmapData *data,
>>>>                                        const void *unused)
>>>>   {
>>>> @@ -394,6 +432,7 @@ int main(int argc, char **argv)
>>>>       hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
>>>>       hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
>>>>       hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
>>>> +    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
>>>>       hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
>>>>       g_test_run();
>>>>
>>>> diff --git a/util/hbitmap.c b/util/hbitmap.c
>>>> index ab13971..acce93c 100644
>>>> --- a/util/hbitmap.c
>>>> +++ b/util/hbitmap.c
>>>> @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
>>>>       hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
>>>>   }
>>>>
>>>> +void hbitmap_reset_all(HBitmap *hb)
>>>> +{
>>>> +    uint64_t size = hb->size;
>>>> +    unsigned int i;
>>>> +
>>>> +    /* Same as hbitmap_alloc() except memset() */
>>>> +    for (i = HBITMAP_LEVELS; --i >= 1; ) {
>>>> +        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
>>>> +        memset(hb->levels[i], 0, size * sizeof(unsigned long));
>>>> +    }
>>>> +
>>
>> For what it's worth, I recently added in a hb->sizes[i] cache to store the size of each array so you don't have to recompute this all the time.
> 
> Yes, will fix it in the next version.
> 
> Thanks
> Wen Congyang
> 

Since the reset stuff is useful all by itself, you can send that patch
by itself, CC me, and I'll review it.

You can update the existing call in block.c:

bdrv_clear_dirty_bitmap() {
    hbitmap_reset(bitmap->bitmap, 0, bitmap->size);
}

to using your faster hbitmap_reset_all call.

Thanks,

--js

>>
>>>> +    assert(size == 1);
>>>> +    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
>>>> +    hb->count = 0;
>>>> +}
>>>> +
>>>>   bool hbitmap_get(const HBitmap *hb, uint64_t item)
>>>>   {
>>>>       /* Compute position and bit in the last layer.  */
>>>>
>>>
>>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>>
>> .
>>
> 
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
  2015-05-07 18:32         ` John Snow
@ 2015-05-08  0:59           ` Wen Congyang
  0 siblings, 0 replies; 74+ messages in thread
From: Wen Congyang @ 2015-05-08  0:59 UTC (permalink / raw)
  To: John Snow, Paolo Bonzini, qemu devel, Fam Zheng, Max Reitz
  Cc: Lai Jiangshan, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Gonglei, Stefan Hajnoczi, Yang Hongyang,
	zhanghailiang

On 05/08/2015 02:32 AM, John Snow wrote:
> 
> 
> On 05/06/2015 10:20 PM, Wen Congyang wrote:
>> On 05/02/2015 12:47 AM, John Snow wrote:
>>>
>>>
>>> On 04/03/2015 07:05 AM, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 03/04/2015 12:01, Wen Congyang wrote:
>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>>>> ---
>>>>>   include/qemu/hbitmap.h |  8 ++++++++
>>>>>   tests/test-hbitmap.c   | 39 +++++++++++++++++++++++++++++++++++++++
>>>>>   util/hbitmap.c         | 16 ++++++++++++++++
>>>>>   3 files changed, 63 insertions(+)
>>>>>
>>>>> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
>>>>> index 550d7ce..95a55e4 100644
>>>>> --- a/include/qemu/hbitmap.h
>>>>> +++ b/include/qemu/hbitmap.h
>>>>> @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
>>>>>   void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
>>>>>
>>>>>   /**
>>>>> + * hbitmap_reset_all:
>>>>> + * @hb: HBitmap to operate on.
>>>>> + *
>>>>> + * Reset all bits in an HBitmap.
>>>>> + */
>>>>> +void hbitmap_reset_all(HBitmap *hb);
>>>>> +
>>>>> +/**
>>>>>    * hbitmap_get:
>>>>>    * @hb: HBitmap to operate on.
>>>>>    * @item: Bit to query (0-based).
>>>>> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
>>>>> index 8c902f2..1f0078a 100644
>>>>> --- a/tests/test-hbitmap.c
>>>>> +++ b/tests/test-hbitmap.c
>>>>> @@ -11,6 +11,7 @@
>>>>>
>>>>>   #include <glib.h>
>>>>>   #include <stdarg.h>
>>>>> +#include <string.h>
>>>>>   #include "qemu/hbitmap.h"
>>>>>
>>>>>   #define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
>>>>> @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
>>>>>       }
>>>>>   }
>>>>>
>>>>> +static void hbitmap_test_reset_all(TestHBitmapData *data)
>>>>> +{
>>>>> +    size_t n;
>>>>> +
>>>>> +    hbitmap_reset_all(data->hb);
>>>>> +
>>>>> +    n = (data->size + BITS_PER_LONG - 1) / BITS_PER_LONG;
>>>>> +    if (n == 0) {
>>>>> +        n = 1;
>>>>> +    }
>>>>> +    memset(data->bits, 0, n * sizeof(unsigned long));
>>>>> +
>>>>> +    if (data->granularity == 0) {
>>>>> +        hbitmap_test_check(data, 0);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>   static void hbitmap_test_check_get(TestHBitmapData *data)
>>>>>   {
>>>>>       uint64_t count = 0;
>>>>> @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
>>>>>       hbitmap_test_set(data, L3 / 2, L3);
>>>>>   }
>>>>>
>>>>> +static void test_hbitmap_reset_all(TestHBitmapData *data,
>>>>> +                                   const void *unused)
>>>>> +{
>>>>> +    hbitmap_test_init(data, L3 * 2, 0);
>>>>> +    hbitmap_test_set(data, L1 - 1, L1 + 2);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, 0, L1 * 3);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, L2, L1);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, L2, L3 - L2 + 1);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, L3 - 1, 3);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, 0, L3 * 2);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +    hbitmap_test_set(data, L3 / 2, L3);
>>>>> +    hbitmap_test_reset_all(data);
>>>>> +}
>>>>> +
>>>>>   static void test_hbitmap_granularity(TestHBitmapData *data,
>>>>>                                        const void *unused)
>>>>>   {
>>>>> @@ -394,6 +432,7 @@ int main(int argc, char **argv)
>>>>>       hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
>>>>>       hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
>>>>>       hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
>>>>> +    hbitmap_test_add("/hbitmap/reset/all", test_hbitmap_reset_all);
>>>>>       hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
>>>>>       g_test_run();
>>>>>
>>>>> diff --git a/util/hbitmap.c b/util/hbitmap.c
>>>>> index ab13971..acce93c 100644
>>>>> --- a/util/hbitmap.c
>>>>> +++ b/util/hbitmap.c
>>>>> @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
>>>>>       hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
>>>>>   }
>>>>>
>>>>> +void hbitmap_reset_all(HBitmap *hb)
>>>>> +{
>>>>> +    uint64_t size = hb->size;
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    /* Same as hbitmap_alloc() except memset() */
>>>>> +    for (i = HBITMAP_LEVELS; --i >= 1; ) {
>>>>> +        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
>>>>> +        memset(hb->levels[i], 0, size * sizeof(unsigned long));
>>>>> +    }
>>>>> +
>>>
>>> For what it's worth, I recently added in a hb->sizes[i] cache to store the size of each array so you don't have to recompute this all the time.
>>
>> Yes, will fix it in the next version.
>>
>> Thanks
>> Wen Congyang
>>
> 
> Since the reset stuff is useful all by itself, you can send that patch
> by itself, CC me, and I'll review it.
> 
> You can update the existing call in block.c:
> 
> bdrv_clear_dirty_bitmap() {
>     hbitmap_reset(bitmap->bitmap, 0, bitmap->size);
> }
> 
> to using your faster hbitmap_reset_all call.

OK, I will do it.

Thanks
Wen Congyang

> 
> Thanks,
> 
> --js
> 
>>>
>>>>> +    assert(size == 1);
>>>>> +    hb->levels[0][0] = 1UL << (BITS_PER_LONG - 1);
>>>>> +    hb->count = 0;
>>>>> +}
>>>>> +
>>>>>   bool hbitmap_get(const HBitmap *hb, uint64_t item)
>>>>>   {
>>>>>       /* Compute position and bit in the last layer.  */
>>>>>
>>>>
>>>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>
>>> .
>>>
>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
  2015-05-06  2:26                                                           ` Dong, Eddie
@ 2015-05-08  8:42                                                           ` Stefan Hajnoczi
  2015-05-08  9:34                                                             ` Dr. David Alan Gilbert
  2015-05-08  9:39                                                             ` Kevin Wolf
  1 sibling, 2 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2015-05-08  8:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> > 
> > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > Do you mean use drive-mirror instead of quorum?
> > > 
> > > Only before starting up a new secondary.  Basically you do a migration
> > > with non-shared storage, and then start the secondary in colo mode.
> > > 
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> > 
> > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > Secondary so that the system can be protected very quickly after
> > failover.
> > 
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation
> of a new secondary after the first failover, so no human need be involved.

That doesn't help, after the first failover is too late even if it's
done by a program.  There should be no window during which the VM is
unprotected.

People who want fault tolerance care about 9s of availability.  The VM
must be protected on the new Primary as soon as the failover occurs,
otherwise this isn't a serious fault tolerance solution.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-08  8:42                                                           ` Stefan Hajnoczi
@ 2015-05-08  9:34                                                             ` Dr. David Alan Gilbert
  2015-05-08  9:39                                                             ` Kevin Wolf
  1 sibling, 0 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-05-08  9:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Lai Jiangshan, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Paolo Bonzini, Yang Hongyang, zhanghailiang

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
>
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

I'm not aware of any other system that manages that, so I don't
think that's fair.

You gain a lot more availability going from a single
system to the 1+1 system that COLO (or any of the checkpointing systems)
propose, I can't say how many 9s it gets you.  It's true having multiple
secondaries would get you a bit more on top of that, but you're still
a lot better off just having the one secondary.

I had thought that having >1 secondary would be a nice addition, but it's
a big change everywhere else (e.g. having to maintain multiple migration
streams, dealing with miscompares from multiple hosts).

Dave


> 
> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-08  8:42                                                           ` Stefan Hajnoczi
  2015-05-08  9:34                                                             ` Dr. David Alan Gilbert
@ 2015-05-08  9:39                                                             ` Kevin Wolf
  2015-05-08  9:55                                                               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 74+ messages in thread
From: Kevin Wolf @ 2015-05-08  9:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Yang Hongyang, Fam Zheng, qemu block, armbru, jcody,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert, qemu devel,
	zhanghailiang, Gonglei, Paolo Bonzini, Max Reitz, Lai Jiangshan

[-- Attachment #1: Type: text/plain, Size: 1957 bytes --]

Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
> 
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

If you're worried about two failures in a row, why wouldn't you be
worried about three in a row? I think if you really want more than one
backup to be ready, you shouldn't go to two, but to n.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
  2015-05-08  9:39                                                             ` Kevin Wolf
@ 2015-05-08  9:55                                                               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2015-05-08  9:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, qemu block, armbru, jcody, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, zhanghailiang, Gonglei, Stefan Hajnoczi,
	Paolo Bonzini, Yang Hongyang, Lai Jiangshan

* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > > 
> > > > > 
> > > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > > >> > 
> > > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > > Do you mean use drive-mirror instead of quorum?
> > > > > 
> > > > > Only before starting up a new secondary.  Basically you do a migration
> > > > > with non-shared storage, and then start the secondary in colo mode.
> > > > > 
> > > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > > driver or filter) is fine for normal colo operation.
> > > > 
> > > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > > Secondary so that the system can be protected very quickly after
> > > > failover.
> > > > 
> > > > I think anyone serious about fault tolerance would deploy a Backup
> > > > Secondary, otherwise the system cannot survive two failures unless a
> > > > human administrator is lucky/fast enough to set up a new Secondary.
> > > 
> > > I'd assumed that a higher level management layer would do the allocation
> > > of a new secondary after the first failover, so no human need be involved.
> > 
> > That doesn't help, after the first failover is too late even if it's
> > done by a program.  There should be no window during which the VM is
> > unprotected.
> > 
> > People who want fault tolerance care about 9s of availability.  The VM
> > must be protected on the new Primary as soon as the failover occurs,
> > otherwise this isn't a serious fault tolerance solution.
> 
> If you're worried about two failures in a row, why wouldn't you be
> worried about three in a row? I think if you really want more than one
> backup to be ready, you shouldn't go to two, but to n.

Agreed, if you did multiple secondaries you'd do 'n'.

But 1+2 does satisfy all but the most paranoid; and in particular it does
mean that if you want to take a host down for some maintenance you can
do it without worrying.

But, as I said in my reply to Stefan, doing more than 1+1 gets really hairy;
the combinations of failovers are much more complicated.

Dave
  1) It means that 
  1) As Stefan mentions you get worried about the lack of protection after
the first failover; 
> Kevin


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2015-05-08  9:55 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-03 10:01 [Qemu-devel] [PATCH COLO v3 00/14] Block replication for continuous checkpoints Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description Wen Congyang
2015-04-20 15:30   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2015-04-21  1:25     ` Wen Congyang
2015-04-21 15:28       ` Paolo Bonzini
2015-04-22  9:18         ` Stefan Hajnoczi
2015-04-22  9:28           ` Wen Congyang
2015-04-23  9:55             ` Stefan Hajnoczi
2015-04-23 10:11               ` Wen Congyang
2015-04-22  9:31         ` Kevin Wolf
2015-04-22 10:12           ` [Qemu-devel] " Paolo Bonzini
2015-04-23  9:00             ` Kevin Wolf
2015-04-23  9:14               ` Wen Congyang
2015-04-23 10:05                 ` Paolo Bonzini
2015-04-23 10:17                   ` Kevin Wolf
2015-04-23 10:33                     ` Paolo Bonzini
2015-04-23 10:40                       ` Kevin Wolf
2015-04-23 10:44                         ` Paolo Bonzini
2015-04-23 11:35                           ` Wen Congyang
2015-04-23 11:36                           ` Kevin Wolf
2015-04-23 11:53                             ` Paolo Bonzini
2015-04-23 12:05                               ` Dr. David Alan Gilbert
2015-04-23 12:11                                 ` Paolo Bonzini
2015-04-23 12:19                                   ` Dr. David Alan Gilbert
2015-04-23 12:23                                     ` Paolo Bonzini
2015-04-24  2:01                                       ` Fam Zheng
2015-04-24  2:16                                         ` Wen Congyang
2015-04-24  7:47                                           ` Paolo Bonzini
2015-04-24  7:55                                             ` Wen Congyang
2015-04-24  8:58                                               ` Dr. David Alan Gilbert
2015-04-24  9:04                                                 ` Paolo Bonzini
2015-04-24  9:38                                                   ` Wen Congyang
2015-04-24  9:36                                                     ` Paolo Bonzini
2015-04-24  9:53                                                       ` Wen Congyang
2015-04-24 10:03                                                         ` Paolo Bonzini
2015-04-27  9:37                                                       ` Stefan Hajnoczi
2015-04-29  8:29                                                         ` Paolo Bonzini
2015-04-29  8:37                                                           ` Gonglei
2015-04-30 14:56                                                             ` Stefan Hajnoczi
2015-05-05 15:23                                                         ` Dr. David Alan Gilbert
2015-05-06  2:26                                                           ` Dong, Eddie
2015-05-06  2:49                                                             ` Fam Zheng
2015-05-08  8:42                                                           ` Stefan Hajnoczi
2015-05-08  9:34                                                             ` Dr. David Alan Gilbert
2015-05-08  9:39                                                             ` Kevin Wolf
2015-05-08  9:55                                                               ` Dr. David Alan Gilbert
2015-04-23  9:26               ` Paolo Bonzini
2015-04-23  9:37                 ` Kevin Wolf
2015-04-23  9:41                 ` Wen Congyang
2015-04-22  9:29       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2015-04-22  9:42         ` Wen Congyang
2015-04-22 10:39   ` [Qemu-devel] " Dr. David Alan Gilbert
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 02/14] quorum: allow ignoring child errors Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 03/14] NBD client: connect to nbd server later Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 04/14] Add new block driver interfaces to control block replication Wen Congyang
2015-04-22 12:56   ` Eric Blake
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 05/14] quorum: implement block driver interfaces for " Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 06/14] NBD client: " Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 07/14] allow writing to the backing file Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 08/14] Allow creating backup jobs when opening BDS Wen Congyang
2015-04-03 11:06   ` Paolo Bonzini
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 09/14] block: Parse "backing_reference" option to reference existing BDS Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap Wen Congyang
2015-04-03 11:05   ` Paolo Bonzini
2015-05-01 16:47     ` [Qemu-devel] [Qemu-block] " John Snow
2015-05-07  2:20       ` Wen Congyang
2015-05-07 18:32         ` John Snow
2015-05-08  0:59           ` Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 11/14] Backup: clear all bitmap when doing block checkpoint Wen Congyang
2015-04-03 11:09   ` Paolo Bonzini
2015-04-07  1:45     ` Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 12/14] qcow2: support colo Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 13/14] skip nbd_target when starting block replication Wen Congyang
2015-04-03 10:01 ` [Qemu-devel] [PATCH COLO v3 14/14] Don't allow a disk use backing reference target Wen Congyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.