All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints
@ 2015-09-02  8:51 Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model Wen Congyang
                   ` (15 more replies)
  0 siblings, 16 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/coloft/qemu/tree/wency/block-replication-v9

You can get the patch with framework here:
https://github.com/coloft/qemu/commits/colo-v2.0-periodic-mode

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (16):
  introduce a new API to enable/disable attach device model
  introduce a new API to check if blk is attached
  allow writing to the backing file
  block: Allow references for backing files
  introduce a new API qemu_opts_absorb_qdict_by_index()
  quorum: allow ignoring child errors
  Backup: clear all bitmap when doing block checkpoint
  block: make bdrv_put_ref_bh_schedule() as a public API
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  Add new block driver interfaces to control block replication
  skip nbd_target when starting block replication
  quorum: implement block driver interfaces for block replication
  Implement new driver for block replication
  support replication driver in blockdev-add
  Add a new API to start/stop replication, do checkpoint to all BDSes

 block.c                        | 230 +++++++++++++++++++-
 block/Makefile.objs            |   3 +-
 block/backup.c                 |  14 ++
 block/block-backend.c          |  33 +++
 block/quorum.c                 | 171 ++++++++++++++-
 block/replication.c            | 471 +++++++++++++++++++++++++++++++++++++++++
 blockdev.c                     |  37 +---
 blockjob.c                     |  11 +
 docs/block-replication.txt     | 183 ++++++++++++++++
 include/block/block.h          |  11 +
 include/block/block_int.h      |  14 ++
 include/block/blockjob.h       |  12 ++
 include/qemu/option.h          |   2 +
 include/sysemu/block-backend.h |   3 +
 qapi/block-core.json           |  43 +++-
 util/qemu-option.c             |  44 ++++
 16 files changed, 1234 insertions(+), 48 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt

-- 
2.4.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 15:37   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached Wen Congyang
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 block/block-backend.c          | 24 ++++++++++++++++++++++++
 include/sysemu/block-backend.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index aee8a12..72d8b2c 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -344,6 +344,30 @@ void *blk_get_attached_dev(BlockBackend *blk)
 }
 
 /*
+ * Disable to attach a device mode to @blk.
+ * Return 0 on success, -EBUSY when a device model is attached already.
+ */
+int blk_disable_attach_dev(BlockBackend *blk)
+{
+    if (blk->dev) {
+        return blk->dev == (void *)-1 ? 0 : -EBUSY;
+    }
+
+    blk->dev = (void *)-1;
+    return 0;
+}
+
+/*
+ * Enable to attach a device mode to @blk.
+ */
+void blk_enable_attach_dev(BlockBackend *blk)
+{
+    if (blk->dev == (void *)-1) {
+        blk->dev = NULL;
+    }
+}
+
+/*
  * Set @blk's device model callbacks to @ops.
  * @opaque is the opaque argument to pass to the callbacks.
  * This is for use by device models.
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8fc960f..7619a9f 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -80,6 +80,8 @@ int blk_attach_dev(BlockBackend *blk, void *dev);
 void blk_attach_dev_nofail(BlockBackend *blk, void *dev);
 void blk_detach_dev(BlockBackend *blk, void *dev);
 void *blk_get_attached_dev(BlockBackend *blk);
+int blk_disable_attach_dev(BlockBackend *blk);
+void blk_enable_attach_dev(BlockBackend *blk);
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
 int blk_read(BlockBackend *blk, int64_t sector_num, uint8_t *buf,
              int nb_sectors);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 15:40   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 03/16] allow writing to the backing file Wen Congyang
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 block.c                        | 4 ++--
 block/block-backend.c          | 9 +++++++++
 include/sysemu/block-backend.h | 1 +
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 0f9029b..aeb365b 100644
--- a/block.c
+++ b/block.c
@@ -2089,7 +2089,7 @@ void bdrv_swap(BlockDriverState *bs_new, BlockDriverState *bs_old)
     }
 
     /* bs_new must be unattached and shouldn't have anything fancy enabled */
-    assert(!bs_new->blk);
+    assert(!blk_is_attached(bs_new->blk));
     assert(QLIST_EMPTY(&bs_new->dirty_bitmaps));
     assert(bs_new->job == NULL);
     assert(bs_new->io_limits_enabled == false);
@@ -2106,7 +2106,7 @@ void bdrv_swap(BlockDriverState *bs_new, BlockDriverState *bs_old)
     bdrv_move_feature_fields(bs_new, &tmp);
 
     /* bs_new must remain unattached */
-    assert(!bs_new->blk);
+    assert(!blk_is_attached(bs_new->blk));
 
     /* Check a few fields that should remain attached to the device */
     assert(bs_new->job == NULL);
diff --git a/block/block-backend.c b/block/block-backend.c
index 72d8b2c..1463c37 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -368,6 +368,15 @@ void blk_enable_attach_dev(BlockBackend *blk)
 }
 
 /*
+ * Return true if a device model is attached to @blk already,
+ * otherwise, return false.
+ */
+bool blk_is_attached(BlockBackend *blk)
+{
+    return blk != NULL && blk->dev != NULL && blk->dev != (void *)-1;
+}
+
+/*
  * Set @blk's device model callbacks to @ops.
  * @opaque is the opaque argument to pass to the callbacks.
  * This is for use by device models.
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 7619a9f..a8c6fd2 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -82,6 +82,7 @@ void blk_detach_dev(BlockBackend *blk, void *dev);
 void *blk_get_attached_dev(BlockBackend *blk);
 int blk_disable_attach_dev(BlockBackend *blk);
 void blk_enable_attach_dev(BlockBackend *blk);
+bool blk_is_attached(BlockBackend *blk);
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
 int blk_read(BlockBackend *blk, int64_t sector_num, uint8_t *buf,
              int nb_sectors);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 03/16] allow writing to the backing file
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 16:06   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 04/16] block: Allow references for backing files Wen Congyang
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c              | 41 ++++++++++++++++++++++++++++++++++++++++-
 qapi/block-core.json |  7 ++++++-
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index aeb365b..d02d9e9 100644
--- a/block.c
+++ b/block.c
@@ -747,6 +747,15 @@ static const BdrvChildRole child_backing = {
     .inherit_flags = bdrv_backing_flags,
 };
 
+static int bdrv_backing_rw_flags(int flags)
+{
+    return bdrv_backing_flags(flags) | BDRV_O_RDWR;
+}
+
+static const BdrvChildRole child_backing_rw = {
+    .inherit_flags = bdrv_backing_rw_flags,
+};
+
 static int bdrv_open_flags(BlockDriverState *bs, int flags)
 {
     int open_flags = flags | BDRV_O_CACHE_WB;
@@ -1169,6 +1178,20 @@ out:
     bdrv_refresh_limits(bs, NULL);
 }
 
+#define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
+static QemuOptsList backing_file_opts = {
+    .name = "backing_file",
+    .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
+    .desc = {
+        {
+            .name = ALLOW_WRITE_BACKING_FILE,
+            .type = QEMU_OPT_BOOL,
+            .help = "allow write to backing file",
+        },
+        { /* end of list */ }
+    },
+};
+
 /*
  * Opens the backing file for a BlockDriverState if not yet open
  *
@@ -1183,6 +1206,9 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     int ret = 0;
     BlockDriverState *backing_hd;
     Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    bool child_rw = false;
+    const BdrvChildRole *child_role = NULL;
 
     if (bs->backing_hd != NULL) {
         QDECREF(options);
@@ -1195,6 +1221,18 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     }
 
     bs->open_flags &= ~BDRV_O_NO_BACKING;
+
+    opts = qemu_opts_create(&backing_file_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        error_propagate(errp, local_err);
+        QDECREF(options);
+        goto free_exit;
+    }
+    child_rw = qemu_opt_get_bool(opts, ALLOW_WRITE_BACKING_FILE, false);
+    child_role = child_rw ? &child_backing_rw : &child_backing;
+
     if (qdict_haskey(options, "file.filename")) {
         backing_filename[0] = '\0';
     } else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) {
@@ -1227,7 +1265,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     assert(bs->backing_hd == NULL);
     ret = bdrv_open_inherit(&backing_hd,
                             *backing_filename ? backing_filename : NULL,
-                            NULL, options, 0, bs, &child_backing,
+                            NULL, options, 0, bs, child_role,
                             NULL, &local_err);
     if (ret < 0) {
         bdrv_unref(backing_hd);
@@ -1242,6 +1280,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     bdrv_set_backing_hd(bs, backing_hd);
 
 free_exit:
+    qemu_opts_del(opts);
     g_free(backing_filename);
     return ret;
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 50292b9..bf141a2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1408,6 +1408,10 @@
 # @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
 #                 (default: off)
 #
+# @allow-write-backing-file: #optional whether the backing file is opened in
+#                            read-write mode. It is only for backing file
+#                            (Since 2.5 default: false)
+#
 # Since: 1.7
 ##
 { 'struct': 'BlockdevOptionsBase',
@@ -1420,7 +1424,8 @@
             '*rerror': 'BlockdevOnError',
             '*werror': 'BlockdevOnError',
             '*read-only': 'bool',
-            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
+            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
+            '*allow-write-backing-file': 'bool' } }
 
 ##
 # @BlockdevOptionsFile
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 04/16] block: Allow references for backing files
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (2 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 03/16] allow writing to the backing file Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 18:50   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index() Wen Congyang
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Usage:
-drive file=xxx,id=Y, \
-drive file=xxxx,id=X,backing.backing_reference=Y

It will create such backing chain:
               {virtio-blk dev 'Y'}      {virtio-blk dev 'X'}
                         |                         |
                         |                         |
                         v                         v

    [base] <- [mid] <- ( Y )  <----------------- ( X )

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c               | 39 +++++++++++++++++++++++++++++++++++----
 include/block/block.h |  1 +
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index d02d9e9..f93c50d 100644
--- a/block.c
+++ b/block.c
@@ -1179,6 +1179,7 @@ out:
 }
 
 #define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
+#define BACKING_REFERENCE           "backing_reference"
 static QemuOptsList backing_file_opts = {
     .name = "backing_file",
     .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
@@ -1188,6 +1189,11 @@ static QemuOptsList backing_file_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "allow write to backing file",
         },
+        {
+            .name = BACKING_REFERENCE,
+            .type = QEMU_OPT_STRING,
+            .help = "reference to the exsiting BDS",
+        },
         { /* end of list */ }
     },
 };
@@ -1204,11 +1210,12 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
 {
     char *backing_filename = g_malloc0(PATH_MAX);
     int ret = 0;
-    BlockDriverState *backing_hd;
+    BlockDriverState *backing_hd = NULL;
     Error *local_err = NULL;
     QemuOpts *opts = NULL;
     bool child_rw = false;
     const BdrvChildRole *child_role = NULL;
+    const char *reference = NULL;
 
     if (bs->backing_hd != NULL) {
         QDECREF(options);
@@ -1231,9 +1238,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
         goto free_exit;
     }
     child_rw = qemu_opt_get_bool(opts, ALLOW_WRITE_BACKING_FILE, false);
+    reference = qemu_opt_get(opts, BACKING_REFERENCE);
     child_role = child_rw ? &child_backing_rw : &child_backing;
 
-    if (qdict_haskey(options, "file.filename")) {
+    if (qdict_haskey(options, "file.filename") || reference) {
         backing_filename[0] = '\0';
     } else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) {
         QDECREF(options);
@@ -1256,7 +1264,9 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
         goto free_exit;
     }
 
-    backing_hd = bdrv_new();
+    if (!reference) {
+        backing_hd = bdrv_new();
+    }
 
     if (bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
         qdict_put(options, "driver", qstring_from_str(bs->backing_format));
@@ -1265,7 +1275,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     assert(bs->backing_hd == NULL);
     ret = bdrv_open_inherit(&backing_hd,
                             *backing_filename ? backing_filename : NULL,
-                            NULL, options, 0, bs, child_role,
+                            reference, options, 0, bs, child_role,
                             NULL, &local_err);
     if (ret < 0) {
         bdrv_unref(backing_hd);
@@ -1276,6 +1286,19 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
         error_free(local_err);
         goto free_exit;
     }
+    if (reference) {
+        if (bdrv_op_is_blocked(backing_hd, BLOCK_OP_TYPE_BACKING_REFERENCE,
+                               errp)) {
+            ret = -EBUSY;
+            goto free_reference_exit;
+        }
+        if (backing_hd->blk && blk_disable_attach_dev(backing_hd->blk)) {
+            error_setg(errp, "backing_hd %s is used by the other device model",
+                       reference);
+            ret = -EBUSY;
+            goto free_reference_exit;
+        }
+    }
 
     bdrv_set_backing_hd(bs, backing_hd);
 
@@ -1283,6 +1306,11 @@ free_exit:
     qemu_opts_del(opts);
     g_free(backing_filename);
     return ret;
+
+free_reference_exit:
+    bdrv_unref(backing_hd);
+    bs->open_flags |= BDRV_O_NO_BACKING;
+    goto free_exit;
 }
 
 /*
@@ -1947,6 +1975,9 @@ void bdrv_close(BlockDriverState *bs)
         if (bs->backing_hd) {
             BlockDriverState *backing_hd = bs->backing_hd;
             bdrv_set_backing_hd(bs, NULL);
+            if (backing_hd->blk) {
+                blk_enable_attach_dev(backing_hd->blk);
+            }
             bdrv_unref(backing_hd);
         }
 
diff --git a/include/block/block.h b/include/block/block.h
index 612dcad..5812716 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -170,6 +170,7 @@ typedef enum BlockOpType {
     BLOCK_OP_TYPE_RESIZE,
     BLOCK_OP_TYPE_STREAM,
     BLOCK_OP_TYPE_REPLACE,
+    BLOCK_OP_TYPE_BACKING_REFERENCE,
     BLOCK_OP_TYPE_MAX,
 } BlockOpType;
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index()
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (3 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 04/16] block: Allow references for backing files Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 19:01   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors Wen Congyang
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 include/qemu/option.h |  2 ++
 util/qemu-option.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/qemu/option.h b/include/qemu/option.h
index 57e51c9..725a781 100644
--- a/include/qemu/option.h
+++ b/include/qemu/option.h
@@ -129,6 +129,8 @@ QemuOpts *qemu_opts_from_qdict(QemuOptsList *list, const QDict *qdict,
                                Error **errp);
 QDict *qemu_opts_to_qdict(QemuOpts *opts, QDict *qdict);
 void qemu_opts_absorb_qdict(QemuOpts *opts, QDict *qdict, Error **errp);
+void qemu_opts_absorb_qdict_by_index(QemuOpts *opts, QDict *qdict,
+                                     const char *index, Error **errp);
 
 typedef int (*qemu_opts_loopfunc)(void *opaque, QemuOpts *opts, Error **errp);
 int qemu_opts_foreach(QemuOptsList *list, qemu_opts_loopfunc func,
diff --git a/util/qemu-option.c b/util/qemu-option.c
index efe9d27..a93a269 100644
--- a/util/qemu-option.c
+++ b/util/qemu-option.c
@@ -1021,6 +1021,50 @@ void qemu_opts_absorb_qdict(QemuOpts *opts, QDict *qdict, Error **errp)
 }
 
 /*
+ * Adds all QDict entries to the QemuOpts that can be added and removes them
+ * from the QDict. The key starts with "%index." in the %qdict. When this
+ * function returns, the QDict contains only those entries that couldn't be
+ * added to the QemuOpts.
+ */
+void qemu_opts_absorb_qdict_by_index(QemuOpts *opts, QDict *qdict,
+                                     const char *index, Error **errp)
+{
+    const QDictEntry *entry, *next;
+    const char *key;
+    int len = strlen(index);
+
+    entry = qdict_first(qdict);
+
+    while (entry != NULL) {
+        Error *local_err = NULL;
+        OptsFromQDictState state = {
+            .errp = &local_err,
+            .opts = opts,
+        };
+
+        next = qdict_next(qdict, entry);
+        if (strncmp(entry->key, index, len) || *(entry->key + len) != '.') {
+            entry = next;
+            continue;
+        }
+
+        key = entry->key + len + 1;
+
+        if (find_desc_by_name(opts->list->desc, key)) {
+            qemu_opts_from_qdict_1(key, entry->value, &state);
+            if (local_err) {
+                error_propagate(errp, local_err);
+                return;
+            } else {
+                qdict_del(qdict, entry->key);
+            }
+        }
+
+        entry = next;
+    }
+}
+
+/*
  * Convert from QemuOpts to QDict.
  * The QDict values are of type QString.
  * TODO We'll want to use types appropriate for opt->desc->type, but
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (4 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index() Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 16:30   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint Wen Congyang
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Alberto Garcia, qemu block, Jiang Yunhong,
	Dong Eddie, Dr. David Alan Gilbert, Michael R. Hines, Gonglei,
	Yang Hongyang, zhanghailiang

If the child is not ready, read/write/getlength/flush will
return -errno. It is not critical error, and can be ignored:
1. read/write:
   Just not report the error event.
2. getlength:
   just ignore it. If all children's getlength return -errno,
   and be ignored, return -EIO.
3. flush:
   Just ignore it. If all children's getlength return -errno,
   and be ignored, return 0.

Usage: children.x.ignore-errors=true

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Alberto Garcia <berto@igalia.com>
---
 block/quorum.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++----
 qapi/block-core.json |  5 ++-
 2 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 8059861..f23dbb7 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -30,6 +30,7 @@
 #define QUORUM_OPT_BLKVERIFY      "blkverify"
 #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
 #define QUORUM_OPT_READ_PATTERN   "read-pattern"
+#define QUORUM_CHILDREN_OPT_IGNORE_ERRORS   "ignore-errors"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -65,6 +66,7 @@ typedef struct QuorumVotes {
 /* the following structure holds the state of one quorum instance */
 typedef struct BDRVQuorumState {
     BlockDriverState **bs; /* children BlockDriverStates */
+    bool *ignore_errors;   /* ignore children's error? */
     int num_children;      /* children count */
     int max_children;      /* The maximum children count, we need to reallocate
                             * bs if num_children will larger than maximum.
@@ -100,6 +102,7 @@ typedef struct QuorumChildRequest {
     uint8_t *buf;
     int ret;
     QuorumAIOCB *parent;
+    int index;
 } QuorumChildRequest;
 
 /* Quorum will use the following structure to track progress of each read/write
@@ -212,6 +215,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
         acb->qcrs[i].buf = NULL;
         acb->qcrs[i].ret = 0;
         acb->qcrs[i].parent = acb;
+        acb->qcrs[i].index = i;
     }
 
     return acb;
@@ -305,7 +309,7 @@ static void quorum_aio_cb(void *opaque, int ret)
     acb->count++;
     if (ret == 0) {
         acb->success_count++;
-    } else {
+    } else if (!s->ignore_errors[sacb->index]) {
         quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
     }
     assert(acb->count <= s->num_children);
@@ -716,19 +720,31 @@ static BlockAIOCB *quorum_aio_writev(BlockDriverState *bs,
 static int64_t quorum_getlength(BlockDriverState *bs)
 {
     BDRVQuorumState *s = bs->opaque;
-    int64_t result;
+    int64_t result = -EIO;
     int i;
 
     /* check that all file have the same length */
-    result = bdrv_getlength(s->bs[0]);
-    if (result < 0) {
-        return result;
-    }
-    for (i = 1; i < s->num_children; i++) {
+    for (i = 0; i < s->num_children; i++) {
         int64_t value = bdrv_getlength(s->bs[i]);
+
         if (value < 0) {
             return value;
         }
+
+        if (value == 0 && s->ignore_errors[i]) {
+            /*
+             * If the child is not ready, it cannot return -errno,
+             * otherwise refresh_total_sectors() will fail when
+             * we open the child.
+             */
+            continue;
+        }
+
+        if (result == -EIO) {
+            result = value;
+            continue;
+        }
+
         if (value != result) {
             return -EIO;
         }
@@ -766,6 +782,9 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
 
     for (i = 0; i < s->num_children; i++) {
         result = bdrv_co_flush(s->bs[i]);
+        if (result < 0 && s->ignore_errors[i]) {
+            result = 0;
+        }
         result_value.l = result;
         quorum_count_vote(&error_votes, &result_value, i);
     }
@@ -840,6 +859,19 @@ static QemuOptsList quorum_runtime_opts = {
     },
 };
 
+static QemuOptsList quorum_children_common_opts = {
+    .name = "quorum children",
+    .head = QTAILQ_HEAD_INITIALIZER(quorum_children_common_opts.head),
+    .desc = {
+        {
+            .name = QUORUM_CHILDREN_OPT_IGNORE_ERRORS,
+            .type = QEMU_OPT_BOOL,
+            .help = "ignore child I/O error",
+        },
+        { /* end of list */ }
+    },
+};
+
 static int parse_read_pattern(const char *opt)
 {
     int i;
@@ -858,6 +890,37 @@ static int parse_read_pattern(const char *opt)
     return -EINVAL;
 }
 
+static int parse_children_options(BDRVQuorumState *s, QDict *options,
+                                  const char *indexstr, int index,
+                                  Error **errp)
+{
+    QemuOpts *children_opts = NULL;
+    Error *local_err = NULL;
+    int ret = 0;
+    bool value;
+
+    children_opts = qemu_opts_create(&quorum_children_common_opts, NULL, 0,
+                                     &error_abort);
+    qemu_opts_absorb_qdict_by_index(children_opts, options, indexstr,
+                                    &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    value = qemu_opt_get_bool(children_opts, QUORUM_CHILDREN_OPT_IGNORE_ERRORS,
+                              false);
+    s->ignore_errors[index] = value;
+
+out:
+    qemu_opts_del(children_opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
                        Error **errp)
 {
@@ -929,12 +992,18 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     s->bs = g_new0(BlockDriverState *, s->num_children);
     opened = g_new0(bool, s->num_children);
     s->max_children = s->num_children;
+    s->ignore_errors = g_new0(bool, s->num_children);
 
     for (i = 0; i < s->num_children; i++) {
         char indexstr[32];
         ret = snprintf(indexstr, 32, "children.%d", i);
         assert(ret < 32);
 
+        ret = parse_children_options(s, options, indexstr, i, &local_err);
+        if (ret < 0) {
+            goto close_exit;
+        }
+
         ret = bdrv_open_image(&s->bs[i], NULL, options, indexstr, bs,
                               &child_format, false, &local_err);
         if (ret < 0) {
@@ -976,6 +1045,7 @@ static void quorum_close(BlockDriverState *bs)
     }
 
     g_free(s->bs);
+    g_free(s->ignore_errors);
 }
 
 static void quorum_detach_aio_context(BlockDriverState *bs)
@@ -1014,10 +1084,18 @@ static void quorum_add_child(BlockDriverState *bs, QDict *options, Error **errp)
         }
 
         s->bs = g_renew(BlockDriverState *, s->bs, s->max_children + 1);
+        s->ignore_errors = g_renew(bool, s->ignore_errors, s->max_children + 1);
         s->bs[s->num_children] = NULL;
         s->max_children += 1;
     }
 
+    ret = parse_children_options(s, options, "child", s->num_children,
+                                 &local_err);
+    if (ret < 0) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
     ret = bdrv_open_image(&s->bs[s->num_children], NULL, options, "child", bs,
                           &child_format, false, &local_err);
     if (ret < 0) {
@@ -1058,6 +1136,8 @@ static void quorum_del_child(BlockDriverState *bs, BlockDriverState *child_bs,
     bdrv_drain(bs);
     /* We can safely remove this child now */
     memmove(&s->bs[i], &s->bs[i+1], (s->num_children - i - 1) * sizeof(void *));
+    memmove(&s->ignore_errors[i], &s->ignore_errors[i+1],
+            (s->num_children - i - 1) * sizeof(bool));
     s->num_children--;
     s->bs[s->num_children] = NULL;
     bdrv_unref(child_bs);
diff --git a/qapi/block-core.json b/qapi/block-core.json
index bf141a2..24099ef 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1411,6 +1411,8 @@
 # @allow-write-backing-file: #optional whether the backing file is opened in
 #                            read-write mode. It is only for backing file
 #                            (Since 2.5 default: false)
+# @ignore-errors: #options whether the child's I/O error should be ignored.
+#                 it is only for quorum's child.(Since 2.5 default: false)
 #
 # Since: 1.7
 ##
@@ -1425,7 +1427,8 @@
             '*werror': 'BlockdevOnError',
             '*read-only': 'bool',
             '*detect-zeroes': 'BlockdevDetectZeroesOptions',
-            '*allow-write-backing-file': 'bool' } }
+            '*allow-write-backing-file': 'bool',
+            '*ignore-errors': 'bool' } }
 
 ##
 # @BlockdevOptionsFile
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (5 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 14:10   ` Jeff Cody
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 08/16] block: make bdrv_put_ref_bh_schedule() as a public API Wen Congyang
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jeff Cody, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Jeff Cody <jcody@redhat.com>
---
 block/backup.c           | 14 ++++++++++++++
 blockjob.c               | 11 +++++++++++
 include/block/blockjob.h | 12 ++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index b729c4b..59bdb79 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -217,11 +217,25 @@ static void backup_iostatus_reset(BlockJob *job)
     bdrv_iostatus_reset(s->target);
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+        error_setg(errp, "The backup job only supports block checkpoint in"
+                   " sync=none mode");
+        return;
+    }
+
+    hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
     .instance_size  = sizeof(BackupBlockJob),
     .job_type       = BLOCK_JOB_TYPE_BACKUP,
     .set_speed      = backup_set_speed,
     .iostatus_reset = backup_iostatus_reset,
+    .do_checkpoint  = backup_do_checkpoint,
 };
 
 static BlockErrorAction backup_error_action(BackupBlockJob *job,
diff --git a/blockjob.c b/blockjob.c
index 62bb906..5528d92 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -404,3 +404,14 @@ void block_job_defer_to_main_loop(BlockJob *job,
 
     qemu_bh_schedule(data->bh);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+    if (!job->driver->do_checkpoint) {
+        error_setg(errp, "The job %s doesn't support block checkpoint",
+                   BlockJobType_lookup[job->driver->job_type]);
+        return;
+    }
+
+    job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index dd9d5e6..0b4f386 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
      * manually.
      */
     void (*complete)(BlockJob *job, Error **errp);
+
+    /** Optional callback for job types that support checkpoint. */
+    void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -356,4 +359,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
                                   BlockJobDeferToMainLoopFn *fn,
                                   void *opaque);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 08/16] block: make bdrv_put_ref_bh_schedule() as a public API
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (6 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS Wen Congyang
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 block.c               | 25 +++++++++++++++++++++++++
 blockdev.c            | 37 ++++++-------------------------------
 include/block/block.h |  1 +
 3 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/block.c b/block.c
index f93c50d..5119e1e 100644
--- a/block.c
+++ b/block.c
@@ -3688,6 +3688,31 @@ void bdrv_unref(BlockDriverState *bs)
     }
 }
 
+typedef struct {
+    QEMUBH *bh;
+    BlockDriverState *bs;
+} BDRVPutRefBH;
+
+static void bdrv_put_ref_bh(void *opaque)
+{
+    BDRVPutRefBH *s = opaque;
+
+    bdrv_unref(s->bs);
+    qemu_bh_delete(s->bh);
+    g_free(s);
+}
+
+/* Release a BDS reference in a BH */
+void bdrv_put_ref_bh_schedule(BlockDriverState *bs)
+{
+    BDRVPutRefBH *s;
+
+    s = g_new(BDRVPutRefBH, 1);
+    s->bh = qemu_bh_new(bdrv_put_ref_bh, s);
+    s->bs = bs;
+    qemu_bh_schedule(s->bh);
+}
+
 struct BdrvOpBlocker {
     Error *reason;
     QLIST_ENTRY(BdrvOpBlocker) list;
diff --git a/blockdev.c b/blockdev.c
index 09e9639..bb9ffae 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -276,37 +276,6 @@ static void bdrv_format_print(void *opaque, const char *name)
     error_printf(" %s", name);
 }
 
-typedef struct {
-    QEMUBH *bh;
-    BlockDriverState *bs;
-} BDRVPutRefBH;
-
-static void bdrv_put_ref_bh(void *opaque)
-{
-    BDRVPutRefBH *s = opaque;
-
-    bdrv_unref(s->bs);
-    qemu_bh_delete(s->bh);
-    g_free(s);
-}
-
-/*
- * Release a BDS reference in a BH
- *
- * It is not safe to use bdrv_unref() from a callback function when the callers
- * still need the BlockDriverState.  In such cases we schedule a BH to release
- * the reference.
- */
-static void bdrv_put_ref_bh_schedule(BlockDriverState *bs)
-{
-    BDRVPutRefBH *s;
-
-    s = g_new(BDRVPutRefBH, 1);
-    s->bh = qemu_bh_new(bdrv_put_ref_bh, s);
-    s->bs = bs;
-    qemu_bh_schedule(s->bh);
-}
-
 static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
 {
     if (!strcmp(buf, "ignore")) {
@@ -2329,6 +2298,12 @@ static void block_job_cb(void *opaque, int ret)
         block_job_event_completed(bs->job, msg);
     }
 
+
+    /*
+     * It is not safe to use bdrv_unref() from a callback function when the
+     * callers still need the BlockDriverState. In such cases we schedule
+     * a BH to release the reference.
+     */
     bdrv_put_ref_bh_schedule(bs);
 }
 
diff --git a/include/block/block.h b/include/block/block.h
index 5812716..98c8509 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -519,6 +519,7 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_ref(BlockDriverState *bs);
 void bdrv_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+void bdrv_put_ref_bh_schedule(BlockDriverState *bs);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (7 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 08/16] block: make bdrv_put_ref_bh_schedule() as a public API Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 14:12   ` Jeff Cody
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 10/16] docs: block replication's description Wen Congyang
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jeff Cody, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
 iscsi.o-libs       := $(LIBISCSI_LIBS)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 10/16] docs: block replication's description
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (8 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 20:41   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication Wen Congyang
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 183 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 183 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..b3cee8f
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,183 @@
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content(it could be from either "Secondary Write Requests" or
+       previous COW of "Primary Write Requests") in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary    2 filter
+     disk         ^                                                             virtio-blk
+                  |                                                                  ^
+                3 NBD  ------->  3 NBD                                               |
+                client    ||     server                                          2 filter
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The new block filter(the name is replication) will control the block
+replication.
+
+3) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+4) The disk on the secondary is represented by a custom block device
+(called active-disk). It should be an empty disk, and the format should
+support bdrv_make_empty() and backing file.
+
+5) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also be an empty disk, and
+the driver supports bdrv_make_empty() and backing file.
+
+== Failure Handling ==
+There are 6 internal errors when block replication is running:
+1. I/O error on primary disk
+2. Forwarding primary write requests failed
+3. Backup failed
+4. I/O error on secondary disk
+5. I/O error on active disk
+6. Making active disk or hidden disk empty failed
+In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
+4 and 6, we just report block replication's error to FT/HA manager(which
+decides when to do a new checkpoint, when to do failover).
+There is one internal error when doing failover:
+1. Commiting the data in active disk/hidden disk to secondary disk failed
+We just to report this error to FT/HA manager.
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU. The caller
+   must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transferred to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. bdrv_stop_replication()
+   It is called on failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it if you use this API to shutdown the guest, or other
+   things except failover. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+
+  Run qmp command in primary qemu:
+    child_add disk1 child.driver=replication,child.mode=primary,\
+              child.file.host=xxx,child.file.port=xxx,\
+              child.file.driver=nbd,child.ignore-errors=on
+  Note:
+  1. There should be only one NBD Client for each primary disk.
+  2. host is the secondary physical machine's hostname or IP
+  3. Each disk must have its own export name.
+  4. It is all a single argument to -drive and child_add, and you should
+     ignore the leading whitespace.
+  5. The qmp command line must be run after running qmp command line in
+     secondary qemu.
+
+Secondary:
+  -drive if=none,driver=raw,file=1.raw,id=colo1 \
+  -drive if=xxx,driver=replication,mode=secondary,\
+         file.file.filename=active_disk.qcow2,\
+         file.driver=qcow2,\
+         file.backing.file.filename=hidden_disk.qcow2,\
+         file.backing.driver=qcow2,\
+         file.backing.allow-write-backing-file=on,\
+         file.backing.backing.backing_reference=colo1\
+
+  Then run qmp command in secondary qemu:
+    nbd-server-start host:port
+    nbd-server-add -w colo1
+
+  Note:
+  1. The export name in secondary QEMU command line is the secondary
+     disk's id.
+  2. The export name for the same disk must be the same
+  3. The qmp command nbd-server-start and nbd-server-add must be run
+     before running the qmp command migrate on primary QEMU
+  4. Don't use nbd-server-start's other options
+  5. Active disk, hidden disk and nbd target's length should be the
+     same.
+  6. It is better to put active disk and hidden disk in ramdisk.
+  7. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (9 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 10/16] docs: block replication's description Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 16:33   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 12/16] skip nbd_target when starting " Wen Congyang
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Luiz Capitulino,
	Gonglei, Yang Hongyang, Michael Roth, zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   | 43 +++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |  5 +++++
 include/block/block_int.h | 14 ++++++++++++++
 qapi/block-core.json      | 15 +++++++++++++++
 4 files changed, 77 insertions(+)

diff --git a/block.c b/block.c
index 5119e1e..c7670d2 100644
--- a/block.c
+++ b/block.c
@@ -4390,3 +4390,46 @@ void bdrv_del_child(BlockDriverState *bs, BlockDriverState *child_bs,
 
     bs->drv->bdrv_del_child(bs, child_bs, errp);
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_start_replication) {
+        drv->bdrv_start_replication(bs, mode, errp);
+    } else if (bs->file) {
+        bdrv_start_replication(bs->file, mode, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support starting block"
+                   " replication", bs->filename);
+    }
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_do_checkpoint) {
+        drv->bdrv_do_checkpoint(bs, errp);
+    } else if (bs->file) {
+        bdrv_do_checkpoint(bs->file, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support block checkpoint",
+                   bs->filename);
+    }
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (drv && drv->bdrv_stop_replication) {
+        drv->bdrv_stop_replication(bs, failover, errp);
+    } else if (bs->file) {
+        bdrv_stop_replication(bs->file, failover, errp);
+    } else {
+        error_setg(errp, "The BDS %s doesn't support stopping block"
+                   " replication", bs->filename);
+    }
+}
diff --git a/include/block/block.h b/include/block/block.h
index 98c8509..192a066 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -626,4 +626,9 @@ void bdrv_add_child(BlockDriverState *bs, QDict *options, Error **errp);
 void bdrv_del_child(BlockDriverState *bs, BlockDriverState *child,
                     Error **errp);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                            Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index b6f2905..1c2d310 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -293,6 +293,20 @@ struct BlockDriver {
     void (*bdrv_del_child)(BlockDriverState *bs, BlockDriverState *child,
                            Error **errp);
 
+    void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+                                   Error **errp);
+    /* Drop Disk buffer when doing checkpoint. */
+    void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+    /*
+     * After failover, we should flush Disk buffer into secondary disk
+     * and stop block replication.
+     *
+     * If the guest is shutdown, we should drop Disk buffer and stop
+     * block representation.
+     */
+    void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover,
+                                  Error **errp);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 24099ef..96f0530 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1810,6 +1810,21 @@
   'data': { '*export': 'str' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @unprotected: Replication is not started or after failover.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.4
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 12/16] skip nbd_target when starting block replication
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (10 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 13/16] quorum: implement block driver interfaces for " Wen Congyang
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block.c b/block.c
index c7670d2..50f3532 100644
--- a/block.c
+++ b/block.c
@@ -4396,6 +4396,10 @@ void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_start_replication) {
         drv->bdrv_start_replication(bs, mode, errp);
     } else if (bs->file) {
@@ -4410,6 +4414,10 @@ void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_do_checkpoint) {
         drv->bdrv_do_checkpoint(bs, errp);
     } else if (bs->file) {
@@ -4424,6 +4432,10 @@ void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
 {
     BlockDriver *drv = bs->drv;
 
+    if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKING_REFERENCE, NULL)) {
+        return;
+    }
+
     if (drv && drv->bdrv_stop_replication) {
         drv->bdrv_stop_replication(bs, failover, errp);
     } else if (bs->file) {
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 13/16] quorum: implement block driver interfaces for block replication
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (11 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 12/16] skip nbd_target when starting " Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 14/16] Implement new driver " Wen Congyang
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
---
 block/quorum.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index f23dbb7..5c00f0c 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -87,6 +87,8 @@ typedef struct BDRVQuorumState {
                             */
 
     QuorumReadPattern read_pattern;
+
+    int replication_index; /* store which child supports block replication */
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -1014,6 +1016,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     g_free(opened);
+    s->replication_index = -1;
     goto exit;
 
 close_exit:
@@ -1176,6 +1179,76 @@ static void quorum_refresh_filename(BlockDriverState *bs)
     bs->full_open_options = opts;
 }
 
+static void quorum_start_replication(BlockDriverState *bs, ReplicationMode mode,
+                                     Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int count = 0, i, index;
+    Error *local_err = NULL;
+
+    /*
+     * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+     * QEMU becoming primary QEMU.
+     */
+    if (mode != REPLICATION_MODE_PRIMARY) {
+        error_setg(errp, "The replication mode for quorum should be 'primary'");
+        return;
+    }
+
+    if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) {
+        error_setg(errp, "Block replication needs read pattern 'fifo'");
+        return;
+    }
+
+    for (i = 0; i < s->num_children; i++) {
+        bdrv_start_replication(s->bs[i], mode, &local_err);
+        if (local_err) {
+            error_free(local_err);
+            local_err = NULL;
+        } else {
+            count++;
+            index = i;
+        }
+    }
+
+    if (count == 0) {
+        error_setg(errp, "No child supports block replication");
+    } else if (count > 1) {
+        for (i = 0; i < s->num_children; i++) {
+            bdrv_stop_replication(s->bs[i], false, NULL);
+        }
+        error_setg(errp, "Too many children support block replication");
+    } else {
+        s->replication_index = index;
+    }
+}
+
+static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->replication_index < 0) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    bdrv_do_checkpoint(s->bs[s->replication_index], errp);
+}
+
+static void quorum_stop_replication(BlockDriverState *bs, bool failover,
+                                    Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+
+    if (s->replication_index < 0) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    bdrv_stop_replication(s->bs[s->replication_index], failover, errp);
+    s->replication_index = -1;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name                        = "quorum",
     .protocol_name                      = "quorum",
@@ -1202,6 +1275,10 @@ static BlockDriver bdrv_quorum = {
 
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+    .bdrv_start_replication             = quorum_start_replication,
+    .bdrv_do_checkpoint                 = quorum_do_checkpoint,
+    .bdrv_stop_replication              = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 14/16] Implement new driver for block replication
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (12 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 13/16] quorum: implement block driver interfaces for " Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add Wen Congyang
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 16/16] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang
  15 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 471 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 472 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 0000000..813f610
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,471 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+
+typedef struct BDRVReplicationState {
+    ReplicationMode mode;
+    int replication_state;
+    BlockDriverState *active_disk;
+    BlockDriverState *hidden_disk;
+    BlockDriverState *secondary_disk;
+    int error;
+} BDRVReplicationState;
+
+enum {
+    BLOCK_REPLICATION_NONE,     /* block replication is not started */
+    BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+    BLOCK_REPLICATION_DONE,     /* block replication is done(failover) */
+};
+
+#define COMMIT_CLUSTER_BITS 16
+#define COMMIT_CLUSTER_SIZE (1 << COMMIT_CLUSTER_BITS)
+#define COMMIT_SECTORS_PER_CLUSTER (COMMIT_CLUSTER_SIZE / BDRV_SECTOR_SIZE)
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp);
+
+#define REPLICATION_MODE        "mode"
+static QemuOptsList replication_runtime_opts = {
+    .name = "replication",
+    .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+    .desc = {
+        {
+            .name = REPLICATION_MODE,
+            .type = QEMU_OPT_STRING,
+        },
+        { /* end of list */ }
+    },
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+                            int flags, Error **errp)
+{
+    int ret;
+    BDRVReplicationState *s = bs->opaque;;
+    Error *local_err = NULL;
+    QemuOpts *opts = NULL;
+    const char *mode;
+
+    ret = -EINVAL;
+    opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        goto fail;
+    }
+
+    mode = qemu_opt_get(opts, REPLICATION_MODE);
+    if (!mode) {
+        error_setg(&local_err, "Missing the option mode");
+        goto fail;
+    }
+
+    if (!strcmp(mode, "primary")) {
+        s->mode = REPLICATION_MODE_PRIMARY;
+    } else if (!strcmp(mode, "secondary")) {
+        s->mode = REPLICATION_MODE_SECONDARY;
+    } else {
+        error_setg(&local_err,
+                   "The option mode's value should be primary or secondary");
+        goto fail;
+    }
+
+    ret = 0;
+
+fail:
+    qemu_opts_del(opts);
+    /* propagate error */
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+        replication_stop(bs, false, NULL);
+    }
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+    switch (s->replication_state) {
+    case BLOCK_REPLICATION_NONE:
+        return -EIO;
+    case BLOCK_REPLICATION_RUNNING:
+        return 0;
+    case BLOCK_REPLICATION_DONE:
+        return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+    default:
+        abort();
+    }
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        return ret;
+    }
+
+    if (ret < 0) {
+        s->error = ret;
+        ret = 0;
+    }
+
+    return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             int remaining_sectors,
+                                             QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    if (s->mode == REPLICATION_MODE_PRIMARY) {
+        /* We only use it to forward primary write requests */
+        return -EIO;
+    }
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /*
+     * After failover, because we don't commit active disk/hidden disk
+     * to secondary disk, so we should read from active disk directly.
+     */
+    ret = bdrv_co_readv(bs->file, sector_num, remaining_sectors, qiov);
+    return replication_return_value(s, ret);
+}
+
+static coroutine_fn int replication_co_writev(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int remaining_sectors,
+                                              QEMUIOVector *qiov)
+{
+    BDRVReplicationState *s = bs->opaque;
+    QEMUIOVector hd_qiov;
+    uint64_t bytes_done = 0;
+    BlockDriverState *top = bs->file;
+    BlockDriverState *base = s->secondary_disk;
+    BlockDriverState *target;
+    int ret, n;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 0) {
+        ret = bdrv_co_writev(bs->file, sector_num, remaining_sectors, qiov);
+        return replication_return_value(s, ret);
+    }
+
+    /*
+     * Only write to active disk if the sectors have
+     * already been allocated in active disk/hidden disk.
+     */
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+    while (remaining_sectors > 0) {
+        ret = bdrv_is_allocated_above(top, base, sector_num,
+                                      remaining_sectors, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512);
+
+        target = ret ? top: base;
+        ret = bdrv_co_writev(target, sector_num, n, &hd_qiov);
+        if (ret < 0) {
+            return ret;
+        }
+
+        remaining_sectors -= n;
+        sector_num += n;
+        bytes_done += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static coroutine_fn int replication_co_discard(BlockDriverState *bs,
+                                               int64_t sector_num,
+                                               int nb_sectors)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int ret;
+
+    ret = replication_get_io_status(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (ret == 1) {
+        /* It is secondary qemu and we are after failover */
+        ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    ret = bdrv_co_discard(bs->file, sector_num, nb_sectors);
+    return replication_return_value(s, ret);
+}
+
+static bool replication_recurse_is_first_non_filter(BlockDriverState *bs,
+                                                    BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file, candidate);
+}
+
+static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    if (!s->secondary_disk->job) {
+        error_setg(errp, "Backup job is cancelled unexpectedly");
+        return;
+    }
+
+    block_job_do_checkpoint(s->secondary_disk->job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    ret = s->active_disk->drv->bdrv_make_empty(s->active_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make active disk empty");
+        return;
+    }
+
+    ret = s->hidden_disk->drv->bdrv_make_empty(s->hidden_disk);
+    if (ret < 0) {
+        error_setg(errp, "Cannot make hidden disk empty");
+        return;
+    }
+}
+
+static void backup_job_completed(void *opaque, int ret)
+{
+    BDRVReplicationState *s = opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_DONE) {
+        /* The backup job is cancelled unexpectedly */
+        s->error = -EIO;
+    }
+
+    bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                  s->active_disk->backing_blocker);
+    bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                  s->hidden_disk->backing_blocker);
+
+    bdrv_put_ref_bh_schedule(s->secondary_disk);
+}
+
+static void replication_start(BlockDriverState *bs, ReplicationMode mode,
+                              Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+    int64_t active_length, hidden_length, disk_length;
+    AioContext *aio_context;
+    Error *local_err = NULL;
+
+    if (s->replication_state != BLOCK_REPLICATION_NONE) {
+        error_setg(errp, "Block replication is running or done");
+        return;
+    }
+
+    if (s->mode != mode) {
+        error_setg(errp, "The parameter mode's value is invalid, needs %d,"
+                   " but receives %d", s->mode, mode);
+        return;
+    }
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        s->active_disk = bs->file;
+        if (!bs->file->backing_hd) {
+            error_setg(errp, "Active disk doesn't have backing file");
+            return;
+        }
+
+        s->hidden_disk = s->active_disk->backing_hd;
+        if (!s->hidden_disk->backing_hd) {
+            error_setg(errp, "Hidden disk doesn't have backing file");
+            return;
+        }
+
+        s->secondary_disk = s->hidden_disk->backing_hd;
+        if (!s->secondary_disk->blk) {
+            error_setg(errp, "The secondary disk doesn't have block backend");
+            return;
+        }
+
+        /* verify the length */
+        active_length = bdrv_getlength(s->active_disk);
+        hidden_length = bdrv_getlength(s->hidden_disk);
+        disk_length = bdrv_getlength(s->secondary_disk);
+        if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
+            active_length != hidden_length || hidden_length != disk_length) {
+            error_setg(errp, "active disk, hidden disk, secondary disk's length"
+                       " are not the same");
+            return;
+        }
+
+        if (!s->active_disk->drv->bdrv_make_empty ||
+            !s->hidden_disk->drv->bdrv_make_empty) {
+            error_setg(errp,
+                       "active disk or hidden disk doesn't support make_empty");
+            return;
+        }
+
+        /* start backup job now */
+        bdrv_op_unblock(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                        s->active_disk->backing_blocker);
+        bdrv_op_unblock(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                        s->hidden_disk->backing_blocker);
+        bdrv_ref(s->hidden_disk);
+
+        aio_context = bdrv_get_aio_context(bs);
+        aio_context_acquire(aio_context);
+        bdrv_set_aio_context(s->secondary_disk, aio_context);
+        backup_start(s->secondary_disk, s->hidden_disk, 0,
+                     MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
+                     BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
+                     s, &local_err);
+        aio_context_release(aio_context);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            bdrv_op_block(s->hidden_disk, BLOCK_OP_TYPE_BACKUP_TARGET,
+                          s->active_disk->backing_blocker);
+            bdrv_op_block(s->secondary_disk, BLOCK_OP_TYPE_BACKUP_SOURCE,
+                          s->hidden_disk->backing_blocker);
+            bdrv_unref(s->hidden_disk);
+            return;
+        }
+        break;
+    default:
+        abort();
+    }
+
+    s->replication_state = BLOCK_REPLICATION_RUNNING;
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+
+    s->error = 0;
+}
+
+static void replication_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    if (s->error) {
+        error_setg(errp, "I/O error occurs");
+        return;
+    }
+
+    if (s->mode == REPLICATION_MODE_SECONDARY) {
+        secondary_do_checkpoint(s, errp);
+    }
+}
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error **errp)
+{
+    BDRVReplicationState *s = bs->opaque;
+
+    if (s->replication_state != BLOCK_REPLICATION_RUNNING) {
+        error_setg(errp, "Block replication is not running");
+        return;
+    }
+
+    s->replication_state = BLOCK_REPLICATION_DONE;
+
+    switch (s->mode) {
+    case REPLICATION_MODE_PRIMARY:
+        break;
+    case REPLICATION_MODE_SECONDARY:
+        if (!failover) {
+            /*
+             * The guest will be shutdown, and secondary disk is the
+             * same as the primary disk. Just make active disk and
+             * hidden disk empty.
+             */
+            secondary_do_checkpoint(s, errp);
+            return;
+        }
+
+        if (s->secondary_disk->job) {
+            block_job_cancel(s->secondary_disk->job);
+        }
+        break;
+    default:
+        abort();
+    }
+}
+
+BlockDriver bdrv_replication = {
+    .format_name                = "replication",
+    .protocol_name              = "replication",
+    .instance_size              = sizeof(BDRVReplicationState),
+
+    .bdrv_open                  = replication_open,
+    .bdrv_close                 = replication_close,
+
+    .bdrv_getlength             = replication_getlength,
+    .bdrv_co_readv              = replication_co_readv,
+    .bdrv_co_writev             = replication_co_writev,
+    .bdrv_co_discard            = replication_co_discard,
+
+    .is_filter                  = true,
+    .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter,
+
+    .bdrv_start_replication     = replication_start,
+    .bdrv_do_checkpoint         = replication_do_checkpoint,
+    .bdrv_stop_replication      = replication_stop,
+
+    .has_variable_length        = true,
+};
+
+static void bdrv_replication_init(void)
+{
+    bdrv_register(&bdrv_replication);
+}
+
+block_init(bdrv_replication_init);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (13 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 14/16] Implement new driver " Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  2015-09-02 16:36   ` Eric Blake
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 16/16] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang
  15 siblings, 1 reply; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 qapi/block-core.json | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 96f0530..86275e3 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1383,7 +1383,7 @@
             'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
             'host_floppy', 'http', 'https', 'null-aio', 'null-co', 'parallels',
             'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-            'vmdk', 'vpc', 'vvfat', 'nbd' ] }
+            'vmdk', 'vpc', 'vvfat', 'nbd', 'replication' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1825,6 +1825,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.5
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -1869,7 +1882,8 @@
       'vhdx':       'BlockdevOptionsGenericFormat',
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
-      'vvfat':      'BlockdevOptionsVVFAT'
+      'vvfat':      'BlockdevOptionsVVFAT',
+      'replication':'BlockdevOptionsReplication'
   } }
 
 ##
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH 16/16] Add a new API to start/stop replication, do checkpoint to all BDSes
  2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
                   ` (14 preceding siblings ...)
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add Wen Congyang
@ 2015-09-02  8:51 ` Wen Congyang
  15 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-02  8:51 UTC (permalink / raw)
  To: qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang,
	zhanghailiang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 block.c               | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h |  4 +++
 2 files changed, 72 insertions(+)

diff --git a/block.c b/block.c
index 50f3532..1e019bf 100644
--- a/block.c
+++ b/block.c
@@ -4445,3 +4445,71 @@ void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
                    " replication", bs->filename);
     }
 }
+
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp)
+{
+    BlockDriverState *bs = NULL, *temp = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_start_replication(bs, mode, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            goto fail;
+        }
+    }
+
+    return;
+
+fail:
+    while ((temp = bdrv_next(temp)) && bs != temp) {
+        bdrv_stop_replication(temp, false, NULL);
+    }
+}
+
+void bdrv_do_checkpoint_all(Error **errp)
+{
+    BlockDriverState *bs = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_do_checkpoint(bs, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
+
+void bdrv_stop_replication_all(bool failover, Error **errp)
+{
+    BlockDriverState *bs = NULL;
+    Error *local_err = NULL;
+
+    while ((bs = bdrv_next(bs))) {
+        if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+            continue;
+        }
+
+        bdrv_stop_replication(bs, failover, &local_err);
+        if (!errp) {
+            /*
+             * The caller doesn't care the result, they just
+             * want to stop all block's replication.
+             */
+            continue;
+        }
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
diff --git a/include/block/block.h b/include/block/block.h
index 192a066..2aa4376 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -631,4 +631,8 @@ void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
 void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
 void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
 
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp);
+void bdrv_do_checkpoint_all(Error **errp);
+void bdrv_stop_replication_all(bool failover, Error **errp);
+
 #endif
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint Wen Congyang
@ 2015-09-02 14:10   ` Jeff Cody
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff Cody @ 2015-09-02 14:10 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Max Reitz, Gonglei,
	Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang,
	Dr. David Alan Gilbert

On Wed, Sep 02, 2015 at 04:51:11PM +0800, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Jeff Cody <jcody@redhat.com>
> ---
>  block/backup.c           | 14 ++++++++++++++
>  blockjob.c               | 11 +++++++++++
>  include/block/blockjob.h | 12 ++++++++++++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/block/backup.c b/block/backup.c
> index b729c4b..59bdb79 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -217,11 +217,25 @@ static void backup_iostatus_reset(BlockJob *job)
>      bdrv_iostatus_reset(s->target);
>  }
>  
> +static void backup_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +    BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
> +
> +    if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
> +        error_setg(errp, "The backup job only supports block checkpoint in"
> +                   " sync=none mode");
> +        return;
> +    }
> +
> +    hbitmap_reset_all(backup_job->bitmap);
> +}
> +
>  static const BlockJobDriver backup_job_driver = {
>      .instance_size  = sizeof(BackupBlockJob),
>      .job_type       = BLOCK_JOB_TYPE_BACKUP,
>      .set_speed      = backup_set_speed,
>      .iostatus_reset = backup_iostatus_reset,
> +    .do_checkpoint  = backup_do_checkpoint,
>  };
>  
>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
> diff --git a/blockjob.c b/blockjob.c
> index 62bb906..5528d92 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -404,3 +404,14 @@ void block_job_defer_to_main_loop(BlockJob *job,
>  
>      qemu_bh_schedule(data->bh);
>  }
> +
> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +    if (!job->driver->do_checkpoint) {
> +        error_setg(errp, "The job %s doesn't support block checkpoint",
> +                   BlockJobType_lookup[job->driver->job_type]);
> +        return;
> +    }
> +
> +    job->driver->do_checkpoint(job, errp);
> +}
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index dd9d5e6..0b4f386 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
>       * manually.
>       */
>      void (*complete)(BlockJob *job, Error **errp);
> +
> +    /** Optional callback for job types that support checkpoint. */
> +    void (*do_checkpoint)(BlockJob *job, Error **errp);
>  } BlockJobDriver;
>  
>  /**
> @@ -356,4 +359,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>                                    BlockJobDeferToMainLoopFn *fn,
>                                    void *opaque);
>  
> +/**
> + * block_job_do_checkpoint:
> + * @job: The job.
> + * @errp: Error object.
> + *
> + * Do block checkpoint on the specified job.
> + */
> +void block_job_do_checkpoint(BlockJob *job, Error **errp);
> +
>  #endif
> -- 
> 2.4.3
> 
> 

Reviewed-by: Jeff Cody <jcody@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS Wen Congyang
@ 2015-09-02 14:12   ` Jeff Cody
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff Cody @ 2015-09-02 14:12 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Max Reitz, Gonglei,
	Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang,
	Dr. David Alan Gilbert

On Wed, Sep 02, 2015 at 04:51:13PM +0800, Wen Congyang wrote:
> When opening BDS, we need to create backup jobs for
> image-fleecing.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Jeff Cody <jcody@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block/Makefile.objs | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 58ef2ef..fa05f37 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
>  block-obj-$(CONFIG_LIBSSH2) += ssh.o
>  block-obj-y += accounting.o
>  block-obj-y += write-threshold.o
> +block-obj-y += backup.o
>  
>  common-obj-y += stream.o
>  common-obj-y += commit.o
> -common-obj-y += backup.o
>  
>  iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
>  iscsi.o-libs       := $(LIBISCSI_LIBS)
> -- 
> 2.4.3
> 
>

Reviewed-by: Jeff Cody <jcody@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model Wen Congyang
@ 2015-09-02 15:37   ` Eric Blake
  2015-09-07  1:27     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 15:37 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 1681 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  block/block-backend.c          | 24 ++++++++++++++++++++++++
>  include/sysemu/block-backend.h |  2 ++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index aee8a12..72d8b2c 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -344,6 +344,30 @@ void *blk_get_attached_dev(BlockBackend *blk)
>  }
>  
>  /*
> + * Disable to attach a device mode to @blk.

s/to attach/attaching/
s/mode/model/

But I'm not even sure this patch makes sense.  I'd rather see Max's work
go in that allows for a BB without a BDS as representing a drive without
media, because then it is clear - either a BB has an associated BDS (and
cannot attach another one), or it does not (and attaching is fine).

> + * Return 0 on success, -EBUSY when a device model is attached already.
> + */
> +int blk_disable_attach_dev(BlockBackend *blk)
> +{
> +    if (blk->dev) {
> +        return blk->dev == (void *)-1 ? 0 : -EBUSY;
> +    }
> +
> +    blk->dev = (void *)-1;
> +    return 0;
> +}
> +
> +/*
> + * Enable to attach a device mode to @blk.
> + */
> +void blk_enable_attach_dev(BlockBackend *blk)
> +{
> +    if (blk->dev == (void *)-1) {

At the very least, if we allow a special sentinel to represent a BB
without a BDS (other than NULL, the way Max's series does it), it should
at least be wrapped by a macro, rather than using '(void *)-1' at
multiple call sites.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached Wen Congyang
@ 2015-09-02 15:40   ` Eric Blake
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Blake @ 2015-09-02 15:40 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  block.c                        | 4 ++--
>  block/block-backend.c          | 9 +++++++++
>  include/sysemu/block-backend.h | 1 +
>  3 files changed, 12 insertions(+), 2 deletions(-)
> 

>  /*
> + * Return true if a device model is attached to @blk already,
> + * otherwise, return false.
> + */
> +bool blk_is_attached(BlockBackend *blk)
> +{
> +    return blk != NULL && blk->dev != NULL && blk->dev != (void *)-1;

Again, I don't like the raw magic constant, even if we go with this
patch.  And it is shorter to write:

   return blk && blk->dev && blk->dev != MAGIC;

But I think it is better to just allow for a NULL BDS to represent an
unattached media (in which case Max's patches may already cover what you
are trying to do here), rather than trying to overload a special value
different from NULL.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] allow writing to the backing file
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 03/16] allow writing to the backing file Wen Congyang
@ 2015-09-02 16:06   ` Eric Blake
  2015-09-09  9:19     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 16:06 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 2216 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

Not much description in the commit message.  I really want an
explanation of why this patch is necessary.  After all, with
'block-commit', we were able to turn on read-write mode of backing files
on an as-needed basis, without having to expose that to the end user.
Giving the end user a knob that they must tune feels a bit awkward, and
probably means we don't have the design right.

> ---
>  block.c              | 41 ++++++++++++++++++++++++++++++++++++++++-
>  qapi/block-core.json |  7 ++++++-
>  2 files changed, 46 insertions(+), 2 deletions(-)
> 

> +#define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
> +static QemuOptsList backing_file_opts = {
> +    .name = "backing_file",
> +    .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
> +    .desc = {
> +        {
> +            .name = ALLOW_WRITE_BACKING_FILE,
> +            .type = QEMU_OPT_BOOL,
> +            .help = "allow write to backing file",

If you do add more justification for why this patch is necessary, then,

s/write/writes/

> +++ b/qapi/block-core.json
> @@ -1408,6 +1408,10 @@
>  # @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>  #                 (default: off)
>  #
> +# @allow-write-backing-file: #optional whether the backing file is opened in
> +#                            read-write mode. It is only for backing file
> +#                            (Since 2.5 default: false)
> +#

The name feels a bit long.

It sounds like it is an error to pass allow-write-backing-file for a
top-level BDS (that is, the BDS associated with a BB).  Meanwhile, the
default for any backing chain BDS is to open it read-only, regardless of
the 'read-only' setting of the parent.  But can we just allow
'read-only':false on a backing BDS to mean that the BDS starts life as
read-write, without having to add a new parameter?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors Wen Congyang
@ 2015-09-02 16:30   ` Eric Blake
  2015-09-07  3:40     ` Wen Congyang
  2015-09-07 16:56     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 39+ messages in thread
From: Eric Blake @ 2015-09-02 16:30 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, Alberto Garcia, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 4191 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> If the child is not ready, read/write/getlength/flush will
> return -errno. It is not critical error, and can be ignored:
> 1. read/write:
>    Just not report the error event.

What happens if all the children report an error?  Or is the threshold
at play here?

For example, if you have a threshold of 3/5, then I'm assuming that if
up to two children return an errno, then it is okay to ignore; but if
three or more return an errno, you haven't met threshold, so the I/O
must fail.

Are you ignoring ALL errors (including things like EACCES), or just EIO
errors?


> 2. getlength:
>    just ignore it. If all children's getlength return -errno,
>    and be ignored, return -EIO.
> 3. flush:
>    Just ignore it. If all children's getlength return -errno,

s/getlength/flush/

>    and be ignored, return 0.

Yuck - claiming success when all of the children fail feels dangerous.

> 
> Usage: children.x.ignore-errors=true
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Alberto Garcia <berto@igalia.com>
> ---
>  block/quorum.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++----

Interface review only:

> +++ b/qapi/block-core.json
> @@ -1411,6 +1411,8 @@
>  # @allow-write-backing-file: #optional whether the backing file is opened in
>  #                            read-write mode. It is only for backing file
>  #                            (Since 2.5 default: false)
> +# @ignore-errors: #options whether the child's I/O error should be ignored.

s/options/optional/
s/error/errors/

> +#                 it is only for quorum's child.(Since 2.5 default: false)

Space after '.' in English sentences.

The fact that you are documenting that this option can only be specified
for quorum children makes me wonder if it belongs better as an option in
BlockdevOptionsQuorum rather than BlockdevOptionsBase.

Semantically, it sounds like you are trying to allow for a per-child
decision of whether this particular child's errors matter to the overall
quorum.  So, if we have a 3/5 quorum, we can decide that for children A,
B, C, and D, errors cannot be ignored, but for child E, errors are not a
problem.

As written, you are tying the semantics to each child BDS, and requiring
special code to double-check that the property is only ever set if the
BDS is used as the child of a quorum.  Furthermore, if the property is
set, you are then changing what the child does in response to various
operations.

What if you instead create a list property in the quorum parent?  Maybe
along the lines of:

# @child-errors-okay: #optional an array of named-node children where
errors will be ignored (Since 2.5, default empty)

{ 'struct': 'BlockdevOptionsQuorum',
  'data': { '*blkverify': 'bool',
            'children': [ 'BlockdevRef' ],
            'vote-threshold': 'int',
            '*rewrite-corrupted': 'bool',
            '*read-pattern': 'QuorumReadPattern',
            '*child-errors-okay': ['str'] } }

The above example of a 3/5 quorum, where only child E can ignore errors,
would then be represented as:

{ "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3,
'child-errors-okay': [ "E" ] }

The code to ignore the errors is then done in the quorum itself (the BDS
for E does not have to care about a special ignore-errors property, but
just always returns the error as usual; and then the quorum is deciding
how to handle the error), and you are not polluting the BDS state for
something that is quorum-specific, because it is now the quorum itself
that tracks the special casing.

Finally, why can't hot-plug/unplug of quorum members work?  If you are
going to always ignore errors from a particular child, then why is that
child even part of the quorum?  Isn't a better design to just not add
the child to the quorum until it is ready and won't be reporting errors?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication Wen Congyang
@ 2015-09-02 16:33   ` Eric Blake
  2015-09-09  9:24     ` Wen Congyang
  2015-09-25  6:14     ` Wen Congyang
  0 siblings, 2 replies; 39+ messages in thread
From: Eric Blake @ 2015-09-02 16:33 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, Michael Roth, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Luiz Capitulino,
	Gonglei, Yang Hongyang, zhanghailiang

[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.c                   | 43 +++++++++++++++++++++++++++++++++++++++++++
>  include/block/block.h     |  5 +++++
>  include/block/block_int.h | 14 ++++++++++++++
>  qapi/block-core.json      | 15 +++++++++++++++
>  4 files changed, 77 insertions(+)
> 

Just an interface review for now:

> +++ b/qapi/block-core.json
> @@ -1810,6 +1810,21 @@
>    'data': { '*export': 'str' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @unprotected: Replication is not started or after failover.

Maybe:

Replication is either not started, or has experienced failover.

> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.4

You've missed 2.4; this should be 2.5.

> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }

Where is 'unprotected' in this list?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add Wen Congyang
@ 2015-09-02 16:36   ` Eric Blake
  2015-09-09  8:27     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 16:36 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 2213 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  qapi/block-core.json | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 96f0530..86275e3 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1383,7 +1383,7 @@
>              'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
>              'host_floppy', 'http', 'https', 'null-aio', 'null-co', 'parallels',
>              'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
> -            'vmdk', 'vpc', 'vvfat', 'nbd' ] }
> +            'vmdk', 'vpc', 'vvfat', 'nbd', 'replication' ] }

'nbd' is not in the current qemu.git; which means your patch series
depends on a prerequisite series.  Please mention that sort of
information in your cover letter.

Please keep this enum in alphabetical order.

Missing documentation under @drv of BlockDeviceInfo that this was
introduced in 2.5.

>  
>  ##
>  # @BlockdevOptionsBase
> @@ -1825,6 +1825,19 @@
>  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>  
>  ##
> +# @BlockdevOptionsReplication
> +#
> +# Driver specific block device options for replication
> +#
> +# @mode: the replication mode

Can the mode be 'unprotected', or must it be 'primary' or 'secondary'
when first creating a replication BDS?

> +#
> +# Since: 2.5
> +##
> +{ 'struct': 'BlockdevOptionsReplication',
> +  'base': 'BlockdevOptionsGenericFormat',
> +  'data': { 'mode': 'ReplicationMode'  } }
> +
> +##
>  # @BlockdevOptions
>  #
>  # Options for creating a block device.
> @@ -1869,7 +1882,8 @@
>        'vhdx':       'BlockdevOptionsGenericFormat',
>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>        'vpc':        'BlockdevOptionsGenericFormat',
> -      'vvfat':      'BlockdevOptionsVVFAT'
> +      'vvfat':      'BlockdevOptionsVVFAT',
> +      'replication':'BlockdevOptionsReplication'
>    } }

It helps to keep this alphabetical.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 04/16] block: Allow references for backing files
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 04/16] block: Allow references for backing files Wen Congyang
@ 2015-09-02 18:50   ` Eric Blake
  2015-09-09  8:51     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 18:50 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 2732 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Usage:
> -drive file=xxx,id=Y, \
> -drive file=xxxx,id=X,backing.backing_reference=Y
> 
> It will create such backing chain:
>                {virtio-blk dev 'Y'}      {virtio-blk dev 'X'}
>                          |                         |
>                          |                         |
>                          v                         v
> 
>     [base] <- [mid] <- ( Y )  <----------------- ( X )

This makes any changes to 'Y' have unspecified effects on 'X'.  While we
may have a valid reason to use a backing BDS in more than one chain, I
seriously doubt anyone will ever want to have two guest-visible -drive's
that are both read-write where one can corrupt the other.  I can totally
see the point of having BDS 'Y' exist for checkpoints or some other
non-guest-visible action, so I'm not saying this patch is wrong, just
that the commit message is picking a poor example of how it would be used.

> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  block.c               | 39 +++++++++++++++++++++++++++++++++++----
>  include/block/block.h |  1 +
>  2 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index d02d9e9..f93c50d 100644
> --- a/block.c
> +++ b/block.c
> @@ -1179,6 +1179,7 @@ out:
>  }
>  
>  #define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
> +#define BACKING_REFERENCE           "backing_reference"

Why the inconsistency in '-' vs. '_'?  I'd stick with dash here.

>  static QemuOptsList backing_file_opts = {
>      .name = "backing_file",
>      .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
> @@ -1188,6 +1189,11 @@ static QemuOptsList backing_file_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "allow write to backing file",
>          },
> +        {
> +            .name = BACKING_REFERENCE,
> +            .type = QEMU_OPT_STRING,
> +            .help = "reference to the exsiting BDS",

s/exsiting/existing/

But why do we need this? In qapi, BlockdevOptionsGenericCOWFormat
already has '*backing':'BlockdevRef', and BlockdevRef already has a
choice between 'definition' (object) and 'reference' (string).  Or is
this just a matter of teaching the command line to do what QMP can
already do?  In which case, wouldn't:

-drive file=xxx,id=Y, -drive file=xxxx,id=X,backing=Y

be the natural mapping of 'backing' being a string rather than a dictionary?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index()
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index() Wen Congyang
@ 2015-09-02 19:01   ` Eric Blake
  2015-09-07  2:18     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 19:01 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 1343 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

Commit message is a bit sparse.

> ---
>  include/qemu/option.h |  2 ++
>  util/qemu-option.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 

Missing testsuite exposure of the new function.

>  /*
> + * Adds all QDict entries to the QemuOpts that can be added and removes them
> + * from the QDict. The key starts with "%index." in the %qdict. When this

"%index." reads awkwardly (I thought it was a printf-style format). But
I'm not sure if "starts with %index followed by '.'" is any better.

> + * function returns, the QDict contains only those entries that couldn't be
> + * added to the QemuOpts.
> + */
> +void qemu_opts_absorb_qdict_by_index(QemuOpts *opts, QDict *qdict,
> +                                     const char *index, Error **errp)
> +{

I didn't review the algorithm closely, but here's a superficial comment:

> +    const QDictEntry *entry, *next;
> +    const char *key;
> +    int len = strlen(index);

size_t

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 10/16] docs: block replication's description
  2015-09-02  8:51 ` [Qemu-devel] [PATCH 10/16] docs: block replication's description Wen Congyang
@ 2015-09-02 20:41   ` Eric Blake
  2015-09-09  8:22     ` Wen Congyang
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Blake @ 2015-09-02 20:41 UTC (permalink / raw)
  To: Wen Congyang, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

[-- Attachment #1: Type: text/plain, Size: 6456 bytes --]

On 09/02/2015 02:51 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 183 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 183 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 


> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content(it could be from either "Secondary Write Requests" or

space before '(' in English sentences.

> +       previous COW of "Primary Write Requests") in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==

> +                3 NBD  ------->  3 NBD                                               |
> +                client    ||     server                                          2 filter
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none
> +

> +
> +4) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +support bdrv_make_empty() and backing file.

s/be an empty disk/start as an empty disk/

> +
> +5) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and

s/be/start as/

> +the driver supports bdrv_make_empty() and backing file.

Missing mention that a drive-backup job is run to allow hidden-disk to
buffer any state that would otherwise be lost by the speculative
write-through of the NBD server into the secondary disk.

> +
> +== Failure Handling ==
> +There are 6 internal errors when block replication is running:
> +1. I/O error on primary disk
> +2. Forwarding primary write requests failed
> +3. Backup failed
> +4. I/O error on secondary disk
> +5. I/O error on active disk
> +6. Making active disk or hidden disk empty failed
> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
> +4 and 6, we just report block replication's error to FT/HA manager(which

space before '('

> +decides when to do a new checkpoint, when to do failover).
> +There is one internal error when doing failover:
> +1. Commiting the data in active disk/hidden disk to secondary disk failed

s/Commiting/Committing/

> +We just to report this error to FT/HA manager.
> +
> +== New block driver interface ==

> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +
> +  Run qmp command in primary qemu:
> +    child_add disk1 child.driver=replication,child.mode=primary,\
> +              child.file.host=xxx,child.file.port=xxx,\
> +              child.file.driver=nbd,child.ignore-errors=on

My comments earlier in this series mean this step should be two QMP
commands: the first is blockdev-add to create an unassociated BDS, the
second to then add that BDS into the quorum.

> +  Note:
> +  1. There should be only one NBD Client for each primary disk.
> +  2. host is the secondary physical machine's hostname or IP
> +  3. Each disk must have its own export name.
> +  4. It is all a single argument to -drive and child_add, and you should
> +     ignore the leading whitespace.
> +  5. The qmp command line must be run after running qmp command line in
> +     secondary qemu.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=colo1 \
> +  -drive if=xxx,driver=replication,mode=secondary,\
> +         file.file.filename=active_disk.qcow2,\
> +         file.driver=qcow2,\
> +         file.backing.file.filename=hidden_disk.qcow2,\
> +         file.backing.driver=qcow2,\
> +         file.backing.allow-write-backing-file=on,\
> +         file.backing.backing.backing_reference=colo1\
> +
> +  Then run qmp command in secondary qemu:
> +    nbd-server-start host:port
> +    nbd-server-add -w colo1
> +
> +  Note:
> +  1. The export name in secondary QEMU command line is the secondary
> +     disk's id.
> +  2. The export name for the same disk must be the same
> +  3. The qmp command nbd-server-start and nbd-server-add must be run
> +     before running the qmp command migrate on primary QEMU
> +  4. Don't use nbd-server-start's other options
> +  5. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  6. It is better to put active disk and hidden disk in ramdisk.
> +  7. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.

Missing: document the steps taken during failover (that is, how do I
promote a Secondary into a new Primary, and then attach a new Secondary
to that point).  In particular, I suspect there may be differences
between whether you want to roll back to the state of the last
checkpoint (in hidden_disk) or just go with the current state of the
Secondary (in Active); either way, it probably involves doing an active
commit of the state you want into Secondary, then the formation of a new
quorum to start handing replication data off through a new NBD client
connection.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model
  2015-09-02 15:37   ` Eric Blake
@ 2015-09-07  1:27     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-07  1:27 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

On 09/02/2015 11:37 PM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>  block/block-backend.c          | 24 ++++++++++++++++++++++++
>>  include/sysemu/block-backend.h |  2 ++
>>  2 files changed, 26 insertions(+)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index aee8a12..72d8b2c 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -344,6 +344,30 @@ void *blk_get_attached_dev(BlockBackend *blk)
>>  }
>>  
>>  /*
>> + * Disable to attach a device mode to @blk.
> 
> s/to attach/attaching/
> s/mode/model/
> 
> But I'm not even sure this patch makes sense.  I'd rather see Max's work
> go in that allows for a BB without a BDS as representing a drive without
> media, because then it is clear - either a BB has an associated BDS (and
> cannot attach another one), or it does not (and attaching is fine).

Hmm, I think the API's name cause a misunderstanding.

These APIs are used to check if the BB is used by some block device. It is very
useful if we allow references for backing files. If the backing file referes an
existing BDS that with BB, the block device can't use this BB at the same time.

Thanks
Wen Congyang

> 
>> + * Return 0 on success, -EBUSY when a device model is attached already.
>> + */
>> +int blk_disable_attach_dev(BlockBackend *blk)
>> +{
>> +    if (blk->dev) {
>> +        return blk->dev == (void *)-1 ? 0 : -EBUSY;
>> +    }
>> +
>> +    blk->dev = (void *)-1;
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Enable to attach a device mode to @blk.
>> + */
>> +void blk_enable_attach_dev(BlockBackend *blk)
>> +{
>> +    if (blk->dev == (void *)-1) {
> 
> At the very least, if we allow a special sentinel to represent a BB
> without a BDS (other than NULL, the way Max's series does it), it should
> at least be wrapped by a macro, rather than using '(void *)-1' at
> multiple call sites.
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index()
  2015-09-02 19:01   ` Eric Blake
@ 2015-09-07  2:18     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-07  2:18 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

On 09/03/2015 03:01 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> 
> Commit message is a bit sparse.
> 
>> ---
>>  include/qemu/option.h |  2 ++
>>  util/qemu-option.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 46 insertions(+)
>>
> 
> Missing testsuite exposure of the new function.

Do you mean: update tests/test-qemu-opts.c?

> 
>>  /*
>> + * Adds all QDict entries to the QemuOpts that can be added and removes them
>> + * from the QDict. The key starts with "%index." in the %qdict. When this
> 
> "%index." reads awkwardly (I thought it was a printf-style format). But
> I'm not sure if "starts with %index followed by '.'" is any better.

The other comments use '@'. I will update it in the next version.

> 
>> + * function returns, the QDict contains only those entries that couldn't be
>> + * added to the QemuOpts.
>> + */
>> +void qemu_opts_absorb_qdict_by_index(QemuOpts *opts, QDict *qdict,
>> +                                     const char *index, Error **errp)
>> +{
> 
> I didn't review the algorithm closely, but here's a superficial comment:
> 
>> +    const QDictEntry *entry, *next;
>> +    const char *key;
>> +    int len = strlen(index);
> 
> size_t
OK, will fix it in the next version.

Thanks
Wen Congyang

> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors
  2015-09-02 16:30   ` Eric Blake
@ 2015-09-07  3:40     ` Wen Congyang
  2015-09-07 16:56     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-07  3:40 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, Alberto Garcia, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, Dr. David Alan Gilbert,
	Michael R. Hines, Gonglei, Yang Hongyang

On 09/03/2015 12:30 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> If the child is not ready, read/write/getlength/flush will
>> return -errno. It is not critical error, and can be ignored:
>> 1. read/write:
>>    Just not report the error event.
> 
> What happens if all the children report an error?  Or is the threshold
> at play here?

Good question. What about this: if we have 5 children, only 4 children
can ignore errors.

> 
> For example, if you have a threshold of 3/5, then I'm assuming that if
> up to two children return an errno, then it is okay to ignore; but if
> three or more return an errno, you haven't met threshold, so the I/O
> must fail.

Do more check here, at least one child sucesses.

> 
> Are you ignoring ALL errors (including things like EACCES), or just EIO
> errors?

Does bdrv_xxx() always return -errno on error?

> 
> 
>> 2. getlength:
>>    just ignore it. If all children's getlength return -errno,
>>    and be ignored, return -EIO.
>> 3. flush:
>>    Just ignore it. If all children's getlength return -errno,
> 
> s/getlength/flush/
> 
>>    and be ignored, return 0.
> 
> Yuck - claiming success when all of the children fail feels dangerous.

Yes.

> 
>>
>> Usage: children.x.ignore-errors=true
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Cc: Alberto Garcia <berto@igalia.com>
>> ---
>>  block/quorum.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 
> Interface review only:
> 
>> +++ b/qapi/block-core.json
>> @@ -1411,6 +1411,8 @@
>>  # @allow-write-backing-file: #optional whether the backing file is opened in
>>  #                            read-write mode. It is only for backing file
>>  #                            (Since 2.5 default: false)
>> +# @ignore-errors: #options whether the child's I/O error should be ignored.
> 
> s/options/optional/
> s/error/errors/
> 
>> +#                 it is only for quorum's child.(Since 2.5 default: false)
> 
> Space after '.' in English sentences.
> 
> The fact that you are documenting that this option can only be specified
> for quorum children makes me wonder if it belongs better as an option in
> BlockdevOptionsQuorum rather than BlockdevOptionsBase.

Hmm, how to define it? It is quorum's child's option, not quorum's option.

> 
> Semantically, it sounds like you are trying to allow for a per-child
> decision of whether this particular child's errors matter to the overall
> quorum.  So, if we have a 3/5 quorum, we can decide that for children A,
> B, C, and D, errors cannot be ignored, but for child E, errors are not a
> problem.
> 
> As written, you are tying the semantics to each child BDS, and requiring
> special code to double-check that the property is only ever set if the
> BDS is used as the child of a quorum.  Furthermore, if the property is
> set, you are then changing what the child does in response to various
> operations.
> 
> What if you instead create a list property in the quorum parent?  Maybe
> along the lines of:
> 
> # @child-errors-okay: #optional an array of named-node children where
> errors will be ignored (Since 2.5, default empty)
> 
> { 'struct': 'BlockdevOptionsQuorum',
>   'data': { '*blkverify': 'bool',
>             'children': [ 'BlockdevRef' ],
>             'vote-threshold': 'int',
>             '*rewrite-corrupted': 'bool',
>             '*read-pattern': 'QuorumReadPattern',
>             '*child-errors-okay': ['str'] } }
> 
> The above example of a 3/5 quorum, where only child E can ignore errors,
> would then be represented as:
> 
> { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3,
> 'child-errors-okay': [ "E" ] }

OK, I will try it.

> 
> The code to ignore the errors is then done in the quorum itself (the BDS
> for E does not have to care about a special ignore-errors property, but
> just always returns the error as usual; and then the quorum is deciding
> how to handle the error), and you are not polluting the BDS state for
> something that is quorum-specific, because it is now the quorum itself
> that tracks the special casing.
> 
> Finally, why can't hot-plug/unplug of quorum members work?  If you are
> going to always ignore errors from a particular child, then why is that
> child even part of the quorum?  Isn't a better design to just not add
> the child to the quorum until it is ready and won't be reporting errors?
> 

Yes. In the early version, I don't use hot-plug/unplug of quorum members, so
the quorum member may be not ready. Now I use hot-plug/unplug of quorum members,
so the quorum's member is ready when it is hot added.  In such case, the quorum
member is not ready after failover. This error is expected, but I want this error
can be ignored, otherwise, there may be too error events...

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors
  2015-09-02 16:30   ` Eric Blake
  2015-09-07  3:40     ` Wen Congyang
@ 2015-09-07 16:56     ` Dr. David Alan Gilbert
  2015-09-08  0:46       ` Wen Congyang
  1 sibling, 1 reply; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-07 16:56 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Fam Zheng, qemu block, Jiang Yunhong, Dong Eddie,
	qemu devel, Michael R. Hines, Max Reitz, Alberto Garcia, Gonglei,
	Stefan Hajnoczi, Paolo Bonzini, Yang Hongyang, zhanghailiang

* Eric Blake (eblake@redhat.com) wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
> > If the child is not ready, read/write/getlength/flush will
> > return -errno. It is not critical error, and can be ignored:
> > 1. read/write:
> >    Just not report the error event.
> 
> What happens if all the children report an error?  Or is the threshold
> at play here?

I think it's interesting because in the COLO case the intention isn't
really about a threshold (in the way you might use for RAID or mirroring),
it's that one of the stores is local (and not expected to error) and one
is somewhere over a network, so if it fails you don't want to stop
the local VM working.

However, if it fails we do need to know about it; if any write to
the secondary stops then the fault-tolerance has failed (at least for
that drive); so we should do *something* - I'm not sure what though.

Dave


> For example, if you have a threshold of 3/5, then I'm assuming that if
> up to two children return an errno, then it is okay to ignore; but if
> three or more return an errno, you haven't met threshold, so the I/O
> must fail.
> 
> Are you ignoring ALL errors (including things like EACCES), or just EIO
> errors?
> 
> 
> > 2. getlength:
> >    just ignore it. If all children's getlength return -errno,
> >    and be ignored, return -EIO.
> > 3. flush:
> >    Just ignore it. If all children's getlength return -errno,
> 
> s/getlength/flush/
> 
> >    and be ignored, return 0.
> 
> Yuck - claiming success when all of the children fail feels dangerous.
> 
> > 
> > Usage: children.x.ignore-errors=true
> > 
> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> > Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > Cc: Alberto Garcia <berto@igalia.com>
> > ---
> >  block/quorum.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 
> Interface review only:
> 
> > +++ b/qapi/block-core.json
> > @@ -1411,6 +1411,8 @@
> >  # @allow-write-backing-file: #optional whether the backing file is opened in
> >  #                            read-write mode. It is only for backing file
> >  #                            (Since 2.5 default: false)
> > +# @ignore-errors: #options whether the child's I/O error should be ignored.
> 
> s/options/optional/
> s/error/errors/
> 
> > +#                 it is only for quorum's child.(Since 2.5 default: false)
> 
> Space after '.' in English sentences.
> 
> The fact that you are documenting that this option can only be specified
> for quorum children makes me wonder if it belongs better as an option in
> BlockdevOptionsQuorum rather than BlockdevOptionsBase.
> 
> Semantically, it sounds like you are trying to allow for a per-child
> decision of whether this particular child's errors matter to the overall
> quorum.  So, if we have a 3/5 quorum, we can decide that for children A,
> B, C, and D, errors cannot be ignored, but for child E, errors are not a
> problem.
> 
> As written, you are tying the semantics to each child BDS, and requiring
> special code to double-check that the property is only ever set if the
> BDS is used as the child of a quorum.  Furthermore, if the property is
> set, you are then changing what the child does in response to various
> operations.
> 
> What if you instead create a list property in the quorum parent?  Maybe
> along the lines of:
> 
> # @child-errors-okay: #optional an array of named-node children where
> errors will be ignored (Since 2.5, default empty)
> 
> { 'struct': 'BlockdevOptionsQuorum',
>   'data': { '*blkverify': 'bool',
>             'children': [ 'BlockdevRef' ],
>             'vote-threshold': 'int',
>             '*rewrite-corrupted': 'bool',
>             '*read-pattern': 'QuorumReadPattern',
>             '*child-errors-okay': ['str'] } }
> 
> The above example of a 3/5 quorum, where only child E can ignore errors,
> would then be represented as:
> 
> { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3,
> 'child-errors-okay': [ "E" ] }
> 
> The code to ignore the errors is then done in the quorum itself (the BDS
> for E does not have to care about a special ignore-errors property, but
> just always returns the error as usual; and then the quorum is deciding
> how to handle the error), and you are not polluting the BDS state for
> something that is quorum-specific, because it is now the quorum itself
> that tracks the special casing.
> 
> Finally, why can't hot-plug/unplug of quorum members work?  If you are
> going to always ignore errors from a particular child, then why is that
> child even part of the quorum?  Isn't a better design to just not add
> the child to the quorum until it is ready and won't be reporting errors?
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors
  2015-09-07 16:56     ` Dr. David Alan Gilbert
@ 2015-09-08  0:46       ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-08  0:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Eric Blake
  Cc: Kevin Wolf, Fam Zheng, zhanghailiang, qemu block, Jiang Yunhong,
	Dong Eddie, qemu devel, Michael R. Hines, Max Reitz,
	Alberto Garcia, Gonglei, Stefan Hajnoczi, Paolo Bonzini,
	Yang Hongyang

On 09/08/2015 12:56 AM, Dr. David Alan Gilbert wrote:
> * Eric Blake (eblake@redhat.com) wrote:
>> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>>> If the child is not ready, read/write/getlength/flush will
>>> return -errno. It is not critical error, and can be ignored:
>>> 1. read/write:
>>>    Just not report the error event.
>>
>> What happens if all the children report an error?  Or is the threshold
>> at play here?
> 
> I think it's interesting because in the COLO case the intention isn't
> really about a threshold (in the way you might use for RAID or mirroring),
> it's that one of the stores is local (and not expected to error) and one
> is somewhere over a network, so if it fails you don't want to stop
> the local VM working.
> 
> However, if it fails we do need to know about it; if any write to
> the secondary stops then the fault-tolerance has failed (at least for
> that drive); so we should do *something* - I'm not sure what though.

The filter driver replication catches this error, and the COLO framework
will get this error when doing checkpoint.

Thanks
Wen Congyang

> 
> Dave
> 
> 
>> For example, if you have a threshold of 3/5, then I'm assuming that if
>> up to two children return an errno, then it is okay to ignore; but if
>> three or more return an errno, you haven't met threshold, so the I/O
>> must fail.
>>
>> Are you ignoring ALL errors (including things like EACCES), or just EIO
>> errors?
>>
>>
>>> 2. getlength:
>>>    just ignore it. If all children's getlength return -errno,
>>>    and be ignored, return -EIO.
>>> 3. flush:
>>>    Just ignore it. If all children's getlength return -errno,
>>
>> s/getlength/flush/
>>
>>>    and be ignored, return 0.
>>
>> Yuck - claiming success when all of the children fail feels dangerous.
>>
>>>
>>> Usage: children.x.ignore-errors=true
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> Cc: Alberto Garcia <berto@igalia.com>
>>> ---
>>>  block/quorum.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>
>> Interface review only:
>>
>>> +++ b/qapi/block-core.json
>>> @@ -1411,6 +1411,8 @@
>>>  # @allow-write-backing-file: #optional whether the backing file is opened in
>>>  #                            read-write mode. It is only for backing file
>>>  #                            (Since 2.5 default: false)
>>> +# @ignore-errors: #options whether the child's I/O error should be ignored.
>>
>> s/options/optional/
>> s/error/errors/
>>
>>> +#                 it is only for quorum's child.(Since 2.5 default: false)
>>
>> Space after '.' in English sentences.
>>
>> The fact that you are documenting that this option can only be specified
>> for quorum children makes me wonder if it belongs better as an option in
>> BlockdevOptionsQuorum rather than BlockdevOptionsBase.
>>
>> Semantically, it sounds like you are trying to allow for a per-child
>> decision of whether this particular child's errors matter to the overall
>> quorum.  So, if we have a 3/5 quorum, we can decide that for children A,
>> B, C, and D, errors cannot be ignored, but for child E, errors are not a
>> problem.
>>
>> As written, you are tying the semantics to each child BDS, and requiring
>> special code to double-check that the property is only ever set if the
>> BDS is used as the child of a quorum.  Furthermore, if the property is
>> set, you are then changing what the child does in response to various
>> operations.
>>
>> What if you instead create a list property in the quorum parent?  Maybe
>> along the lines of:
>>
>> # @child-errors-okay: #optional an array of named-node children where
>> errors will be ignored (Since 2.5, default empty)
>>
>> { 'struct': 'BlockdevOptionsQuorum',
>>   'data': { '*blkverify': 'bool',
>>             'children': [ 'BlockdevRef' ],
>>             'vote-threshold': 'int',
>>             '*rewrite-corrupted': 'bool',
>>             '*read-pattern': 'QuorumReadPattern',
>>             '*child-errors-okay': ['str'] } }
>>
>> The above example of a 3/5 quorum, where only child E can ignore errors,
>> would then be represented as:
>>
>> { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3,
>> 'child-errors-okay': [ "E" ] }
>>
>> The code to ignore the errors is then done in the quorum itself (the BDS
>> for E does not have to care about a special ignore-errors property, but
>> just always returns the error as usual; and then the quorum is deciding
>> how to handle the error), and you are not polluting the BDS state for
>> something that is quorum-specific, because it is now the quorum itself
>> that tracks the special casing.
>>
>> Finally, why can't hot-plug/unplug of quorum members work?  If you are
>> going to always ignore errors from a particular child, then why is that
>> child even part of the quorum?  Isn't a better design to just not add
>> the child to the quorum until it is ready and won't be reporting errors?
>>
>> -- 
>> Eric Blake   eblake redhat com    +1-919-301-3266
>> Libvirt virtualization library http://libvirt.org
>>
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 10/16] docs: block replication's description
  2015-09-02 20:41   ` Eric Blake
@ 2015-09-09  8:22     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-09  8:22 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

On 09/03/2015 04:41 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 183 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 183 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
> 
> 
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content(it could be from either "Secondary Write Requests" or
> 
> space before '(' in English sentences.
> 
>> +       previous COW of "Primary Write Requests") in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
> 
>> +                3 NBD  ------->  3 NBD                                               |
>> +                client    ||     server                                          2 filter
>> +                          ||        ^                                                ^
>> +--------.                 ||        |                                                |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
>> +--------'                 ||        |          backing        ^       backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
>> +
> 
>> +
>> +4) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +support bdrv_make_empty() and backing file.
> 
> s/be an empty disk/start as an empty disk/
> 
>> +
>> +5) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
> 
> s/be/start as/
> 
>> +the driver supports bdrv_make_empty() and backing file.
> 
> Missing mention that a drive-backup job is run to allow hidden-disk to
> buffer any state that would otherwise be lost by the speculative
> write-through of the NBD server into the secondary disk.
> 
>> +
>> +== Failure Handling ==
>> +There are 6 internal errors when block replication is running:
>> +1. I/O error on primary disk
>> +2. Forwarding primary write requests failed
>> +3. Backup failed
>> +4. I/O error on secondary disk
>> +5. I/O error on active disk
>> +6. Making active disk or hidden disk empty failed
>> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
>> +4 and 6, we just report block replication's error to FT/HA manager(which
> 
> space before '('
> 
>> +decides when to do a new checkpoint, when to do failover).
>> +There is one internal error when doing failover:
>> +1. Commiting the data in active disk/hidden disk to secondary disk failed
> 
> s/Commiting/Committing/
> 
>> +We just to report this error to FT/HA manager.
>> +
>> +== New block driver interface ==
> 
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +
>> +  Run qmp command in primary qemu:
>> +    child_add disk1 child.driver=replication,child.mode=primary,\
>> +              child.file.host=xxx,child.file.port=xxx,\
>> +              child.file.driver=nbd,child.ignore-errors=on
> 
> My comments earlier in this series mean this step should be two QMP
> commands: the first is blockdev-add to create an unassociated BDS, the
> second to then add that BDS into the quorum.
> 
>> +  Note:
>> +  1. There should be only one NBD Client for each primary disk.
>> +  2. host is the secondary physical machine's hostname or IP
>> +  3. Each disk must have its own export name.
>> +  4. It is all a single argument to -drive and child_add, and you should
>> +     ignore the leading whitespace.
>> +  5. The qmp command line must be run after running qmp command line in
>> +     secondary qemu.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=colo1 \
>> +  -drive if=xxx,driver=replication,mode=secondary,\
>> +         file.file.filename=active_disk.qcow2,\
>> +         file.driver=qcow2,\
>> +         file.backing.file.filename=hidden_disk.qcow2,\
>> +         file.backing.driver=qcow2,\
>> +         file.backing.allow-write-backing-file=on,\
>> +         file.backing.backing.backing_reference=colo1\
>> +
>> +  Then run qmp command in secondary qemu:
>> +    nbd-server-start host:port
>> +    nbd-server-add -w colo1
>> +
>> +  Note:
>> +  1. The export name in secondary QEMU command line is the secondary
>> +     disk's id.
>> +  2. The export name for the same disk must be the same
>> +  3. The qmp command nbd-server-start and nbd-server-add must be run
>> +     before running the qmp command migrate on primary QEMU
>> +  4. Don't use nbd-server-start's other options
>> +  5. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  6. It is better to put active disk and hidden disk in ramdisk.
>> +  7. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
> 
> Missing: document the steps taken during failover (that is, how do I
> promote a Secondary into a new Primary, and then attach a new Secondary
> to that point).  In particular, I suspect there may be differences

Continuous block replication is in the TODO list. But I think it is very
easy to implement it if the quorum's child can be hot-added/removed.

> between whether you want to roll back to the state of the last
> checkpoint (in hidden_disk) or just go with the current state of the

For periodic checkpoint, the secondary vm is not running, so just commit
hidden_disk to secondary disk.
For COLO, the secondary vm is running, and we need this state, so just commit
active disk to secondary disk(hidden_disk is also committed).

In which case, do we need to drop secondary disk and commit hidden disk?

Thanks
Wen Congyang

> Secondary (in Active); either way, it probably involves doing an active
> commit of the state you want into Secondary, then the formation of a new
> quorum to start handing replication data off through a new NBD client
> connection.
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add
  2015-09-02 16:36   ` Eric Blake
@ 2015-09-09  8:27     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-09  8:27 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Yang Hongyang

On 09/03/2015 12:36 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>  qapi/block-core.json | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 96f0530..86275e3 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1383,7 +1383,7 @@
>>              'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
>>              'host_floppy', 'http', 'https', 'null-aio', 'null-co', 'parallels',
>>              'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
>> -            'vmdk', 'vpc', 'vvfat', 'nbd' ] }
>> +            'vmdk', 'vpc', 'vvfat', 'nbd', 'replication' ] }
> 
> 'nbd' is not in the current qemu.git; which means your patch series
> depends on a prerequisite series.  Please mention that sort of
> information in your cover letter.
> 
> Please keep this enum in alphabetical order.
> 
> Missing documentation under @drv of BlockDeviceInfo that this was
> introduced in 2.5.

OK

> 
>>  
>>  ##
>>  # @BlockdevOptionsBase
>> @@ -1825,6 +1825,19 @@
>>  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>>  
>>  ##
>> +# @BlockdevOptionsReplication
>> +#
>> +# Driver specific block device options for replication
>> +#
>> +# @mode: the replication mode
> 
> Can the mode be 'unprotected', or must it be 'primary' or 'secondary'
> when first creating a replication BDS?

I will check it, and fix it in the next version.

> 
>> +#
>> +# Since: 2.5
>> +##
>> +{ 'struct': 'BlockdevOptionsReplication',
>> +  'base': 'BlockdevOptionsGenericFormat',
>> +  'data': { 'mode': 'ReplicationMode'  } }
>> +
>> +##
>>  # @BlockdevOptions
>>  #
>>  # Options for creating a block device.
>> @@ -1869,7 +1882,8 @@
>>        'vhdx':       'BlockdevOptionsGenericFormat',
>>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>>        'vpc':        'BlockdevOptionsGenericFormat',
>> -      'vvfat':      'BlockdevOptionsVVFAT'
>> +      'vvfat':      'BlockdevOptionsVVFAT',
>> +      'replication':'BlockdevOptionsReplication'
>>    } }
> 
> It helps to keep this alphabetical.
> 

OK

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 04/16] block: Allow references for backing files
  2015-09-02 18:50   ` Eric Blake
@ 2015-09-09  8:51     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-09  8:51 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

On 09/03/2015 02:50 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Usage:
>> -drive file=xxx,id=Y, \
>> -drive file=xxxx,id=X,backing.backing_reference=Y
>>
>> It will create such backing chain:
>>                {virtio-blk dev 'Y'}      {virtio-blk dev 'X'}
>>                          |                         |
>>                          |                         |
>>                          v                         v
>>
>>     [base] <- [mid] <- ( Y )  <----------------- ( X )
> 
> This makes any changes to 'Y' have unspecified effects on 'X'.  While we
> may have a valid reason to use a backing BDS in more than one chain, I
> seriously doubt anyone will ever want to have two guest-visible -drive's
> that are both read-write where one can corrupt the other.  I can totally
> see the point of having BDS 'Y' exist for checkpoints or some other
> non-guest-visible action, so I'm not saying this patch is wrong, just
> that the commit message is picking a poor example of how it would be used.

Sorry, this graph is wrong. virtio-blk dev 'Y' can not use the BDS Y if it
is referenced by X. That is why I need patch 1 and 2.

Thanks
Wen Congyang

> 
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  block.c               | 39 +++++++++++++++++++++++++++++++++++----
>>  include/block/block.h |  1 +
>>  2 files changed, 36 insertions(+), 4 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index d02d9e9..f93c50d 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1179,6 +1179,7 @@ out:
>>  }
>>  
>>  #define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
>> +#define BACKING_REFERENCE           "backing_reference"
> 
> Why the inconsistency in '-' vs. '_'?  I'd stick with dash here.
> 
>>  static QemuOptsList backing_file_opts = {
>>      .name = "backing_file",
>>      .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
>> @@ -1188,6 +1189,11 @@ static QemuOptsList backing_file_opts = {
>>              .type = QEMU_OPT_BOOL,
>>              .help = "allow write to backing file",
>>          },
>> +        {
>> +            .name = BACKING_REFERENCE,
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "reference to the exsiting BDS",
> 
> s/exsiting/existing/
> 
> But why do we need this? In qapi, BlockdevOptionsGenericCOWFormat
> already has '*backing':'BlockdevRef', and BlockdevRef already has a
> choice between 'definition' (object) and 'reference' (string).  Or is
> this just a matter of teaching the command line to do what QMP can
> already do?  In which case, wouldn't:
> 
> -drive file=xxx,id=Y, -drive file=xxxx,id=X,backing=Y
> 
> be the natural mapping of 'backing' being a string rather than a dictionary?
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] allow writing to the backing file
  2015-09-02 16:06   ` Eric Blake
@ 2015-09-09  9:19     ` Wen Congyang
  0 siblings, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-09  9:19 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Gonglei, Yang Hongyang

On 09/03/2015 12:06 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> 
> Not much description in the commit message.  I really want an
> explanation of why this patch is necessary.  After all, with

For COLO, we have such backing chain:
secondary disk <-- hidden disk <-- active disk
secondary disk is top BDS(use bacing reference), so it can be opened in read-write mode.
But hidden disk is read only, and we need to write to hidden disk(backup job
will write data to it).

> 'block-commit', we were able to turn on read-write mode of backing files
> on an as-needed basis, without having to expose that to the end user.
> Giving the end user a knob that they must tune feels a bit awkward, and
> probably means we don't have the design right.
> 
>> ---
>>  block.c              | 41 ++++++++++++++++++++++++++++++++++++++++-
>>  qapi/block-core.json |  7 ++++++-
>>  2 files changed, 46 insertions(+), 2 deletions(-)
>>
> 
>> +#define ALLOW_WRITE_BACKING_FILE    "allow-write-backing-file"
>> +static QemuOptsList backing_file_opts = {
>> +    .name = "backing_file",
>> +    .head = QTAILQ_HEAD_INITIALIZER(backing_file_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = ALLOW_WRITE_BACKING_FILE,
>> +            .type = QEMU_OPT_BOOL,
>> +            .help = "allow write to backing file",
> 
> If you do add more justification for why this patch is necessary, then,
> 
> s/write/writes/
> 
>> +++ b/qapi/block-core.json
>> @@ -1408,6 +1408,10 @@
>>  # @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>>  #                 (default: off)
>>  #
>> +# @allow-write-backing-file: #optional whether the backing file is opened in
>> +#                            read-write mode. It is only for backing file
>> +#                            (Since 2.5 default: false)
>> +#
> 
> The name feels a bit long.
> 
> It sounds like it is an error to pass allow-write-backing-file for a
> top-level BDS (that is, the BDS associated with a BB).  Meanwhile, the
> default for any backing chain BDS is to open it read-only, regardless of
> the 'read-only' setting of the parent.  But can we just allow
> 'read-only':false on a backing BDS to mean that the BDS starts life as
> read-write, without having to add a new parameter?
> 

We have discussed it before:
http://lists.nongnu.org/archive/html/qemu-devel/2015-02/msg04468.html

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication
  2015-09-02 16:33   ` Eric Blake
@ 2015-09-09  9:24     ` Wen Congyang
  2015-09-25  6:14     ` Wen Congyang
  1 sibling, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-09  9:24 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, Michael Roth, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Luiz Capitulino,
	Gonglei, Yang Hongyang, zhanghailiang

On 09/03/2015 12:33 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  block.c                   | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  include/block/block.h     |  5 +++++
>>  include/block/block_int.h | 14 ++++++++++++++
>>  qapi/block-core.json      | 15 +++++++++++++++
>>  4 files changed, 77 insertions(+)
>>
> 
> Just an interface review for now:
> 
>> +++ b/qapi/block-core.json
>> @@ -1810,6 +1810,21 @@
>>    'data': { '*export': 'str' } }
>>  
>>  ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @unprotected: Replication is not started or after failover.
> 
> Maybe:
> 
> Replication is either not started, or has experienced failover.

OK

> 
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.4
> 
> You've missed 2.4; this should be 2.5.

Yes, I forgot to update it. I will check all patches.

> 
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> 
> Where is 'unprotected' in this list?
> 

I don't know when it is removed. I will check it and fix it in the next version.

Thanks
Wen Congyang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication
  2015-09-02 16:33   ` Eric Blake
  2015-09-09  9:24     ` Wen Congyang
@ 2015-09-25  6:14     ` Wen Congyang
  1 sibling, 0 replies; 39+ messages in thread
From: Wen Congyang @ 2015-09-25  6:14 UTC (permalink / raw)
  To: Eric Blake, qemu devel, Fam Zheng, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi
  Cc: Kevin Wolf, Michael Roth, qemu block, Jiang Yunhong, Dong Eddie,
	Dr. David Alan Gilbert, Michael R. Hines, Luiz Capitulino,
	Gonglei, Yang Hongyang, zhanghailiang

On 09/03/2015 12:33 AM, Eric Blake wrote:
> On 09/02/2015 02:51 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  block.c                   | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  include/block/block.h     |  5 +++++
>>  include/block/block_int.h | 14 ++++++++++++++
>>  qapi/block-core.json      | 15 +++++++++++++++
>>  4 files changed, 77 insertions(+)
>>
> 
> Just an interface review for now:
> 
>> +++ b/qapi/block-core.json
>> @@ -1810,6 +1810,21 @@
>>    'data': { '*export': 'str' } }
>>  
>>  ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @unprotected: Replication is not started or after failover.
> 
> Maybe:
> 
> Replication is either not started, or has experienced failover.

This is internal state, and this mode is used to tell qemu that
it is on which side.

Thanks
Wen Congyang

> 
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.4
> 
> You've missed 2.4; this should be 2.5.
> 
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> 
> Where is 'unprotected' in this list?
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-09-25  6:15 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-02  8:51 [Qemu-devel] [PATCH 00/16] Block replication for continuous checkpoints Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 01/16] introduce a new API to enable/disable attach device model Wen Congyang
2015-09-02 15:37   ` Eric Blake
2015-09-07  1:27     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 02/16] introduce a new API to check if blk is attached Wen Congyang
2015-09-02 15:40   ` Eric Blake
2015-09-02  8:51 ` [Qemu-devel] [PATCH 03/16] allow writing to the backing file Wen Congyang
2015-09-02 16:06   ` Eric Blake
2015-09-09  9:19     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 04/16] block: Allow references for backing files Wen Congyang
2015-09-02 18:50   ` Eric Blake
2015-09-09  8:51     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 05/16] introduce a new API qemu_opts_absorb_qdict_by_index() Wen Congyang
2015-09-02 19:01   ` Eric Blake
2015-09-07  2:18     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 06/16] quorum: allow ignoring child errors Wen Congyang
2015-09-02 16:30   ` Eric Blake
2015-09-07  3:40     ` Wen Congyang
2015-09-07 16:56     ` Dr. David Alan Gilbert
2015-09-08  0:46       ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 07/16] Backup: clear all bitmap when doing block checkpoint Wen Congyang
2015-09-02 14:10   ` Jeff Cody
2015-09-02  8:51 ` [Qemu-devel] [PATCH 08/16] block: make bdrv_put_ref_bh_schedule() as a public API Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 09/16] Allow creating backup jobs when opening BDS Wen Congyang
2015-09-02 14:12   ` Jeff Cody
2015-09-02  8:51 ` [Qemu-devel] [PATCH 10/16] docs: block replication's description Wen Congyang
2015-09-02 20:41   ` Eric Blake
2015-09-09  8:22     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 11/16] Add new block driver interfaces to control block replication Wen Congyang
2015-09-02 16:33   ` Eric Blake
2015-09-09  9:24     ` Wen Congyang
2015-09-25  6:14     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 12/16] skip nbd_target when starting " Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 13/16] quorum: implement block driver interfaces for " Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 14/16] Implement new driver " Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 15/16] support replication driver in blockdev-add Wen Congyang
2015-09-02 16:36   ` Eric Blake
2015-09-09  8:27     ` Wen Congyang
2015-09-02  8:51 ` [Qemu-devel] [PATCH 16/16] Add a new API to start/stop replication, do checkpoint to all BDSes Wen Congyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.