All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v12 0/3] qapi: child add/delete support
@ 2016-03-10  2:49 Changlong Xie
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child Changlong Xie
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Changlong Xie @ 2016-03-10  2:49 UTC (permalink / raw)
  To: qemu devel, Eric Blake, Alberto Garcia, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert

ChangLog:
v11~v12:
1. Address comments from Max
p1. Add R-B
p2. Add R-B, remove unnecessary "endptr" "value"
p3. Add R-B

v10~v11:
1. Rebase to the newest codes
2. Address comment from Max
Don't use contractions in error messages,
p1: Remove R-Bs, and use "BdrvChild *child" in bdrv_del_child
p2: Fix error logic in get_new_child_index/remove_child_index, and prefect
child->name parsing
p3: Make bdrv_find_child return BdrvChild *, and add missing explanation

v9~v10:
1. Rebase to the newest codes
2. Address comments from Berto and Max, update the documents in block-core.json 
   and qmp-commands.hx 
3. Remove redundant codes in quorum_add_child() and quorum_del_child()
v8:
1. Rebase to the newest codes
2. Address the comments from Eric Blake
v7:
1. Remove the qmp command x-blockdev-change's parameter operation according
   to Kevin's comments.
2. Remove the hmp command.
v6:
1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
   and x-blockdev-child-delete
v5:
1. Address Eric Blake's comments
v4:
1. drop nbd driver's implementation. We can use human-monitor-command
   to do it.
2. Rename the command name.
v3:
1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
   created by the QMP command blockdev-add.
2. The driver NBD can support filename, path, host:port now.
v2:
1. Use bdrv_get_device_or_node_name() instead of new function
   bdrv_get_id_or_node_name()
2. Update the error message
3. Update the documents in block-core.json

Wen Congyang (3):
  Add new block driver interface to add/delete a BDS's child
  quorum: implement bdrv_add_child() and bdrv_del_child()
  qmp: add monitor command to add/remove a child

 block.c                   |  57 ++++++++++++++++++++--
 block/quorum.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++-
 blockdev.c                |  55 +++++++++++++++++++++
 include/block/block.h     |   8 +++
 include/block/block_int.h |   5 ++
 qapi/block-core.json      |  32 ++++++++++++
 qmp-commands.hx           |  54 +++++++++++++++++++++
 7 files changed, 326 insertions(+), 6 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child
  2016-03-10  2:49 [Qemu-devel] [PATCH v12 0/3] qapi: child add/delete support Changlong Xie
@ 2016-03-10  2:49 ` Changlong Xie
  2016-03-10 14:57   ` Alberto Garcia
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child() Changlong Xie
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child Changlong Xie
  2 siblings, 1 reply; 42+ messages in thread
From: Changlong Xie @ 2016-03-10  2:49 UTC (permalink / raw)
  To: qemu devel, Eric Blake, Alberto Garcia, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

From: Wen Congyang <wency@cn.fujitsu.com>

In some cases, we want to take a quorum child offline, and take
another child online.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 49 +++++++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |  4 ++++
 include/block/block_int.h |  5 +++++
 3 files changed, 58 insertions(+)

diff --git a/block.c b/block.c
index ba24b8e..d48f441 100644
--- a/block.c
+++ b/block.c
@@ -4395,3 +4395,52 @@ void bdrv_refresh_filename(BlockDriverState *bs)
         QDECREF(json);
     }
 }
+
+/*
+ * Hot add/remove a BDS's child. So the user can take a child offline when
+ * it is broken and take a new child online
+ */
+void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+                    Error **errp)
+{
+
+    if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
+        error_setg(errp, "The node %s does not support adding a child",
+                   bdrv_get_device_or_node_name(parent_bs));
+        return;
+    }
+
+    if (!QLIST_EMPTY(&child_bs->parents)) {
+        error_setg(errp, "The node %s already has a parent",
+                   child_bs->node_name);
+        return;
+    }
+
+    parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp);
+}
+
+void bdrv_del_child(BlockDriverState *parent_bs, BdrvChild *child, Error **errp)
+{
+    BdrvChild *tmp;
+
+    if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
+        error_setg(errp, "The node %s does not support removing a child",
+                   bdrv_get_device_or_node_name(parent_bs));
+        return;
+    }
+
+    QLIST_FOREACH(tmp, &parent_bs->children, next) {
+        if (tmp == child) {
+            break;
+        }
+    }
+
+    if (!tmp) {
+        error_setg(errp, "The node %s does not have child named %s",
+                   bdrv_get_device_or_node_name(parent_bs),
+                   bdrv_get_device_or_node_name(child->bs));
+        return;
+    }
+
+    parent_bs->drv->bdrv_del_child(parent_bs, child, errp);
+}
diff --git a/include/block/block.h b/include/block/block.h
index 1c4f4d8..7378e74 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -582,4 +582,8 @@ void bdrv_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
+void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
+                    Error **errp);
+void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9ef823a..704efe5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -305,6 +305,11 @@ struct BlockDriver {
      */
     void (*bdrv_drain)(BlockDriverState *bs);
 
+    void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+                           Error **errp);
+    void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
+                           Error **errp);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-10  2:49 [Qemu-devel] [PATCH v12 0/3] qapi: child add/delete support Changlong Xie
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child Changlong Xie
@ 2016-03-10  2:49 ` Changlong Xie
  2016-03-11 12:21   ` Alberto Garcia
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child Changlong Xie
  2 siblings, 1 reply; 42+ messages in thread
From: Changlong Xie @ 2016-03-10  2:49 UTC (permalink / raw)
  To: qemu devel, Eric Blake, Alberto Garcia, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c               |   8 ++--
 block/quorum.c        | 121 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/block/block.h |   4 ++
 3 files changed, 127 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index d48f441..66d21af 100644
--- a/block.c
+++ b/block.c
@@ -1194,10 +1194,10 @@ static int bdrv_fill_options(QDict **options, const char *filename,
     return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-                                    BlockDriverState *child_bs,
-                                    const char *child_name,
-                                    const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role)
 {
     BdrvChild *child = g_new(BdrvChild, 1);
     *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..97f030b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
     bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
                             * block if Quorum is reached.
                             */
+    unsigned long *index_bitmap;
+    int bsize;
 
     QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -877,9 +880,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
         ret = -EINVAL;
         goto exit;
     }
-    if (s->num_children < 2) {
+    if (s->num_children < 1) {
         error_setg(&local_err,
-                   "Number of provided children must be greater than 1");
+                   "Number of provided children must be 1 or more");
         ret = -EINVAL;
         goto exit;
     }
@@ -928,6 +931,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     /* allocate the children array */
     s->children = g_new0(BdrvChild *, s->num_children);
     opened = g_new0(bool, s->num_children);
+    s->index_bitmap = bitmap_new(s->num_children);
 
     for (i = 0; i < s->num_children; i++) {
         char indexstr[32];
@@ -943,6 +947,8 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
         opened[i] = true;
     }
+    bitmap_set(s->index_bitmap, 0, s->num_children);
+    s->bsize = s->num_children;
 
     g_free(opened);
     goto exit;
@@ -999,6 +1005,114 @@ static void quorum_attach_aio_context(BlockDriverState *bs,
     }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+    int index;
+
+    index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+    if (index < s->bsize) {
+        return index;
+    }
+
+    s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+                                         s->bsize + 1);
+    return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+    int last_index, old_bsize;
+    size_t new_len;
+
+    assert(index < s->bsize);
+
+    clear_bit(index, s->index_bitmap);
+    if (index < s->bsize - 1) {
+        /* The last bit is always set */
+        return;
+    }
+
+    /* Clear last bit */
+    old_bsize = s->bsize;
+    last_index = find_last_bit(s->index_bitmap, s->bsize);
+    assert(last_index < old_bsize);
+    s->bsize = last_index + 1;
+
+    if (BITS_TO_LONGS(old_bsize) == BITS_TO_LONGS(s->bsize)) {
+        return;
+    }
+
+    new_len = BITS_TO_LONGS(s->bsize) * sizeof(unsigned long);
+    s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    BdrvChild *child;
+    char indexstr[32];
+    int index, ret;
+
+    index = get_new_child_index(s);
+    ret = snprintf(indexstr, 32, "children.%d", index);
+    if (ret < 0 || ret >= 32) {
+        error_setg(errp, "cannot generate child name");
+        return;
+    }
+
+    bdrv_drain(bs);
+
+    assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+    if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+        error_setg(errp, "Too many children");
+        return;
+    }
+    s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
+
+    bdrv_ref(child_bs);
+    child = bdrv_attach_child(bs, child_bs, indexstr, &child_format);
+    s->children[s->num_children++] = child;
+    set_bit(index, s->index_bitmap);
+}
+
+static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i, rc;
+    unsigned long index;
+
+    for (i = 0; i < s->num_children; i++) {
+        if (s->children[i] == child) {
+            break;
+        }
+    }
+
+    /* we have checked it in bdrv_del_child() */
+    assert(i < s->num_children);
+
+    if (s->num_children <= s->threshold) {
+        error_setg(errp,
+            "The number of children cannot be lower than the vote threshold %d",
+            s->threshold);
+        return;
+    }
+
+    /* child->name is "children.%d" */
+    assert(!strncmp(child->name, "children.", 9));
+    rc = qemu_strtoul(child->name + 9, NULL, 10, &index);
+    assert(!rc && index < INT_MAX / sizeof(BdrvChild *));
+
+    bdrv_drain(bs);
+    /* We can safely remove this child now */
+    memmove(&s->children[i], &s->children[i + 1],
+            (s->num_children - i - 1) * sizeof(void *));
+    s->children = g_renew(BdrvChild *, s->children, --s->num_children);
+    remove_child_index(s, index);
+    bdrv_unref_child(bs, child);
+}
+
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
 {
     BDRVQuorumState *s = bs->opaque;
@@ -1054,6 +1168,9 @@ static BlockDriver bdrv_quorum = {
     .bdrv_detach_aio_context            = quorum_detach_aio_context,
     .bdrv_attach_aio_context            = quorum_attach_aio_context,
 
+    .bdrv_add_child                     = quorum_add_child,
+    .bdrv_del_child                     = quorum_del_child,
+
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
 };
diff --git a/include/block/block.h b/include/block/block.h
index 7378e74..8a3966d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -517,6 +517,10 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_ref(BlockDriverState *bs);
 void bdrv_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child
  2016-03-10  2:49 [Qemu-devel] [PATCH v12 0/3] qapi: child add/delete support Changlong Xie
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child Changlong Xie
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child() Changlong Xie
@ 2016-03-10  2:49 ` Changlong Xie
  2016-03-11 12:48   ` Alberto Garcia
  2016-03-28  6:09   ` Changlong Xie
  2 siblings, 2 replies; 42+ messages in thread
From: Changlong Xie @ 2016-03-10  2:49 UTC (permalink / raw)
  To: qemu devel, Eric Blake, Alberto Garcia, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

From: Wen Congyang <wency@cn.fujitsu.com>

The new QMP command name is x-blockdev-change. It's just for adding/removing
quorum's child now, and doesn't support all kinds of children, all kinds of
operations, nor all block drivers. So it is experimental now.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 blockdev.c           | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 qapi/block-core.json | 32 ++++++++++++++++++++++++++++++
 qmp-commands.hx      | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 141 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 0f20c65..435631e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4017,6 +4017,61 @@ out:
     aio_context_release(aio_context);
 }
 
+static BdrvChild *bdrv_find_child(BlockDriverState *parent_bs,
+                                  const char *child_name)
+{
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &parent_bs->children, next) {
+        if (strcmp(child->name, child_name) == 0) {
+            return child;
+        }
+    }
+
+    return NULL;
+}
+
+void qmp_x_blockdev_change(const char *parent, bool has_child,
+                           const char *child, bool has_node,
+                           const char *node, Error **errp)
+{
+    BlockDriverState *parent_bs, *new_bs = NULL;
+    BdrvChild *p_child;
+
+    parent_bs = bdrv_lookup_bs(parent, parent, errp);
+    if (!parent_bs) {
+        return;
+    }
+
+    if (has_child == has_node) {
+        if (has_child) {
+            error_setg(errp, "The parameters child and node are in conflict");
+        } else {
+            error_setg(errp, "Either child or node must be specified");
+        }
+        return;
+    }
+
+    if (has_child) {
+        p_child = bdrv_find_child(parent_bs, child);
+        if (!p_child) {
+            error_setg(errp, "Node '%s' does not have child '%s'",
+                       parent, child);
+            return;
+        }
+        bdrv_del_child(parent_bs, p_child, errp);
+    }
+
+    if (has_node) {
+        new_bs = bdrv_find_node(node);
+        if (!new_bs) {
+            error_setg(errp, "Node '%s' not found", node);
+            return;
+        }
+        bdrv_add_child(parent_bs, new_bs, errp);
+    }
+}
+
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 {
     BlockJobInfoList *head = NULL, **p_next = &head;
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 9bf1b22..bc3fd0b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2548,3 +2548,35 @@
 ##
 { 'command': 'block-set-write-threshold',
   'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @x-blockdev-change
+#
+# Dynamically reconfigure the block driver state graph. It can be used
+# to add, remove, insert or replace a graph node. Currently only the
+# Quorum driver implements this feature to add or remove its child. This
+# is useful to fix a broken quorum child.
+#
+# If @node is specified, it will be inserted under @parent. @child
+# may not be specified in this case. If both @parent and @child are
+# specified but @node is not, @child will be detached from @parent.
+#
+# @parent: the id or name of the parent node.
+#
+# @child: #optional the name of a child under the given parent node.
+#
+# @node: #optional the name of the node that will be added.
+#
+# Note: this command is experimental, and its API is not stable. It
+# does not support all kinds of operations, all kinds of children, nor
+# all block drivers.
+#
+# Warning: The data in a new quorum child MUST be consistent with that of
+# the rest of the array.
+#
+# Since: 2.6
+##
+{ 'command': 'x-blockdev-change',
+  'data' : { 'parent': 'str',
+             '*child': 'str',
+             '*node': 'str' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index b629673..2a55135 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -4398,6 +4398,60 @@ Example:
 EQMP
 
     {
+        .name       = "x-blockdev-change",
+        .args_type  = "parent:B,child:B?,node:B?",
+        .mhandler.cmd_new = qmp_marshal_x_blockdev_change,
+    },
+
+SQMP
+x-blockdev-change
+-----------------
+
+Dynamically reconfigure the block driver state graph. It can be used
+to add, remove, insert or replace a graph node. Currently only the
+Quorum driver implements this feature to add or remove its child. This
+is useful to fix a broken quorum child.
+
+If @node is specified, it will be inserted under @parent. @child
+may not be specified in this case. If both @parent and @child are
+specified but @node is not, @child will be detached from @parent.
+
+Arguments:
+- "parent": the id or name of the parent node (json-string)
+- "child": the name of a child under the given parent node (json-string, optional)
+- "node": the name of the node that will be added (json-string, optional)
+
+Note: this command is experimental, and not a stable API. It doesn't
+support all kinds of operations, all kinds of children, nor all block
+drivers.
+
+Warning: The data in a new quorum child MUST be consistent with that of
+the rest of the array.
+
+Example:
+
+Add a new node to a quorum
+-> { "execute": "blockdev-add",
+    "arguments": { "options": { "driver": "raw",
+                                "node-name": "new_node",
+                                "id": "test_new_node",
+                                "file": { "driver": "file",
+                                          "filename": "test.raw" } } } }
+<- { "return": {} }
+-> { "execute": "x-blockdev-change",
+    "arguments": { "parent": "disk1",
+                   "node": "new_node" } }
+<- { "return": {} }
+
+Delete a quorum's node
+-> { "execute": "x-blockdev-change",
+    "arguments": { "parent": "disk1",
+                   "child": "children.1" } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "query-named-block-nodes",
         .args_type  = "",
         .mhandler.cmd_new = qmp_marshal_query_named_block_nodes,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child Changlong Xie
@ 2016-03-10 14:57   ` Alberto Garcia
  2016-03-11  1:17     ` Changlong Xie
  0 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2016-03-10 14:57 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On Thu 10 Mar 2016 03:49:39 AM CET, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
>
> In some cases, we want to take a quorum child offline, and take
> another child online.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>

> +    if (!tmp) {
> +        error_setg(errp, "The node %s does not have child named %s",
> +                   bdrv_get_device_or_node_name(parent_bs),
> +                   bdrv_get_device_or_node_name(child->bs));
> +        return;

I think it should be "does not have a child" or "does not have any
child".

With that corrected,

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child
  2016-03-10 14:57   ` Alberto Garcia
@ 2016-03-11  1:17     ` Changlong Xie
  0 siblings, 0 replies; 42+ messages in thread
From: Changlong Xie @ 2016-03-11  1:17 UTC (permalink / raw)
  To: Alberto Garcia, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On 03/10/2016 10:57 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:39 AM CET, Changlong Xie wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> In some cases, we want to take a quorum child offline, and take
>> another child online.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>
>> +    if (!tmp) {
>> +        error_setg(errp, "The node %s does not have child named %s",
>> +                   bdrv_get_device_or_node_name(parent_bs),
>> +                   bdrv_get_device_or_node_name(child->bs));
>> +        return;
>
> I think it should be "does not have a child" or "does not have any
> child".
>
> With that corrected,

Ok, will fix in next version.

Thanks
	-Xie

>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
>
> Berto
>
>
> .
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child() Changlong Xie
@ 2016-03-11 12:21   ` Alberto Garcia
  2016-03-14  1:33     ` Changlong Xie
                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Alberto Garcia @ 2016-03-11 12:21 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>                              * block if Quorum is reached.
>                              */
> +    unsigned long *index_bitmap;
> +    int bsize;
  [...]
> +static int get_new_child_index(BDRVQuorumState *s)
  [...]
> +static void remove_child_index(BDRVQuorumState *s, int index)
  [...]

Sorry if I missed a previous discussion, but why is this necessary?

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child Changlong Xie
@ 2016-03-11 12:48   ` Alberto Garcia
  2016-03-28  6:09   ` Changlong Xie
  1 sibling, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2016-03-11 12:48 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On Thu 10 Mar 2016 03:49:41 AM CET, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
>
> The new QMP command name is x-blockdev-change. It's just for adding/removing
> quorum's child now, and doesn't support all kinds of children, all kinds of
> operations, nor all block drivers. So it is experimental now.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-11 12:21   ` Alberto Garcia
@ 2016-03-14  1:33     ` Changlong Xie
  2016-03-14  6:02     ` Changlong Xie
  2016-03-16  2:10     ` Wen Congyang
  2 siblings, 0 replies; 42+ messages in thread
From: Changlong Xie @ 2016-03-14  1:33 UTC (permalink / raw)
  To: Alberto Garcia, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                               * block if Quorum is reached.
>>                               */
>> +    unsigned long *index_bitmap;
>> +    int bsize;
>    [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>    [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>    [...]
>
> Sorry if I missed a previous discussion, but why is this necessary?

Hi Betro

Currently we implement this for COLO, we need the capability to hotplug 
NBD child in COLO mode.

More detail please reference 
http://wiki.qemu.org/Features/BlockReplication.

Thanks
	-Xie

>
> Berto
>
>
> .
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-11 12:21   ` Alberto Garcia
  2016-03-14  1:33     ` Changlong Xie
@ 2016-03-14  6:02     ` Changlong Xie
  2016-03-16 12:38       ` Alberto Garcia
  2016-03-16  2:10     ` Wen Congyang
  2 siblings, 1 reply; 42+ messages in thread
From: Changlong Xie @ 2016-03-14  6:02 UTC (permalink / raw)
  To: Alberto Garcia, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                               * block if Quorum is reached.
>>                               */
>> +    unsigned long *index_bitmap;

Hi Berto

*NOTE*, In the old version, we just used "bs->node_name", but in the 
lastest one, as Kevin suggested we introduce "child->child_name"(formart 
as "children.xxx"), this is the key cause why we need this two functions 
here.

Thanks
	-Xie
>> +    int bsize;
>    [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>    [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>    [...]
>
> Sorry if I missed a previous discussion, but why is this necessary?
>
> Berto
>
>
> .
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-11 12:21   ` Alberto Garcia
  2016-03-14  1:33     ` Changlong Xie
  2016-03-14  6:02     ` Changlong Xie
@ 2016-03-16  2:10     ` Wen Congyang
  2 siblings, 0 replies; 42+ messages in thread
From: Wen Congyang @ 2016-03-16  2:10 UTC (permalink / raw)
  To: Alberto Garcia, Changlong Xie, qemu devel, Eric Blake,
	Kevin Wolf, Max Reitz, Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Markus Armbruster, Dr. David Alan Gilbert, Gonglei

On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                              * block if Quorum is reached.
>>                              */
>> +    unsigned long *index_bitmap;
>> +    int bsize;
>   [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>   [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>   [...]
> 
> Sorry if I missed a previous discussion, but why is this necessary?

Hi, Alberto Garcia

Do you have any comments about this patch or give a R-B?

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-14  6:02     ` Changlong Xie
@ 2016-03-16 12:38       ` Alberto Garcia
  2016-03-17  1:22         ` Wen Congyang
  0 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2016-03-16 12:38 UTC (permalink / raw)
  To: Changlong Xie, qemu devel, Eric Blake, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: qemu block, Jiang Yunhong, Dong Eddie, Markus Armbruster,
	Dr. David Alan Gilbert, Gonglei, zhanghailiang

On Mon 14 Mar 2016 07:02:08 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:

>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>                               * block if Quorum is reached.
>>>                               */
>>> +    unsigned long *index_bitmap;
>
> Hi Berto
>
> *NOTE*, In the old version, we just used "bs->node_name", but in the
> lastest one, as Kevin suggested we introduce
> "child->child_name"(formart as "children.xxx"), this is the key cause
> why we need this two functions here.

I'm sorry I missed this discussion earlier. Your code seems technically
correct but I have several questions:

- I read that one of the reasons for this change is that "In theory, the
  same node could be attached twice to the same parent in different
  roles.". Is there any example of that? What's the use case?

- How do you obtain the child name?

- I see that if you have children.0 and children.1 (let's say hd0.qcow2
  and hd1.qcow2), then you remove children.0 and add it again, it will
  keep the 'children.0' name (that's what the bitmap is for if I'm
  understanding it correctly). However the position in the s->children
  array will change because you do memmove() when you remove children.0
  and then add it again to the end of the array.

  Initial status:

    s->children[0] <--> "children.0" (hd0.qcow2)
    s->children[1] <--> "children.1" (hd1.qcow2)

  children.0 (hd0.qcow2) is removed:

    s->children[0] <--> "children.1" (hd1.qcow2)

  children.0 (hd0.qcow2) is added again:

    s->children[0] <--> "children.1" (hd1.qcow2)
    s->children[1] <--> "children.0" (hd0.qcow2)

  Is this correct? Is this the indented behavior? Since you are reading
  in FIFO mode, now hd1.qcow2 will always be read first, so if
  children.1 was the secondary disk, it has just become the primary.

I also think that it would be great to have tests for this
functionality, but they can be added later.

Thanks,

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-16 12:38       ` Alberto Garcia
@ 2016-03-17  1:22         ` Wen Congyang
  2016-03-17  9:10           ` Alberto Garcia
  0 siblings, 1 reply; 42+ messages in thread
From: Wen Congyang @ 2016-03-17  1:22 UTC (permalink / raw)
  To: Alberto Garcia, Changlong Xie, qemu devel, Eric Blake,
	Kevin Wolf, Max Reitz, Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Markus Armbruster, Dr. David Alan Gilbert, Gonglei

On 03/16/2016 08:38 PM, Alberto Garcia wrote:
> On Mon 14 Mar 2016 07:02:08 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:
> 
>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>                               * block if Quorum is reached.
>>>>                               */
>>>> +    unsigned long *index_bitmap;
>>
>> Hi Berto
>>
>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>> lastest one, as Kevin suggested we introduce
>> "child->child_name"(formart as "children.xxx"), this is the key cause
>> why we need this two functions here.
> 
> I'm sorry I missed this discussion earlier. Your code seems technically
> correct but I have several questions:
> 
> - I read that one of the reasons for this change is that "In theory, the
>   same node could be attached twice to the same parent in different
>   roles.". Is there any example of that? What's the use case?

Kevin may know the case.

> 
> - How do you obtain the child name?

IIRC, the answer is no now. I think we can improve 'info block' output

> 
> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>   and hd1.qcow2), then you remove children.0 and add it again, it will
>   keep the 'children.0' name (that's what the bitmap is for if I'm
>   understanding it correctly). However the position in the s->children
>   array will change because you do memmove() when you remove children.0
>   and then add it again to the end of the array.
> 
>   Initial status:
> 
>     s->children[0] <--> "children.0" (hd0.qcow2)
>     s->children[1] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is removed:
> 
>     s->children[0] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is added again:
> 
>     s->children[0] <--> "children.1" (hd1.qcow2)
>     s->children[1] <--> "children.0" (hd0.qcow2)

Yes, it is correct.

> 
>   Is this correct? Is this the indented behavior? Since you are reading
>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>   children.1 was the secondary disk, it has just become the primary.

Yes.

Thanks
Wen Congyang

> 
> I also think that it would be great to have tests for this
> functionality, but they can be added later.
> 
> Thanks,
> 
> Berto
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  1:22         ` Wen Congyang
@ 2016-03-17  9:10           ` Alberto Garcia
  2016-03-17  9:44             ` Wen Congyang
  0 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2016-03-17  9:10 UTC (permalink / raw)
  To: Wen Congyang, Changlong Xie, qemu devel, Eric Blake, Kevin Wolf,
	Max Reitz, Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Markus Armbruster, Dr. David Alan Gilbert, Gonglei

On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>                               * block if Quorum is reached.
>>>>>                               */
>>>>> +    unsigned long *index_bitmap;
>>>
>>> Hi Berto
>>>
>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>> lastest one, as Kevin suggested we introduce
>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>> why we need this two functions here.
>> 
>> I'm sorry I missed this discussion earlier. Your code seems technically
>> correct but I have several questions:
>> 
>> - I read that one of the reasons for this change is that "In theory, the
>>   same node could be attached twice to the same parent in different
>>   roles.". Is there any example of that? What's the use case?
>
> Kevin may know the case.

Kevin, do you have an example?

>> - How do you obtain the child name?
>
> IIRC, the answer is no now. I think we can improve 'info block' output

Okay, but then we should extend that first, otherwise this API cannot be
used.

>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>   understanding it correctly). However the position in the s->children
>>   array will change because you do memmove() when you remove children.0
>>   and then add it again to the end of the array.
>> 
>>   Initial status:
>> 
>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>     s->children[1] <--> "children.1" (hd1.qcow2)
>> 
>>   children.0 (hd0.qcow2) is removed:
>> 
>>     s->children[0] <--> "children.1" (hd1.qcow2)
>> 
>>   children.0 (hd0.qcow2) is added again:
>> 
>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>     s->children[1] <--> "children.0" (hd0.qcow2)
>
> Yes, it is correct.
>
>> 
>>   Is this correct? Is this the indented behavior? Since you are reading
>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>   children.1 was the secondary disk, it has just become the primary.
>
> Yes.

And don't you need a way to control the order in which the disks must be
read for COLO?

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:10           ` Alberto Garcia
@ 2016-03-17  9:44             ` Wen Congyang
  2016-03-17  9:48               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Wen Congyang @ 2016-03-17  9:44 UTC (permalink / raw)
  To: Alberto Garcia, Changlong Xie, qemu devel, Eric Blake,
	Kevin Wolf, Max Reitz, Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Markus Armbruster, Dr. David Alan Gilbert, Gonglei

On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>>                               * block if Quorum is reached.
>>>>>>                               */
>>>>>> +    unsigned long *index_bitmap;
>>>>
>>>> Hi Berto
>>>>
>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>> lastest one, as Kevin suggested we introduce
>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>> why we need this two functions here.
>>>
>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>> correct but I have several questions:
>>>
>>> - I read that one of the reasons for this change is that "In theory, the
>>>   same node could be attached twice to the same parent in different
>>>   roles.". Is there any example of that? What's the use case?
>>
>> Kevin may know the case.
> 
> Kevin, do you have an example?
> 
>>> - How do you obtain the child name?
>>
>> IIRC, the answer is no now. I think we can improve 'info block' output
> 
> Okay, but then we should extend that first, otherwise this API cannot be
> used.
> 
>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>   understanding it correctly). However the position in the s->children
>>>   array will change because you do memmove() when you remove children.0
>>>   and then add it again to the end of the array.
>>>
>>>   Initial status:
>>>
>>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>>     s->children[1] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is removed:
>>>
>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is added again:
>>>
>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>     s->children[1] <--> "children.0" (hd0.qcow2)
>>
>> Yes, it is correct.
>>
>>>
>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>   children.1 was the secondary disk, it has just become the primary.
>>
>> Yes.
> 
> And don't you need a way to control the order in which the disks must be
> read for COLO?

I think in fifo mode, we should read the disk first that is added earlier.

We don't need a way to control the order now.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:44             ` Wen Congyang
@ 2016-03-17  9:48               ` Dr. David Alan Gilbert
  2016-03-17  9:56                 ` Wen Congyang
  0 siblings, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-17  9:48 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> > On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
> >>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >>>>>>                               * block if Quorum is reached.
> >>>>>>                               */
> >>>>>> +    unsigned long *index_bitmap;
> >>>>
> >>>> Hi Berto
> >>>>
> >>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
> >>>> lastest one, as Kevin suggested we introduce
> >>>> "child->child_name"(formart as "children.xxx"), this is the key cause
> >>>> why we need this two functions here.
> >>>
> >>> I'm sorry I missed this discussion earlier. Your code seems technically
> >>> correct but I have several questions:
> >>>
> >>> - I read that one of the reasons for this change is that "In theory, the
> >>>   same node could be attached twice to the same parent in different
> >>>   roles.". Is there any example of that? What's the use case?
> >>
> >> Kevin may know the case.
> > 
> > Kevin, do you have an example?
> > 
> >>> - How do you obtain the child name?
> >>
> >> IIRC, the answer is no now. I think we can improve 'info block' output
> > 
> > Okay, but then we should extend that first, otherwise this API cannot be
> > used.
> > 
> >>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
> >>>   and hd1.qcow2), then you remove children.0 and add it again, it will
> >>>   keep the 'children.0' name (that's what the bitmap is for if I'm
> >>>   understanding it correctly). However the position in the s->children
> >>>   array will change because you do memmove() when you remove children.0
> >>>   and then add it again to the end of the array.
> >>>
> >>>   Initial status:
> >>>
> >>>     s->children[0] <--> "children.0" (hd0.qcow2)
> >>>     s->children[1] <--> "children.1" (hd1.qcow2)
> >>>
> >>>   children.0 (hd0.qcow2) is removed:
> >>>
> >>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>
> >>>   children.0 (hd0.qcow2) is added again:
> >>>
> >>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>     s->children[1] <--> "children.0" (hd0.qcow2)
> >>
> >> Yes, it is correct.
> >>
> >>>
> >>>   Is this correct? Is this the indented behavior? Since you are reading
> >>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
> >>>   children.1 was the secondary disk, it has just become the primary.
> >>
> >> Yes.
> > 
> > And don't you need a way to control the order in which the disks must be
> > read for COLO?
> 
> I think in fifo mode, we should read the disk first that is added earlier.
> 
> We don't need a way to control the order now.

Can you document fully how it's used in COLO then?
We should have the failure modes documented, and how you'll use
it after failover etc   Without that it's really difficult to tell
if this naming is right.
The children.0 notation is really confusing in the way that Berto
describes; I hit this a couple of months ago and it really doesn't
make sense.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Berto
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:48               ` Dr. David Alan Gilbert
@ 2016-03-17  9:56                 ` Wen Congyang
  2016-03-17  9:59                   ` Dr. David Alan Gilbert
                                     ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Wen Congyang @ 2016-03-17  9:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>>>>                               * block if Quorum is reached.
>>>>>>>>                               */
>>>>>>>> +    unsigned long *index_bitmap;
>>>>>>
>>>>>> Hi Berto
>>>>>>
>>>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>>>> lastest one, as Kevin suggested we introduce
>>>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>>>> why we need this two functions here.
>>>>>
>>>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>>>> correct but I have several questions:
>>>>>
>>>>> - I read that one of the reasons for this change is that "In theory, the
>>>>>   same node could be attached twice to the same parent in different
>>>>>   roles.". Is there any example of that? What's the use case?
>>>>
>>>> Kevin may know the case.
>>>
>>> Kevin, do you have an example?
>>>
>>>>> - How do you obtain the child name?
>>>>
>>>> IIRC, the answer is no now. I think we can improve 'info block' output
>>>
>>> Okay, but then we should extend that first, otherwise this API cannot be
>>> used.
>>>
>>>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>>>   understanding it correctly). However the position in the s->children
>>>>>   array will change because you do memmove() when you remove children.0
>>>>>   and then add it again to the end of the array.
>>>>>
>>>>>   Initial status:
>>>>>
>>>>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>>>>     s->children[1] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is removed:
>>>>>
>>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is added again:
>>>>>
>>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>>>     s->children[1] <--> "children.0" (hd0.qcow2)
>>>>
>>>> Yes, it is correct.
>>>>
>>>>>
>>>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>>>   children.1 was the secondary disk, it has just become the primary.
>>>>
>>>> Yes.
>>>
>>> And don't you need a way to control the order in which the disks must be
>>> read for COLO?
>>
>> I think in fifo mode, we should read the disk first that is added earlier.
>>
>> We don't need a way to control the order now.
> 
> Can you document fully how it's used in COLO then?

Do you mean document it in docs/block-replication.txt?

> We should have the failure modes documented, and how you'll use
> it after failover etc   Without that it's really difficult to tell
> if this naming is right.

For COLO, children.0 is the real disk, children.1 is replication driver.
After failure, children.1 will be removed by the user. If we want to
continue do COLO, we need add a new children.1 again.

> The children.0 notation is really confusing in the way that Berto
> describes; I hit this a couple of months ago and it really doesn't
> make sense.

Do you mean: read from children.1 first, and then read from children.0 in
fifo mode? Yes, the behavior is very strange.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:56                 ` Wen Congyang
@ 2016-03-17  9:59                   ` Dr. David Alan Gilbert
  2016-03-17 10:07                   ` Alberto Garcia
  2016-03-29 15:38                   ` Max Reitz
  2 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-17  9:59 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> >>> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
> >>>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >>>>>>>>                               * block if Quorum is reached.
> >>>>>>>>                               */
> >>>>>>>> +    unsigned long *index_bitmap;
> >>>>>>
> >>>>>> Hi Berto
> >>>>>>
> >>>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
> >>>>>> lastest one, as Kevin suggested we introduce
> >>>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
> >>>>>> why we need this two functions here.
> >>>>>
> >>>>> I'm sorry I missed this discussion earlier. Your code seems technically
> >>>>> correct but I have several questions:
> >>>>>
> >>>>> - I read that one of the reasons for this change is that "In theory, the
> >>>>>   same node could be attached twice to the same parent in different
> >>>>>   roles.". Is there any example of that? What's the use case?
> >>>>
> >>>> Kevin may know the case.
> >>>
> >>> Kevin, do you have an example?
> >>>
> >>>>> - How do you obtain the child name?
> >>>>
> >>>> IIRC, the answer is no now. I think we can improve 'info block' output
> >>>
> >>> Okay, but then we should extend that first, otherwise this API cannot be
> >>> used.
> >>>
> >>>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
> >>>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
> >>>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
> >>>>>   understanding it correctly). However the position in the s->children
> >>>>>   array will change because you do memmove() when you remove children.0
> >>>>>   and then add it again to the end of the array.
> >>>>>
> >>>>>   Initial status:
> >>>>>
> >>>>>     s->children[0] <--> "children.0" (hd0.qcow2)
> >>>>>     s->children[1] <--> "children.1" (hd1.qcow2)
> >>>>>
> >>>>>   children.0 (hd0.qcow2) is removed:
> >>>>>
> >>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>>>
> >>>>>   children.0 (hd0.qcow2) is added again:
> >>>>>
> >>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>>>     s->children[1] <--> "children.0" (hd0.qcow2)
> >>>>
> >>>> Yes, it is correct.
> >>>>
> >>>>>
> >>>>>   Is this correct? Is this the indented behavior? Since you are reading
> >>>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
> >>>>>   children.1 was the secondary disk, it has just become the primary.
> >>>>
> >>>> Yes.
> >>>
> >>> And don't you need a way to control the order in which the disks must be
> >>> read for COLO?
> >>
> >> I think in fifo mode, we should read the disk first that is added earlier.
> >>
> >> We don't need a way to control the order now.
> > 
> > Can you document fully how it's used in COLO then?
> 
> Do you mean document it in docs/block-replication.txt?

That would be OK.

> > We should have the failure modes documented, and how you'll use
> > it after failover etc   Without that it's really difficult to tell
> > if this naming is right.
> 
> For COLO, children.0 is the real disk, children.1 is replication driver.
> After failure, children.1 will be removed by the user. If we want to
> continue do COLO, we need add a new children.1 again.

So you need to document how to do that.

> > The children.0 notation is really confusing in the way that Berto
> > describes; I hit this a couple of months ago and it really doesn't
> > make sense.
> 
> Do you mean: read from children.1 first, and then read from children.0 in
> fifo mode? Yes, the behavior is very strange.

I mean the 'children.0' 'children.1' naming is just very confusing.
Also because the order in the array is important it's even more confusing
since the 'children.1' isn't necessarily the children[1].

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >>
> >> Thanks
> >> Wen Congyang
> >>
> >>>
> >>> Berto
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:56                 ` Wen Congyang
  2016-03-17  9:59                   ` Dr. David Alan Gilbert
@ 2016-03-17 10:07                   ` Alberto Garcia
  2016-03-17 10:23                     ` Wen Congyang
  2016-03-29 15:38                   ` Max Reitz
  2 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2016-03-17 10:07 UTC (permalink / raw)
  To: Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Markus Armbruster, Jiang Yunhong, Dong Eddie, qemu devel,
	Max Reitz, Gonglei, Stefan Hajnoczi

On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>> We should have the failure modes documented, and how you'll use it
>> after failover etc Without that it's really difficult to tell if this
>> naming is right.
>
> For COLO, children.0 is the real disk, children.1 is replication
> driver.  After failure, children.1 will be removed by the user. If we
> want to continue do COLO, we need add a new children.1 again.

What if children.0 fails ?

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17 10:07                   ` Alberto Garcia
@ 2016-03-17 10:23                     ` Wen Congyang
  2016-03-17 11:25                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Wen Congyang @ 2016-03-17 10:23 UTC (permalink / raw)
  To: Alberto Garcia, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Markus Armbruster, Jiang Yunhong, Dong Eddie, qemu devel,
	Max Reitz, Gonglei, Stefan Hajnoczi

On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>> We should have the failure modes documented, and how you'll use it
>>> after failover etc Without that it's really difficult to tell if this
>>> naming is right.
>>
>> For COLO, children.0 is the real disk, children.1 is replication
>> driver.  After failure, children.1 will be removed by the user. If we
>> want to continue do COLO, we need add a new children.1 again.
> 
> What if children.0 fails ?

For COLO, reading from children.1 always fails. if children.0 fails, it
means that reading from the disk fails. The guest vm will see the I/O error.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17 10:23                     ` Wen Congyang
@ 2016-03-17 11:25                       ` Dr. David Alan Gilbert
  2016-03-18  2:56                         ` Wen Congyang
  0 siblings, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-17 11:25 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> > On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
> >>> We should have the failure modes documented, and how you'll use it
> >>> after failover etc Without that it's really difficult to tell if this
> >>> naming is right.
> >>
> >> For COLO, children.0 is the real disk, children.1 is replication
> >> driver.  After failure, children.1 will be removed by the user. If we
> >> want to continue do COLO, we need add a new children.1 again.
> > 
> > What if children.0 fails ?
> 
> For COLO, reading from children.1 always fails. if children.0 fails, it
> means that reading from the disk fails. The guest vm will see the I/O error.

How do we get that to cause a fail over before the guest detects it?
If the primary's local disk (children.0) fails then if we can failover
at that point then the guest carries running on the secondary without
ever knowing about the failure.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Berto
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17 11:25                       ` Dr. David Alan Gilbert
@ 2016-03-18  2:56                         ` Wen Congyang
  2016-03-18 10:48                           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Wen Congyang @ 2016-03-18  2:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

On 03/17/2016 07:25 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>>>> We should have the failure modes documented, and how you'll use it
>>>>> after failover etc Without that it's really difficult to tell if this
>>>>> naming is right.
>>>>
>>>> For COLO, children.0 is the real disk, children.1 is replication
>>>> driver.  After failure, children.1 will be removed by the user. If we
>>>> want to continue do COLO, we need add a new children.1 again.
>>>
>>> What if children.0 fails ?
>>
>> For COLO, reading from children.1 always fails. if children.0 fails, it
>> means that reading from the disk fails. The guest vm will see the I/O error.
> 
> How do we get that to cause a fail over before the guest detects it?
> If the primary's local disk (children.0) fails then if we can failover
> at that point then the guest carries running on the secondary without
> ever knowing about the failure.

COLO is not designed for such case. The children.0 can also be quorum, so
you can add more than one real disk, and get more reliability. Another
choice is that, the real disk is an external storage, and it has
its own replication solution.

COLO is designed for such case: the host is crashed, and the guest is still
alive after failover, the client doesn't know this event.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-18  2:56                         ` Wen Congyang
@ 2016-03-18 10:48                           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-18 10:48 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Markus Armbruster, Jiang Yunhong, Dong Eddie,
	qemu devel, Max Reitz, Gonglei, Stefan Hajnoczi

* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 07:25 PM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> >>> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
> >>>>> We should have the failure modes documented, and how you'll use it
> >>>>> after failover etc Without that it's really difficult to tell if this
> >>>>> naming is right.
> >>>>
> >>>> For COLO, children.0 is the real disk, children.1 is replication
> >>>> driver.  After failure, children.1 will be removed by the user. If we
> >>>> want to continue do COLO, we need add a new children.1 again.
> >>>
> >>> What if children.0 fails ?
> >>
> >> For COLO, reading from children.1 always fails. if children.0 fails, it
> >> means that reading from the disk fails. The guest vm will see the I/O error.
> > 
> > How do we get that to cause a fail over before the guest detects it?
> > If the primary's local disk (children.0) fails then if we can failover
> > at that point then the guest carries running on the secondary without
> > ever knowing about the failure.
> 
> COLO is not designed for such case. The children.0 can also be quorum, so
> you can add more than one real disk, and get more reliability. Another
> choice is that, the real disk is an external storage, and it has
> its own replication solution.
> 
> COLO is designed for such case: the host is crashed, and the guest is still
> alive after failover, the client doesn't know this event.

That seems an odd limitation; the only thing needed for COLO to survive a disk
failure on the primary would be to ensure that the primary fails/triggers failover
if access to the local disk fails and to do it before the IO result is returned to
the guest.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >>
> >> Thanks
> >> Wen Congyang
> >>
> >>>
> >>> Berto
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child
  2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child Changlong Xie
  2016-03-11 12:48   ` Alberto Garcia
@ 2016-03-28  6:09   ` Changlong Xie
  1 sibling, 0 replies; 42+ messages in thread
From: Changlong Xie @ 2016-03-28  6:09 UTC (permalink / raw)
  To: qemu devel, Eric Blake, Alberto Garcia, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi
  Cc: zhanghailiang, qemu block, Jiang Yunhong, Dong Eddie,
	Markus Armbruster, Dr. David Alan Gilbert, Gonglei

On 03/10/2016 10:49 AM, Changlong Xie wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
>
> The new QMP command name is x-blockdev-change. It's just for adding/removing
> quorum's child now, and doesn't support all kinds of children, all kinds of
> operations, nor all block drivers. So it is experimental now.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>   blockdev.c           | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   qapi/block-core.json | 32 ++++++++++++++++++++++++++++++
>   qmp-commands.hx      | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 141 insertions(+)
>
> diff --git a/blockdev.c b/blockdev.c
> index 0f20c65..435631e 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -4017,6 +4017,61 @@ out:
>       aio_context_release(aio_context);
>   }
>
> +static BdrvChild *bdrv_find_child(BlockDriverState *parent_bs,
> +                                  const char *child_name)
> +{
> +    BdrvChild *child;
> +
> +    QLIST_FOREACH(child, &parent_bs->children, next) {
> +        if (strcmp(child->name, child_name) == 0) {
> +            return child;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +void qmp_x_blockdev_change(const char *parent, bool has_child,
> +                           const char *child, bool has_node,
> +                           const char *node, Error **errp)
> +{
> +    BlockDriverState *parent_bs, *new_bs = NULL;
> +    BdrvChild *p_child;
> +
> +    parent_bs = bdrv_lookup_bs(parent, parent, errp);
> +    if (!parent_bs) {
> +        return;
> +    }
> +
> +    if (has_child == has_node) {
> +        if (has_child) {
> +            error_setg(errp, "The parameters child and node are in conflict");
> +        } else {
> +            error_setg(errp, "Either child or node must be specified");
> +        }
> +        return;
> +    }
> +
> +    if (has_child) {
> +        p_child = bdrv_find_child(parent_bs, child);
> +        if (!p_child) {
> +            error_setg(errp, "Node '%s' does not have child '%s'",
> +                       parent, child);
> +            return;
> +        }
> +        bdrv_del_child(parent_bs, p_child, errp);
> +    }
> +
> +    if (has_node) {
> +        new_bs = bdrv_find_node(node);
> +        if (!new_bs) {
> +            error_setg(errp, "Node '%s' not found", node);
> +            return;
> +        }
> +        bdrv_add_child(parent_bs, new_bs, errp);
> +    }
> +}
> +
>   BlockJobInfoList *qmp_query_block_jobs(Error **errp)
>   {
>       BlockJobInfoList *head = NULL, **p_next = &head;
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 9bf1b22..bc3fd0b 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2548,3 +2548,35 @@
>   ##
>   { 'command': 'block-set-write-threshold',
>     'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
> +
> +##
> +# @x-blockdev-change
> +#
> +# Dynamically reconfigure the block driver state graph. It can be used
> +# to add, remove, insert or replace a graph node. Currently only the
> +# Quorum driver implements this feature to add or remove its child. This
> +# is useful to fix a broken quorum child.
> +#
> +# If @node is specified, it will be inserted under @parent. @child
> +# may not be specified in this case. If both @parent and @child are
> +# specified but @node is not, @child will be detached from @parent.
> +#
> +# @parent: the id or name of the parent node.
> +#
> +# @child: #optional the name of a child under the given parent node.
> +#
> +# @node: #optional the name of the node that will be added.
> +#
> +# Note: this command is experimental, and its API is not stable. It
> +# does not support all kinds of operations, all kinds of children, nor
> +# all block drivers.
> +#
> +# Warning: The data in a new quorum child MUST be consistent with that of
> +# the rest of the array.
> +#
> +# Since: 2.6
> +##
> +{ 'command': 'x-blockdev-change',
> +  'data' : { 'parent': 'str',
> +             '*child': 'str',
> +             '*node': 'str' } }
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index b629673..2a55135 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -4398,6 +4398,60 @@ Example:
>   EQMP
>
>       {
> +        .name       = "x-blockdev-change",
> +        .args_type  = "parent:B,child:B?,node:B?",
> +        .mhandler.cmd_new = qmp_marshal_x_blockdev_change,
> +    },
> +
> +SQMP
> +x-blockdev-change
> +-----------------
> +
> +Dynamically reconfigure the block driver state graph. It can be used
> +to add, remove, insert or replace a graph node. Currently only the
> +Quorum driver implements this feature to add or remove its child. This
> +is useful to fix a broken quorum child.
> +
> +If @node is specified, it will be inserted under @parent. @child
> +may not be specified in this case. If both @parent and @child are
> +specified but @node is not, @child will be detached from @parent.
> +
> +Arguments:
> +- "parent": the id or name of the parent node (json-string)
> +- "child": the name of a child under the given parent node (json-string, optional)
> +- "node": the name of the node that will be added (json-string, optional)
> +
> +Note: this command is experimental, and not a stable API. It doesn't
> +support all kinds of operations, all kinds of children, nor all block
> +drivers.
> +
> +Warning: The data in a new quorum child MUST be consistent with that of
> +the rest of the array.
> +
> +Example:
> +
> +Add a new node to a quorum
> +-> { "execute": "blockdev-add",
> +    "arguments": { "options": { "driver": "raw",
> +                                "node-name": "new_node",
> +                                "id": "test_new_node",

Since f21d96d04 block: Use BdrvChild in BlockBackend, if we want to 
create a node without BB, we should not pass " "id": "test_new_node" ",
will fix in next version.

Thanks
	-Xie
> +                                "file": { "driver": "file",
> +                                          "filename": "test.raw" } } } }
> +<- { "return": {} }
> +-> { "execute": "x-blockdev-change",
> +    "arguments": { "parent": "disk1",
> +                   "node": "new_node" } }
> +<- { "return": {} }
> +
> +Delete a quorum's node
> +-> { "execute": "x-blockdev-change",
> +    "arguments": { "parent": "disk1",
> +                   "child": "children.1" } }
> +<- { "return": {} }
> +
> +EQMP
> +
> +    {
>           .name       = "query-named-block-nodes",
>           .args_type  = "",
>           .mhandler.cmd_new = qmp_marshal_query_named_block_nodes,
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-17  9:56                 ` Wen Congyang
  2016-03-17  9:59                   ` Dr. David Alan Gilbert
  2016-03-17 10:07                   ` Alberto Garcia
@ 2016-03-29 15:38                   ` Max Reitz
  2016-03-29 15:44                     ` Eric Blake
  2 siblings, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-03-29 15:38 UTC (permalink / raw)
  To: Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Jiang Yunhong, Dong Eddie, qemu devel,
	Markus Armbruster, Gonglei, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 800 bytes --]

On 17.03.2016 10:56, Wen Congyang wrote:
> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:

[...]

>> The children.0 notation is really confusing in the way that Berto
>> describes; I hit this a couple of months ago and it really doesn't
>> make sense.
> 
> Do you mean: read from children.1 first, and then read from children.0 in
> fifo mode? Yes, the behavior is very strange.

So is this intended or is it not? In
http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
you said that it is.

I myself would indeed say it is very strange. If I were a user, I would
not expect this behavior. And as I developer, I think that how a BDS's
child is used by its parent should solely depend on its role (e.g.
whether it is "children.0" or "children.1").

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:38                   ` Max Reitz
@ 2016-03-29 15:44                     ` Eric Blake
  2016-03-29 15:50                       ` Dr. David Alan Gilbert
  2016-03-29 15:51                       ` Max Reitz
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Blake @ 2016-03-29 15:44 UTC (permalink / raw)
  To: Max Reitz, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Jiang Yunhong, Dong Eddie, qemu devel,
	Markus Armbruster, Gonglei, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]

On 03/29/2016 09:38 AM, Max Reitz wrote:
> On 17.03.2016 10:56, Wen Congyang wrote:
>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
>>> The children.0 notation is really confusing in the way that Berto
>>> describes; I hit this a couple of months ago and it really doesn't
>>> make sense.
>>
>> Do you mean: read from children.1 first, and then read from children.0 in
>> fifo mode? Yes, the behavior is very strange.
> 
> So is this intended or is it not? In
> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> you said that it is.
> 
> I myself would indeed say it is very strange. If I were a user, I would
> not expect this behavior. And as I developer, I think that how a BDS's
> child is used by its parent should solely depend on its role (e.g.
> whether it is "children.0" or "children.1").

It sounds like the argument here, and in Max's thread on
query-block-node-tree, is that we DO have cases where order matters, and
so we need a way for the hot-add operation to explicitly specify where
in the list a child is inserted (whether it is being inserted as the new
primary image, or explicitly as the last resort, or somewhere in the
middle).  An optional parameter, that defaults to appending, may be ok,
but we definitely need to consider how the order of children is affected
by hot-add.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:44                     ` Eric Blake
@ 2016-03-29 15:50                       ` Dr. David Alan Gilbert
  2016-03-29 15:52                         ` Max Reitz
  2016-03-29 15:51                       ` Max Reitz
  1 sibling, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-29 15:50 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Markus Armbruster, Jiang Yunhong, Dong Eddie, qemu devel,
	Max Reitz, Gonglei, Stefan Hajnoczi, zhanghailiang

* Eric Blake (eblake@redhat.com) wrote:
> On 03/29/2016 09:38 AM, Max Reitz wrote:
> > On 17.03.2016 10:56, Wen Congyang wrote:
> >> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> > 
> > [...]
> > 
> >>> The children.0 notation is really confusing in the way that Berto
> >>> describes; I hit this a couple of months ago and it really doesn't
> >>> make sense.
> >>
> >> Do you mean: read from children.1 first, and then read from children.0 in
> >> fifo mode? Yes, the behavior is very strange.
> > 
> > So is this intended or is it not? In
> > http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> > you said that it is.
> > 
> > I myself would indeed say it is very strange. If I were a user, I would
> > not expect this behavior. And as I developer, I think that how a BDS's
> > child is used by its parent should solely depend on its role (e.g.
> > whether it is "children.0" or "children.1").
> 
> It sounds like the argument here, and in Max's thread on
> query-block-node-tree, is that we DO have cases where order matters, and
> so we need a way for the hot-add operation to explicitly specify where
> in the list a child is inserted (whether it is being inserted as the new
> primary image, or explicitly as the last resort, or somewhere in the
> middle).  An optional parameter, that defaults to appending, may be ok,
> but we definitely need to consider how the order of children is affected
> by hot-add.

Certainly in the COLO case the two children are not identical; and IMHO we need
to get away from thinking about ordering and start thinking about functional
namingd - children.0/children.1 doesn't suggest the fact they behave
differently.

Dave

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:44                     ` Eric Blake
  2016-03-29 15:50                       ` Dr. David Alan Gilbert
@ 2016-03-29 15:51                       ` Max Reitz
  2016-03-30 11:39                         ` Alberto Garcia
  1 sibling, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-03-29 15:51 UTC (permalink / raw)
  To: Eric Blake, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, zhanghailiang,
	qemu block, Jiang Yunhong, Dong Eddie, qemu devel,
	Markus Armbruster, Gonglei, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 2550 bytes --]

On 29.03.2016 17:44, Eric Blake wrote:
> On 03/29/2016 09:38 AM, Max Reitz wrote:
>> On 17.03.2016 10:56, Wen Congyang wrote:
>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>
>> [...]
>>
>>>> The children.0 notation is really confusing in the way that Berto
>>>> describes; I hit this a couple of months ago and it really doesn't
>>>> make sense.
>>>
>>> Do you mean: read from children.1 first, and then read from children.0 in
>>> fifo mode? Yes, the behavior is very strange.
>>
>> So is this intended or is it not? In
>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>> you said that it is.
>>
>> I myself would indeed say it is very strange. If I were a user, I would
>> not expect this behavior. And as I developer, I think that how a BDS's
>> child is used by its parent should solely depend on its role (e.g.
>> whether it is "children.0" or "children.1").
> 
> It sounds like the argument here, and in Max's thread on
> query-block-node-tree, is that we DO have cases where order matters, and
> so we need a way for the hot-add operation to explicitly specify where
> in the list a child is inserted (whether it is being inserted as the new
> primary image, or explicitly as the last resort, or somewhere in the
> middle).  An optional parameter, that defaults to appending, may be ok,
> but we definitely need to consider how the order of children is affected
> by hot-add.

However, the order should be queriable after the fact, and there are
three ways I see to accomplish this:

(1) Make this information queriable as driver-specific BDS information.
    I personally don't like it very much, but it would be fine.
(2) Implement query-block-node-tree, make the order of child nodes
    significant and thus represent the FIFO order there. I don't like
    this because it would mean returning two orders through that child
    node list: One is the numeric order (children.0, children.1, ...)
    and another is the FIFO order, which are not necessarily equal.
(3) Fix FIFO order to the child name (its role). I'm very much in favor
    of this.

While I don't have good arguments against (1), I think I have good
arguments for (3) instead: It just doesn't make sense to have a numeric
order of children if this order doesn't mean anything; especially if you
suddenly do need the list of child nodes to be ordered. To me, it
doesn't make any sense to introduce a new hidden order which takes
precedence over this obvious user-visible order.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:50                       ` Dr. David Alan Gilbert
@ 2016-03-29 15:52                         ` Max Reitz
  2016-03-29 15:54                           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-03-29 15:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Eric Blake
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang


[-- Attachment #1.1: Type: text/plain, Size: 2004 bytes --]

On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> * Eric Blake (eblake@redhat.com) wrote:
>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>
>>> [...]
>>>
>>>>> The children.0 notation is really confusing in the way that Berto
>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>> make sense.
>>>>
>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>> fifo mode? Yes, the behavior is very strange.
>>>
>>> So is this intended or is it not? In
>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>> you said that it is.
>>>
>>> I myself would indeed say it is very strange. If I were a user, I would
>>> not expect this behavior. And as I developer, I think that how a BDS's
>>> child is used by its parent should solely depend on its role (e.g.
>>> whether it is "children.0" or "children.1").
>>
>> It sounds like the argument here, and in Max's thread on
>> query-block-node-tree, is that we DO have cases where order matters, and
>> so we need a way for the hot-add operation to explicitly specify where
>> in the list a child is inserted (whether it is being inserted as the new
>> primary image, or explicitly as the last resort, or somewhere in the
>> middle).  An optional parameter, that defaults to appending, may be ok,
>> but we definitely need to consider how the order of children is affected
>> by hot-add.
> 
> Certainly in the COLO case the two children are not identical; and IMHO we need
> to get away from thinking about ordering and start thinking about functional
> namingd - children.0/children.1 doesn't suggest the fact they behave
> differently.

To me it does. If quorum is operating in a mode call "FIFO" I would
expect some order on the child nodes, and if the child nodes are
actually numbered in an ascending order, that is an obvious order.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:52                         ` Max Reitz
@ 2016-03-29 15:54                           ` Dr. David Alan Gilbert
  2016-03-29 15:59                             ` Max Reitz
  0 siblings, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-29 15:54 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang

* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> > * Eric Blake (eblake@redhat.com) wrote:
> >> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>
> >>> [...]
> >>>
> >>>>> The children.0 notation is really confusing in the way that Berto
> >>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>> make sense.
> >>>>
> >>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>> fifo mode? Yes, the behavior is very strange.
> >>>
> >>> So is this intended or is it not? In
> >>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>> you said that it is.
> >>>
> >>> I myself would indeed say it is very strange. If I were a user, I would
> >>> not expect this behavior. And as I developer, I think that how a BDS's
> >>> child is used by its parent should solely depend on its role (e.g.
> >>> whether it is "children.0" or "children.1").
> >>
> >> It sounds like the argument here, and in Max's thread on
> >> query-block-node-tree, is that we DO have cases where order matters, and
> >> so we need a way for the hot-add operation to explicitly specify where
> >> in the list a child is inserted (whether it is being inserted as the new
> >> primary image, or explicitly as the last resort, or somewhere in the
> >> middle).  An optional parameter, that defaults to appending, may be ok,
> >> but we definitely need to consider how the order of children is affected
> >> by hot-add.
> > 
> > Certainly in the COLO case the two children are not identical; and IMHO we need
> > to get away from thinking about ordering and start thinking about functional
> > namingd - children.0/children.1 doesn't suggest the fact they behave
> > differently.
> 
> To me it does. If quorum is operating in a mode call "FIFO" I would
> expect some order on the child nodes, and if the child nodes are
> actually numbered in an ascending order, that is an obvious order.

I don't understand why it's called 'FIFO'.

Dave

> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:54                           ` Dr. David Alan Gilbert
@ 2016-03-29 15:59                             ` Max Reitz
  2016-03-29 16:03                               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-03-29 15:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang


[-- Attachment #1.1: Type: text/plain, Size: 2597 bytes --]

On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
>>> * Eric Blake (eblake@redhat.com) wrote:
>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> The children.0 notation is really confusing in the way that Berto
>>>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>>>> make sense.
>>>>>>
>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>>>> fifo mode? Yes, the behavior is very strange.
>>>>>
>>>>> So is this intended or is it not? In
>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>>>> you said that it is.
>>>>>
>>>>> I myself would indeed say it is very strange. If I were a user, I would
>>>>> not expect this behavior. And as I developer, I think that how a BDS's
>>>>> child is used by its parent should solely depend on its role (e.g.
>>>>> whether it is "children.0" or "children.1").
>>>>
>>>> It sounds like the argument here, and in Max's thread on
>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>> so we need a way for the hot-add operation to explicitly specify where
>>>> in the list a child is inserted (whether it is being inserted as the new
>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>> but we definitely need to consider how the order of children is affected
>>>> by hot-add.
>>>
>>> Certainly in the COLO case the two children are not identical; and IMHO we need
>>> to get away from thinking about ordering and start thinking about functional
>>> namingd - children.0/children.1 doesn't suggest the fact they behave
>>> differently.
>>
>> To me it does. If quorum is operating in a mode call "FIFO" I would
>> expect some order on the child nodes, and if the child nodes are
>> actually numbered in an ascending order, that is an obvious order.
> 
> I don't understand why it's called 'FIFO'.

Because in that mode quorum successively reads from all of its children
and returns the first successful result. So the First successful Input
is the one that becomes quorum's Output (there isn't much of a
successive output, so it doesn't make much sense to call that the First
Output, though...).

I didn't name it, though. *waves hands defensively* :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:59                             ` Max Reitz
@ 2016-03-29 16:03                               ` Dr. David Alan Gilbert
  2016-03-29 16:09                                 ` Max Reitz
  0 siblings, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-29 16:03 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang

* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mreitz@redhat.com) wrote:
> >> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> >>> * Eric Blake (eblake@redhat.com) wrote:
> >>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>>>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>>> The children.0 notation is really confusing in the way that Berto
> >>>>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>>>> make sense.
> >>>>>>
> >>>>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>>>> fifo mode? Yes, the behavior is very strange.
> >>>>>
> >>>>> So is this intended or is it not? In
> >>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>>>> you said that it is.
> >>>>>
> >>>>> I myself would indeed say it is very strange. If I were a user, I would
> >>>>> not expect this behavior. And as I developer, I think that how a BDS's
> >>>>> child is used by its parent should solely depend on its role (e.g.
> >>>>> whether it is "children.0" or "children.1").
> >>>>
> >>>> It sounds like the argument here, and in Max's thread on
> >>>> query-block-node-tree, is that we DO have cases where order matters, and
> >>>> so we need a way for the hot-add operation to explicitly specify where
> >>>> in the list a child is inserted (whether it is being inserted as the new
> >>>> primary image, or explicitly as the last resort, or somewhere in the
> >>>> middle).  An optional parameter, that defaults to appending, may be ok,
> >>>> but we definitely need to consider how the order of children is affected
> >>>> by hot-add.
> >>>
> >>> Certainly in the COLO case the two children are not identical; and IMHO we need
> >>> to get away from thinking about ordering and start thinking about functional
> >>> namingd - children.0/children.1 doesn't suggest the fact they behave
> >>> differently.
> >>
> >> To me it does. If quorum is operating in a mode call "FIFO" I would
> >> expect some order on the child nodes, and if the child nodes are
> >> actually numbered in an ascending order, that is an obvious order.
> > 
> > I don't understand why it's called 'FIFO'.
> 
> Because in that mode quorum successively reads from all of its children
> and returns the first successful result. So the First successful Input
> is the one that becomes quorum's Output (there isn't much of a
> successive output, so it doesn't make much sense to call that the First
> Output, though...).
> 
> I didn't name it, though. *waves hands defensively* :-)

But that description doesn't make sense for what COLO uses it for.

They have, on the primary host:
   0) Local disk
   1) an NBD connection to the secondary

So in theory a read should always happen from (0) and writes should
go to both.

Dave

> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 16:03                               ` Dr. David Alan Gilbert
@ 2016-03-29 16:09                                 ` Max Reitz
  2016-03-29 17:33                                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-03-29 16:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang


[-- Attachment #1.1: Type: text/plain, Size: 3419 bytes --]

On 29.03.2016 18:03, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
>>> * Max Reitz (mreitz@redhat.com) wrote:
>>>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
>>>>> * Eric Blake (eblake@redhat.com) wrote:
>>>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>> The children.0 notation is really confusing in the way that Berto
>>>>>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>>>>>> make sense.
>>>>>>>>
>>>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>>>>>> fifo mode? Yes, the behavior is very strange.
>>>>>>>
>>>>>>> So is this intended or is it not? In
>>>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>>>>>> you said that it is.
>>>>>>>
>>>>>>> I myself would indeed say it is very strange. If I were a user, I would
>>>>>>> not expect this behavior. And as I developer, I think that how a BDS's
>>>>>>> child is used by its parent should solely depend on its role (e.g.
>>>>>>> whether it is "children.0" or "children.1").
>>>>>>
>>>>>> It sounds like the argument here, and in Max's thread on
>>>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>>>> so we need a way for the hot-add operation to explicitly specify where
>>>>>> in the list a child is inserted (whether it is being inserted as the new
>>>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>>>> but we definitely need to consider how the order of children is affected
>>>>>> by hot-add.
>>>>>
>>>>> Certainly in the COLO case the two children are not identical; and IMHO we need
>>>>> to get away from thinking about ordering and start thinking about functional
>>>>> namingd - children.0/children.1 doesn't suggest the fact they behave
>>>>> differently.
>>>>
>>>> To me it does. If quorum is operating in a mode call "FIFO" I would
>>>> expect some order on the child nodes, and if the child nodes are
>>>> actually numbered in an ascending order, that is an obvious order.
>>>
>>> I don't understand why it's called 'FIFO'.
>>
>> Because in that mode quorum successively reads from all of its children
>> and returns the first successful result. So the First successful Input
>> is the one that becomes quorum's Output (there isn't much of a
>> successive output, so it doesn't make much sense to call that the First
>> Output, though...).
>>
>> I didn't name it, though. *waves hands defensively* :-)
> 
> But that description doesn't make sense for what COLO uses it for.
> 
> They have, on the primary host:
>    0) Local disk
>    1) an NBD connection to the secondary
> 
> So in theory a read should always happen from (0) and writes should
> go to both.

Well that's the way it works, isn't it?

I didn't mention what happens with writes, but those are indeed
distributed to all of quorum's children. And as long as the local disk
doesn't fail, data is always read from it alone.

All you need to do is make sure that the local disk is the first node in
whatever order FIFO is supposed to use.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 16:09                                 ` Max Reitz
@ 2016-03-29 17:33                                   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-29 17:33 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Changlong Xie, Alberto Garcia, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi, zhanghailiang

* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 18:03, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mreitz@redhat.com) wrote:
> >> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> >>> * Max Reitz (mreitz@redhat.com) wrote:
> >>>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> >>>>> * Eric Blake (eblake@redhat.com) wrote:
> >>>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>>>>>
> >>>>>>> [...]
> >>>>>>>
> >>>>>>>>> The children.0 notation is really confusing in the way that Berto
> >>>>>>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>>>>>> make sense.
> >>>>>>>>
> >>>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>>>>>> fifo mode? Yes, the behavior is very strange.
> >>>>>>>
> >>>>>>> So is this intended or is it not? In
> >>>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>>>>>> you said that it is.
> >>>>>>>
> >>>>>>> I myself would indeed say it is very strange. If I were a user, I would
> >>>>>>> not expect this behavior. And as I developer, I think that how a BDS's
> >>>>>>> child is used by its parent should solely depend on its role (e.g.
> >>>>>>> whether it is "children.0" or "children.1").
> >>>>>>
> >>>>>> It sounds like the argument here, and in Max's thread on
> >>>>>> query-block-node-tree, is that we DO have cases where order matters, and
> >>>>>> so we need a way for the hot-add operation to explicitly specify where
> >>>>>> in the list a child is inserted (whether it is being inserted as the new
> >>>>>> primary image, or explicitly as the last resort, or somewhere in the
> >>>>>> middle).  An optional parameter, that defaults to appending, may be ok,
> >>>>>> but we definitely need to consider how the order of children is affected
> >>>>>> by hot-add.
> >>>>>
> >>>>> Certainly in the COLO case the two children are not identical; and IMHO we need
> >>>>> to get away from thinking about ordering and start thinking about functional
> >>>>> namingd - children.0/children.1 doesn't suggest the fact they behave
> >>>>> differently.
> >>>>
> >>>> To me it does. If quorum is operating in a mode call "FIFO" I would
> >>>> expect some order on the child nodes, and if the child nodes are
> >>>> actually numbered in an ascending order, that is an obvious order.
> >>>
> >>> I don't understand why it's called 'FIFO'.
> >>
> >> Because in that mode quorum successively reads from all of its children
> >> and returns the first successful result. So the First successful Input
> >> is the one that becomes quorum's Output (there isn't much of a
> >> successive output, so it doesn't make much sense to call that the First
> >> Output, though...).
> >>
> >> I didn't name it, though. *waves hands defensively* :-)
> > 
> > But that description doesn't make sense for what COLO uses it for.
> > 
> > They have, on the primary host:
> >    0) Local disk
> >    1) an NBD connection to the secondary
> > 
> > So in theory a read should always happen from (0) and writes should
> > go to both.
> 
> Well that's the way it works, isn't it?
> 
> I didn't mention what happens with writes, but those are indeed
> distributed to all of quorum's children. And as long as the local disk
> doesn't fail, data is always read from it alone.

I guess so, but it seems to be odd to name something after an ordering
when you never expect it to actually perform the read from anything other
than the first;  and certainly for fault tolerance stuff I think it's
important to define the failure modes.

Dave

> All you need to do is make sure that the local disk is the first node in
> whatever order FIFO is supposed to use.
> 
> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-29 15:51                       ` Max Reitz
@ 2016-03-30 11:39                         ` Alberto Garcia
  2016-03-30 15:07                           ` Max Reitz
  0 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2016-03-30 11:39 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi

On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>> It sounds like the argument here, and in Max's thread on
>> query-block-node-tree, is that we DO have cases where order matters, and
>> so we need a way for the hot-add operation to explicitly specify where
>> in the list a child is inserted (whether it is being inserted as the new
>> primary image, or explicitly as the last resort, or somewhere in the
>> middle).  An optional parameter, that defaults to appending, may be ok,
>> but we definitely need to consider how the order of children is affected
>> by hot-add.
>
> However, the order should be queriable after the fact, and there are
> three ways I see to accomplish this:
>
> (1) Make this information queriable as driver-specific BDS information.
>     I personally don't like it very much, but it would be fine.
> (2) Implement query-block-node-tree, make the order of child nodes
>     significant and thus represent the FIFO order there. I don't like
>     this because it would mean returning two orders through that child
>     node list: One is the numeric order (children.0, children.1, ...)
>     and another is the FIFO order, which are not necessarily equal.
> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>     of this.
>
> While I don't have good arguments against (1), I think I have good
> arguments for (3) instead: It just doesn't make sense to have a numeric
> order of children if this order doesn't mean anything; especially if you
> suddenly do need the list of child nodes to be ordered. To me, it
> doesn't make any sense to introduce a new hidden order which takes
> precedence over this obvious user-visible order.

I'm not sure if I understand correctly what you mean in (3). The
user-visible FIFO order is the one specified when the Quorum is created:

children.0.file.filename=hd0.qcow2,
children.1.file.filename=hd1.qcow2,
...

Would you then call those BDS children.0, children.1, etc and make those
names be the ones that actually define how they are ordered internally?

I also have another (not directly related) question: why not simply use
the node name when removing children? I understood that the idea was
that it's possible to have the same node attached twice to the same
Quorum, but can you actually do that? And what's the use case?

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-30 11:39                         ` Alberto Garcia
@ 2016-03-30 15:07                           ` Max Reitz
  2016-03-31 11:42                             ` Alberto Garcia
  2016-04-11  5:18                             ` Changlong Xie
  0 siblings, 2 replies; 42+ messages in thread
From: Max Reitz @ 2016-03-30 15:07 UTC (permalink / raw)
  To: Alberto Garcia, Eric Blake, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 3554 bytes --]

On 30.03.2016 13:39, Alberto Garcia wrote:
> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>> It sounds like the argument here, and in Max's thread on
>>> query-block-node-tree, is that we DO have cases where order matters, and
>>> so we need a way for the hot-add operation to explicitly specify where
>>> in the list a child is inserted (whether it is being inserted as the new
>>> primary image, or explicitly as the last resort, or somewhere in the
>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>> but we definitely need to consider how the order of children is affected
>>> by hot-add.
>>
>> However, the order should be queriable after the fact, and there are
>> three ways I see to accomplish this:
>>
>> (1) Make this information queriable as driver-specific BDS information.
>>     I personally don't like it very much, but it would be fine.
>> (2) Implement query-block-node-tree, make the order of child nodes
>>     significant and thus represent the FIFO order there. I don't like
>>     this because it would mean returning two orders through that child
>>     node list: One is the numeric order (children.0, children.1, ...)
>>     and another is the FIFO order, which are not necessarily equal.
>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>     of this.
>>
>> While I don't have good arguments against (1), I think I have good
>> arguments for (3) instead: It just doesn't make sense to have a numeric
>> order of children if this order doesn't mean anything; especially if you
>> suddenly do need the list of child nodes to be ordered. To me, it
>> doesn't make any sense to introduce a new hidden order which takes
>> precedence over this obvious user-visible order.
> 
> I'm not sure if I understand correctly what you mean in (3). The
> user-visible FIFO order is the one specified when the Quorum is created:
> 
> children.0.file.filename=hd0.qcow2,
> children.1.file.filename=hd1.qcow2,
> ...
> 
> Would you then call those BDS children.0, children.1, etc

They are already called that way; it's not their node name but their
"child role" name.

>                                                           and make those
> names be the ones that actually define how they are ordered internally?

Yes, that's what I meant.

> I also have another (not directly related) question: why not simply use
> the node name when removing children? I understood that the idea was
> that it's possible to have the same node attached twice to the same
> Quorum, but can you actually do that? And what's the use case?

What I like about using the child role name is that it automatically
prevents you from specifying a node that is not a child of the given parent.

Which makes me notice that it might be a good idea to require the user
to specify the child's role when adding a new child. In this version of
this series (where only quorum is supported), the children are just
inserted in numerical order (first free slot is taken first), but maybe
the user wants to insert them in a different order.

For quorum, this is basically irrelevant if the order doesn't mean
anything (which I don't like), but it may be relevant for other block
drivers.

And the "filling up quorum's children" topic then makes me notice that
(x-)blockdev-change should probably be transaction-able (so you can
restructure the whole BDS graph in a single transaction), but that's
something we can care about later on.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-30 15:07                           ` Max Reitz
@ 2016-03-31 11:42                             ` Alberto Garcia
  2016-03-31 12:31                               ` Dr. David Alan Gilbert
  2016-04-01 15:20                               ` Max Reitz
  2016-04-11  5:18                             ` Changlong Xie
  1 sibling, 2 replies; 42+ messages in thread
From: Alberto Garcia @ 2016-03-31 11:42 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi

On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>> I also have another (not directly related) question: why not simply
>> use the node name when removing children? I understood that the idea
>> was that it's possible to have the same node attached twice to the
>> same Quorum, but can you actually do that? And what's the use case?
>
> What I like about using the child role name is that it automatically
> prevents you from specifying a node that is not a child of the given
> parent.

Right, but checking if a node is not a child and returning an error is
very simple. And it doesn't require the user to keep track of the node
name *and* the child role name.

Unless I'm forgetting something this would be the first time we expose
the child role name in the API, that's why I'm wondering if it's
something worth doing.

> Which makes me notice that it might be a good idea to require the user
> to specify the child's role when adding a new child. In this version
> of this series (where only quorum is supported), the children are just
> inserted in numerical order (first free slot is taken first), but
> maybe the user wants to insert them in a different order.

For the Quorum case it totally makes sense to let the user choose the
position of the new child.

But for creating a Quorum array in the first place we don't require
that, the order is the one that the user provides, and the user does not
need to know about the child role names at that point.

> And the "filling up quorum's children" topic then makes me notice that
> (x-)blockdev-change should probably be transaction-able (so you can
> restructure the whole BDS graph in a single transaction), but that's
> something we can care about later on.

I agree.

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-31 11:42                             ` Alberto Garcia
@ 2016-03-31 12:31                               ` Dr. David Alan Gilbert
  2016-04-01 15:20                               ` Max Reitz
  1 sibling, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2016-03-31 12:31 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Kevin Wolf, Changlong Xie, qemu block, Markus Armbruster,
	Jiang Yunhong, Dong Eddie, qemu devel, Max Reitz, Gonglei,
	Stefan Hajnoczi, zhanghailiang

* Alberto Garcia (berto@igalia.com) wrote:
> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
> >> I also have another (not directly related) question: why not simply
> >> use the node name when removing children? I understood that the idea
> >> was that it's possible to have the same node attached twice to the
> >> same Quorum, but can you actually do that? And what's the use case?
> >
> > What I like about using the child role name is that it automatically
> > prevents you from specifying a node that is not a child of the given
> > parent.
> 
> Right, but checking if a node is not a child and returning an error is
> very simple. And it doesn't require the user to keep track of the node
> name *and* the child role name.
> 
> Unless I'm forgetting something this would be the first time we expose
> the child role name in the API, that's why I'm wondering if it's
> something worth doing.
> 
> > Which makes me notice that it might be a good idea to require the user
> > to specify the child's role when adding a new child. In this version
> > of this series (where only quorum is supported), the children are just
> > inserted in numerical order (first free slot is taken first), but
> > maybe the user wants to insert them in a different order.
> 
> For the Quorum case it totally makes sense to let the user choose the
> position of the new child.
> 
> But for creating a Quorum array in the first place we don't require
> that, the order is the one that the user provides, and the user does not
> need to know about the child role names at that point.

Actually in my setup I only add the local block device using the normal command
line, and use drive_add even at startup.  It solves a lot of the ordering
problems with getting COLO booted.

Dave

> 
> > And the "filling up quorum's children" topic then makes me notice that
> > (x-)blockdev-change should probably be transaction-able (so you can
> > restructure the whole BDS graph in a single transaction), but that's
> > something we can care about later on.
> 
> I agree.
> 
> Berto
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-31 11:42                             ` Alberto Garcia
  2016-03-31 12:31                               ` Dr. David Alan Gilbert
@ 2016-04-01 15:20                               ` Max Reitz
  2016-04-06  7:48                                 ` Wen Congyang
  1 sibling, 1 reply; 42+ messages in thread
From: Max Reitz @ 2016-04-01 15:20 UTC (permalink / raw)
  To: Alberto Garcia, Eric Blake, Wen Congyang, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 2386 bytes --]

On 31.03.2016 13:42, Alberto Garcia wrote:
> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>>> I also have another (not directly related) question: why not simply
>>> use the node name when removing children? I understood that the idea
>>> was that it's possible to have the same node attached twice to the
>>> same Quorum, but can you actually do that? And what's the use case?
>>
>> What I like about using the child role name is that it automatically
>> prevents you from specifying a node that is not a child of the given
>> parent.
> 
> Right, but checking if a node is not a child and returning an error is
> very simple. And it doesn't require the user to keep track of the node
> name *and* the child role name.

Yes. But I think that you need to know parent and child anyway if you
want to modify (delete) an edge in the graph.

Also, it may be possible to have multiple parents per node. Actually, it
is already possible because the BB-BDS relationship is modeled as a
parent-child relationship. Thus, I'm not sure whether it would be
sufficient to specify a single node if you want to delete a single edge.

> Unless I'm forgetting something this would be the first time we expose
> the child role name in the API, that's why I'm wondering if it's
> something worth doing.

Well, the roles are kind of exposed already. It's exactly what you
specify in -drive or blockdev-add.

>> Which makes me notice that it might be a good idea to require the user
>> to specify the child's role when adding a new child. In this version
>> of this series (where only quorum is supported), the children are just
>> inserted in numerical order (first free slot is taken first), but
>> maybe the user wants to insert them in a different order.
> 
> For the Quorum case it totally makes sense to let the user choose the
> position of the new child.
> 
> But for creating a Quorum array in the first place we don't require
> that, the order is the one that the user provides, and the user does not
> need to know about the child role names at that point.

Depends. If you create an empty quorum BDS and then add the children
using the QAPI command introduced in this series, you are right. But if
you add children along with creating the quorum BDS (be it via -drive or
via blockdev-add), one has to specify the child role names.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-04-01 15:20                               ` Max Reitz
@ 2016-04-06  7:48                                 ` Wen Congyang
  0 siblings, 0 replies; 42+ messages in thread
From: Wen Congyang @ 2016-04-06  7:48 UTC (permalink / raw)
  To: Max Reitz, Alberto Garcia, Eric Blake, Dr. David Alan Gilbert
  Cc: Kevin Wolf, Changlong Xie, zhanghailiang, qemu block,
	Jiang Yunhong, Dong Eddie, qemu devel, Markus Armbruster,
	Gonglei, Stefan Hajnoczi

On 04/01/2016 11:20 PM, Max Reitz wrote:
> On 31.03.2016 13:42, Alberto Garcia wrote:
>> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>>>> I also have another (not directly related) question: why not simply
>>>> use the node name when removing children? I understood that the idea
>>>> was that it's possible to have the same node attached twice to the
>>>> same Quorum, but can you actually do that? And what's the use case?
>>>
>>> What I like about using the child role name is that it automatically
>>> prevents you from specifying a node that is not a child of the given
>>> parent.
>>
>> Right, but checking if a node is not a child and returning an error is
>> very simple. And it doesn't require the user to keep track of the node
>> name *and* the child role name.
> 
> Yes. But I think that you need to know parent and child anyway if you
> want to modify (delete) an edge in the graph.
> 
> Also, it may be possible to have multiple parents per node. Actually, it
> is already possible because the BB-BDS relationship is modeled as a
> parent-child relationship. Thus, I'm not sure whether it would be
> sufficient to specify a single node if you want to delete a single edge.
> 
>> Unless I'm forgetting something this would be the first time we expose
>> the child role name in the API, that's why I'm wondering if it's
>> something worth doing.
> 
> Well, the roles are kind of exposed already. It's exactly what you
> specify in -drive or blockdev-add.
> 
>>> Which makes me notice that it might be a good idea to require the user
>>> to specify the child's role when adding a new child. In this version
>>> of this series (where only quorum is supported), the children are just
>>> inserted in numerical order (first free slot is taken first), but
>>> maybe the user wants to insert them in a different order.
>>
>> For the Quorum case it totally makes sense to let the user choose the
>> position of the new child.
>>
>> But for creating a Quorum array in the first place we don't require
>> that, the order is the one that the user provides, and the user does not
>> need to know about the child role names at that point.
> 
> Depends. If you create an empty quorum BDS and then add the children
> using the QAPI command introduced in this series, you are right. But if
> you add children along with creating the quorum BDS (be it via -drive or
> via blockdev-add), one has to specify the child role names.

I think the problem is that: the child role name is wrong.

If we always attach the child in the tail, we can do it like this:
the child role name is children.XXX, and the XXX's value is larger than
any child role name's XXX.

For example:
Quorum has one child: children.1(children.0 is removed)
We add a new child, its role name is children.2, not children.0.

If we want to attach the child not in the tail, for example:
Quorum has two children: children.0, children.1. And the new child should
be before children.1. In this case, we should rename children.1 to children.2
and the new child role name can be children.1. If we allow such usage, we
should rename the other child role name when add/deleting a child. It means
that we should query the role name again after add/deleting a child.

Thanks
Wen Congyang

> 
> Max
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-03-30 15:07                           ` Max Reitz
  2016-03-31 11:42                             ` Alberto Garcia
@ 2016-04-11  5:18                             ` Changlong Xie
  2016-04-12 16:21                               ` Max Reitz
  1 sibling, 1 reply; 42+ messages in thread
From: Changlong Xie @ 2016-04-11  5:18 UTC (permalink / raw)
  To: Max Reitz, Alberto Garcia, Eric Blake, Wen Congyang,
	Dr. David Alan Gilbert
  Cc: qemu devel, Kevin Wolf, Stefan Hajnoczi, Markus Armbruster,
	Dong Eddie, Jiang Yunhong, qemu block, zhanghailiang, Gonglei

On 03/30/2016 11:07 PM, Max Reitz wrote:
> On 30.03.2016 13:39, Alberto Garcia wrote:
>> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>>> It sounds like the argument here, and in Max's thread on
>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>> so we need a way for the hot-add operation to explicitly specify where
>>>> in the list a child is inserted (whether it is being inserted as the new
>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>> but we definitely need to consider how the order of children is affected
>>>> by hot-add.
>>>
>>> However, the order should be queriable after the fact, and there are
>>> three ways I see to accomplish this:
>>>
>>> (1) Make this information queriable as driver-specific BDS information.
>>>      I personally don't like it very much, but it would be fine.
>>> (2) Implement query-block-node-tree, make the order of child nodes
>>>      significant and thus represent the FIFO order there. I don't like
>>>      this because it would mean returning two orders through that child
>>>      node list: One is the numeric order (children.0, children.1, ...)
>>>      and another is the FIFO order, which are not necessarily equal.
>>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>>      of this.
>>>
>>> While I don't have good arguments against (1), I think I have good
>>> arguments for (3) instead: It just doesn't make sense to have a numeric
>>> order of children if this order doesn't mean anything; especially if you
>>> suddenly do need the list of child nodes to be ordered. To me, it
>>> doesn't make any sense to introduce a new hidden order which takes
>>> precedence over this obvious user-visible order.
>>
>> I'm not sure if I understand correctly what you mean in (3). The
>> user-visible FIFO order is the one specified when the Quorum is created:
>>
>> children.0.file.filename=hd0.qcow2,
>> children.1.file.filename=hd1.qcow2,
>> ...
>>
>> Would you then call those BDS children.0, children.1, etc
>
> They are already called that way; it's not their node name but their
> "child role" name.
>
>>                                                            and make those
>> names be the ones that actually define how they are ordered internally?
>
> Yes, that's what I meant.
>
Hi Max

I think you just mean what i draw in below chart:

1) Insert 4 children.
0           1          2          3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+----------------------------------------------------

2) Remove the 2th child (s->children[1])

{ "execute": "x-blockdev-change", 
 

   "arguments": { "parent": "xxx", 
 

                  "child": "children.1" } }

0           1          2          3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+------------------------+------------+--------------
                          |            | 	
                   +------+   +--------+
0           1     |    2     |
+----------------v----------v------------------------
|children.0|children.1|children.2|
+----------------------------------------------------

Remove children.1 and shorten the array, then rename children.{2,3} to 
children.{1.2}

3) Insert a new child

0           1          2         3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+----------------------------------------------------

But as Wen said: 
http://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg00898.html

Everytime we try to remove a children.i (i < n-1, so it's not the last 
element of the array[n]), we have to rename children.{i+1, n-1} to 
children.{i, n-2}.

Thanks
	-Xie

>> I also have another (not directly related) question: why not simply use
>> the node name when removing children? I understood that the idea was
>> that it's possible to have the same node attached twice to the same
>> Quorum, but can you actually do that? And what's the use case?
>
> What I like about using the child role name is that it automatically
> prevents you from specifying a node that is not a child of the given parent.
>
> Which makes me notice that it might be a good idea to require the user
> to specify the child's role when adding a new child. In this version of
> this series (where only quorum is supported), the children are just
> inserted in numerical order (first free slot is taken first), but maybe
> the user wants to insert them in a different order.
>
> For quorum, this is basically irrelevant if the order doesn't mean
> anything (which I don't like), but it may be relevant for other block
> drivers.
>
> And the "filling up quorum's children" topic then makes me notice that
> (x-)blockdev-change should probably be transaction-able (so you can
> restructure the whole BDS graph in a single transaction), but that's
> something we can care about later on.
>
> Max
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()
  2016-04-11  5:18                             ` Changlong Xie
@ 2016-04-12 16:21                               ` Max Reitz
  0 siblings, 0 replies; 42+ messages in thread
From: Max Reitz @ 2016-04-12 16:21 UTC (permalink / raw)
  To: Changlong Xie, Alberto Garcia, Eric Blake, Wen Congyang,
	Dr. David Alan Gilbert
  Cc: qemu devel, Kevin Wolf, Stefan Hajnoczi, Markus Armbruster,
	Dong Eddie, Jiang Yunhong, qemu block, zhanghailiang, Gonglei


[-- Attachment #1.1: Type: text/plain, Size: 4623 bytes --]

On 11.04.2016 07:18, Changlong Xie wrote:
> On 03/30/2016 11:07 PM, Max Reitz wrote:
>> On 30.03.2016 13:39, Alberto Garcia wrote:
>>> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>>>> It sounds like the argument here, and in Max's thread on
>>>>> query-block-node-tree, is that we DO have cases where order
>>>>> matters, and
>>>>> so we need a way for the hot-add operation to explicitly specify where
>>>>> in the list a child is inserted (whether it is being inserted as
>>>>> the new
>>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>>> middle).  An optional parameter, that defaults to appending, may be
>>>>> ok,
>>>>> but we definitely need to consider how the order of children is
>>>>> affected
>>>>> by hot-add.
>>>>
>>>> However, the order should be queriable after the fact, and there are
>>>> three ways I see to accomplish this:
>>>>
>>>> (1) Make this information queriable as driver-specific BDS information.
>>>>      I personally don't like it very much, but it would be fine.
>>>> (2) Implement query-block-node-tree, make the order of child nodes
>>>>      significant and thus represent the FIFO order there. I don't like
>>>>      this because it would mean returning two orders through that child
>>>>      node list: One is the numeric order (children.0, children.1, ...)
>>>>      and another is the FIFO order, which are not necessarily equal.
>>>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>>>      of this.
>>>>
>>>> While I don't have good arguments against (1), I think I have good
>>>> arguments for (3) instead: It just doesn't make sense to have a numeric
>>>> order of children if this order doesn't mean anything; especially if
>>>> you
>>>> suddenly do need the list of child nodes to be ordered. To me, it
>>>> doesn't make any sense to introduce a new hidden order which takes
>>>> precedence over this obvious user-visible order.
>>>
>>> I'm not sure if I understand correctly what you mean in (3). The
>>> user-visible FIFO order is the one specified when the Quorum is created:
>>>
>>> children.0.file.filename=hd0.qcow2,
>>> children.1.file.filename=hd1.qcow2,
>>> ...
>>>
>>> Would you then call those BDS children.0, children.1, etc
>>
>> They are already called that way; it's not their node name but their
>> "child role" name.
>>
>>>                                                            and make
>>> those
>>> names be the ones that actually define how they are ordered internally?
>>
>> Yes, that's what I meant.
>>
> Hi Max
> 
> I think you just mean what i draw in below chart:
> 
> 1) Insert 4 children.
> 0           1          2          3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +----------------------------------------------------
> 
> 2) Remove the 2th child (s->children[1])
> 
> { "execute": "x-blockdev-change",
> 
>   "arguments": { "parent": "xxx",
> 
>                  "child": "children.1" } }
> 
> 0           1          2          3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +------------------------+------------+--------------
>                          |            |    
>                   +------+   +--------+
> 0           1     |    2     |
> +----------------v----------v------------------------
> |children.0|children.1|children.2|
> +----------------------------------------------------

No, what I meant, is:

 0          1          2          3
+----------+----------+----------+----------+
|children.0|children.1|children.2|children.3|
+----------+----------+----------+----------+

|
v

 0          1          2          3
+----------+----------+----------+----------+
|children.0|          |children.2|children.3|
+----------+----------+----------+----------+

I.e., children.1 simply ceases to exist.

Max

> Remove children.1 and shorten the array, then rename children.{2,3} to
> children.{1.2}
> 
> 3) Insert a new child
> 
> 0           1          2         3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +----------------------------------------------------
> 
> But as Wen said:
> http://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg00898.html
> 
> Everytime we try to remove a children.i (i < n-1, so it's not the last
> element of the array[n]), we have to rename children.{i+1, n-1} to
> children.{i, n-2}.
> 
> Thanks
>     -Xie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-04-12 16:21 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-10  2:49 [Qemu-devel] [PATCH v12 0/3] qapi: child add/delete support Changlong Xie
2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 1/3] Add new block driver interface to add/delete a BDS's child Changlong Xie
2016-03-10 14:57   ` Alberto Garcia
2016-03-11  1:17     ` Changlong Xie
2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child() Changlong Xie
2016-03-11 12:21   ` Alberto Garcia
2016-03-14  1:33     ` Changlong Xie
2016-03-14  6:02     ` Changlong Xie
2016-03-16 12:38       ` Alberto Garcia
2016-03-17  1:22         ` Wen Congyang
2016-03-17  9:10           ` Alberto Garcia
2016-03-17  9:44             ` Wen Congyang
2016-03-17  9:48               ` Dr. David Alan Gilbert
2016-03-17  9:56                 ` Wen Congyang
2016-03-17  9:59                   ` Dr. David Alan Gilbert
2016-03-17 10:07                   ` Alberto Garcia
2016-03-17 10:23                     ` Wen Congyang
2016-03-17 11:25                       ` Dr. David Alan Gilbert
2016-03-18  2:56                         ` Wen Congyang
2016-03-18 10:48                           ` Dr. David Alan Gilbert
2016-03-29 15:38                   ` Max Reitz
2016-03-29 15:44                     ` Eric Blake
2016-03-29 15:50                       ` Dr. David Alan Gilbert
2016-03-29 15:52                         ` Max Reitz
2016-03-29 15:54                           ` Dr. David Alan Gilbert
2016-03-29 15:59                             ` Max Reitz
2016-03-29 16:03                               ` Dr. David Alan Gilbert
2016-03-29 16:09                                 ` Max Reitz
2016-03-29 17:33                                   ` Dr. David Alan Gilbert
2016-03-29 15:51                       ` Max Reitz
2016-03-30 11:39                         ` Alberto Garcia
2016-03-30 15:07                           ` Max Reitz
2016-03-31 11:42                             ` Alberto Garcia
2016-03-31 12:31                               ` Dr. David Alan Gilbert
2016-04-01 15:20                               ` Max Reitz
2016-04-06  7:48                                 ` Wen Congyang
2016-04-11  5:18                             ` Changlong Xie
2016-04-12 16:21                               ` Max Reitz
2016-03-16  2:10     ` Wen Congyang
2016-03-10  2:49 ` [Qemu-devel] [PATCH v12 3/3] qmp: add monitor command to add/remove a child Changlong Xie
2016-03-11 12:48   ` Alberto Garcia
2016-03-28  6:09   ` Changlong Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.