All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH V9 00/11] Quorum block driver
@ 2013-10-02 12:39 Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB Benoît Canet
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."

This patchset create a block driver implementing a quorum using total qemu disk
images. Writes are mirrored on the $total files.
For the reading part the $total files are read at the same time and a vote is
done to determine if a qiov version is present $threshold or more times. It then
return this majority version to the upper layers.
When i < $threshold versions of the data are returned by the lower layer the
quorum is broken and the read return -EIO.

The goal of this patchset is to be turned in a QEMU block filter living just
above raw-*.c and below qcow2/qed when the required infrastructure will be done.

Main use of this feature will be people using NFS appliances which can be
subjected to bitflip errors.

This patchset can be used to replace blkverify and the out of tree blkmirror.

v9: integrate latests comments from kevin

Benoît Canet (11):
  quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
  quorum: Create BDRVQuorumState and BlkDriver and do init.
  quorum: Add quorum_aio_writev and its dependencies.
  blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
    blkverify.
  quorum: Add quorum_aio_readv.
  quorum: Add quorum mechanism.
  quorum: Add quorum_getlength().
  quorum: Add quorum_invalidate_cache().
  quorum: Add quorum_co_get_block_status.
  quorum: Add quorum_co_flush().
  quorum: Add quorum_open() and quorum_close().

 block/Makefile.objs       |    1 +
 block/blkverify.c         |  108 +----
 block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
 configure                 |   36 ++
 include/monitor/monitor.h |    2 +
 include/qemu-common.h     |    2 +
 monitor.c                 |    2 +
 util/iov.c                |  103 +++++
 8 files changed, 1183 insertions(+), 106 deletions(-)
 create mode 100644 block/quorum.c

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:33   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init Benoît Canet
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/Makefile.objs |  1 +
 block/quorum.c      | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)
 create mode 100644 block/quorum.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 3bb85b5..05a65c2 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -3,6 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
+block-obj-y += quorum.o
 block-obj-y += parallels.o blkdebug.o blkverify.o
 block-obj-y += snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
diff --git a/block/quorum.c b/block/quorum.c
new file mode 100644
index 0000000..76a1fbb
--- /dev/null
+++ b/block/quorum.c
@@ -0,0 +1,54 @@
+/*
+ * Quorum Block filter
+ *
+ * Copyright (C) 2012-2013 Nodalink, SARL.
+ *
+ * Author:
+ *   Benoît Canet <benoit.canet@irqsave.net>
+ *
+ * Based on the design and code of blkverify.c (Copyright (C) 2010 IBM, Corp)
+ * and blkmirror.c (Copyright (C) 2011 Red Hat, Inc).
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "block/block_int.h"
+
+typedef struct QuorumAIOCB QuorumAIOCB;
+
+/* Quorum will create one instance of the following structure per read/write
+ * operation it do on it's children.
+ * So for each read/write operation coming from the upper layer there will be
+ * $children_count QuorumSingleAIOCB.
+ */
+typedef struct QuorumSingleAIOCB {
+    BlockDriverAIOCB *aiocb;
+    QEMUIOVector qiov;
+    uint8_t *buf;
+    int ret;
+    QuorumAIOCB *parent;
+} QuorumSingleAIOCB;
+
+/* Quorum will use the following structure to track progress of each read/write
+ * operation received by the upper layer.
+ * This structure hold pointers to the QuorumSingleAIOCB structures instances
+ * used to do operations on each children and track overall progress.
+ */
+struct QuorumAIOCB {
+    BlockDriverAIOCB common;
+
+    /* Request metadata */
+    uint64_t sector_num;
+    int nb_sectors;
+
+    QEMUIOVector *qiov;         /* calling IOV */
+
+    QuorumSingleAIOCB *aios;    /* individual AIOs */
+    int count;                  /* number of completed AIOCB */
+    int success_count;          /* number of successfully completed AIOCB */
+    bool *finished;             /* completion signal for cancel */
+
+    bool is_read;
+    int vote_ret;
+};
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:34   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies Benoît Canet
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 76a1fbb..9557e61 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -15,6 +15,16 @@
 
 #include "block/block_int.h"
 
+/* the following structure hold the state of one quorum instance */
+typedef struct {
+    BlockDriverState *bs; /* children BlockDriverStates */
+    int total;            /* children count */
+    int threshold;        /* if less than threshold children reads gave the same
+                           * result a quorum error occur.
+                           */
+    bool is_blkverify;    /* true if the driver is in blkverify mode */
+} BDRVQuorumState;
+
 typedef struct QuorumAIOCB QuorumAIOCB;
 
 /* Quorum will create one instance of the following structure per read/write
@@ -37,6 +47,7 @@ typedef struct QuorumSingleAIOCB {
  */
 struct QuorumAIOCB {
     BlockDriverAIOCB common;
+    BDRVQuorumState *bqs;
 
     /* Request metadata */
     uint64_t sector_num;
@@ -52,3 +63,17 @@ struct QuorumAIOCB {
     bool is_read;
     int vote_ret;
 };
+
+static BlockDriver bdrv_quorum = {
+    .format_name        = "quorum",
+    .protocol_name      = "quorum",
+
+    .instance_size      = sizeof(BDRVQuorumState),
+};
+
+static void bdrv_quorum_init(void)
+{
+    bdrv_register(&bdrv_quorum);
+}
+
+block_init(bdrv_quorum_init);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:35   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 04/11] blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify Benoît Canet
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 9557e61..b49e3c6 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -64,11 +64,134 @@ struct QuorumAIOCB {
     int vote_ret;
 };
 
+static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
+{
+    QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
+    bool finished = false;
+
+    /* Wait for the request to finish */
+    acb->finished = &finished;
+    while (!finished) {
+        qemu_aio_wait();
+    }
+}
+
+static AIOCBInfo quorum_aiocb_info = {
+    .aiocb_size         = sizeof(QuorumAIOCB),
+    .cancel             = quorum_aio_cancel,
+};
+
+/* return the first error code get by each individual callbacks */
+static int quorum_get_first_error(QuorumAIOCB *acb)
+{
+    BDRVQuorumState *s = acb->bqs;
+    int i, ret = 0;
+
+    for (i = 0; i < s->total; i++) {
+        ret = acb->aios[i].ret;
+        if (ret) {
+            return ret;
+        }
+    }
+
+    /* should not pass here */
+    assert(false);
+}
+
+static void quorum_aio_finalize(QuorumAIOCB *acb)
+{
+    BDRVQuorumState *s = acb->bqs;
+    int ret;
+
+    ret = s->threshold <= acb->success_count ? 0 : quorum_get_first_error(acb);
+
+    acb->common.cb(acb->common.opaque, ret);
+    if (acb->finished) {
+        *acb->finished = true;
+    }
+    g_free(acb->aios);
+    qemu_aio_release(acb);
+}
+
+static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
+                                   BlockDriverState *bs,
+                                   QEMUIOVector *qiov,
+                                   uint64_t sector_num,
+                                   int nb_sectors,
+                                   BlockDriverCompletionFunc *cb,
+                                   void *opaque)
+{
+    QuorumAIOCB *acb = qemu_aio_get(&quorum_aiocb_info, bs, cb, opaque);
+    int i;
+
+    acb->bqs = s;
+    acb->sector_num = sector_num;
+    acb->nb_sectors = nb_sectors;
+    acb->qiov = qiov;
+    acb->aios = g_new0(QuorumSingleAIOCB, s->total);
+    acb->count = 0;
+    acb->success_count = 0;
+    acb->finished = NULL;
+    acb->is_read = false;
+    acb->vote_ret = 0;
+
+    for (i = 0; i < s->total; i++) {
+        acb->aios[i].buf = NULL;
+        acb->aios[i].ret = 0;
+        acb->aios[i].parent = acb;
+    }
+
+    return acb;
+}
+
+static void quorum_aio_cb(void *opaque, int ret)
+{
+    QuorumSingleAIOCB *sacb = opaque;
+    QuorumAIOCB *acb = sacb->parent;
+    BDRVQuorumState *s = acb->bqs;
+
+    sacb->ret = ret;
+    acb->count++;
+    if (ret == 0) {
+        acb->success_count++;
+    }
+    assert(acb->count <= s->total);
+    assert(acb->success_count <= s->total);
+    if (acb->count < s->total) {
+        return;
+    }
+
+    quorum_aio_finalize(acb);
+}
+
+static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
+                                          int64_t sector_num,
+                                          QEMUIOVector *qiov,
+                                          int nb_sectors,
+                                          BlockDriverCompletionFunc *cb,
+                                          void *opaque)
+{
+    BDRVQuorumState *s = bs->opaque;
+    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors,
+                                      cb, opaque);
+    int i;
+
+    for (i = 0; i < s->total; i++) {
+        acb->aios[i].aiocb = bdrv_aio_writev(&s->bs[i], sector_num, qiov,
+                                             nb_sectors, &quorum_aio_cb,
+                                             &acb->aios[i]);
+    }
+
+    return &acb->common;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
 
     .instance_size      = sizeof(BDRVQuorumState),
+
+    .bdrv_aio_writev    = quorum_aio_writev,
 };
 
 static void bdrv_quorum_init(void)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 04/11] blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (2 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv Benoît Canet
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/blkverify.c     | 108 +-------------------------------------------------
 include/qemu-common.h |   2 +
 util/iov.c            | 103 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 106 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index 1e8e6d9..b64c638 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -185,110 +185,6 @@ static int64_t blkverify_getlength(BlockDriverState *bs)
     return bdrv_getlength(s->test_file);
 }
 
-/**
- * Check that I/O vector contents are identical
- *
- * @a:          I/O vector
- * @b:          I/O vector
- * @ret:        Offset to first mismatching byte or -1 if match
- */
-static ssize_t blkverify_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
-{
-    int i;
-    ssize_t offset = 0;
-
-    assert(a->niov == b->niov);
-    for (i = 0; i < a->niov; i++) {
-        size_t len = 0;
-        uint8_t *p = (uint8_t *)a->iov[i].iov_base;
-        uint8_t *q = (uint8_t *)b->iov[i].iov_base;
-
-        assert(a->iov[i].iov_len == b->iov[i].iov_len);
-        while (len < a->iov[i].iov_len && *p++ == *q++) {
-            len++;
-        }
-
-        offset += len;
-
-        if (len != a->iov[i].iov_len) {
-            return offset;
-        }
-    }
-    return -1;
-}
-
-typedef struct {
-    int src_index;
-    struct iovec *src_iov;
-    void *dest_base;
-} IOVectorSortElem;
-
-static int sortelem_cmp_src_base(const void *a, const void *b)
-{
-    const IOVectorSortElem *elem_a = a;
-    const IOVectorSortElem *elem_b = b;
-
-    /* Don't overflow */
-    if (elem_a->src_iov->iov_base < elem_b->src_iov->iov_base) {
-        return -1;
-    } else if (elem_a->src_iov->iov_base > elem_b->src_iov->iov_base) {
-        return 1;
-    } else {
-        return 0;
-    }
-}
-
-static int sortelem_cmp_src_index(const void *a, const void *b)
-{
-    const IOVectorSortElem *elem_a = a;
-    const IOVectorSortElem *elem_b = b;
-
-    return elem_a->src_index - elem_b->src_index;
-}
-
-/**
- * Copy contents of I/O vector
- *
- * The relative relationships of overlapping iovecs are preserved.  This is
- * necessary to ensure identical semantics in the cloned I/O vector.
- */
-static void blkverify_iovec_clone(QEMUIOVector *dest, const QEMUIOVector *src,
-                                  void *buf)
-{
-    IOVectorSortElem sortelems[src->niov];
-    void *last_end;
-    int i;
-
-    /* Sort by source iovecs by base address */
-    for (i = 0; i < src->niov; i++) {
-        sortelems[i].src_index = i;
-        sortelems[i].src_iov = &src->iov[i];
-    }
-    qsort(sortelems, src->niov, sizeof(sortelems[0]), sortelem_cmp_src_base);
-
-    /* Allocate buffer space taking into account overlapping iovecs */
-    last_end = NULL;
-    for (i = 0; i < src->niov; i++) {
-        struct iovec *cur = sortelems[i].src_iov;
-        ptrdiff_t rewind = 0;
-
-        /* Detect overlap */
-        if (last_end && last_end > cur->iov_base) {
-            rewind = last_end - cur->iov_base;
-        }
-
-        sortelems[i].dest_base = buf - rewind;
-        buf += cur->iov_len - MIN(rewind, cur->iov_len);
-        last_end = MAX(cur->iov_base + cur->iov_len, last_end);
-    }
-
-    /* Sort by source iovec index and build destination iovec */
-    qsort(sortelems, src->niov, sizeof(sortelems[0]), sortelem_cmp_src_index);
-    for (i = 0; i < src->niov; i++) {
-        qemu_iovec_add(dest, sortelems[i].dest_base, src->iov[i].iov_len);
-    }
-}
-
 static BlkverifyAIOCB *blkverify_aio_get(BlockDriverState *bs, bool is_write,
                                          int64_t sector_num, QEMUIOVector *qiov,
                                          int nb_sectors,
@@ -352,7 +248,7 @@ static void blkverify_aio_cb(void *opaque, int ret)
 
 static void blkverify_verify_readv(BlkverifyAIOCB *acb)
 {
-    ssize_t offset = blkverify_iovec_compare(acb->qiov, &acb->raw_qiov);
+    ssize_t offset = qemu_iovec_compare(acb->qiov, &acb->raw_qiov);
     if (offset != -1) {
         blkverify_err(acb, "contents mismatch in sector %" PRId64,
                       acb->sector_num + (int64_t)(offset / BDRV_SECTOR_SIZE));
@@ -370,7 +266,7 @@ static BlockDriverAIOCB *blkverify_aio_readv(BlockDriverState *bs,
     acb->verify = blkverify_verify_readv;
     acb->buf = qemu_blockalign(bs->file, qiov->size);
     qemu_iovec_init(&acb->raw_qiov, acb->qiov->niov);
-    blkverify_iovec_clone(&acb->raw_qiov, qiov, acb->buf);
+    qemu_iovec_clone(&acb->raw_qiov, qiov, acb->buf);
 
     bdrv_aio_readv(s->test_file, sector_num, qiov, nb_sectors,
                    blkverify_aio_cb, acb);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 5054836..d70783e 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -346,6 +346,8 @@ size_t qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
                            const void *buf, size_t bytes);
 size_t qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
                          int fillc, size_t bytes);
+ssize_t qemu_iovec_compare(QEMUIOVector *a, QEMUIOVector *b);
+void qemu_iovec_clone(QEMUIOVector *dest, const QEMUIOVector *src, void *buf);
 
 bool buffer_is_zero(const void *buf, size_t len);
 
diff --git a/util/iov.c b/util/iov.c
index bb46c04..f869fb9 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -378,6 +378,109 @@ size_t qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
     return iov_memset(qiov->iov, qiov->niov, offset, fillc, bytes);
 }
 
+/**
+ * Check that I/O vector contents are identical
+ *
+ * @a:          I/O vector
+ * @b:          I/O vector
+ * @ret:        Offset to first mismatching byte or -1 if match
+ */
+ssize_t qemu_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
+{
+    int i;
+    ssize_t offset = 0;
+
+    assert(a->niov == b->niov);
+    for (i = 0; i < a->niov; i++) {
+        size_t len = 0;
+        uint8_t *p = (uint8_t *)a->iov[i].iov_base;
+        uint8_t *q = (uint8_t *)b->iov[i].iov_base;
+
+        assert(a->iov[i].iov_len == b->iov[i].iov_len);
+        while (len < a->iov[i].iov_len && *p++ == *q++) {
+            len++;
+        }
+
+        offset += len;
+
+        if (len != a->iov[i].iov_len) {
+            return offset;
+        }
+    }
+    return -1;
+}
+
+typedef struct {
+    int src_index;
+    struct iovec *src_iov;
+    void *dest_base;
+} IOVectorSortElem;
+
+static int sortelem_cmp_src_base(const void *a, const void *b)
+{
+    const IOVectorSortElem *elem_a = a;
+    const IOVectorSortElem *elem_b = b;
+
+    /* Don't overflow */
+    if (elem_a->src_iov->iov_base < elem_b->src_iov->iov_base) {
+        return -1;
+    } else if (elem_a->src_iov->iov_base > elem_b->src_iov->iov_base) {
+        return 1;
+    } else {
+        return 0;
+    }
+}
+
+static int sortelem_cmp_src_index(const void *a, const void *b)
+{
+    const IOVectorSortElem *elem_a = a;
+    const IOVectorSortElem *elem_b = b;
+
+    return elem_a->src_index - elem_b->src_index;
+}
+
+/**
+ * Copy contents of I/O vector
+ *
+ * The relative relationships of overlapping iovecs are preserved.  This is
+ * necessary to ensure identical semantics in the cloned I/O vector.
+ */
+void qemu_iovec_clone(QEMUIOVector *dest, const QEMUIOVector *src, void *buf)
+{
+    IOVectorSortElem sortelems[src->niov];
+    void *last_end;
+    int i;
+
+    /* Sort by source iovecs by base address */
+    for (i = 0; i < src->niov; i++) {
+        sortelems[i].src_index = i;
+        sortelems[i].src_iov = &src->iov[i];
+    }
+    qsort(sortelems, src->niov, sizeof(sortelems[0]), sortelem_cmp_src_base);
+
+    /* Allocate buffer space taking into account overlapping iovecs */
+    last_end = NULL;
+    for (i = 0; i < src->niov; i++) {
+        struct iovec *cur = sortelems[i].src_iov;
+        ptrdiff_t rewind = 0;
+
+        /* Detect overlap */
+        if (last_end && last_end > cur->iov_base) {
+            rewind = last_end - cur->iov_base;
+        }
+
+        sortelems[i].dest_base = buf - rewind;
+        buf += cur->iov_len - MIN(rewind, cur->iov_len);
+        last_end = MAX(cur->iov_base + cur->iov_len, last_end);
+    }
+
+    /* Sort by source iovec index and build destination iovec */
+    qsort(sortelems, src->niov, sizeof(sortelems[0]), sortelem_cmp_src_index);
+    for (i = 0; i < src->niov; i++) {
+        qemu_iovec_add(dest, sortelems[i].dest_base, src->iov[i].iov_len);
+    }
+}
+
 size_t iov_discard_front(struct iovec **iov, unsigned int *iov_cnt,
                          size_t bytes)
 {
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (3 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 04/11] blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:38   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism Benoît Canet
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index b49e3c6..f0fc0e9 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -101,14 +101,23 @@ static int quorum_get_first_error(QuorumAIOCB *acb)
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
     BDRVQuorumState *s = acb->bqs;
-    int ret;
+    int i, ret;
 
     ret = s->threshold <= acb->success_count ? 0 : quorum_get_first_error(acb);
 
+    for (i = 0; i < s->total; i++) {
+        qemu_vfree(acb->aios[i].buf);
+        acb->aios[i].buf = NULL;
+        acb->aios[i].ret = 0;
+    }
+
     acb->common.cb(acb->common.opaque, ret);
     if (acb->finished) {
         *acb->finished = true;
     }
+    for (i = 0; i < s->total; i++) {
+        qemu_iovec_destroy(&acb->aios[i].qiov);
+    }
     g_free(acb->aios);
     qemu_aio_release(acb);
 }
@@ -164,6 +173,34 @@ static void quorum_aio_cb(void *opaque, int ret)
     quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
+                                         int64_t sector_num,
+                                         QEMUIOVector *qiov,
+                                         int nb_sectors,
+                                         BlockDriverCompletionFunc *cb,
+                                         void *opaque)
+{
+    BDRVQuorumState *s = bs->opaque;
+    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
+                                      nb_sectors, cb, opaque);
+    int i;
+
+    acb->is_read = true;
+
+    for (i = 0; i < s->total; i++) {
+        acb->aios[i].buf = qemu_blockalign(bs->file, qiov->size);
+        qemu_iovec_init(&acb->aios[i].qiov, qiov->niov);
+        qemu_iovec_clone(&acb->aios[i].qiov, qiov, acb->aios[i].buf);
+    }
+
+    for (i = 0; i < s->total; i++) {
+        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
+                       quorum_aio_cb, &acb->aios[i]);
+    }
+
+    return &acb->common;
+}
+
 static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
                                           int64_t sector_num,
                                           QEMUIOVector *qiov,
@@ -191,6 +228,7 @@ static BlockDriver bdrv_quorum = {
 
     .instance_size      = sizeof(BDRVQuorumState),
 
+    .bdrv_aio_readv     = quorum_aio_readv,
     .bdrv_aio_writev    = quorum_aio_writev,
 };
 
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (4 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:48   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength() Benoît Canet
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Use gnutls's SHA-256 to compare versions.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/Makefile.objs       |   2 +-
 block/quorum.c            | 321 +++++++++++++++++++++++++++++++++++++++++++++-
 configure                 |  36 ++++++
 include/monitor/monitor.h |   2 +
 monitor.c                 |   2 +
 5 files changed, 361 insertions(+), 2 deletions(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 05a65c2..adcdc21 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
-block-obj-y += quorum.o
+block-obj-$(CONFIG_QUORUM) += quorum.o
 block-obj-y += parallels.o blkdebug.o blkverify.o
 block-obj-y += snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
diff --git a/block/quorum.c b/block/quorum.c
index f0fc0e9..e235ac1 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -13,7 +13,43 @@
  * See the COPYING file in the top-level directory.
  */
 
+#include <gnutls/gnutls.h>
+#include <gnutls/crypto.h>
 #include "block/block_int.h"
+#include "qapi/qmp/qjson.h"
+
+#define HASH_LENGTH 32
+
+/* This union hold a vote hash value */
+typedef union QuorumVoteValue {
+    char h[HASH_LENGTH];       /* SHA-256 hash */
+    int64_t l;                 /* simpler 64 bits hash */
+} QuorumVoteValue;
+
+/* A vote item */
+typedef struct QuorumVoteItem {
+    int index;
+    QLIST_ENTRY(QuorumVoteItem) next;
+} QuorumVoteItem;
+
+/* this structure is a vote version. A version is the set of vote sharing the
+ * same vote value.
+ * The set of vote will be tracked with the items field and it's count is
+ * vote_count.
+ */
+typedef struct QuorumVoteVersion {
+    QuorumVoteValue value;
+    int index;
+    int vote_count;
+    QLIST_HEAD(, QuorumVoteItem) items;
+    QLIST_ENTRY(QuorumVoteVersion) next;
+} QuorumVoteVersion;
+
+/* this structure hold a group of vote versions together */
+typedef struct QuorumVotes {
+    QLIST_HEAD(, QuorumVoteVersion) vote_list;
+    int (*compare)(QuorumVoteValue *a, QuorumVoteValue *b);
+} QuorumVotes;
 
 /* the following structure hold the state of one quorum instance */
 typedef struct {
@@ -60,10 +96,14 @@ struct QuorumAIOCB {
     int success_count;          /* number of successfully completed AIOCB */
     bool *finished;             /* completion signal for cancel */
 
+    QuorumVotes votes;
+
     bool is_read;
     int vote_ret;
 };
 
+static void quorum_vote(QuorumAIOCB *acb);
+
 static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
@@ -111,6 +151,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
         acb->aios[i].ret = 0;
     }
 
+    if (acb->vote_ret) {
+        ret = acb->vote_ret;
+    }
+
     acb->common.cb(acb->common.opaque, ret);
     if (acb->finished) {
         *acb->finished = true;
@@ -122,6 +166,11 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
     qemu_aio_release(acb);
 }
 
+static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
+{
+    return memcmp(a->h, b->h, HASH_LENGTH);
+}
+
 static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
                                    BlockDriverState *bs,
                                    QEMUIOVector *qiov,
@@ -141,6 +190,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
     acb->count = 0;
     acb->success_count = 0;
     acb->finished = NULL;
+    acb->votes.compare = quorum_sha256_compare;
     acb->is_read = false;
     acb->vote_ret = 0;
 
@@ -170,9 +220,278 @@ static void quorum_aio_cb(void *opaque, int ret)
         return;
     }
 
+    /* Do the vote on read */
+    if (acb->is_read) {
+        quorum_vote(acb);
+    }
+
     quorum_aio_finalize(acb);
 }
 
+static void quorum_report_bad(QuorumAIOCB *acb, int index)
+{
+    QObject *data;
+    data = qobject_from_jsonf("{ 'children-index': %i"
+                              ", 'sector-num': %" PRId64
+                              ", 'sectors-count': %i }",
+                              index,
+                              acb->sector_num,
+                              acb->nb_sectors);
+    monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data);
+}
+
+static void quorum_report_failure(QuorumAIOCB *acb)
+{
+    QObject *data;
+    data = qobject_from_jsonf("{ 'sector-num': %" PRId64
+                              ", 'sectors-count': %i }",
+                              acb->sector_num,
+                              acb->nb_sectors);
+    monitor_protocol_event(QEVENT_QUORUM_FAILURE, data);
+}
+
+static void quorum_report_bad_versions(QuorumAIOCB *acb,
+                                      QuorumVoteValue *value)
+{
+    QuorumVoteVersion *version;
+    QuorumVoteItem *item;
+
+    QLIST_FOREACH(version, &acb->votes.vote_list, next) {
+        if (!acb->votes.compare(&version->value, value)) {
+            continue;
+        }
+        QLIST_FOREACH(item, &version->items, next) {
+            quorum_report_bad(acb, item->index);
+        }
+    }
+}
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+    int i;
+    assert(dest->niov == source->niov);
+    assert(dest->size == source->size);
+    for (i = 0; i < source->niov; i++) {
+        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+        memcpy(dest->iov[i].iov_base,
+               source->iov[i].iov_base,
+               source->iov[i].iov_len);
+    }
+}
+
+static void quorum_count_vote(QuorumVotes *votes,
+                              QuorumVoteValue *value,
+                              int index)
+{
+    QuorumVoteVersion *v = NULL, *version = NULL;
+    QuorumVoteItem *item;
+
+    /* look if we have something with this hash */
+    QLIST_FOREACH(v, &votes->vote_list, next) {
+        if (!votes->compare(&v->value, value)) {
+            version = v;
+            break;
+        }
+    }
+
+    /* It's a version not yet in the list add it */
+    if (!version) {
+        version = g_new0(QuorumVoteVersion, 1);
+        QLIST_INIT(&version->items);
+        memcpy(&version->value, value, sizeof(version->value));
+        version->index = index;
+        version->vote_count = 0;
+        QLIST_INSERT_HEAD(&votes->vote_list, version, next);
+    }
+
+    version->vote_count++;
+
+    item = g_new0(QuorumVoteItem, 1);
+    item->index = index;
+    QLIST_INSERT_HEAD(&version->items, item, next);
+}
+
+static void quorum_free_vote_list(QuorumVotes *votes)
+{
+    QuorumVoteVersion *version, *next_version;
+    QuorumVoteItem *item, *next_item;
+
+    QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) {
+        QLIST_REMOVE(version, next);
+        QLIST_FOREACH_SAFE(item, &version->items, next, next_item) {
+            QLIST_REMOVE(item, next);
+            g_free(item);
+        }
+        g_free(version);
+    }
+}
+
+static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
+{
+    int j, ret;
+    gnutls_hash_hd_t dig;
+    QEMUIOVector *qiov = &acb->aios[i].qiov;
+
+    ret = gnutls_hash_init(&dig, GNUTLS_DIG_SHA256);
+
+    if (ret < 0) {
+        return ret;
+    }
+
+    for (j = 0; j < qiov->niov; j++) {
+        ret = gnutls_hash(dig, qiov->iov[j].iov_base, qiov->iov[j].iov_len);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    gnutls_hash_deinit(dig, (void *) hash);
+
+    return 0;
+}
+
+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
+{
+    int i = 0;
+    QuorumVoteVersion *candidate, *winner = NULL;
+
+    QLIST_FOREACH(candidate, &votes->vote_list, next) {
+        if (candidate->vote_count > i) {
+            i = candidate->vote_count;
+            winner = candidate;
+        }
+    }
+
+    return winner;
+}
+
+static bool quorum_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
+{
+    int i;
+    int result;
+
+    assert(a->niov == b->niov);
+    for (i = 0; i < a->niov; i++) {
+        assert(a->iov[i].iov_len == b->iov[i].iov_len);
+        result = memcmp(a->iov[i].iov_base,
+                        b->iov[i].iov_base,
+                        a->iov[i].iov_len);
+        if (result) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
+                                          const char *fmt, ...)
+{
+    va_list ap;
+
+    va_start(ap, fmt);
+    fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
+            acb->sector_num, acb->nb_sectors);
+    vfprintf(stderr, fmt, ap);
+    fprintf(stderr, "\n");
+    va_end(ap);
+    exit(1);
+}
+
+static bool quorum_compare(QuorumAIOCB *acb,
+                           QEMUIOVector *a,
+                           QEMUIOVector *b)
+{
+    BDRVQuorumState *s = acb->bqs;
+    ssize_t offset;
+
+    /* This driver will replace blkverify in this particular case */
+    if (s->is_blkverify) {
+        offset = qemu_iovec_compare(a, b);
+        if (offset != -1) {
+            quorum_err(acb, "contents mismatch in sector %" PRId64,
+                       acb->sector_num +
+                       (uint64_t)(offset / BDRV_SECTOR_SIZE));
+        }
+        return true;
+    }
+
+    return quorum_iovec_compare(a, b);
+}
+
+static void quorum_vote(QuorumAIOCB *acb)
+{
+    bool quorum = true;
+    int i, j, ret;
+    QuorumVoteValue hash;
+    BDRVQuorumState *s = acb->bqs;
+    QuorumVoteVersion *winner;
+
+    /* get the index of the first successful read */
+    for (i = 0; i < s->total; i++) {
+        if (!acb->aios[i].ret) {
+            break;
+        }
+    }
+
+    /* compare this read with all other successful read looking for quorum */
+    for (j = i + 1; j < s->total; j++) {
+        if (acb->aios[j].ret) {
+            continue;
+        }
+        quorum = quorum_compare(acb, &acb->aios[i].qiov, &acb->aios[j].qiov);
+        if (!quorum) {
+            break;
+       }
+    }
+
+    /* Every successful read agrees -> Quorum */
+    if (quorum) {
+        quorum_copy_qiov(acb->qiov, &acb->aios[i].qiov);
+        return;
+    }
+
+    /* compute hashs for each successful read, also store indexes */
+    for (i = 0; i < s->total; i++) {
+        if (acb->aios[i].ret) {
+            continue;
+        }
+        ret = quorum_compute_hash(acb, i, &hash);
+        /* if ever the hash computation failed */
+        if (ret < 0) {
+            acb->vote_ret = ret;
+            goto free_exit;
+        }
+        quorum_count_vote(&acb->votes, &hash, i);
+    }
+
+    /* vote to select the most represented version */
+    winner = quorum_get_vote_winner(&acb->votes);
+    /* every vote version are differents -> error */
+    if (!winner) {
+        quorum_report_failure(acb);
+        acb->vote_ret = -EIO;
+        goto free_exit;
+    }
+
+    /* if the winner count is smaller then threshold read fail */
+    if (winner->vote_count < s->threshold) {
+        quorum_report_failure(acb);
+        acb->vote_ret = -EIO;
+        goto free_exit;
+    }
+
+    /* we have a winner: copy it */
+    quorum_copy_qiov(acb->qiov, &acb->aios[winner->index].qiov);
+
+    /* some versions are bad print them */
+    quorum_report_bad_versions(acb, &winner->value);
+
+free_exit:
+    /* free lists */
+    quorum_free_vote_list(&acb->votes);
+}
+
 static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
                                          int64_t sector_num,
                                          QEMUIOVector *qiov,
@@ -194,7 +513,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
     }
 
     for (i = 0; i < s->total; i++) {
-        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
+        bdrv_aio_readv(&s->bs[i], sector_num, &acb->aios[i].qiov, nb_sectors,
                        quorum_aio_cb, &acb->aios[i]);
     }
 
diff --git a/configure b/configure
index 23dbaaf..efbe7d9 100755
--- a/configure
+++ b/configure
@@ -246,6 +246,7 @@ gtk=""
 gtkabi="2.0"
 tpm="no"
 libssh2=""
+quorum="no"
 
 # parse CC options first
 for opt do
@@ -972,6 +973,10 @@ for opt do
   ;;
   --enable-libssh2) libssh2="yes"
   ;;
+  --disable-quorum) quorum="no"
+  ;;
+  --enable-quorum) quorum="yes"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -1203,6 +1208,8 @@ echo "  --gcov=GCOV              use specified gcov [$gcov_tool]"
 echo "  --enable-tpm             enable TPM support"
 echo "  --disable-libssh2        disable ssh block device support"
 echo "  --enable-libssh2         enable ssh block device support"
+echo "  --disable-quorum         disable quorum block filter support"
+echo "  --enable-quorum          enable quorum block filter support"
 echo ""
 echo "NOTE: The object files are built at the place where configure is launched"
 exit 1
@@ -1895,6 +1902,30 @@ EOF
 fi
 
 ##########################################
+# Quorum probe (check for gnutls)
+if test "$quorum" != "no" ; then
+cat > $TMPC <<EOF
+#include <gnutls/gnutls.h>
+#include <gnutls/crypto.h>
+int main(void) {char data[4096], digest[32];
+gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest);
+return 0;
+}
+EOF
+quorum_tls_cflags=`$pkg_config --cflags gnutls 2> /dev/null`
+quorum_tls_libs=`$pkg_config --libs gnutls 2> /dev/null`
+if compile_prog "$quorum_tls_cflags" "$quorum_tls_libs" ; then
+  qcow_tls=yes
+  libs_softmmu="$quorum_tls_libs $libs_softmmu"
+  libs_tools="$quorum_tls_libs $libs_softmmu"
+  QEMU_CFLAGS="$QEMU_CFLAGS $quorum_tls_cflags"
+else
+  echo "gnutls > 2.10.0 required to compile Quorum"
+  exit 1
+fi
+fi
+
+##########################################
 # VNC SASL detection
 if test "$vnc" = "yes" -a "$vnc_sasl" != "no" ; then
   cat > $TMPC <<EOF
@@ -3758,6 +3789,7 @@ echo "TPM support       $tpm"
 echo "libssh2 support   $libssh2"
 echo "TPM passthrough   $tpm_passthrough"
 echo "QOM debugging     $qom_cast_debug"
+echo "Quorum            $quorum"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -4156,6 +4188,10 @@ if test "$libssh2" = "yes" ; then
   echo "CONFIG_LIBSSH2=y" >> $config_host_mak
 fi
 
+if test "$quorum" = "yes" ; then
+  echo "CONFIG_QUORUM=y" >> $config_host_mak
+fi
+
 if test "$virtio_blk_data_plane" = "yes" ; then
   echo 'CONFIG_VIRTIO_BLK_DATA_PLANE=$(CONFIG_VIRTIO)' >> $config_host_mak
 fi
diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index 10fa0e3..51902ed 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -49,6 +49,8 @@ typedef enum MonitorEvent {
     QEVENT_SPICE_MIGRATE_COMPLETED,
     QEVENT_GUEST_PANICKED,
     QEVENT_BLOCK_IMAGE_CORRUPTED,
+    QEVENT_QUORUM_FAILURE,
+    QEVENT_QUORUM_REPORT_BAD,
 
     /* Add to 'monitor_event_names' array in monitor.c when
      * defining new events here */
diff --git a/monitor.c b/monitor.c
index 74f3f1b..89d139f 100644
--- a/monitor.c
+++ b/monitor.c
@@ -507,6 +507,8 @@ static const char *monitor_event_names[] = {
     [QEVENT_SPICE_MIGRATE_COMPLETED] = "SPICE_MIGRATE_COMPLETED",
     [QEVENT_GUEST_PANICKED] = "GUEST_PANICKED",
     [QEVENT_BLOCK_IMAGE_CORRUPTED] = "BLOCK_IMAGE_CORRUPTED",
+    [QEVENT_QUORUM_FAILURE] = "QUORUM_FAILURE",
+    [QEVENT_QUORUM_REPORT_BAD] = "QUORUM_REPORT_BAD",
 };
 QEMU_BUILD_BUG_ON(ARRAY_SIZE(monitor_event_names) != QEVENT_MAX)
 
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength().
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (5 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:49   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 08/11] quorum: Add quorum_invalidate_cache() Benoît Canet
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Check that every bs file return the same length.
If not return -EIO to disable the quorum and
avoid length discrepancy.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index e235ac1..b3649a4 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -541,12 +541,38 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
     return &acb->common;
 }
 
+static int64_t quorum_getlength(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int64_t result;
+    int i;
+
+    /* check that all file have the same length */
+    result = bdrv_getlength(&s->bs[0]);
+    if (result < 0) {
+        return result;
+    }
+    for (i = 1; i < s->total; i++) {
+        int64_t value = bdrv_getlength(&s->bs[i]);
+        if (value < 0) {
+            return value;
+        }
+        if (value != result) {
+            return -EIO;
+        }
+    }
+
+    return result;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
 
     .instance_size      = sizeof(BDRVQuorumState),
 
+    .bdrv_getlength     = quorum_getlength,
+
     .bdrv_aio_readv     = quorum_aio_readv,
     .bdrv_aio_writev    = quorum_aio_writev,
 };
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 08/11] quorum: Add quorum_invalidate_cache().
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (6 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength() Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status Benoît Canet
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index b3649a4..223ad22 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -565,6 +565,16 @@ static int64_t quorum_getlength(BlockDriverState *bs)
     return result;
 }
 
+static void quorum_invalidate_cache(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i;
+
+    for (i = 0; i < s->total; i++) {
+        bdrv_invalidate_cache(&s->bs[i]);
+    }
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
@@ -575,6 +585,7 @@ static BlockDriver bdrv_quorum = {
 
     .bdrv_aio_readv     = quorum_aio_readv,
     .bdrv_aio_writev    = quorum_aio_writev,
+    .bdrv_invalidate_cache = quorum_invalidate_cache,
 };
 
 static void bdrv_quorum_init(void)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status.
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (7 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 08/11] quorum: Add quorum_invalidate_cache() Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:51   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush() Benoît Canet
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 223ad22..084d030 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -171,6 +171,22 @@ static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
     return memcmp(a->h, b->h, HASH_LENGTH);
 }
 
+static int quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b)
+{
+    int64_t i = a->l;
+    int64_t j = b->l;
+
+    if (i < j) {
+        return -1;
+    }
+
+    if (i > j) {
+        return 1;
+    }
+
+    return 0;
+}
+
 static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
                                    BlockDriverState *bs,
                                    QEMUIOVector *qiov,
@@ -575,6 +591,56 @@ static void quorum_invalidate_cache(BlockDriverState *bs)
     }
 }
 
+static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs,
+                                                       int64_t sector_num,
+                                                       int nb_sectors,
+                                                       int *pnum)
+{
+    BDRVQuorumState *s = bs->opaque;
+    QuorumVoteVersion *winner = NULL;
+    QuorumVotes result_votes, num_votes;
+    QuorumVoteValue result_value, num_value;
+    int i, num;
+    int64_t result = 0;
+
+    QLIST_INIT(&result_votes.vote_list);
+    QLIST_INIT(&num_votes.vote_list);
+    result_votes.compare = quorum_64bits_compare;
+    num_votes.compare = quorum_64bits_compare;
+
+    for (i = 0; i < s->total; i++) {
+        result = bdrv_get_block_status(&s->bs[i], sector_num, nb_sectors, &num);
+        /* skip failed requests */
+        if (result < 0) {
+            continue;
+        }
+        result_value.l = result & BDRV_BLOCK_DATA;
+        num_value.l = num;
+        quorum_count_vote(&result_votes, &result_value, i);
+        quorum_count_vote(&num_votes, &num_value, i);
+    }
+
+    winner = quorum_get_vote_winner(&result_votes);
+    result = winner->value.l;
+    if (winner->vote_count < s->threshold) {
+        result = -ERANGE;
+        goto free_exit;
+    }
+
+    winner = quorum_get_vote_winner(&num_votes);
+    if (winner->vote_count < s->threshold) {
+        result = -ERANGE;
+        goto free_exit;
+    }
+    *pnum = winner->value.l;
+
+free_exit:
+    quorum_free_vote_list(&result_votes);
+    quorum_free_vote_list(&num_votes);
+
+    return result;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
@@ -586,6 +652,7 @@ static BlockDriver bdrv_quorum = {
     .bdrv_aio_readv     = quorum_aio_readv,
     .bdrv_aio_writev    = quorum_aio_writev,
     .bdrv_invalidate_cache = quorum_invalidate_cache,
+    .bdrv_co_get_block_status = quorum_co_get_block_status,
 };
 
 static void bdrv_quorum_init(void)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush().
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (8 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 14:53   ` Max Reitz
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close() Benoît Canet
  2013-10-04 14:31 ` [Qemu-devel] [PATCH V9 00/11] Quorum block driver Max Reitz
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Makes a vote to select error if any.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 084d030..0a28ab1 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -641,12 +641,46 @@ free_exit:
     return result;
 }
 
+static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    QuorumVoteVersion *winner = NULL;
+    QuorumVotes error_votes;
+    QuorumVoteValue result_value;
+    int i;
+    int result = 0;
+    bool error = false;
+
+    QLIST_INIT(&error_votes.vote_list);
+    error_votes.compare = quorum_64bits_compare;
+
+    for (i = 0; i < s->total; i++) {
+        result = bdrv_co_flush(&s->bs[i]);
+        if (result) {
+            error = true;
+            result_value.l = result;
+            quorum_count_vote(&error_votes, &result_value, i);
+        }
+    }
+
+    if (error) {
+        winner = quorum_get_vote_winner(&error_votes);
+        result = winner->value.l;
+    }
+
+    quorum_free_vote_list(&error_votes);
+
+    return result;
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
 
     .instance_size      = sizeof(BDRVQuorumState),
 
+    .bdrv_co_flush_to_disk = quorum_co_flush,
+
     .bdrv_getlength     = quorum_getlength,
 
     .bdrv_aio_readv     = quorum_aio_readv,
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close().
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (9 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush() Benoît Canet
@ 2013-10-02 12:39 ` Benoît Canet
  2013-10-04 15:14   ` Max Reitz
  2013-10-04 14:31 ` [Qemu-devel] [PATCH V9 00/11] Quorum block driver Max Reitz
  11 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-02 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Benoît Canet, stefanha

Example of command line:
-drive if=virtio,file.driver=quorum,\
file.children.0.file.filename=1.qcow2,\
file.children.1.file.filename=2.qcow2,\
file.children.2.file.filename=3.qcow2,\
file.vote_threshold=3

file.blkverify=on with file.vote_threshold=2 and two file can be passed to
emulated blkverify.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block/quorum.c | 338 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 338 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index 0a28ab1..3cfc67e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -17,8 +17,13 @@
 #include <gnutls/crypto.h>
 #include "block/block_int.h"
 #include "qapi/qmp/qjson.h"
+#include "qapi/qmp/types.h"
+#include "qemu-common.h"
 
 #define HASH_LENGTH 32
+#define KEY_PREFIX "children."
+#define KEY_SUFFIX "."
+#define KEY_FILENAME_SUFFIX ".file.filename"
 
 /* This union hold a vote hash value */
 typedef union QuorumVoteValue {
@@ -673,12 +678,342 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
     return result;
 }
 
+static int quorum_match_key(const char *key,
+                            char **key_prefix,
+                            bool clone_key_prefix)
+{
+    const char *start;
+    char *next;
+    unsigned long long idx;
+    int ret;
+
+    *key_prefix = NULL;
+
+    /* the following code is a translation of the following pseudo code:
+     *       match = key.match('(^children\.(\d+)\.)file\.filename)
+     *       if not match:
+     *           return -1;
+     *       key_prefix = match.outer_match()
+     *       idx = match.inner_match()
+     *
+     * note: we also match the .file suffix to avoid futher checkings
+     */
+
+    /* this key is not a children */
+    if (strncmp(key, KEY_PREFIX, strlen(KEY_PREFIX))) {
+        return -1;
+    }
+
+    /* take first char after prefix */
+    start = key + strlen(KEY_PREFIX);
+
+    /* if the string end here -> scan fail */
+    if (start[0] == '\0') {
+        return -1;
+    }
+
+    /* try to retrieve the index */
+    ret = parse_uint(start, &idx, &next, 10);
+
+    /* no int found -> scan fail */
+    if (ret < 0) {
+        return -1;
+    }
+
+    /* match the ".file.filename" suffix to avoid matching the same idx
+     * multiple times and be required to do more checks later
+     */
+    if (strncmp(next, KEY_FILENAME_SUFFIX, strlen(KEY_FILENAME_SUFFIX))) {
+        return -1;
+    }
+
+    if (clone_key_prefix) {
+        /* include '.' */
+        int len = next + strlen(KEY_SUFFIX) - key;
+        *key_prefix = g_strndup(key, len);
+    }
+
+    return idx;
+}
+
+static QDict *quorum_get_children_idx(const QDict *options)
+{
+    const QDictEntry *ent;
+    QDict *result;
+    char *key_prefix;
+    int idx;
+
+    result = qdict_new();
+
+    for (ent = qdict_first(options); ent; ent = qdict_next(options, ent)) {
+        const char *key = qdict_entry_key(ent);
+        idx = quorum_match_key(key, &key_prefix, true);
+
+        /* if the result is -1 or any negative index value skip this key */
+        if (idx < 0) {
+            continue;
+        }
+
+        qdict_put(result, key_prefix, qint_from_int(idx));
+    }
+
+    return result;
+}
+
+static int quorum_fill_validation_array(bool *array,
+                                        const QDict *dict,
+                                        int total,
+                                        Error **errp)
+{
+    const QDictEntry *ent;
+
+    /* fill the checking array with children indexes */
+    for (ent = qdict_first(dict); ent; ent = qdict_next(dict, ent)) {
+        const char *key = qdict_entry_key(ent);
+        int idx = qdict_get_int(dict, key);
+
+        if (idx < 0 || idx >= total) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "Children index must be between 0 and children count -1");
+            return -ERANGE;
+        }
+
+        array[idx] = true;
+    }
+
+    return 0;
+}
+
+static int quorum_valid_indexes(const bool *array, int total, Error **errp)
+{
+    int i;
+
+    for (i = 0; i < total; i++) {
+        if (array[i] == true) {
+            continue;
+        }
+
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "All children indexes between 0 and children count -1  must "
+                  " be used");
+        return -ERANGE;
+    }
+
+    return 0;
+}
+
+static int quorum_valid_children_indexes(const QDict *dict,
+                                         int total,
+                                         Error **errp)
+{
+    bool *array;
+    int ret = 0;;
+
+    /* allocate indexes checking array and put false in it */
+    array = g_new0(bool, total);
+
+    ret = quorum_fill_validation_array(array, dict, total, errp);
+    if (ret < 0) {
+        goto free_exit;
+    }
+
+    ret = quorum_valid_indexes(array, total, errp);
+free_exit:
+    g_free(array);
+    return ret;
+}
+
+static int quorum_valid_threshold(int threshold,
+                                  int total,
+                                  Error **errp)
+{
+
+    if (threshold < 1) {
+        error_set(errp, QERR_INVALID_PARAMETER_VALUE,
+                  "vote_threshold", "value >= 1");
+        return -ERANGE;
+    }
+
+    if (threshold > total) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "threshold <= children count must be true");
+        return -ERANGE;
+    }
+
+    return 0;
+}
+
+static int quorum_bdrv_open_sub(BlockDriverState *bs,
+                                QDict *bs_dict,
+                                int flags,
+                                Error **errp)
+{
+    const char *format_name = NULL;
+    const char *filename = NULL;
+    BlockDriver *drv = NULL;
+    int ret = 0;
+
+    format_name = qdict_get_try_str(bs_dict, "file.driver");
+
+    if (!format_name) {
+        filename = qdict_get_try_str(bs_dict, "file.filename");
+        qdict_del(bs_dict, "file.filename");
+    } else {
+        drv = bdrv_find_format(format_name);
+        if (!drv) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "invalid format name");
+            return -EINVAL;
+        }
+    }
+
+    QINCREF(bs_dict);
+    ret = bdrv_open(bs,
+                    filename,
+                    bs_dict,
+                    flags,
+                    drv,
+                    errp);
+    QDECREF(bs_dict);
+    return ret;
+}
+
+static int quorum_open(BlockDriverState *bs,
+                       QDict *options,
+                       int flags,
+                       Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    Error *local_err = NULL;
+    const QDictEntry *ent;
+    QDict *idx_dict;
+    bool *opened;
+    const char *value;
+    char *next;
+    int i;
+    int ret = 0;
+
+    /* get a dict of children indexes for validation */
+    idx_dict = quorum_get_children_idx(options);
+
+    /* count how many different children indexes are present and validate */
+    s->total = qdict_size(idx_dict);
+    if (s->total < 2) {
+        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
+                  "Number of provided children must be greater than 1");
+        ret = -EINVAL;
+        goto exit;
+    }
+
+    /* validate that the set of index is coherent */
+    ret = quorum_valid_children_indexes(idx_dict, s->total, &local_err);
+    if (ret < 0) {
+        goto exit;
+    }
+
+    /* retrieve the threshold option from the command line */
+    value = qdict_get_try_str(options, "vote_threshold");
+    if (!value) {
+        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
+                  "vote_threshold must be provided");
+        ret = -EINVAL;
+        goto exit;
+    }
+    qdict_del(options, "vote_threshold");
+
+    ret = parse_uint(value, (unsigned long long *) &s->threshold, &next, 10);
+
+    /* no int found -> scan fail */
+    if (ret < 0) {
+        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
+                  "invalid vote_threshold specified");
+        ret = -EINVAL;
+        goto exit;
+    }
+
+    /* and validate it againts s->total */
+    ret = quorum_valid_threshold(s->threshold, s->total, &local_err);
+    if (ret < 0) {
+        goto exit;
+    }
+
+    /* is the driver in blkverify mode */
+    value = qdict_get_try_str(options, "blkverify");
+    if (value && !strcmp(value, "on")  &&
+        s->total == 2 && s->threshold == 2) {
+        s->is_blkverify = true;
+    }
+    qdict_del(options, "blkverify");
+
+    /* allocate the children BlockDriverState array */
+    s->bs = g_new0(BlockDriverState, s->total);
+    opened = g_new0(bool, s->total);
+
+    /* open children bs */
+    for (ent = qdict_first(idx_dict);
+         ent; ent = qdict_next(idx_dict, ent)) {
+        const char *key = qdict_entry_key(ent);
+        int idx = qdict_get_int(idx_dict, key);
+        int ret = 0;
+        QDict *bs_dict; /* dict used to open child */
+        qdict_extract_subqdict(options, &bs_dict, key);
+        ret = quorum_bdrv_open_sub(&s->bs[idx],
+                                   bs_dict,
+                                   flags,
+                                   &local_err);
+        if (ret < 0) {
+            QDECREF(bs_dict);
+            goto close_exit;
+        }
+        opened[idx] = true;
+    }
+
+    g_free(opened);
+    goto exit;
+
+close_exit:
+    /* cleanup on error */
+    for (i = 0; i < s->total; i++) {
+        if (!opened[i]) {
+            continue;
+        }
+        bdrv_close(&s->bs[i]);
+    }
+    g_free(s->bs);
+    g_free(opened);
+exit:
+    /* propagate error */
+    QDECREF(idx_dict);
+    if (error_is_set(&local_err)) {
+        error_propagate(errp, local_err);
+    }
+    return ret;
+}
+
+static void quorum_close(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i;
+
+    for (i = 0; i < s->total; i++) {
+        /* Ensure writes reach stable storage */
+        bdrv_flush(&s->bs[i]);
+        /* close manually because we'll free s->bs */
+        bdrv_close(&s->bs[i]);
+    }
+
+    g_free(s->bs);
+}
+
 static BlockDriver bdrv_quorum = {
     .format_name        = "quorum",
     .protocol_name      = "quorum",
 
     .instance_size      = sizeof(BDRVQuorumState),
 
+    .bdrv_file_open     = quorum_open,
+    .bdrv_close         = quorum_close,
+
     .bdrv_co_flush_to_disk = quorum_co_flush,
 
     .bdrv_getlength     = quorum_getlength,
@@ -687,6 +1022,9 @@ static BlockDriver bdrv_quorum = {
     .bdrv_aio_writev    = quorum_aio_writev,
     .bdrv_invalidate_cache = quorum_invalidate_cache,
     .bdrv_co_get_block_status = quorum_co_get_block_status,
+
+    /* disable external snapshots while waiting for block filters */
+    .bdrv_check_ext_snapshot = bdrv_check_ext_snapshot_forbidden,
 };
 
 static void bdrv_quorum_init(void)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
                   ` (10 preceding siblings ...)
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close() Benoît Canet
@ 2013-10-04 14:31 ` Max Reitz
  2013-10-04 22:25   ` Benoît Canet
  2013-10-29  7:59   ` Benoît Canet
  11 siblings, 2 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:31 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
>
> This patchset create a block driver implementing a quorum using total qemu disk
> images. Writes are mirrored on the $total files.
> For the reading part the $total files are read at the same time and a vote is
> done to determine if a qiov version is present $threshold or more times. It then
> return this majority version to the upper layers.
> When i < $threshold versions of the data are returned by the lower layer the
> quorum is broken and the read return -EIO.
>
> The goal of this patchset is to be turned in a QEMU block filter living just
> above raw-*.c and below qcow2/qed when the required infrastructure will be done.
>
> Main use of this feature will be people using NFS appliances which can be
> subjected to bitflip errors.
>
> This patchset can be used to replace blkverify and the out of tree blkmirror.
>
> v9: integrate latests comments from kevin
>
> Benoît Canet (11):
>    quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
>    quorum: Create BDRVQuorumState and BlkDriver and do init.
>    quorum: Add quorum_aio_writev and its dependencies.
>    blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
>      blkverify.
>    quorum: Add quorum_aio_readv.
>    quorum: Add quorum mechanism.
>    quorum: Add quorum_getlength().
>    quorum: Add quorum_invalidate_cache().
>    quorum: Add quorum_co_get_block_status.
>    quorum: Add quorum_co_flush().
>    quorum: Add quorum_open() and quorum_close().
>
>   block/Makefile.objs       |    1 +
>   block/blkverify.c         |  108 +----
>   block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
>   configure                 |   36 ++
>   include/monitor/monitor.h |    2 +
>   include/qemu-common.h     |    2 +
>   monitor.c                 |    2 +
>   util/iov.c                |  103 +++++
>   8 files changed, 1183 insertions(+), 106 deletions(-)
>   create mode 100644 block/quorum.c
>

Could you add a qemu-iotest if possible?

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB Benoît Canet
@ 2013-10-04 14:33   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:33 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/Makefile.objs |  1 +
>   block/quorum.c      | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 55 insertions(+)
>   create mode 100644 block/quorum.c
>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 3bb85b5..05a65c2 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -3,6 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
>   block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>   block-obj-y += qed-check.o
>   block-obj-y += vhdx.o
> +block-obj-y += quorum.o
>   block-obj-y += parallels.o blkdebug.o blkverify.o
>   block-obj-y += snapshot.o qapi.o
>   block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> diff --git a/block/quorum.c b/block/quorum.c
> new file mode 100644
> index 0000000..76a1fbb
> --- /dev/null
> +++ b/block/quorum.c
> @@ -0,0 +1,54 @@
> +/*
> + * Quorum Block filter
> + *
> + * Copyright (C) 2012-2013 Nodalink, SARL.
> + *
> + * Author:
> + *   Benoît Canet <benoit.canet@irqsave.net>
> + *
> + * Based on the design and code of blkverify.c (Copyright (C) 2010 IBM, Corp)
> + * and blkmirror.c (Copyright (C) 2011 Red Hat, Inc).
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "block/block_int.h"
> +
> +typedef struct QuorumAIOCB QuorumAIOCB;
> +
> +/* Quorum will create one instance of the following structure per read/write
> + * operation it do on it's children.
I'd vote for "operation it performs on its children" or something like that.

Max

> + * So for each read/write operation coming from the upper layer there will be
> + * $children_count QuorumSingleAIOCB.
> + */
> +typedef struct QuorumSingleAIOCB {
> +    BlockDriverAIOCB *aiocb;
> +    QEMUIOVector qiov;
> +    uint8_t *buf;
> +    int ret;
> +    QuorumAIOCB *parent;
> +} QuorumSingleAIOCB;
> +
> +/* Quorum will use the following structure to track progress of each read/write
> + * operation received by the upper layer.
> + * This structure hold pointers to the QuorumSingleAIOCB structures instances
> + * used to do operations on each children and track overall progress.
> + */
> +struct QuorumAIOCB {
> +    BlockDriverAIOCB common;
> +
> +    /* Request metadata */
> +    uint64_t sector_num;
> +    int nb_sectors;
> +
> +    QEMUIOVector *qiov;         /* calling IOV */
> +
> +    QuorumSingleAIOCB *aios;    /* individual AIOs */
> +    int count;                  /* number of completed AIOCB */
> +    int success_count;          /* number of successfully completed AIOCB */
> +    bool *finished;             /* completion signal for cancel */
> +
> +    bool is_read;
> +    int vote_ret;
> +};

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init Benoît Canet
@ 2013-10-04 14:34   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:34 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 25 +++++++++++++++++++++++++
>   1 file changed, 25 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 76a1fbb..9557e61 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -15,6 +15,16 @@
>   
>   #include "block/block_int.h"
>   
> +/* the following structure hold the state of one quorum instance */
s/hold/holds/

> +typedef struct {
> +    BlockDriverState *bs; /* children BlockDriverStates */
> +    int total;            /* children count */
> +    int threshold;        /* if less than threshold children reads gave the same
> +                           * result a quorum error occur.
s/occur/occurs/

Max

> +                           */
> +    bool is_blkverify;    /* true if the driver is in blkverify mode */
> +} BDRVQuorumState;
> +
>   typedef struct QuorumAIOCB QuorumAIOCB;
>   
>   /* Quorum will create one instance of the following structure per read/write
> @@ -37,6 +47,7 @@ typedef struct QuorumSingleAIOCB {
>    */
>   struct QuorumAIOCB {
>       BlockDriverAIOCB common;
> +    BDRVQuorumState *bqs;
>   
>       /* Request metadata */
>       uint64_t sector_num;
> @@ -52,3 +63,17 @@ struct QuorumAIOCB {
>       bool is_read;
>       int vote_ret;
>   };
> +
> +static BlockDriver bdrv_quorum = {
> +    .format_name        = "quorum",
> +    .protocol_name      = "quorum",
> +
> +    .instance_size      = sizeof(BDRVQuorumState),
> +};
> +
> +static void bdrv_quorum_init(void)
> +{
> +    bdrv_register(&bdrv_quorum);
> +}
> +
> +block_init(bdrv_quorum_init);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies Benoît Canet
@ 2013-10-04 14:35   ` Max Reitz
  2013-10-28 12:21     ` Benoît Canet
  0 siblings, 1 reply; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:35 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 123 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 9557e61..b49e3c6 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -64,11 +64,134 @@ struct QuorumAIOCB {
>       int vote_ret;
>   };
>   
> +static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
> +{
> +    QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> +    bool finished = false;
> +
> +    /* Wait for the request to finish */
> +    acb->finished = &finished;
> +    while (!finished) {
> +        qemu_aio_wait();
> +    }
Hm, wouldn't it be better to pass the cancel to the children?

Max

> +}
> +
> +static AIOCBInfo quorum_aiocb_info = {
> +    .aiocb_size         = sizeof(QuorumAIOCB),
> +    .cancel             = quorum_aio_cancel,
> +};
> +
> +/* return the first error code get by each individual callbacks */
> +static int quorum_get_first_error(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->bqs;
> +    int i, ret = 0;
> +
> +    for (i = 0; i < s->total; i++) {
> +        ret = acb->aios[i].ret;
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    /* should not pass here */
> +    assert(false);
> +}
> +
> +static void quorum_aio_finalize(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->bqs;
> +    int ret;
> +
> +    ret = s->threshold <= acb->success_count ? 0 : quorum_get_first_error(acb);
> +
> +    acb->common.cb(acb->common.opaque, ret);
> +    if (acb->finished) {
> +        *acb->finished = true;
> +    }
> +    g_free(acb->aios);
> +    qemu_aio_release(acb);
> +}
> +
> +static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> +                                   BlockDriverState *bs,
> +                                   QEMUIOVector *qiov,
> +                                   uint64_t sector_num,
> +                                   int nb_sectors,
> +                                   BlockDriverCompletionFunc *cb,
> +                                   void *opaque)
> +{
> +    QuorumAIOCB *acb = qemu_aio_get(&quorum_aiocb_info, bs, cb, opaque);
> +    int i;
> +
> +    acb->bqs = s;
> +    acb->sector_num = sector_num;
> +    acb->nb_sectors = nb_sectors;
> +    acb->qiov = qiov;
> +    acb->aios = g_new0(QuorumSingleAIOCB, s->total);
> +    acb->count = 0;
> +    acb->success_count = 0;
> +    acb->finished = NULL;
> +    acb->is_read = false;
> +    acb->vote_ret = 0;
> +
> +    for (i = 0; i < s->total; i++) {
> +        acb->aios[i].buf = NULL;
> +        acb->aios[i].ret = 0;
> +        acb->aios[i].parent = acb;
> +    }
> +
> +    return acb;
> +}
> +
> +static void quorum_aio_cb(void *opaque, int ret)
> +{
> +    QuorumSingleAIOCB *sacb = opaque;
> +    QuorumAIOCB *acb = sacb->parent;
> +    BDRVQuorumState *s = acb->bqs;
> +
> +    sacb->ret = ret;
> +    acb->count++;
> +    if (ret == 0) {
> +        acb->success_count++;
> +    }
> +    assert(acb->count <= s->total);
> +    assert(acb->success_count <= s->total);
> +    if (acb->count < s->total) {
> +        return;
> +    }
> +
> +    quorum_aio_finalize(acb);
> +}
> +
> +static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
> +                                          int64_t sector_num,
> +                                          QEMUIOVector *qiov,
> +                                          int nb_sectors,
> +                                          BlockDriverCompletionFunc *cb,
> +                                          void *opaque)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors,
> +                                      cb, opaque);
> +    int i;
> +
> +    for (i = 0; i < s->total; i++) {
> +        acb->aios[i].aiocb = bdrv_aio_writev(&s->bs[i], sector_num, qiov,
> +                                             nb_sectors, &quorum_aio_cb,
> +                                             &acb->aios[i]);
> +    }
> +
> +    return &acb->common;
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name        = "quorum",
>       .protocol_name      = "quorum",
>   
>       .instance_size      = sizeof(BDRVQuorumState),
> +
> +    .bdrv_aio_writev    = quorum_aio_writev,
>   };
>   
>   static void bdrv_quorum_init(void)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv Benoît Canet
@ 2013-10-04 14:38   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:38 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 40 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index b49e3c6..f0fc0e9 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -101,14 +101,23 @@ static int quorum_get_first_error(QuorumAIOCB *acb)
>   static void quorum_aio_finalize(QuorumAIOCB *acb)
>   {
>       BDRVQuorumState *s = acb->bqs;
> -    int ret;
> +    int i, ret;
>   
>       ret = s->threshold <= acb->success_count ? 0 : quorum_get_first_error(acb);
>   
> +    for (i = 0; i < s->total; i++) {
> +        qemu_vfree(acb->aios[i].buf);
> +        acb->aios[i].buf = NULL;
> +        acb->aios[i].ret = 0;
> +    }
> +
>       acb->common.cb(acb->common.opaque, ret);
>       if (acb->finished) {
>           *acb->finished = true;
>       }
> +    for (i = 0; i < s->total; i++) {
> +        qemu_iovec_destroy(&acb->aios[i].qiov);
As far as I see it, these fields are not really initialized for quorum 
writes. Should we check for is_read before destroying them? (Although it 
doesn't seem to hurt this way either)

Max

> +    }
>       g_free(acb->aios);
>       qemu_aio_release(acb);
>   }
> @@ -164,6 +173,34 @@ static void quorum_aio_cb(void *opaque, int ret)
>       quorum_aio_finalize(acb);
>   }
>   
> +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> +                                         int64_t sector_num,
> +                                         QEMUIOVector *qiov,
> +                                         int nb_sectors,
> +                                         BlockDriverCompletionFunc *cb,
> +                                         void *opaque)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> +                                      nb_sectors, cb, opaque);
> +    int i;
> +
> +    acb->is_read = true;
> +
> +    for (i = 0; i < s->total; i++) {
> +        acb->aios[i].buf = qemu_blockalign(bs->file, qiov->size);
> +        qemu_iovec_init(&acb->aios[i].qiov, qiov->niov);
> +        qemu_iovec_clone(&acb->aios[i].qiov, qiov, acb->aios[i].buf);
> +    }
> +
> +    for (i = 0; i < s->total; i++) {
> +        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
> +                       quorum_aio_cb, &acb->aios[i]);
> +    }
> +
> +    return &acb->common;
> +}
> +
>   static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>                                             int64_t sector_num,
>                                             QEMUIOVector *qiov,
> @@ -191,6 +228,7 @@ static BlockDriver bdrv_quorum = {
>   
>       .instance_size      = sizeof(BDRVQuorumState),
>   
> +    .bdrv_aio_readv     = quorum_aio_readv,
>       .bdrv_aio_writev    = quorum_aio_writev,
>   };
>   

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism Benoît Canet
@ 2013-10-04 14:48   ` Max Reitz
  2013-10-28 12:31     ` Benoît Canet
  2013-10-28 13:04     ` Benoît Canet
  0 siblings, 2 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:48 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Use gnutls's SHA-256 to compare versions.
Wouldn't CRC32 suffice? (I don't really oppose using SHA, but taking in 
gnutls as a dependency just for comparing several memory areas seems a 
bit much to me)

> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/Makefile.objs       |   2 +-
>   block/quorum.c            | 321 +++++++++++++++++++++++++++++++++++++++++++++-
>   configure                 |  36 ++++++
>   include/monitor/monitor.h |   2 +
>   monitor.c                 |   2 +
>   5 files changed, 361 insertions(+), 2 deletions(-)
>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 05a65c2..adcdc21 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
>   block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>   block-obj-y += qed-check.o
>   block-obj-y += vhdx.o
> -block-obj-y += quorum.o
> +block-obj-$(CONFIG_QUORUM) += quorum.o
>   block-obj-y += parallels.o blkdebug.o blkverify.o
>   block-obj-y += snapshot.o qapi.o
>   block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> diff --git a/block/quorum.c b/block/quorum.c
> index f0fc0e9..e235ac1 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -13,7 +13,43 @@
>    * See the COPYING file in the top-level directory.
>    */
>   
> +#include <gnutls/gnutls.h>
> +#include <gnutls/crypto.h>
>   #include "block/block_int.h"
> +#include "qapi/qmp/qjson.h"
> +
> +#define HASH_LENGTH 32
> +
> +/* This union hold a vote hash value */
*holds

> +typedef union QuorumVoteValue {
> +    char h[HASH_LENGTH];       /* SHA-256 hash */
> +    int64_t l;                 /* simpler 64 bits hash */
> +} QuorumVoteValue;
> +
> +/* A vote item */
> +typedef struct QuorumVoteItem {
> +    int index;
> +    QLIST_ENTRY(QuorumVoteItem) next;
> +} QuorumVoteItem;
> +
> +/* this structure is a vote version. A version is the set of vote sharing the
*set of votes

> + * same vote value.
> + * The set of vote will be tracked with the items field and it's count is
*set of votes or *vote set; also s/it's count/its cardinality/ or 
something like that

> + * vote_count.
> + */
> +typedef struct QuorumVoteVersion {
> +    QuorumVoteValue value;
> +    int index;
> +    int vote_count;
> +    QLIST_HEAD(, QuorumVoteItem) items;
> +    QLIST_ENTRY(QuorumVoteVersion) next;
> +} QuorumVoteVersion;
> +
> +/* this structure hold a group of vote versions together */
*holds

> +typedef struct QuorumVotes {
> +    QLIST_HEAD(, QuorumVoteVersion) vote_list;
> +    int (*compare)(QuorumVoteValue *a, QuorumVoteValue *b);
> +} QuorumVotes;
>   
>   /* the following structure hold the state of one quorum instance */
>   typedef struct {
> @@ -60,10 +96,14 @@ struct QuorumAIOCB {
>       int success_count;          /* number of successfully completed AIOCB */
>       bool *finished;             /* completion signal for cancel */
>   
> +    QuorumVotes votes;
> +
>       bool is_read;
>       int vote_ret;
>   };
>   
> +static void quorum_vote(QuorumAIOCB *acb);
> +
>   static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>   {
>       QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> @@ -111,6 +151,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>           acb->aios[i].ret = 0;
>       }
>   
> +    if (acb->vote_ret) {
> +        ret = acb->vote_ret;
> +    }
> +
>       acb->common.cb(acb->common.opaque, ret);
>       if (acb->finished) {
>           *acb->finished = true;
> @@ -122,6 +166,11 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>       qemu_aio_release(acb);
>   }
>   
> +static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
> +{
> +    return memcmp(a->h, b->h, HASH_LENGTH);
> +}
> +
>   static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>                                      BlockDriverState *bs,
>                                      QEMUIOVector *qiov,
> @@ -141,6 +190,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>       acb->count = 0;
>       acb->success_count = 0;
>       acb->finished = NULL;
> +    acb->votes.compare = quorum_sha256_compare;
>       acb->is_read = false;
>       acb->vote_ret = 0;
>   
> @@ -170,9 +220,278 @@ static void quorum_aio_cb(void *opaque, int ret)
>           return;
>       }
>   
> +    /* Do the vote on read */
> +    if (acb->is_read) {
> +        quorum_vote(acb);
> +    }
> +
>       quorum_aio_finalize(acb);
>   }
>   
> +static void quorum_report_bad(QuorumAIOCB *acb, int index)
> +{
> +    QObject *data;
> +    data = qobject_from_jsonf("{ 'children-index': %i"
I'd prefer child-index. Generally, remember that the singular of 
"children" is "child".

> +                              ", 'sector-num': %" PRId64
> +                              ", 'sectors-count': %i }",
> +                              index,
> +                              acb->sector_num,
> +                              acb->nb_sectors);
> +    monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data);
How about decrementing the refcount of data?

> +}
> +
> +static void quorum_report_failure(QuorumAIOCB *acb)
> +{
> +    QObject *data;
> +    data = qobject_from_jsonf("{ 'sector-num': %" PRId64
> +                              ", 'sectors-count': %i }",
> +                              acb->sector_num,
> +                              acb->nb_sectors);
> +    monitor_protocol_event(QEVENT_QUORUM_FAILURE, data);
Same here.

> +}
> +
> +static void quorum_report_bad_versions(QuorumAIOCB *acb,
> +                                      QuorumVoteValue *value)
> +{
> +    QuorumVoteVersion *version;
> +    QuorumVoteItem *item;
> +
> +    QLIST_FOREACH(version, &acb->votes.vote_list, next) {
> +        if (!acb->votes.compare(&version->value, value)) {
> +            continue;
> +        }
> +        QLIST_FOREACH(item, &version->items, next) {
> +            quorum_report_bad(acb, item->index);
> +        }
> +    }
> +}
> +
> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> +{
> +    int i;
> +    assert(dest->niov == source->niov);
> +    assert(dest->size == source->size);
> +    for (i = 0; i < source->niov; i++) {
> +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> +        memcpy(dest->iov[i].iov_base,
> +               source->iov[i].iov_base,
> +               source->iov[i].iov_len);
> +    }
> +}
> +
> +static void quorum_count_vote(QuorumVotes *votes,
> +                              QuorumVoteValue *value,
> +                              int index)
> +{
> +    QuorumVoteVersion *v = NULL, *version = NULL;
> +    QuorumVoteItem *item;
> +
> +    /* look if we have something with this hash */
> +    QLIST_FOREACH(v, &votes->vote_list, next) {
> +        if (!votes->compare(&v->value, value)) {
> +            version = v;
> +            break;
> +        }
> +    }
> +
> +    /* It's a version not yet in the list add it */
> +    if (!version) {
> +        version = g_new0(QuorumVoteVersion, 1);
> +        QLIST_INIT(&version->items);
> +        memcpy(&version->value, value, sizeof(version->value));
> +        version->index = index;
> +        version->vote_count = 0;
> +        QLIST_INSERT_HEAD(&votes->vote_list, version, next);
> +    }
> +
> +    version->vote_count++;
> +
> +    item = g_new0(QuorumVoteItem, 1);
> +    item->index = index;
> +    QLIST_INSERT_HEAD(&version->items, item, next);
> +}
> +
> +static void quorum_free_vote_list(QuorumVotes *votes)
> +{
> +    QuorumVoteVersion *version, *next_version;
> +    QuorumVoteItem *item, *next_item;
> +
> +    QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) {
> +        QLIST_REMOVE(version, next);
> +        QLIST_FOREACH_SAFE(item, &version->items, next, next_item) {
> +            QLIST_REMOVE(item, next);
> +            g_free(item);
> +        }
> +        g_free(version);
> +    }
> +}
> +
> +static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
> +{
> +    int j, ret;
> +    gnutls_hash_hd_t dig;
> +    QEMUIOVector *qiov = &acb->aios[i].qiov;
> +
> +    ret = gnutls_hash_init(&dig, GNUTLS_DIG_SHA256);
> +
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    for (j = 0; j < qiov->niov; j++) {
> +        ret = gnutls_hash(dig, qiov->iov[j].iov_base, qiov->iov[j].iov_len);
> +        if (ret < 0) {
> +            return ret;
gnutls_hash_deinit?

> +        }
> +    }
> +
> +    gnutls_hash_deinit(dig, (void *) hash);
> +
> +    return 0;
> +}
> +
> +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
> +{
> +    int i = 0;
> +    QuorumVoteVersion *candidate, *winner = NULL;
> +
> +    QLIST_FOREACH(candidate, &votes->vote_list, next) {
> +        if (candidate->vote_count > i) {
> +            i = candidate->vote_count;
> +            winner = candidate;
> +        }
> +    }
> +
> +    return winner;
> +}
> +
> +static bool quorum_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
> +{
> +    int i;
> +    int result;
> +
> +    assert(a->niov == b->niov);
> +    for (i = 0; i < a->niov; i++) {
> +        assert(a->iov[i].iov_len == b->iov[i].iov_len);
> +        result = memcmp(a->iov[i].iov_base,
> +                        b->iov[i].iov_base,
> +                        a->iov[i].iov_len);
> +        if (result) {
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
> +                                          const char *fmt, ...)
> +{
> +    va_list ap;
> +
> +    va_start(ap, fmt);
> +    fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
> +            acb->sector_num, acb->nb_sectors);
> +    vfprintf(stderr, fmt, ap);
> +    fprintf(stderr, "\n");
> +    va_end(ap);
> +    exit(1);
> +}
> +
> +static bool quorum_compare(QuorumAIOCB *acb,
> +                           QEMUIOVector *a,
> +                           QEMUIOVector *b)
> +{
> +    BDRVQuorumState *s = acb->bqs;
> +    ssize_t offset;
> +
> +    /* This driver will replace blkverify in this particular case */
> +    if (s->is_blkverify) {
> +        offset = qemu_iovec_compare(a, b);
> +        if (offset != -1) {
> +            quorum_err(acb, "contents mismatch in sector %" PRId64,
> +                       acb->sector_num +
> +                       (uint64_t)(offset / BDRV_SECTOR_SIZE));
> +        }
> +        return true;
> +    }
> +
> +    return quorum_iovec_compare(a, b);
> +}
> +
> +static void quorum_vote(QuorumAIOCB *acb)
> +{
> +    bool quorum = true;
> +    int i, j, ret;
> +    QuorumVoteValue hash;
> +    BDRVQuorumState *s = acb->bqs;
> +    QuorumVoteVersion *winner;
> +
> +    /* get the index of the first successful read */
> +    for (i = 0; i < s->total; i++) {
> +        if (!acb->aios[i].ret) {
> +            break;
> +        }
> +    }
> +
> +    /* compare this read with all other successful read looking for quorum */
> +    for (j = i + 1; j < s->total; j++) {
> +        if (acb->aios[j].ret) {
> +            continue;
> +        }
> +        quorum = quorum_compare(acb, &acb->aios[i].qiov, &acb->aios[j].qiov);
> +        if (!quorum) {
> +            break;
> +       }
> +    }
> +
> +    /* Every successful read agrees -> Quorum */
> +    if (quorum) {
> +        quorum_copy_qiov(acb->qiov, &acb->aios[i].qiov);
If no reads were successful at all, quorum will be true and i will be 
s->total. Therefore, this will result in an array access with an index 
out of bounds here.

Generally, why is this function executed at all if any read failed? If a 
read fails, the quorum read function will return an error, thus the 
caller will ignore the result anyway.

> +        return;
> +    }
> +
> +    /* compute hashs for each successful read, also store indexes */
> +    for (i = 0; i < s->total; i++) {
> +        if (acb->aios[i].ret) {
> +            continue;
> +        }
> +        ret = quorum_compute_hash(acb, i, &hash);
> +        /* if ever the hash computation failed */
> +        if (ret < 0) {
> +            acb->vote_ret = ret;
> +            goto free_exit;
> +        }
> +        quorum_count_vote(&acb->votes, &hash, i);
> +    }
> +
> +    /* vote to select the most represented version */
> +    winner = quorum_get_vote_winner(&acb->votes);
> +    /* every vote version are differents -> error */
> +    if (!winner) {
> +        quorum_report_failure(acb);
> +        acb->vote_ret = -EIO;
> +        goto free_exit;
> +    }
> +
> +    /* if the winner count is smaller then threshold read fail */
s/then/than/, s/ read fail/, the read fails/

> +    if (winner->vote_count < s->threshold) {
> +        quorum_report_failure(acb);
> +        acb->vote_ret = -EIO;
> +        goto free_exit;
> +    }
> +
> +    /* we have a winner: copy it */
> +    quorum_copy_qiov(acb->qiov, &acb->aios[winner->index].qiov);
> +
> +    /* some versions are bad print them */
> +    quorum_report_bad_versions(acb, &winner->value);
> +
> +free_exit:
> +    /* free lists */
> +    quorum_free_vote_list(&acb->votes);
> +}
> +
>   static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
>                                            int64_t sector_num,
>                                            QEMUIOVector *qiov,
> @@ -194,7 +513,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
>       }
>   
>       for (i = 0; i < s->total; i++) {
> -        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
> +        bdrv_aio_readv(&s->bs[i], sector_num, &acb->aios[i].qiov, nb_sectors,
>                          quorum_aio_cb, &acb->aios[i]);
>       }
>   
> diff --git a/configure b/configure
> index 23dbaaf..efbe7d9 100755
> --- a/configure
> +++ b/configure
> @@ -246,6 +246,7 @@ gtk=""
>   gtkabi="2.0"
>   tpm="no"
>   libssh2=""
> +quorum="no"
>   
>   # parse CC options first
>   for opt do
> @@ -972,6 +973,10 @@ for opt do
>     ;;
>     --enable-libssh2) libssh2="yes"
>     ;;
> +  --disable-quorum) quorum="no"
> +  ;;
> +  --enable-quorum) quorum="yes"
> +  ;;
>     *) echo "ERROR: unknown option $opt"; show_help="yes"
>     ;;
>     esac
> @@ -1203,6 +1208,8 @@ echo "  --gcov=GCOV              use specified gcov [$gcov_tool]"
>   echo "  --enable-tpm             enable TPM support"
>   echo "  --disable-libssh2        disable ssh block device support"
>   echo "  --enable-libssh2         enable ssh block device support"
> +echo "  --disable-quorum         disable quorum block filter support"
> +echo "  --enable-quorum          enable quorum block filter support"
>   echo ""
>   echo "NOTE: The object files are built at the place where configure is launched"
>   exit 1
> @@ -1895,6 +1902,30 @@ EOF
>   fi
>   
>   ##########################################
> +# Quorum probe (check for gnutls)
> +if test "$quorum" != "no" ; then
> +cat > $TMPC <<EOF
> +#include <gnutls/gnutls.h>
> +#include <gnutls/crypto.h>
> +int main(void) {char data[4096], digest[32];
> +gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest);
> +return 0;
> +}
> +EOF
> +quorum_tls_cflags=`$pkg_config --cflags gnutls 2> /dev/null`
> +quorum_tls_libs=`$pkg_config --libs gnutls 2> /dev/null`
> +if compile_prog "$quorum_tls_cflags" "$quorum_tls_libs" ; then
> +  qcow_tls=yes
> +  libs_softmmu="$quorum_tls_libs $libs_softmmu"
> +  libs_tools="$quorum_tls_libs $libs_softmmu"
> +  QEMU_CFLAGS="$QEMU_CFLAGS $quorum_tls_cflags"
> +else
> +  echo "gnutls > 2.10.0 required to compile Quorum"
> +  exit 1
> +fi
> +fi
> +
> +##########################################
>   # VNC SASL detection
>   if test "$vnc" = "yes" -a "$vnc_sasl" != "no" ; then
>     cat > $TMPC <<EOF
> @@ -3758,6 +3789,7 @@ echo "TPM support       $tpm"
>   echo "libssh2 support   $libssh2"
>   echo "TPM passthrough   $tpm_passthrough"
>   echo "QOM debugging     $qom_cast_debug"
> +echo "Quorum            $quorum"
>   
>   if test "$sdl_too_old" = "yes"; then
>   echo "-> Your SDL version is too old - please upgrade to have SDL support"
> @@ -4156,6 +4188,10 @@ if test "$libssh2" = "yes" ; then
>     echo "CONFIG_LIBSSH2=y" >> $config_host_mak
>   fi
>   
> +if test "$quorum" = "yes" ; then
> +  echo "CONFIG_QUORUM=y" >> $config_host_mak
> +fi
> +
>   if test "$virtio_blk_data_plane" = "yes" ; then
>     echo 'CONFIG_VIRTIO_BLK_DATA_PLANE=$(CONFIG_VIRTIO)' >> $config_host_mak
>   fi
> diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
> index 10fa0e3..51902ed 100644
> --- a/include/monitor/monitor.h
> +++ b/include/monitor/monitor.h
> @@ -49,6 +49,8 @@ typedef enum MonitorEvent {
>       QEVENT_SPICE_MIGRATE_COMPLETED,
>       QEVENT_GUEST_PANICKED,
>       QEVENT_BLOCK_IMAGE_CORRUPTED,
> +    QEVENT_QUORUM_FAILURE,
> +    QEVENT_QUORUM_REPORT_BAD,
>   
>       /* Add to 'monitor_event_names' array in monitor.c when
>        * defining new events here */
> diff --git a/monitor.c b/monitor.c
> index 74f3f1b..89d139f 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -507,6 +507,8 @@ static const char *monitor_event_names[] = {
>       [QEVENT_SPICE_MIGRATE_COMPLETED] = "SPICE_MIGRATE_COMPLETED",
>       [QEVENT_GUEST_PANICKED] = "GUEST_PANICKED",
>       [QEVENT_BLOCK_IMAGE_CORRUPTED] = "BLOCK_IMAGE_CORRUPTED",
> +    [QEVENT_QUORUM_FAILURE] = "QUORUM_FAILURE",
> +    [QEVENT_QUORUM_REPORT_BAD] = "QUORUM_REPORT_BAD",
Could you document these events in docs/qmp/qmp-events.txt?

Max

>   };
>   QEMU_BUILD_BUG_ON(ARRAY_SIZE(monitor_event_names) != QEVENT_MAX)
>   

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength().
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength() Benoît Canet
@ 2013-10-04 14:49   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:49 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Check that every bs file return the same length.
*returns

> If not return -EIO to disable the quorum and
I'd prefer "If not," (mind the comma) or "Otherwise," (comma optional).

Max

> avoid length discrepancy.
>
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index e235ac1..b3649a4 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -541,12 +541,38 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>       return &acb->common;
>   }
>   
> +static int64_t quorum_getlength(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int64_t result;
> +    int i;
> +
> +    /* check that all file have the same length */
> +    result = bdrv_getlength(&s->bs[0]);
> +    if (result < 0) {
> +        return result;
> +    }
> +    for (i = 1; i < s->total; i++) {
> +        int64_t value = bdrv_getlength(&s->bs[i]);
> +        if (value < 0) {
> +            return value;
> +        }
> +        if (value != result) {
> +            return -EIO;
> +        }
> +    }
> +
> +    return result;
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name        = "quorum",
>       .protocol_name      = "quorum",
>   
>       .instance_size      = sizeof(BDRVQuorumState),
>   
> +    .bdrv_getlength     = quorum_getlength,
> +
>       .bdrv_aio_readv     = quorum_aio_readv,
>       .bdrv_aio_writev    = quorum_aio_writev,
>   };

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status.
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status Benoît Canet
@ 2013-10-04 14:51   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:51 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 67 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 223ad22..084d030 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -171,6 +171,22 @@ static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
>       return memcmp(a->h, b->h, HASH_LENGTH);
>   }
>   
> +static int quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b)
> +{
> +    int64_t i = a->l;
> +    int64_t j = b->l;
> +
> +    if (i < j) {
> +        return -1;
> +    }
> +
> +    if (i > j) {
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
>   static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>                                      BlockDriverState *bs,
>                                      QEMUIOVector *qiov,
> @@ -575,6 +591,56 @@ static void quorum_invalidate_cache(BlockDriverState *bs)
>       }
>   }
>   
> +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs,
> +                                                       int64_t sector_num,
> +                                                       int nb_sectors,
> +                                                       int *pnum)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumVoteVersion *winner = NULL;
> +    QuorumVotes result_votes, num_votes;
> +    QuorumVoteValue result_value, num_value;
> +    int i, num;
> +    int64_t result = 0;
> +
> +    QLIST_INIT(&result_votes.vote_list);
> +    QLIST_INIT(&num_votes.vote_list);
> +    result_votes.compare = quorum_64bits_compare;
> +    num_votes.compare = quorum_64bits_compare;
> +
> +    for (i = 0; i < s->total; i++) {
> +        result = bdrv_get_block_status(&s->bs[i], sector_num, nb_sectors, &num);
> +        /* skip failed requests */
> +        if (result < 0) {
> +            continue;
> +        }
> +        result_value.l = result & BDRV_BLOCK_DATA;
> +        num_value.l = num;
> +        quorum_count_vote(&result_votes, &result_value, i);
> +        quorum_count_vote(&num_votes, &num_value, i);
> +    }
> +
> +    winner = quorum_get_vote_winner(&result_votes);
> +    result = winner->value.l;
> +    if (winner->vote_count < s->threshold) {
> +        result = -ERANGE;
> +        goto free_exit;
> +    }
> +
> +    winner = quorum_get_vote_winner(&num_votes);
> +    if (winner->vote_count < s->threshold) {
> +        result = -ERANGE;
> +        goto free_exit;
> +    }
> +    *pnum = winner->value.l;
I wouldn't vote here. I think it would be better to take the minimum 
value for pnum (the minimum of num_votes) and consider the sector 
unallocated if it's unallocated for any child. But voting seems okay, too.

Max

> +
> +free_exit:
> +    quorum_free_vote_list(&result_votes);
> +    quorum_free_vote_list(&num_votes);
> +
> +    return result;
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name        = "quorum",
>       .protocol_name      = "quorum",
> @@ -586,6 +652,7 @@ static BlockDriver bdrv_quorum = {
>       .bdrv_aio_readv     = quorum_aio_readv,
>       .bdrv_aio_writev    = quorum_aio_writev,
>       .bdrv_invalidate_cache = quorum_invalidate_cache,
> +    .bdrv_co_get_block_status = quorum_co_get_block_status,
>   };
>   
>   static void bdrv_quorum_init(void)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush().
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush() Benoît Canet
@ 2013-10-04 14:53   ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-04 14:53 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Makes a vote to select error if any.
Voting on the error code seems a little over the top to me. ;)

I mean, the idea's nice, but then you'd have to do the same for reads 
and writes, I think. Just doing this for flushes seems a little incoherent.

Max

> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 34 ++++++++++++++++++++++++++++++++++
>   1 file changed, 34 insertions(+)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 084d030..0a28ab1 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -641,12 +641,46 @@ free_exit:
>       return result;
>   }
>   
> +static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumVoteVersion *winner = NULL;
> +    QuorumVotes error_votes;
> +    QuorumVoteValue result_value;
> +    int i;
> +    int result = 0;
> +    bool error = false;
> +
> +    QLIST_INIT(&error_votes.vote_list);
> +    error_votes.compare = quorum_64bits_compare;
> +
> +    for (i = 0; i < s->total; i++) {
> +        result = bdrv_co_flush(&s->bs[i]);
> +        if (result) {
> +            error = true;
> +            result_value.l = result;
> +            quorum_count_vote(&error_votes, &result_value, i);
> +        }
> +    }
> +
> +    if (error) {
> +        winner = quorum_get_vote_winner(&error_votes);
> +        result = winner->value.l;
> +    }
> +
> +    quorum_free_vote_list(&error_votes);
> +
> +    return result;
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name        = "quorum",
>       .protocol_name      = "quorum",
>   
>       .instance_size      = sizeof(BDRVQuorumState),
>   
> +    .bdrv_co_flush_to_disk = quorum_co_flush,
> +
>       .bdrv_getlength     = quorum_getlength,
>   
>       .bdrv_aio_readv     = quorum_aio_readv,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close().
  2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close() Benoît Canet
@ 2013-10-04 15:14   ` Max Reitz
  2013-10-04 15:31     ` Eric Blake
  0 siblings, 1 reply; 36+ messages in thread
From: Max Reitz @ 2013-10-04 15:14 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

On 2013-10-02 14:39, Benoît Canet wrote:
> Example of command line:
> -drive if=virtio,file.driver=quorum,\
> file.children.0.file.filename=1.qcow2,\
> file.children.1.file.filename=2.qcow2,\
> file.children.2.file.filename=3.qcow2,\
> file.vote_threshold=3
>
> file.blkverify=on with file.vote_threshold=2 and two file can be passed to
*two files

> emulated blkverify.
>
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>   block/quorum.c | 338 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 338 insertions(+)
Okay, in general: Bear in mind that the singular of "children" is "child".

Second, error_setg is a shorthand to error_set(…, 
ERROR_CLASS_GENERIC_ERROR, …), so I'd propose you might make use of it – 
if you're so inclined. ;)

Third, wouldn't it be better to use children[i] instead of children.i 
for addressing them on the command line? I guess, a QMP interface would 
represent them as an array, so using the standard array index notation 
seems better to me.

Fourth, I don't like the parsing functions. They look reasonable for 
what they're supposed to do, but they just do so much hard-coded parsing 
(e.g., expect this prefix and that suffix). This is okay for now, since 
they're just used by this driver, but there's my problem: Is there no 
other, general way of parsing the command line options? (I truly don't 
know) I guess not, since blockdev.c fetches whole arguments such as 
"cache.direct" as well.

So, these parsing functions seem necessary for now, but I don't like 
them. ;)

> diff --git a/block/quorum.c b/block/quorum.c
> index 0a28ab1..3cfc67e 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -17,8 +17,13 @@
>   #include <gnutls/crypto.h>
>   #include "block/block_int.h"
>   #include "qapi/qmp/qjson.h"
> +#include "qapi/qmp/types.h"
> +#include "qemu-common.h"
>   
>   #define HASH_LENGTH 32
> +#define KEY_PREFIX "children."
> +#define KEY_SUFFIX "."
> +#define KEY_FILENAME_SUFFIX ".file.filename"
>   
>   /* This union hold a vote hash value */
>   typedef union QuorumVoteValue {
> @@ -673,12 +678,342 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
>       return result;
>   }
>   
> +static int quorum_match_key(const char *key,
> +                            char **key_prefix,
> +                            bool clone_key_prefix)
> +{
> +    const char *start;
> +    char *next;
> +    unsigned long long idx;
> +    int ret;
> +
> +    *key_prefix = NULL;
> +
> +    /* the following code is a translation of the following pseudo code:
> +     *       match = key.match('(^children\.(\d+)\.)file\.filename)
> +     *       if not match:
> +     *           return -1;
> +     *       key_prefix = match.outer_match()
> +     *       idx = match.inner_match()
> +     *
> +     * note: we also match the .file suffix to avoid futher checkings
> +     */
> +
> +    /* this key is not a children */
> +    if (strncmp(key, KEY_PREFIX, strlen(KEY_PREFIX))) {
You could use sizeof instead of strlen here (and for every other string 
constant as well), though I don't know whether this is acceptable coding 
style for qemu. ;)

> +        return -1;
> +    }
> +
> +    /* take first char after prefix */
> +    start = key + strlen(KEY_PREFIX);
> +
> +    /* if the string end here -> scan fail */
> +    if (start[0] == '\0') {
> +        return -1;
> +    }
> +
> +    /* try to retrieve the index */
> +    ret = parse_uint(start, &idx, &next, 10);
> +
> +    /* no int found -> scan fail */
> +    if (ret < 0) {
> +        return -1;
> +    }
> +
> +    /* match the ".file.filename" suffix to avoid matching the same idx
> +     * multiple times and be required to do more checks later
> +     */
> +    if (strncmp(next, KEY_FILENAME_SUFFIX, strlen(KEY_FILENAME_SUFFIX))) {
> +        return -1;
> +    }
> +
> +    if (clone_key_prefix) {
> +        /* include '.' */
> +        int len = next + strlen(KEY_SUFFIX) - key;
> +        *key_prefix = g_strndup(key, len);
> +    }
> +
> +    return idx;
> +}
> +
> +static QDict *quorum_get_children_idx(const QDict *options)
> +{
> +    const QDictEntry *ent;
> +    QDict *result;
> +    char *key_prefix;
> +    int idx;
> +
> +    result = qdict_new();
> +
> +    for (ent = qdict_first(options); ent; ent = qdict_next(options, ent)) {
> +        const char *key = qdict_entry_key(ent);
> +        idx = quorum_match_key(key, &key_prefix, true);
> +
> +        /* if the result is -1 or any negative index value skip this key */
> +        if (idx < 0) {
> +            continue;
> +        }
> +
> +        qdict_put(result, key_prefix, qint_from_int(idx));
> +    }
> +
> +    return result;
> +}
> +
> +static int quorum_fill_validation_array(bool *array,
> +                                        const QDict *dict,
> +                                        int total,
> +                                        Error **errp)
> +{
> +    const QDictEntry *ent;
> +
> +    /* fill the checking array with children indexes */
> +    for (ent = qdict_first(dict); ent; ent = qdict_next(dict, ent)) {
> +        const char *key = qdict_entry_key(ent);
> +        int idx = qdict_get_int(dict, key);
> +
> +        if (idx < 0 || idx >= total) {
> +            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
> +                      "Children index must be between 0 and children count -1");
I'd prefer "- 1" instead of "-1" (or even "Child index must be an 
unsigned integer smaller than the children count").

> +            return -ERANGE;
> +        }
> +
> +        array[idx] = true;
> +    }
> +
> +    return 0;
> +}
> +
> +static int quorum_valid_indexes(const bool *array, int total, Error **errp)
> +{
> +    int i;
> +
> +    for (i = 0; i < total; i++) {
> +        if (array[i] == true) {
> +            continue;
> +        }
> +
> +        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
> +                  "All children indexes between 0 and children count -1  must "
> +                  " be used");
Again, I'd prefer "- 1"; then, there are two spaces both left and right 
of must – is this intentional?

> +        return -ERANGE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int quorum_valid_children_indexes(const QDict *dict,
> +                                         int total,
> +                                         Error **errp)
> +{
> +    bool *array;
> +    int ret = 0;;
> +
> +    /* allocate indexes checking array and put false in it */
> +    array = g_new0(bool, total);
> +
> +    ret = quorum_fill_validation_array(array, dict, total, errp);
> +    if (ret < 0) {
> +        goto free_exit;
> +    }
> +
> +    ret = quorum_valid_indexes(array, total, errp);
> +free_exit:
> +    g_free(array);
> +    return ret;
> +}
> +
> +static int quorum_valid_threshold(int threshold,
> +                                  int total,
> +                                  Error **errp)
> +{
> +
> +    if (threshold < 1) {
> +        error_set(errp, QERR_INVALID_PARAMETER_VALUE,
> +                  "vote_threshold", "value >= 1");
> +        return -ERANGE;
> +    }
> +
> +    if (threshold > total) {
> +        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
> +                  "threshold <= children count must be true");
Well, while this might technically be true, I'd rather go for "threshold 
may not exceed children count" instead. ;)

> +        return -ERANGE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int quorum_bdrv_open_sub(BlockDriverState *bs,
> +                                QDict *bs_dict,
> +                                int flags,
> +                                Error **errp)
> +{
> +    const char *format_name = NULL;
> +    const char *filename = NULL;
> +    BlockDriver *drv = NULL;
> +    int ret = 0;
> +
> +    format_name = qdict_get_try_str(bs_dict, "file.driver");
> +
> +    if (!format_name) {
> +        filename = qdict_get_try_str(bs_dict, "file.filename");
> +        qdict_del(bs_dict, "file.filename");
> +    } else {
> +        drv = bdrv_find_format(format_name);
> +        if (!drv) {
> +            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
> +                      "invalid format name");
> +            return -EINVAL;
> +        }
> +    }
> +
> +    QINCREF(bs_dict);
> +    ret = bdrv_open(bs,
> +                    filename,
> +                    bs_dict,
> +                    flags,
> +                    drv,
> +                    errp);
Why don't you write this on a single line?

> +    QDECREF(bs_dict);
> +    return ret;
> +}
> +
> +static int quorum_open(BlockDriverState *bs,
> +                       QDict *options,
> +                       int flags,
> +                       Error **errp)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    const QDictEntry *ent;
> +    QDict *idx_dict;
> +    bool *opened;
> +    const char *value;
> +    char *next;
> +    int i;
> +    int ret = 0;
> +
> +    /* get a dict of children indexes for validation */
> +    idx_dict = quorum_get_children_idx(options);
> +
> +    /* count how many different children indexes are present and validate */
> +    s->total = qdict_size(idx_dict);
> +    if (s->total < 2) {
> +        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
> +                  "Number of provided children must be greater than 1");
> +        ret = -EINVAL;
> +        goto exit;
> +    }
> +
> +    /* validate that the set of index is coherent */
> +    ret = quorum_valid_children_indexes(idx_dict, s->total, &local_err);
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +
> +    /* retrieve the threshold option from the command line */
> +    value = qdict_get_try_str(options, "vote_threshold");
> +    if (!value) {
> +        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
> +                  "vote_threshold must be provided");
> +        ret = -EINVAL;
> +        goto exit;
> +    }
> +    qdict_del(options, "vote_threshold");
> +
> +    ret = parse_uint(value, (unsigned long long *) &s->threshold, &next, 10);
> +
> +    /* no int found -> scan fail */
> +    if (ret < 0) {
> +        error_set(&local_err, ERROR_CLASS_GENERIC_ERROR,
> +                  "invalid vote_threshold specified");
> +        ret = -EINVAL;
> +        goto exit;
> +    }
> +
> +    /* and validate it againts s->total */
*against

> +    ret = quorum_valid_threshold(s->threshold, s->total, &local_err);
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +
> +    /* is the driver in blkverify mode */
> +    value = qdict_get_try_str(options, "blkverify");
> +    if (value && !strcmp(value, "on")  &&
> +        s->total == 2 && s->threshold == 2) {
> +        s->is_blkverify = true;
> +    }
If the user specifies anything different from "on" or if he does and 
s->total or s->threshold is not 2, quorum will silently work in 
non-blkverify mode without telling the user. So I'd emit a message here 
if blkverify has been specified and its value is different from "off" 
but s->is_blkverify remains false (i.e., “else if (value && 
strcmp(value, "off")) { fprintf(stderr, "[Some warning]"); }”).

> +    qdict_del(options, "blkverify");
> +
> +    /* allocate the children BlockDriverState array */
> +    s->bs = g_new0(BlockDriverState, s->total);
> +    opened = g_new0(bool, s->total);
> +
> +    /* open children bs */
> +    for (ent = qdict_first(idx_dict);
> +         ent; ent = qdict_next(idx_dict, ent)) {
> +        const char *key = qdict_entry_key(ent);
> +        int idx = qdict_get_int(idx_dict, key);
> +        int ret = 0;
Remove this declaration, it shadows the outer ret. This results in that 
outer ret being 0 even if quorum_bdrv_open_sub returns an error which 
makes this whole function return 0 (instead of propagating the return 
value of quorum_bdrv_open_sub).

> +        QDict *bs_dict; /* dict used to open child */
> +        qdict_extract_subqdict(options, &bs_dict, key);
> +        ret = quorum_bdrv_open_sub(&s->bs[idx],
> +                                   bs_dict,
> +                                   flags,
> +                                   &local_err);
> +        if (ret < 0) {
> +            QDECREF(bs_dict);
I think you should remove this QDECREF. I don't really know why, but I 
think basically bdrv_open takes ownership of it. Anyway, if this QDECREF 
stays, qemu crashes when a child could not be opened due to memory 
corruption, so it's probably not correct. ;)

> +            goto close_exit;
> +        }
> +        opened[idx] = true;
> +    }
> +
> +    g_free(opened);
> +    goto exit;
> +
> +close_exit:
> +    /* cleanup on error */
> +    for (i = 0; i < s->total; i++) {
> +        if (!opened[i]) {
> +            continue;
> +        }
> +        bdrv_close(&s->bs[i]);
> +    }
> +    g_free(s->bs);
> +    g_free(opened);
> +exit:
> +    /* propagate error */
> +    QDECREF(idx_dict);
I think you should swap these two lines.

Max

> +    if (error_is_set(&local_err)) {
> +        error_propagate(errp, local_err);
> +    }
> +    return ret;
> +}
> +
> +static void quorum_close(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int i;
> +
> +    for (i = 0; i < s->total; i++) {
> +        /* Ensure writes reach stable storage */
> +        bdrv_flush(&s->bs[i]);
> +        /* close manually because we'll free s->bs */
> +        bdrv_close(&s->bs[i]);
> +    }
> +
> +    g_free(s->bs);
> +}
> +
>   static BlockDriver bdrv_quorum = {
>       .format_name        = "quorum",
>       .protocol_name      = "quorum",
>   
>       .instance_size      = sizeof(BDRVQuorumState),
>   
> +    .bdrv_file_open     = quorum_open,
> +    .bdrv_close         = quorum_close,
> +
>       .bdrv_co_flush_to_disk = quorum_co_flush,
>   
>       .bdrv_getlength     = quorum_getlength,
> @@ -687,6 +1022,9 @@ static BlockDriver bdrv_quorum = {
>       .bdrv_aio_writev    = quorum_aio_writev,
>       .bdrv_invalidate_cache = quorum_invalidate_cache,
>       .bdrv_co_get_block_status = quorum_co_get_block_status,
> +
> +    /* disable external snapshots while waiting for block filters */
> +    .bdrv_check_ext_snapshot = bdrv_check_ext_snapshot_forbidden,
>   };
>   
>   static void bdrv_quorum_init(void)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close().
  2013-10-04 15:14   ` Max Reitz
@ 2013-10-04 15:31     ` Eric Blake
  2013-10-07  7:23       ` Max Reitz
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Blake @ 2013-10-04 15:31 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, Benoît Canet, stefanha, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

On 10/04/2013 09:14 AM, Max Reitz wrote:
> On 2013-10-02 14:39, Benoît Canet wrote:
>> Example of command line:
>> -drive if=virtio,file.driver=quorum,\
>> file.children.0.file.filename=1.qcow2,\
>> file.children.1.file.filename=2.qcow2,\
>> file.children.2.file.filename=3.qcow2,\
>> file.vote_threshold=3
>>

> 
> Third, wouldn't it be better to use children[i] instead of children.i
> for addressing them on the command line? I guess, a QMP interface would
> represent them as an array, so using the standard array index notation
> seems better to me.

No, we already discussed this.  file.child[0].file.filename=1.qcow2
requires shell quoting to avoid unintentional shell globbing;
file.child.0.file.filename=1.qcow2 is just as legible and can still be
reconstructed into an array under the hood, and has the benefit of not
needing quoting to protect against shell globbing.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-04 14:31 ` [Qemu-devel] [PATCH V9 00/11] Quorum block driver Max Reitz
@ 2013-10-04 22:25   ` Benoît Canet
  2013-10-29  7:59   ` Benoît Canet
  1 sibling, 0 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-04 22:25 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, qemu-devel, stefanha

Le Friday 04 Oct 2013 à 16:31:56 (+0200), Max Reitz a écrit :
> On 2013-10-02 14:39, Benoît Canet wrote:
> >It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
> >
> >This patchset create a block driver implementing a quorum using total qemu disk
> >images. Writes are mirrored on the $total files.
> >For the reading part the $total files are read at the same time and a vote is
> >done to determine if a qiov version is present $threshold or more times. It then
> >return this majority version to the upper layers.
> >When i < $threshold versions of the data are returned by the lower layer the
> >quorum is broken and the read return -EIO.
> >
> >The goal of this patchset is to be turned in a QEMU block filter living just
> >above raw-*.c and below qcow2/qed when the required infrastructure will be done.
> >
> >Main use of this feature will be people using NFS appliances which can be
> >subjected to bitflip errors.
> >
> >This patchset can be used to replace blkverify and the out of tree blkmirror.
> >
> >v9: integrate latests comments from kevin
> >
> >Benoît Canet (11):
> >   quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
> >   quorum: Create BDRVQuorumState and BlkDriver and do init.
> >   quorum: Add quorum_aio_writev and its dependencies.
> >   blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
> >     blkverify.
> >   quorum: Add quorum_aio_readv.
> >   quorum: Add quorum mechanism.
> >   quorum: Add quorum_getlength().
> >   quorum: Add quorum_invalidate_cache().
> >   quorum: Add quorum_co_get_block_status.
> >   quorum: Add quorum_co_flush().
> >   quorum: Add quorum_open() and quorum_close().
> >
> >  block/Makefile.objs       |    1 +
> >  block/blkverify.c         |  108 +----
> >  block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
> >  configure                 |   36 ++
> >  include/monitor/monitor.h |    2 +
> >  include/qemu-common.h     |    2 +
> >  monitor.c                 |    2 +
> >  util/iov.c                |  103 +++++
> >  8 files changed, 1183 insertions(+), 106 deletions(-)
> >  create mode 100644 block/quorum.c
> >
> 
> Could you add a qemu-iotest if possible?

Thanks for the review Max.
I will iterate.

Best regards

Benoît

> 
> Max
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close().
  2013-10-04 15:31     ` Eric Blake
@ 2013-10-07  7:23       ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-07  7:23 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, Benoît Canet, stefanha, qemu-devel

On 2013-10-04 17:31, Eric Blake wrote:
> On 10/04/2013 09:14 AM, Max Reitz wrote:
>> On 2013-10-02 14:39, Benoît Canet wrote:
>>> Example of command line:
>>> -drive if=virtio,file.driver=quorum,\
>>> file.children.0.file.filename=1.qcow2,\
>>> file.children.1.file.filename=2.qcow2,\
>>> file.children.2.file.filename=3.qcow2,\
>>> file.vote_threshold=3
>>>
>> Third, wouldn't it be better to use children[i] instead of children.i
>> for addressing them on the command line? I guess, a QMP interface would
>> represent them as an array, so using the standard array index notation
>> seems better to me.
> No, we already discussed this.  file.child[0].file.filename=1.qcow2
> requires shell quoting to avoid unintentional shell globbing;
> file.child.0.file.filename=1.qcow2 is just as legible and can still be
> reconstructed into an array under the hood, and has the benefit of not
> needing quoting to protect against shell globbing.

Oh, sorry.I thought escaping wouldn't hurt too much; but well, on second 
thought, for an interface that's purely designed for the command line 
it's probably best to avoid the need for it as much as possible.

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies.
  2013-10-04 14:35   ` Max Reitz
@ 2013-10-28 12:21     ` Benoît Canet
  2013-10-29 17:21       ` Max Reitz
  0 siblings, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-28 12:21 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, qemu-devel, stefanha

Le Friday 04 Oct 2013 à 16:35:18 (+0200), Max Reitz a écrit :
> On 2013-10-02 14:39, Benoît Canet wrote:
> >Signed-off-by: Benoit Canet <benoit@irqsave.net>
> >---
> >  block/quorum.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 123 insertions(+)
> >
> >diff --git a/block/quorum.c b/block/quorum.c
> >index 9557e61..b49e3c6 100644
> >--- a/block/quorum.c
> >+++ b/block/quorum.c
> >@@ -64,11 +64,134 @@ struct QuorumAIOCB {
> >      int vote_ret;
> >  };
> >+static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
> >+{
> >+    QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> >+    bool finished = false;
> >+
> >+    /* Wait for the request to finish */
> >+    acb->finished = &finished;
> >+    while (!finished) {
> >+        qemu_aio_wait();
> >+    }
> Hm, wouldn't it be better to pass the cancel to the children?
> 
> Max

Hi Max,

Hi don't understand how you would do this.

Best regards

Benoît

> 
> >+}
> >+
> >+static AIOCBInfo quorum_aiocb_info = {
> >+    .aiocb_size         = sizeof(QuorumAIOCB),
> >+    .cancel             = quorum_aio_cancel,
> >+};
> >+
> >+/* return the first error code get by each individual callbacks */
> >+static int quorum_get_first_error(QuorumAIOCB *acb)
> >+{
> >+    BDRVQuorumState *s = acb->bqs;
> >+    int i, ret = 0;
> >+
> >+    for (i = 0; i < s->total; i++) {
> >+        ret = acb->aios[i].ret;
> >+        if (ret) {
> >+            return ret;
> >+        }
> >+    }
> >+
> >+    /* should not pass here */
> >+    assert(false);
> >+}
> >+
> >+static void quorum_aio_finalize(QuorumAIOCB *acb)
> >+{
> >+    BDRVQuorumState *s = acb->bqs;
> >+    int ret;
> >+
> >+    ret = s->threshold <= acb->success_count ? 0 : quorum_get_first_error(acb);
> >+
> >+    acb->common.cb(acb->common.opaque, ret);
> >+    if (acb->finished) {
> >+        *acb->finished = true;
> >+    }
> >+    g_free(acb->aios);
> >+    qemu_aio_release(acb);
> >+}
> >+
> >+static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >+                                   BlockDriverState *bs,
> >+                                   QEMUIOVector *qiov,
> >+                                   uint64_t sector_num,
> >+                                   int nb_sectors,
> >+                                   BlockDriverCompletionFunc *cb,
> >+                                   void *opaque)
> >+{
> >+    QuorumAIOCB *acb = qemu_aio_get(&quorum_aiocb_info, bs, cb, opaque);
> >+    int i;
> >+
> >+    acb->bqs = s;
> >+    acb->sector_num = sector_num;
> >+    acb->nb_sectors = nb_sectors;
> >+    acb->qiov = qiov;
> >+    acb->aios = g_new0(QuorumSingleAIOCB, s->total);
> >+    acb->count = 0;
> >+    acb->success_count = 0;
> >+    acb->finished = NULL;
> >+    acb->is_read = false;
> >+    acb->vote_ret = 0;
> >+
> >+    for (i = 0; i < s->total; i++) {
> >+        acb->aios[i].buf = NULL;
> >+        acb->aios[i].ret = 0;
> >+        acb->aios[i].parent = acb;
> >+    }
> >+
> >+    return acb;
> >+}
> >+
> >+static void quorum_aio_cb(void *opaque, int ret)
> >+{
> >+    QuorumSingleAIOCB *sacb = opaque;
> >+    QuorumAIOCB *acb = sacb->parent;
> >+    BDRVQuorumState *s = acb->bqs;
> >+
> >+    sacb->ret = ret;
> >+    acb->count++;
> >+    if (ret == 0) {
> >+        acb->success_count++;
> >+    }
> >+    assert(acb->count <= s->total);
> >+    assert(acb->success_count <= s->total);
> >+    if (acb->count < s->total) {
> >+        return;
> >+    }
> >+
> >+    quorum_aio_finalize(acb);
> >+}
> >+
> >+static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
> >+                                          int64_t sector_num,
> >+                                          QEMUIOVector *qiov,
> >+                                          int nb_sectors,
> >+                                          BlockDriverCompletionFunc *cb,
> >+                                          void *opaque)
> >+{
> >+    BDRVQuorumState *s = bs->opaque;
> >+    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors,
> >+                                      cb, opaque);
> >+    int i;
> >+
> >+    for (i = 0; i < s->total; i++) {
> >+        acb->aios[i].aiocb = bdrv_aio_writev(&s->bs[i], sector_num, qiov,
> >+                                             nb_sectors, &quorum_aio_cb,
> >+                                             &acb->aios[i]);
> >+    }
> >+
> >+    return &acb->common;
> >+}
> >+
> >  static BlockDriver bdrv_quorum = {
> >      .format_name        = "quorum",
> >      .protocol_name      = "quorum",
> >      .instance_size      = sizeof(BDRVQuorumState),
> >+
> >+    .bdrv_aio_writev    = quorum_aio_writev,
> >  };
> >  static void bdrv_quorum_init(void)
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-04 14:48   ` Max Reitz
@ 2013-10-28 12:31     ` Benoît Canet
  2013-10-29 16:42       ` Max Reitz
  2013-10-28 13:04     ` Benoît Canet
  1 sibling, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-28 12:31 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, qemu-devel, stefanha

Le Friday 04 Oct 2013 à 16:48:12 (+0200), Max Reitz a écrit :
> On 2013-10-02 14:39, Benoît Canet wrote:
> >Use gnutls's SHA-256 to compare versions.
> Wouldn't CRC32 suffice? (I don't really oppose using SHA, but taking
> in gnutls as a dependency just for comparing several memory areas
> seems a bit much to me)

Initially it gzip's addler32 was used but someone was concerned with the risk
of collisions.
Anyway the code fallback using hashes only when something wrong is detected so
it won't impact the normal case.

Best regards

Benoît

> 
> >Signed-off-by: Benoit Canet <benoit@irqsave.net>
> >---
> >  block/Makefile.objs       |   2 +-
> >  block/quorum.c            | 321 +++++++++++++++++++++++++++++++++++++++++++++-
> >  configure                 |  36 ++++++
> >  include/monitor/monitor.h |   2 +
> >  monitor.c                 |   2 +
> >  5 files changed, 361 insertions(+), 2 deletions(-)
> >
> >diff --git a/block/Makefile.objs b/block/Makefile.objs
> >index 05a65c2..adcdc21 100644
> >--- a/block/Makefile.objs
> >+++ b/block/Makefile.objs
> >@@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
> >  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
> >  block-obj-y += qed-check.o
> >  block-obj-y += vhdx.o
> >-block-obj-y += quorum.o
> >+block-obj-$(CONFIG_QUORUM) += quorum.o
> >  block-obj-y += parallels.o blkdebug.o blkverify.o
> >  block-obj-y += snapshot.o qapi.o
> >  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> >diff --git a/block/quorum.c b/block/quorum.c
> >index f0fc0e9..e235ac1 100644
> >--- a/block/quorum.c
> >+++ b/block/quorum.c
> >@@ -13,7 +13,43 @@
> >   * See the COPYING file in the top-level directory.
> >   */
> >+#include <gnutls/gnutls.h>
> >+#include <gnutls/crypto.h>
> >  #include "block/block_int.h"
> >+#include "qapi/qmp/qjson.h"
> >+
> >+#define HASH_LENGTH 32
> >+
> >+/* This union hold a vote hash value */
> *holds
> 
> >+typedef union QuorumVoteValue {
> >+    char h[HASH_LENGTH];       /* SHA-256 hash */
> >+    int64_t l;                 /* simpler 64 bits hash */
> >+} QuorumVoteValue;
> >+
> >+/* A vote item */
> >+typedef struct QuorumVoteItem {
> >+    int index;
> >+    QLIST_ENTRY(QuorumVoteItem) next;
> >+} QuorumVoteItem;
> >+
> >+/* this structure is a vote version. A version is the set of vote sharing the
> *set of votes
> 
> >+ * same vote value.
> >+ * The set of vote will be tracked with the items field and it's count is
> *set of votes or *vote set; also s/it's count/its cardinality/ or
> something like that
> 
> >+ * vote_count.
> >+ */
> >+typedef struct QuorumVoteVersion {
> >+    QuorumVoteValue value;
> >+    int index;
> >+    int vote_count;
> >+    QLIST_HEAD(, QuorumVoteItem) items;
> >+    QLIST_ENTRY(QuorumVoteVersion) next;
> >+} QuorumVoteVersion;
> >+
> >+/* this structure hold a group of vote versions together */
> *holds
> 
> >+typedef struct QuorumVotes {
> >+    QLIST_HEAD(, QuorumVoteVersion) vote_list;
> >+    int (*compare)(QuorumVoteValue *a, QuorumVoteValue *b);
> >+} QuorumVotes;
> >  /* the following structure hold the state of one quorum instance */
> >  typedef struct {
> >@@ -60,10 +96,14 @@ struct QuorumAIOCB {
> >      int success_count;          /* number of successfully completed AIOCB */
> >      bool *finished;             /* completion signal for cancel */
> >+    QuorumVotes votes;
> >+
> >      bool is_read;
> >      int vote_ret;
> >  };
> >+static void quorum_vote(QuorumAIOCB *acb);
> >+
> >  static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
> >  {
> >      QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> >@@ -111,6 +151,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >          acb->aios[i].ret = 0;
> >      }
> >+    if (acb->vote_ret) {
> >+        ret = acb->vote_ret;
> >+    }
> >+
> >      acb->common.cb(acb->common.opaque, ret);
> >      if (acb->finished) {
> >          *acb->finished = true;
> >@@ -122,6 +166,11 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >      qemu_aio_release(acb);
> >  }
> >+static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
> >+{
> >+    return memcmp(a->h, b->h, HASH_LENGTH);
> >+}
> >+
> >  static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >                                     BlockDriverState *bs,
> >                                     QEMUIOVector *qiov,
> >@@ -141,6 +190,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >      acb->count = 0;
> >      acb->success_count = 0;
> >      acb->finished = NULL;
> >+    acb->votes.compare = quorum_sha256_compare;
> >      acb->is_read = false;
> >      acb->vote_ret = 0;
> >@@ -170,9 +220,278 @@ static void quorum_aio_cb(void *opaque, int ret)
> >          return;
> >      }
> >+    /* Do the vote on read */
> >+    if (acb->is_read) {
> >+        quorum_vote(acb);
> >+    }
> >+
> >      quorum_aio_finalize(acb);
> >  }
> >+static void quorum_report_bad(QuorumAIOCB *acb, int index)
> >+{
> >+    QObject *data;
> >+    data = qobject_from_jsonf("{ 'children-index': %i"
> I'd prefer child-index. Generally, remember that the singular of
> "children" is "child".
> 
> >+                              ", 'sector-num': %" PRId64
> >+                              ", 'sectors-count': %i }",
> >+                              index,
> >+                              acb->sector_num,
> >+                              acb->nb_sectors);
> >+    monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data);
> How about decrementing the refcount of data?
> 
> >+}
> >+
> >+static void quorum_report_failure(QuorumAIOCB *acb)
> >+{
> >+    QObject *data;
> >+    data = qobject_from_jsonf("{ 'sector-num': %" PRId64
> >+                              ", 'sectors-count': %i }",
> >+                              acb->sector_num,
> >+                              acb->nb_sectors);
> >+    monitor_protocol_event(QEVENT_QUORUM_FAILURE, data);
> Same here.
> 
> >+}
> >+
> >+static void quorum_report_bad_versions(QuorumAIOCB *acb,
> >+                                      QuorumVoteValue *value)
> >+{
> >+    QuorumVoteVersion *version;
> >+    QuorumVoteItem *item;
> >+
> >+    QLIST_FOREACH(version, &acb->votes.vote_list, next) {
> >+        if (!acb->votes.compare(&version->value, value)) {
> >+            continue;
> >+        }
> >+        QLIST_FOREACH(item, &version->items, next) {
> >+            quorum_report_bad(acb, item->index);
> >+        }
> >+    }
> >+}
> >+
> >+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> >+{
> >+    int i;
> >+    assert(dest->niov == source->niov);
> >+    assert(dest->size == source->size);
> >+    for (i = 0; i < source->niov; i++) {
> >+        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> >+        memcpy(dest->iov[i].iov_base,
> >+               source->iov[i].iov_base,
> >+               source->iov[i].iov_len);
> >+    }
> >+}
> >+
> >+static void quorum_count_vote(QuorumVotes *votes,
> >+                              QuorumVoteValue *value,
> >+                              int index)
> >+{
> >+    QuorumVoteVersion *v = NULL, *version = NULL;
> >+    QuorumVoteItem *item;
> >+
> >+    /* look if we have something with this hash */
> >+    QLIST_FOREACH(v, &votes->vote_list, next) {
> >+        if (!votes->compare(&v->value, value)) {
> >+            version = v;
> >+            break;
> >+        }
> >+    }
> >+
> >+    /* It's a version not yet in the list add it */
> >+    if (!version) {
> >+        version = g_new0(QuorumVoteVersion, 1);
> >+        QLIST_INIT(&version->items);
> >+        memcpy(&version->value, value, sizeof(version->value));
> >+        version->index = index;
> >+        version->vote_count = 0;
> >+        QLIST_INSERT_HEAD(&votes->vote_list, version, next);
> >+    }
> >+
> >+    version->vote_count++;
> >+
> >+    item = g_new0(QuorumVoteItem, 1);
> >+    item->index = index;
> >+    QLIST_INSERT_HEAD(&version->items, item, next);
> >+}
> >+
> >+static void quorum_free_vote_list(QuorumVotes *votes)
> >+{
> >+    QuorumVoteVersion *version, *next_version;
> >+    QuorumVoteItem *item, *next_item;
> >+
> >+    QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) {
> >+        QLIST_REMOVE(version, next);
> >+        QLIST_FOREACH_SAFE(item, &version->items, next, next_item) {
> >+            QLIST_REMOVE(item, next);
> >+            g_free(item);
> >+        }
> >+        g_free(version);
> >+    }
> >+}
> >+
> >+static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
> >+{
> >+    int j, ret;
> >+    gnutls_hash_hd_t dig;
> >+    QEMUIOVector *qiov = &acb->aios[i].qiov;
> >+
> >+    ret = gnutls_hash_init(&dig, GNUTLS_DIG_SHA256);
> >+
> >+    if (ret < 0) {
> >+        return ret;
> >+    }
> >+
> >+    for (j = 0; j < qiov->niov; j++) {
> >+        ret = gnutls_hash(dig, qiov->iov[j].iov_base, qiov->iov[j].iov_len);
> >+        if (ret < 0) {
> >+            return ret;
> gnutls_hash_deinit?
> 
> >+        }
> >+    }
> >+
> >+    gnutls_hash_deinit(dig, (void *) hash);
> >+
> >+    return 0;
> >+}
> >+
> >+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
> >+{
> >+    int i = 0;
> >+    QuorumVoteVersion *candidate, *winner = NULL;
> >+
> >+    QLIST_FOREACH(candidate, &votes->vote_list, next) {
> >+        if (candidate->vote_count > i) {
> >+            i = candidate->vote_count;
> >+            winner = candidate;
> >+        }
> >+    }
> >+
> >+    return winner;
> >+}
> >+
> >+static bool quorum_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
> >+{
> >+    int i;
> >+    int result;
> >+
> >+    assert(a->niov == b->niov);
> >+    for (i = 0; i < a->niov; i++) {
> >+        assert(a->iov[i].iov_len == b->iov[i].iov_len);
> >+        result = memcmp(a->iov[i].iov_base,
> >+                        b->iov[i].iov_base,
> >+                        a->iov[i].iov_len);
> >+        if (result) {
> >+            return false;
> >+        }
> >+    }
> >+
> >+    return true;
> >+}
> >+
> >+static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
> >+                                          const char *fmt, ...)
> >+{
> >+    va_list ap;
> >+
> >+    va_start(ap, fmt);
> >+    fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
> >+            acb->sector_num, acb->nb_sectors);
> >+    vfprintf(stderr, fmt, ap);
> >+    fprintf(stderr, "\n");
> >+    va_end(ap);
> >+    exit(1);
> >+}
> >+
> >+static bool quorum_compare(QuorumAIOCB *acb,
> >+                           QEMUIOVector *a,
> >+                           QEMUIOVector *b)
> >+{
> >+    BDRVQuorumState *s = acb->bqs;
> >+    ssize_t offset;
> >+
> >+    /* This driver will replace blkverify in this particular case */
> >+    if (s->is_blkverify) {
> >+        offset = qemu_iovec_compare(a, b);
> >+        if (offset != -1) {
> >+            quorum_err(acb, "contents mismatch in sector %" PRId64,
> >+                       acb->sector_num +
> >+                       (uint64_t)(offset / BDRV_SECTOR_SIZE));
> >+        }
> >+        return true;
> >+    }
> >+
> >+    return quorum_iovec_compare(a, b);
> >+}
> >+
> >+static void quorum_vote(QuorumAIOCB *acb)
> >+{
> >+    bool quorum = true;
> >+    int i, j, ret;
> >+    QuorumVoteValue hash;
> >+    BDRVQuorumState *s = acb->bqs;
> >+    QuorumVoteVersion *winner;
> >+
> >+    /* get the index of the first successful read */
> >+    for (i = 0; i < s->total; i++) {
> >+        if (!acb->aios[i].ret) {
> >+            break;
> >+        }
> >+    }
> >+
> >+    /* compare this read with all other successful read looking for quorum */
> >+    for (j = i + 1; j < s->total; j++) {
> >+        if (acb->aios[j].ret) {
> >+            continue;
> >+        }
> >+        quorum = quorum_compare(acb, &acb->aios[i].qiov, &acb->aios[j].qiov);
> >+        if (!quorum) {
> >+            break;
> >+       }
> >+    }
> >+
> >+    /* Every successful read agrees -> Quorum */
> >+    if (quorum) {
> >+        quorum_copy_qiov(acb->qiov, &acb->aios[i].qiov);
> If no reads were successful at all, quorum will be true and i will
> be s->total. Therefore, this will result in an array access with an
> index out of bounds here.
> 
> Generally, why is this function executed at all if any read failed?
> If a read fails, the quorum read function will return an error, thus
> the caller will ignore the result anyway.
> 
> >+        return;
> >+    }
> >+
> >+    /* compute hashs for each successful read, also store indexes */
> >+    for (i = 0; i < s->total; i++) {
> >+        if (acb->aios[i].ret) {
> >+            continue;
> >+        }
> >+        ret = quorum_compute_hash(acb, i, &hash);
> >+        /* if ever the hash computation failed */
> >+        if (ret < 0) {
> >+            acb->vote_ret = ret;
> >+            goto free_exit;
> >+        }
> >+        quorum_count_vote(&acb->votes, &hash, i);
> >+    }
> >+
> >+    /* vote to select the most represented version */
> >+    winner = quorum_get_vote_winner(&acb->votes);
> >+    /* every vote version are differents -> error */
> >+    if (!winner) {
> >+        quorum_report_failure(acb);
> >+        acb->vote_ret = -EIO;
> >+        goto free_exit;
> >+    }
> >+
> >+    /* if the winner count is smaller then threshold read fail */
> s/then/than/, s/ read fail/, the read fails/
> 
> >+    if (winner->vote_count < s->threshold) {
> >+        quorum_report_failure(acb);
> >+        acb->vote_ret = -EIO;
> >+        goto free_exit;
> >+    }
> >+
> >+    /* we have a winner: copy it */
> >+    quorum_copy_qiov(acb->qiov, &acb->aios[winner->index].qiov);
> >+
> >+    /* some versions are bad print them */
> >+    quorum_report_bad_versions(acb, &winner->value);
> >+
> >+free_exit:
> >+    /* free lists */
> >+    quorum_free_vote_list(&acb->votes);
> >+}
> >+
> >  static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >                                           int64_t sector_num,
> >                                           QEMUIOVector *qiov,
> >@@ -194,7 +513,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >      }
> >      for (i = 0; i < s->total; i++) {
> >-        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
> >+        bdrv_aio_readv(&s->bs[i], sector_num, &acb->aios[i].qiov, nb_sectors,
> >                         quorum_aio_cb, &acb->aios[i]);
> >      }
> >diff --git a/configure b/configure
> >index 23dbaaf..efbe7d9 100755
> >--- a/configure
> >+++ b/configure
> >@@ -246,6 +246,7 @@ gtk=""
> >  gtkabi="2.0"
> >  tpm="no"
> >  libssh2=""
> >+quorum="no"
> >  # parse CC options first
> >  for opt do
> >@@ -972,6 +973,10 @@ for opt do
> >    ;;
> >    --enable-libssh2) libssh2="yes"
> >    ;;
> >+  --disable-quorum) quorum="no"
> >+  ;;
> >+  --enable-quorum) quorum="yes"
> >+  ;;
> >    *) echo "ERROR: unknown option $opt"; show_help="yes"
> >    ;;
> >    esac
> >@@ -1203,6 +1208,8 @@ echo "  --gcov=GCOV              use specified gcov [$gcov_tool]"
> >  echo "  --enable-tpm             enable TPM support"
> >  echo "  --disable-libssh2        disable ssh block device support"
> >  echo "  --enable-libssh2         enable ssh block device support"
> >+echo "  --disable-quorum         disable quorum block filter support"
> >+echo "  --enable-quorum          enable quorum block filter support"
> >  echo ""
> >  echo "NOTE: The object files are built at the place where configure is launched"
> >  exit 1
> >@@ -1895,6 +1902,30 @@ EOF
> >  fi
> >  ##########################################
> >+# Quorum probe (check for gnutls)
> >+if test "$quorum" != "no" ; then
> >+cat > $TMPC <<EOF
> >+#include <gnutls/gnutls.h>
> >+#include <gnutls/crypto.h>
> >+int main(void) {char data[4096], digest[32];
> >+gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest);
> >+return 0;
> >+}
> >+EOF
> >+quorum_tls_cflags=`$pkg_config --cflags gnutls 2> /dev/null`
> >+quorum_tls_libs=`$pkg_config --libs gnutls 2> /dev/null`
> >+if compile_prog "$quorum_tls_cflags" "$quorum_tls_libs" ; then
> >+  qcow_tls=yes
> >+  libs_softmmu="$quorum_tls_libs $libs_softmmu"
> >+  libs_tools="$quorum_tls_libs $libs_softmmu"
> >+  QEMU_CFLAGS="$QEMU_CFLAGS $quorum_tls_cflags"
> >+else
> >+  echo "gnutls > 2.10.0 required to compile Quorum"
> >+  exit 1
> >+fi
> >+fi
> >+
> >+##########################################
> >  # VNC SASL detection
> >  if test "$vnc" = "yes" -a "$vnc_sasl" != "no" ; then
> >    cat > $TMPC <<EOF
> >@@ -3758,6 +3789,7 @@ echo "TPM support       $tpm"
> >  echo "libssh2 support   $libssh2"
> >  echo "TPM passthrough   $tpm_passthrough"
> >  echo "QOM debugging     $qom_cast_debug"
> >+echo "Quorum            $quorum"
> >  if test "$sdl_too_old" = "yes"; then
> >  echo "-> Your SDL version is too old - please upgrade to have SDL support"
> >@@ -4156,6 +4188,10 @@ if test "$libssh2" = "yes" ; then
> >    echo "CONFIG_LIBSSH2=y" >> $config_host_mak
> >  fi
> >+if test "$quorum" = "yes" ; then
> >+  echo "CONFIG_QUORUM=y" >> $config_host_mak
> >+fi
> >+
> >  if test "$virtio_blk_data_plane" = "yes" ; then
> >    echo 'CONFIG_VIRTIO_BLK_DATA_PLANE=$(CONFIG_VIRTIO)' >> $config_host_mak
> >  fi
> >diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
> >index 10fa0e3..51902ed 100644
> >--- a/include/monitor/monitor.h
> >+++ b/include/monitor/monitor.h
> >@@ -49,6 +49,8 @@ typedef enum MonitorEvent {
> >      QEVENT_SPICE_MIGRATE_COMPLETED,
> >      QEVENT_GUEST_PANICKED,
> >      QEVENT_BLOCK_IMAGE_CORRUPTED,
> >+    QEVENT_QUORUM_FAILURE,
> >+    QEVENT_QUORUM_REPORT_BAD,
> >      /* Add to 'monitor_event_names' array in monitor.c when
> >       * defining new events here */
> >diff --git a/monitor.c b/monitor.c
> >index 74f3f1b..89d139f 100644
> >--- a/monitor.c
> >+++ b/monitor.c
> >@@ -507,6 +507,8 @@ static const char *monitor_event_names[] = {
> >      [QEVENT_SPICE_MIGRATE_COMPLETED] = "SPICE_MIGRATE_COMPLETED",
> >      [QEVENT_GUEST_PANICKED] = "GUEST_PANICKED",
> >      [QEVENT_BLOCK_IMAGE_CORRUPTED] = "BLOCK_IMAGE_CORRUPTED",
> >+    [QEVENT_QUORUM_FAILURE] = "QUORUM_FAILURE",
> >+    [QEVENT_QUORUM_REPORT_BAD] = "QUORUM_REPORT_BAD",
> Could you document these events in docs/qmp/qmp-events.txt?
> 
> Max
> 
> >  };
> >  QEMU_BUILD_BUG_ON(ARRAY_SIZE(monitor_event_names) != QEVENT_MAX)
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-04 14:48   ` Max Reitz
  2013-10-28 12:31     ` Benoît Canet
@ 2013-10-28 13:04     ` Benoît Canet
  2013-10-29 17:34       ` Max Reitz
  1 sibling, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-28 13:04 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, qemu-devel, stefanha

Le Friday 04 Oct 2013 à 16:48:12 (+0200), Max Reitz a écrit :
> On 2013-10-02 14:39, Benoît Canet wrote:
> >Use gnutls's SHA-256 to compare versions.
> Wouldn't CRC32 suffice? (I don't really oppose using SHA, but taking
> in gnutls as a dependency just for comparing several memory areas
> seems a bit much to me)
> 
> >Signed-off-by: Benoit Canet <benoit@irqsave.net>
> >---
> >  block/Makefile.objs       |   2 +-
> >  block/quorum.c            | 321 +++++++++++++++++++++++++++++++++++++++++++++-
> >  configure                 |  36 ++++++
> >  include/monitor/monitor.h |   2 +
> >  monitor.c                 |   2 +
> >  5 files changed, 361 insertions(+), 2 deletions(-)
> >
> >diff --git a/block/Makefile.objs b/block/Makefile.objs
> >index 05a65c2..adcdc21 100644
> >--- a/block/Makefile.objs
> >+++ b/block/Makefile.objs
> >@@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
> >  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
> >  block-obj-y += qed-check.o
> >  block-obj-y += vhdx.o
> >-block-obj-y += quorum.o
> >+block-obj-$(CONFIG_QUORUM) += quorum.o
> >  block-obj-y += parallels.o blkdebug.o blkverify.o
> >  block-obj-y += snapshot.o qapi.o
> >  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> >diff --git a/block/quorum.c b/block/quorum.c
> >index f0fc0e9..e235ac1 100644
> >--- a/block/quorum.c
> >+++ b/block/quorum.c
> >@@ -13,7 +13,43 @@
> >   * See the COPYING file in the top-level directory.
> >   */
> >+#include <gnutls/gnutls.h>
> >+#include <gnutls/crypto.h>
> >  #include "block/block_int.h"
> >+#include "qapi/qmp/qjson.h"
> >+
> >+#define HASH_LENGTH 32
> >+
> >+/* This union hold a vote hash value */
> *holds
> 
> >+typedef union QuorumVoteValue {
> >+    char h[HASH_LENGTH];       /* SHA-256 hash */
> >+    int64_t l;                 /* simpler 64 bits hash */
> >+} QuorumVoteValue;
> >+
> >+/* A vote item */
> >+typedef struct QuorumVoteItem {
> >+    int index;
> >+    QLIST_ENTRY(QuorumVoteItem) next;
> >+} QuorumVoteItem;
> >+
> >+/* this structure is a vote version. A version is the set of vote sharing the
> *set of votes
> 
> >+ * same vote value.
> >+ * The set of vote will be tracked with the items field and it's count is
> *set of votes or *vote set; also s/it's count/its cardinality/ or
> something like that
> 
> >+ * vote_count.
> >+ */
> >+typedef struct QuorumVoteVersion {
> >+    QuorumVoteValue value;
> >+    int index;
> >+    int vote_count;
> >+    QLIST_HEAD(, QuorumVoteItem) items;
> >+    QLIST_ENTRY(QuorumVoteVersion) next;
> >+} QuorumVoteVersion;
> >+
> >+/* this structure hold a group of vote versions together */
> *holds
> 
> >+typedef struct QuorumVotes {
> >+    QLIST_HEAD(, QuorumVoteVersion) vote_list;
> >+    int (*compare)(QuorumVoteValue *a, QuorumVoteValue *b);
> >+} QuorumVotes;
> >  /* the following structure hold the state of one quorum instance */
> >  typedef struct {
> >@@ -60,10 +96,14 @@ struct QuorumAIOCB {
> >      int success_count;          /* number of successfully completed AIOCB */
> >      bool *finished;             /* completion signal for cancel */
> >+    QuorumVotes votes;
> >+
> >      bool is_read;
> >      int vote_ret;
> >  };
> >+static void quorum_vote(QuorumAIOCB *acb);
> >+
> >  static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
> >  {
> >      QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> >@@ -111,6 +151,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >          acb->aios[i].ret = 0;
> >      }
> >+    if (acb->vote_ret) {
> >+        ret = acb->vote_ret;
> >+    }
> >+
> >      acb->common.cb(acb->common.opaque, ret);
> >      if (acb->finished) {
> >          *acb->finished = true;
> >@@ -122,6 +166,11 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >      qemu_aio_release(acb);
> >  }
> >+static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
> >+{
> >+    return memcmp(a->h, b->h, HASH_LENGTH);
> >+}
> >+
> >  static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >                                     BlockDriverState *bs,
> >                                     QEMUIOVector *qiov,
> >@@ -141,6 +190,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >      acb->count = 0;
> >      acb->success_count = 0;
> >      acb->finished = NULL;
> >+    acb->votes.compare = quorum_sha256_compare;
> >      acb->is_read = false;
> >      acb->vote_ret = 0;
> >@@ -170,9 +220,278 @@ static void quorum_aio_cb(void *opaque, int ret)
> >          return;
> >      }
> >+    /* Do the vote on read */
> >+    if (acb->is_read) {
> >+        quorum_vote(acb);
> >+    }
> >+
> >      quorum_aio_finalize(acb);
> >  }
> >+static void quorum_report_bad(QuorumAIOCB *acb, int index)
> >+{
> >+    QObject *data;
> >+    data = qobject_from_jsonf("{ 'children-index': %i"
> I'd prefer child-index. Generally, remember that the singular of
> "children" is "child".
> 
> >+                              ", 'sector-num': %" PRId64
> >+                              ", 'sectors-count': %i }",
> >+                              index,
> >+                              acb->sector_num,
> >+                              acb->nb_sectors);
> >+    monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data);
> How about decrementing the refcount of data?
> 
> >+}
> >+
> >+static void quorum_report_failure(QuorumAIOCB *acb)
> >+{
> >+    QObject *data;
> >+    data = qobject_from_jsonf("{ 'sector-num': %" PRId64
> >+                              ", 'sectors-count': %i }",
> >+                              acb->sector_num,
> >+                              acb->nb_sectors);
> >+    monitor_protocol_event(QEVENT_QUORUM_FAILURE, data);
> Same here.
> 
> >+}
> >+
> >+static void quorum_report_bad_versions(QuorumAIOCB *acb,
> >+                                      QuorumVoteValue *value)
> >+{
> >+    QuorumVoteVersion *version;
> >+    QuorumVoteItem *item;
> >+
> >+    QLIST_FOREACH(version, &acb->votes.vote_list, next) {
> >+        if (!acb->votes.compare(&version->value, value)) {
> >+            continue;
> >+        }
> >+        QLIST_FOREACH(item, &version->items, next) {
> >+            quorum_report_bad(acb, item->index);
> >+        }
> >+    }
> >+}
> >+
> >+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> >+{
> >+    int i;
> >+    assert(dest->niov == source->niov);
> >+    assert(dest->size == source->size);
> >+    for (i = 0; i < source->niov; i++) {
> >+        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> >+        memcpy(dest->iov[i].iov_base,
> >+               source->iov[i].iov_base,
> >+               source->iov[i].iov_len);
> >+    }
> >+}
> >+
> >+static void quorum_count_vote(QuorumVotes *votes,
> >+                              QuorumVoteValue *value,
> >+                              int index)
> >+{
> >+    QuorumVoteVersion *v = NULL, *version = NULL;
> >+    QuorumVoteItem *item;
> >+
> >+    /* look if we have something with this hash */
> >+    QLIST_FOREACH(v, &votes->vote_list, next) {
> >+        if (!votes->compare(&v->value, value)) {
> >+            version = v;
> >+            break;
> >+        }
> >+    }
> >+
> >+    /* It's a version not yet in the list add it */
> >+    if (!version) {
> >+        version = g_new0(QuorumVoteVersion, 1);
> >+        QLIST_INIT(&version->items);
> >+        memcpy(&version->value, value, sizeof(version->value));
> >+        version->index = index;
> >+        version->vote_count = 0;
> >+        QLIST_INSERT_HEAD(&votes->vote_list, version, next);
> >+    }
> >+
> >+    version->vote_count++;
> >+
> >+    item = g_new0(QuorumVoteItem, 1);
> >+    item->index = index;
> >+    QLIST_INSERT_HEAD(&version->items, item, next);
> >+}
> >+
> >+static void quorum_free_vote_list(QuorumVotes *votes)
> >+{
> >+    QuorumVoteVersion *version, *next_version;
> >+    QuorumVoteItem *item, *next_item;
> >+
> >+    QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) {
> >+        QLIST_REMOVE(version, next);
> >+        QLIST_FOREACH_SAFE(item, &version->items, next, next_item) {
> >+            QLIST_REMOVE(item, next);
> >+            g_free(item);
> >+        }
> >+        g_free(version);
> >+    }
> >+}
> >+
> >+static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
> >+{
> >+    int j, ret;
> >+    gnutls_hash_hd_t dig;
> >+    QEMUIOVector *qiov = &acb->aios[i].qiov;
> >+
> >+    ret = gnutls_hash_init(&dig, GNUTLS_DIG_SHA256);
> >+
> >+    if (ret < 0) {
> >+        return ret;
> >+    }
> >+
> >+    for (j = 0; j < qiov->niov; j++) {
> >+        ret = gnutls_hash(dig, qiov->iov[j].iov_base, qiov->iov[j].iov_len);
> >+        if (ret < 0) {
> >+            return ret;
> gnutls_hash_deinit?
> 
> >+        }
> >+    }
> >+
> >+    gnutls_hash_deinit(dig, (void *) hash);
> >+
> >+    return 0;
> >+}
> >+
> >+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
> >+{
> >+    int i = 0;
> >+    QuorumVoteVersion *candidate, *winner = NULL;
> >+
> >+    QLIST_FOREACH(candidate, &votes->vote_list, next) {
> >+        if (candidate->vote_count > i) {
> >+            i = candidate->vote_count;
> >+            winner = candidate;
> >+        }
> >+    }
> >+
> >+    return winner;
> >+}
> >+
> >+static bool quorum_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
> >+{
> >+    int i;
> >+    int result;
> >+
> >+    assert(a->niov == b->niov);
> >+    for (i = 0; i < a->niov; i++) {
> >+        assert(a->iov[i].iov_len == b->iov[i].iov_len);
> >+        result = memcmp(a->iov[i].iov_base,
> >+                        b->iov[i].iov_base,
> >+                        a->iov[i].iov_len);
> >+        if (result) {
> >+            return false;
> >+        }
> >+    }
> >+
> >+    return true;
> >+}
> >+
> >+static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
> >+                                          const char *fmt, ...)
> >+{
> >+    va_list ap;
> >+
> >+    va_start(ap, fmt);
> >+    fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
> >+            acb->sector_num, acb->nb_sectors);
> >+    vfprintf(stderr, fmt, ap);
> >+    fprintf(stderr, "\n");
> >+    va_end(ap);
> >+    exit(1);
> >+}
> >+
> >+static bool quorum_compare(QuorumAIOCB *acb,
> >+                           QEMUIOVector *a,
> >+                           QEMUIOVector *b)
> >+{
> >+    BDRVQuorumState *s = acb->bqs;
> >+    ssize_t offset;
> >+
> >+    /* This driver will replace blkverify in this particular case */
> >+    if (s->is_blkverify) {
> >+        offset = qemu_iovec_compare(a, b);
> >+        if (offset != -1) {
> >+            quorum_err(acb, "contents mismatch in sector %" PRId64,
> >+                       acb->sector_num +
> >+                       (uint64_t)(offset / BDRV_SECTOR_SIZE));
> >+        }
> >+        return true;
> >+    }
> >+
> >+    return quorum_iovec_compare(a, b);
> >+}
> >+
> >+static void quorum_vote(QuorumAIOCB *acb)
> >+{
> >+    bool quorum = true;
> >+    int i, j, ret;
> >+    QuorumVoteValue hash;
> >+    BDRVQuorumState *s = acb->bqs;
> >+    QuorumVoteVersion *winner;
> >+
> >+    /* get the index of the first successful read */
> >+    for (i = 0; i < s->total; i++) {
> >+        if (!acb->aios[i].ret) {
> >+            break;
> >+        }
> >+    }
> >+
> >+    /* compare this read with all other successful read looking for quorum */
> >+    for (j = i + 1; j < s->total; j++) {
> >+        if (acb->aios[j].ret) {
> >+            continue;
> >+        }
> >+        quorum = quorum_compare(acb, &acb->aios[i].qiov, &acb->aios[j].qiov);
> >+        if (!quorum) {
> >+            break;
> >+       }
> >+    }
> >+
> >+    /* Every successful read agrees -> Quorum */
> >+    if (quorum) {
> >+        quorum_copy_qiov(acb->qiov, &acb->aios[i].qiov);
> If no reads were successful at all, quorum will be true and i will
> be s->total. Therefore, this will result in an array access with an
> index out of bounds here.
Thanks a lot for spotting this.

> 
> Generally, why is this function executed at all if any read failed?
> If a read fails, the quorum read function will return an error, thus
> the caller will ignore the result anyway.

I should if there is a quorum between the successfull reads; if all the success
ful reads agrees and their count is greater than threshold.

> 
> >+        return;
> >+    }
> >+
> >+    /* compute hashs for each successful read, also store indexes */
> >+    for (i = 0; i < s->total; i++) {
> >+        if (acb->aios[i].ret) {
> >+            continue;
> >+        }
> >+        ret = quorum_compute_hash(acb, i, &hash);
> >+        /* if ever the hash computation failed */
> >+        if (ret < 0) {
> >+            acb->vote_ret = ret;
> >+            goto free_exit;
> >+        }
> >+        quorum_count_vote(&acb->votes, &hash, i);
> >+    }
> >+
> >+    /* vote to select the most represented version */
> >+    winner = quorum_get_vote_winner(&acb->votes);
> >+    /* every vote version are differents -> error */
> >+    if (!winner) {
> >+        quorum_report_failure(acb);
> >+        acb->vote_ret = -EIO;
> >+        goto free_exit;
> >+    }
> >+
> >+    /* if the winner count is smaller then threshold read fail */
> s/then/than/, s/ read fail/, the read fails/
> 
> >+    if (winner->vote_count < s->threshold) {
> >+        quorum_report_failure(acb);
> >+        acb->vote_ret = -EIO;
> >+        goto free_exit;
> >+    }
> >+
> >+    /* we have a winner: copy it */
> >+    quorum_copy_qiov(acb->qiov, &acb->aios[winner->index].qiov);
> >+
> >+    /* some versions are bad print them */
> >+    quorum_report_bad_versions(acb, &winner->value);
> >+
> >+free_exit:
> >+    /* free lists */
> >+    quorum_free_vote_list(&acb->votes);
> >+}
> >+
> >  static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >                                           int64_t sector_num,
> >                                           QEMUIOVector *qiov,
> >@@ -194,7 +513,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >      }
> >      for (i = 0; i < s->total; i++) {
> >-        bdrv_aio_readv(&s->bs[i], sector_num, qiov, nb_sectors,
> >+        bdrv_aio_readv(&s->bs[i], sector_num, &acb->aios[i].qiov, nb_sectors,
> >                         quorum_aio_cb, &acb->aios[i]);
> >      }
> >diff --git a/configure b/configure
> >index 23dbaaf..efbe7d9 100755
> >--- a/configure
> >+++ b/configure
> >@@ -246,6 +246,7 @@ gtk=""
> >  gtkabi="2.0"
> >  tpm="no"
> >  libssh2=""
> >+quorum="no"
> >  # parse CC options first
> >  for opt do
> >@@ -972,6 +973,10 @@ for opt do
> >    ;;
> >    --enable-libssh2) libssh2="yes"
> >    ;;
> >+  --disable-quorum) quorum="no"
> >+  ;;
> >+  --enable-quorum) quorum="yes"
> >+  ;;
> >    *) echo "ERROR: unknown option $opt"; show_help="yes"
> >    ;;
> >    esac
> >@@ -1203,6 +1208,8 @@ echo "  --gcov=GCOV              use specified gcov [$gcov_tool]"
> >  echo "  --enable-tpm             enable TPM support"
> >  echo "  --disable-libssh2        disable ssh block device support"
> >  echo "  --enable-libssh2         enable ssh block device support"
> >+echo "  --disable-quorum         disable quorum block filter support"
> >+echo "  --enable-quorum          enable quorum block filter support"
> >  echo ""
> >  echo "NOTE: The object files are built at the place where configure is launched"
> >  exit 1
> >@@ -1895,6 +1902,30 @@ EOF
> >  fi
> >  ##########################################
> >+# Quorum probe (check for gnutls)
> >+if test "$quorum" != "no" ; then
> >+cat > $TMPC <<EOF
> >+#include <gnutls/gnutls.h>
> >+#include <gnutls/crypto.h>
> >+int main(void) {char data[4096], digest[32];
> >+gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest);
> >+return 0;
> >+}
> >+EOF
> >+quorum_tls_cflags=`$pkg_config --cflags gnutls 2> /dev/null`
> >+quorum_tls_libs=`$pkg_config --libs gnutls 2> /dev/null`
> >+if compile_prog "$quorum_tls_cflags" "$quorum_tls_libs" ; then
> >+  qcow_tls=yes
> >+  libs_softmmu="$quorum_tls_libs $libs_softmmu"
> >+  libs_tools="$quorum_tls_libs $libs_softmmu"
> >+  QEMU_CFLAGS="$QEMU_CFLAGS $quorum_tls_cflags"
> >+else
> >+  echo "gnutls > 2.10.0 required to compile Quorum"
> >+  exit 1
> >+fi
> >+fi
> >+
> >+##########################################
> >  # VNC SASL detection
> >  if test "$vnc" = "yes" -a "$vnc_sasl" != "no" ; then
> >    cat > $TMPC <<EOF
> >@@ -3758,6 +3789,7 @@ echo "TPM support       $tpm"
> >  echo "libssh2 support   $libssh2"
> >  echo "TPM passthrough   $tpm_passthrough"
> >  echo "QOM debugging     $qom_cast_debug"
> >+echo "Quorum            $quorum"
> >  if test "$sdl_too_old" = "yes"; then
> >  echo "-> Your SDL version is too old - please upgrade to have SDL support"
> >@@ -4156,6 +4188,10 @@ if test "$libssh2" = "yes" ; then
> >    echo "CONFIG_LIBSSH2=y" >> $config_host_mak
> >  fi
> >+if test "$quorum" = "yes" ; then
> >+  echo "CONFIG_QUORUM=y" >> $config_host_mak
> >+fi
> >+
> >  if test "$virtio_blk_data_plane" = "yes" ; then
> >    echo 'CONFIG_VIRTIO_BLK_DATA_PLANE=$(CONFIG_VIRTIO)' >> $config_host_mak
> >  fi
> >diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
> >index 10fa0e3..51902ed 100644
> >--- a/include/monitor/monitor.h
> >+++ b/include/monitor/monitor.h
> >@@ -49,6 +49,8 @@ typedef enum MonitorEvent {
> >      QEVENT_SPICE_MIGRATE_COMPLETED,
> >      QEVENT_GUEST_PANICKED,
> >      QEVENT_BLOCK_IMAGE_CORRUPTED,
> >+    QEVENT_QUORUM_FAILURE,
> >+    QEVENT_QUORUM_REPORT_BAD,
> >      /* Add to 'monitor_event_names' array in monitor.c when
> >       * defining new events here */
> >diff --git a/monitor.c b/monitor.c
> >index 74f3f1b..89d139f 100644
> >--- a/monitor.c
> >+++ b/monitor.c
> >@@ -507,6 +507,8 @@ static const char *monitor_event_names[] = {
> >      [QEVENT_SPICE_MIGRATE_COMPLETED] = "SPICE_MIGRATE_COMPLETED",
> >      [QEVENT_GUEST_PANICKED] = "GUEST_PANICKED",
> >      [QEVENT_BLOCK_IMAGE_CORRUPTED] = "BLOCK_IMAGE_CORRUPTED",
> >+    [QEVENT_QUORUM_FAILURE] = "QUORUM_FAILURE",
> >+    [QEVENT_QUORUM_REPORT_BAD] = "QUORUM_REPORT_BAD",
> Could you document these events in docs/qmp/qmp-events.txt?
> 
> Max
> 
> >  };
> >  QEMU_BUILD_BUG_ON(ARRAY_SIZE(monitor_event_names) != QEVENT_MAX)
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-04 14:31 ` [Qemu-devel] [PATCH V9 00/11] Quorum block driver Max Reitz
  2013-10-04 22:25   ` Benoît Canet
@ 2013-10-29  7:59   ` Benoît Canet
  2013-10-29 17:55     ` Max Reitz
  1 sibling, 1 reply; 36+ messages in thread
From: Benoît Canet @ 2013-10-29  7:59 UTC (permalink / raw)
  To: Max Reitz; +Cc: kwolf, qemu-devel, stefanha

Le Friday 04 Oct 2013 à 16:31:56 (+0200), Max Reitz a écrit :
> On 2013-10-02 14:39, Benoît Canet wrote:
> >It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
> >
> >This patchset create a block driver implementing a quorum using total qemu disk
> >images. Writes are mirrored on the $total files.
> >For the reading part the $total files are read at the same time and a vote is
> >done to determine if a qiov version is present $threshold or more times. It then
> >return this majority version to the upper layers.
> >When i < $threshold versions of the data are returned by the lower layer the
> >quorum is broken and the read return -EIO.
> >
> >The goal of this patchset is to be turned in a QEMU block filter living just
> >above raw-*.c and below qcow2/qed when the required infrastructure will be done.
> >
> >Main use of this feature will be people using NFS appliances which can be
> >subjected to bitflip errors.
> >
> >This patchset can be used to replace blkverify and the out of tree blkmirror.
> >
> >v9: integrate latests comments from kevin
> >
> >Benoît Canet (11):
> >   quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
> >   quorum: Create BDRVQuorumState and BlkDriver and do init.
> >   quorum: Add quorum_aio_writev and its dependencies.
> >   blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
> >     blkverify.
> >   quorum: Add quorum_aio_readv.
> >   quorum: Add quorum mechanism.
> >   quorum: Add quorum_getlength().
> >   quorum: Add quorum_invalidate_cache().
> >   quorum: Add quorum_co_get_block_status.
> >   quorum: Add quorum_co_flush().
> >   quorum: Add quorum_open() and quorum_close().
> >
> >  block/Makefile.objs       |    1 +
> >  block/blkverify.c         |  108 +----
> >  block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
> >  configure                 |   36 ++
> >  include/monitor/monitor.h |    2 +
> >  include/qemu-common.h     |    2 +
> >  monitor.c                 |    2 +
> >  util/iov.c                |  103 +++++
> >  8 files changed, 1183 insertions(+), 106 deletions(-)
> >  create mode 100644 block/quorum.c
> >
> 
> Could you add a qemu-iotest if possible?

I don't see how to specify quorum to qemu-io for bytes checking.

Best regards

Benoît

> 
> Max
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-28 12:31     ` Benoît Canet
@ 2013-10-29 16:42       ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-29 16:42 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

Am 28.10.2013 13:31, schrieb Benoît Canet:
> Le Friday 04 Oct 2013 à 16:48:12 (+0200), Max Reitz a écrit :
>> On 2013-10-02 14:39, Benoît Canet wrote:
>>> Use gnutls's SHA-256 to compare versions.
>> Wouldn't CRC32 suffice? (I don't really oppose using SHA, but taking
>> in gnutls as a dependency just for comparing several memory areas
>> seems a bit much to me)
> Initially it gzip's addler32 was used but someone was concerned with the risk
> of collisions.
> Anyway the code fallback using hashes only when something wrong is detected so
> it won't impact the normal case.
>
> Best regards
>
> Benoît

Yes, that's correct, but it adds a new dependency to qemu. Personally, I
am unable to decide whether this is better than having a higher risk of
collisions with CRC, so I'll leave the decision to someone more
qualified (like you). ;-)

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies.
  2013-10-28 12:21     ` Benoît Canet
@ 2013-10-29 17:21       ` Max Reitz
  2013-10-29 20:06         ` Benoît Canet
  0 siblings, 1 reply; 36+ messages in thread
From: Max Reitz @ 2013-10-29 17:21 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

Am 28.10.2013 13:21, schrieb Benoît Canet:
> Le Friday 04 Oct 2013 à 16:35:18 (+0200), Max Reitz a écrit :
>> On 2013-10-02 14:39, Benoît Canet wrote:
>>> Signed-off-by: Benoit Canet <benoit@irqsave.net>
>>> ---
>>>  block/quorum.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 123 insertions(+)
>>>
>>> diff --git a/block/quorum.c b/block/quorum.c
>>> index 9557e61..b49e3c6 100644
>>> --- a/block/quorum.c
>>> +++ b/block/quorum.c
>>> @@ -64,11 +64,134 @@ struct QuorumAIOCB {
>>>      int vote_ret;
>>>  };
>>> +static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>>> +{
>>> +    QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
>>> +    bool finished = false;
>>> +
>>> +    /* Wait for the request to finish */
>>> +    acb->finished = &finished;
>>> +    while (!finished) {
>>> +        qemu_aio_wait();
>>> +    }
>> Hm, wouldn't it be better to pass the cancel to the children?
>>
>> Max
> Hi Max,
>
> Hi don't understand how you would do this.
>
> Best regards
>
> Benoît

I thought of calling bdrv_aio_cancel() on every aios[i].aiocb – I don't
know if that will work but it seems reasonable to me to try to cancel
all the operations quorum has initiated when the quorum operation itself
is requested to be canceled (instead of waiting for the spawned
operations to finish normally).

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism.
  2013-10-28 13:04     ` Benoît Canet
@ 2013-10-29 17:34       ` Max Reitz
  0 siblings, 0 replies; 36+ messages in thread
From: Max Reitz @ 2013-10-29 17:34 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

Am 28.10.2013 14:04, schrieb Benoît Canet:
> Le Friday 04 Oct 2013 à 16:48:12 (+0200), Max Reitz a écrit :
>> On 2013-10-02 14:39, Benoît Canet wrote:
>>> Use gnutls's SHA-256 to compare versions.
>> Wouldn't CRC32 suffice? (I don't really oppose using SHA, but taking
>> in gnutls as a dependency just for comparing several memory areas
>> seems a bit much to me)
>>
>>> Signed-off-by: Benoit Canet <benoit@irqsave.net>
>>> ---
>>>  block/Makefile.objs       |   2 +-
>>>  block/quorum.c            | 321 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  configure                 |  36 ++++++
>>>  include/monitor/monitor.h |   2 +
>>>  monitor.c                 |   2 +
>>>  5 files changed, 361 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>>> index 05a65c2..adcdc21 100644
>>> --- a/block/Makefile.objs
>>> +++ b/block/Makefile.objs
>>> @@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c
>>>  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>>>  block-obj-y += qed-check.o
>>>  block-obj-y += vhdx.o
>>> -block-obj-y += quorum.o
>>> +block-obj-$(CONFIG_QUORUM) += quorum.o
>>>  block-obj-y += parallels.o blkdebug.o blkverify.o
>>>  block-obj-y += snapshot.o qapi.o
>>>  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
>>> diff --git a/block/quorum.c b/block/quorum.c
>>> index f0fc0e9..e235ac1 100644
>>> --- a/block/quorum.c
>>> +++ b/block/quorum.c
>>> @@ -13,7 +13,43 @@
>>>   * See the COPYING file in the top-level directory.
>>>   */
>>> +#include <gnutls/gnutls.h>
>>> +#include <gnutls/crypto.h>
>>>  #include "block/block_int.h"
>>> +#include "qapi/qmp/qjson.h"
>>> +
>>> +#define HASH_LENGTH 32
>>> +
>>> +/* This union hold a vote hash value */
>> *holds
>>
>>> +typedef union QuorumVoteValue {
>>> +    char h[HASH_LENGTH];       /* SHA-256 hash */
>>> +    int64_t l;                 /* simpler 64 bits hash */
>>> +} QuorumVoteValue;
>>> +
>>> +/* A vote item */
>>> +typedef struct QuorumVoteItem {
>>> +    int index;
>>> +    QLIST_ENTRY(QuorumVoteItem) next;
>>> +} QuorumVoteItem;
>>> +
>>> +/* this structure is a vote version. A version is the set of vote sharing the
>> *set of votes
>>
>>> + * same vote value.
>>> + * The set of vote will be tracked with the items field and it's count is
>> *set of votes or *vote set; also s/it's count/its cardinality/ or
>> something like that
>>
>>> + * vote_count.
>>> + */
>>> +typedef struct QuorumVoteVersion {
>>> +    QuorumVoteValue value;
>>> +    int index;
>>> +    int vote_count;
>>> +    QLIST_HEAD(, QuorumVoteItem) items;
>>> +    QLIST_ENTRY(QuorumVoteVersion) next;
>>> +} QuorumVoteVersion;
>>> +
>>> +/* this structure hold a group of vote versions together */
>> *holds
>>
>>> +typedef struct QuorumVotes {
>>> +    QLIST_HEAD(, QuorumVoteVersion) vote_list;
>>> +    int (*compare)(QuorumVoteValue *a, QuorumVoteValue *b);
>>> +} QuorumVotes;
>>>  /* the following structure hold the state of one quorum instance */
>>>  typedef struct {
>>> @@ -60,10 +96,14 @@ struct QuorumAIOCB {
>>>      int success_count;          /* number of successfully completed AIOCB */
>>>      bool *finished;             /* completion signal for cancel */
>>> +    QuorumVotes votes;
>>> +
>>>      bool is_read;
>>>      int vote_ret;
>>>  };
>>> +static void quorum_vote(QuorumAIOCB *acb);
>>> +
>>>  static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>>>  {
>>>      QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
>>> @@ -111,6 +151,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>>>          acb->aios[i].ret = 0;
>>>      }
>>> +    if (acb->vote_ret) {
>>> +        ret = acb->vote_ret;
>>> +    }
>>> +
>>>      acb->common.cb(acb->common.opaque, ret);
>>>      if (acb->finished) {
>>>          *acb->finished = true;
>>> @@ -122,6 +166,11 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>>>      qemu_aio_release(acb);
>>>  }
>>> +static int quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
>>> +{
>>> +    return memcmp(a->h, b->h, HASH_LENGTH);
>>> +}
>>> +
>>>  static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>>>                                     BlockDriverState *bs,
>>>                                     QEMUIOVector *qiov,
>>> @@ -141,6 +190,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>>>      acb->count = 0;
>>>      acb->success_count = 0;
>>>      acb->finished = NULL;
>>> +    acb->votes.compare = quorum_sha256_compare;
>>>      acb->is_read = false;
>>>      acb->vote_ret = 0;
>>> @@ -170,9 +220,278 @@ static void quorum_aio_cb(void *opaque, int ret)
>>>          return;
>>>      }
>>> +    /* Do the vote on read */
>>> +    if (acb->is_read) {
>>> +        quorum_vote(acb);
>>> +    }
>>> +
>>>      quorum_aio_finalize(acb);
>>>  }
>>> +static void quorum_report_bad(QuorumAIOCB *acb, int index)
>>> +{
>>> +    QObject *data;
>>> +    data = qobject_from_jsonf("{ 'children-index': %i"
>> I'd prefer child-index. Generally, remember that the singular of
>> "children" is "child".
>>
>>> +                              ", 'sector-num': %" PRId64
>>> +                              ", 'sectors-count': %i }",
>>> +                              index,
>>> +                              acb->sector_num,
>>> +                              acb->nb_sectors);
>>> +    monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data);
>> How about decrementing the refcount of data?
>>
>>> +}
>>> +
>>> +static void quorum_report_failure(QuorumAIOCB *acb)
>>> +{
>>> +    QObject *data;
>>> +    data = qobject_from_jsonf("{ 'sector-num': %" PRId64
>>> +                              ", 'sectors-count': %i }",
>>> +                              acb->sector_num,
>>> +                              acb->nb_sectors);
>>> +    monitor_protocol_event(QEVENT_QUORUM_FAILURE, data);
>> Same here.
>>
>>> +}
>>> +
>>> +static void quorum_report_bad_versions(QuorumAIOCB *acb,
>>> +                                      QuorumVoteValue *value)
>>> +{
>>> +    QuorumVoteVersion *version;
>>> +    QuorumVoteItem *item;
>>> +
>>> +    QLIST_FOREACH(version, &acb->votes.vote_list, next) {
>>> +        if (!acb->votes.compare(&version->value, value)) {
>>> +            continue;
>>> +        }
>>> +        QLIST_FOREACH(item, &version->items, next) {
>>> +            quorum_report_bad(acb, item->index);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
>>> +{
>>> +    int i;
>>> +    assert(dest->niov == source->niov);
>>> +    assert(dest->size == source->size);
>>> +    for (i = 0; i < source->niov; i++) {
>>> +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
>>> +        memcpy(dest->iov[i].iov_base,
>>> +               source->iov[i].iov_base,
>>> +               source->iov[i].iov_len);
>>> +    }
>>> +}
>>> +
>>> +static void quorum_count_vote(QuorumVotes *votes,
>>> +                              QuorumVoteValue *value,
>>> +                              int index)
>>> +{
>>> +    QuorumVoteVersion *v = NULL, *version = NULL;
>>> +    QuorumVoteItem *item;
>>> +
>>> +    /* look if we have something with this hash */
>>> +    QLIST_FOREACH(v, &votes->vote_list, next) {
>>> +        if (!votes->compare(&v->value, value)) {
>>> +            version = v;
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    /* It's a version not yet in the list add it */
>>> +    if (!version) {
>>> +        version = g_new0(QuorumVoteVersion, 1);
>>> +        QLIST_INIT(&version->items);
>>> +        memcpy(&version->value, value, sizeof(version->value));
>>> +        version->index = index;
>>> +        version->vote_count = 0;
>>> +        QLIST_INSERT_HEAD(&votes->vote_list, version, next);
>>> +    }
>>> +
>>> +    version->vote_count++;
>>> +
>>> +    item = g_new0(QuorumVoteItem, 1);
>>> +    item->index = index;
>>> +    QLIST_INSERT_HEAD(&version->items, item, next);
>>> +}
>>> +
>>> +static void quorum_free_vote_list(QuorumVotes *votes)
>>> +{
>>> +    QuorumVoteVersion *version, *next_version;
>>> +    QuorumVoteItem *item, *next_item;
>>> +
>>> +    QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) {
>>> +        QLIST_REMOVE(version, next);
>>> +        QLIST_FOREACH_SAFE(item, &version->items, next, next_item) {
>>> +            QLIST_REMOVE(item, next);
>>> +            g_free(item);
>>> +        }
>>> +        g_free(version);
>>> +    }
>>> +}
>>> +
>>> +static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
>>> +{
>>> +    int j, ret;
>>> +    gnutls_hash_hd_t dig;
>>> +    QEMUIOVector *qiov = &acb->aios[i].qiov;
>>> +
>>> +    ret = gnutls_hash_init(&dig, GNUTLS_DIG_SHA256);
>>> +
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    for (j = 0; j < qiov->niov; j++) {
>>> +        ret = gnutls_hash(dig, qiov->iov[j].iov_base, qiov->iov[j].iov_len);
>>> +        if (ret < 0) {
>>> +            return ret;
>> gnutls_hash_deinit?
>>
>>> +        }
>>> +    }
>>> +
>>> +    gnutls_hash_deinit(dig, (void *) hash);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
>>> +{
>>> +    int i = 0;
>>> +    QuorumVoteVersion *candidate, *winner = NULL;
>>> +
>>> +    QLIST_FOREACH(candidate, &votes->vote_list, next) {
>>> +        if (candidate->vote_count > i) {
>>> +            i = candidate->vote_count;
>>> +            winner = candidate;
>>> +        }
>>> +    }
>>> +
>>> +    return winner;
>>> +}
>>> +
>>> +static bool quorum_iovec_compare(QEMUIOVector *a, QEMUIOVector *b)
>>> +{
>>> +    int i;
>>> +    int result;
>>> +
>>> +    assert(a->niov == b->niov);
>>> +    for (i = 0; i < a->niov; i++) {
>>> +        assert(a->iov[i].iov_len == b->iov[i].iov_len);
>>> +        result = memcmp(a->iov[i].iov_base,
>>> +                        b->iov[i].iov_base,
>>> +                        a->iov[i].iov_len);
>>> +        if (result) {
>>> +            return false;
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
>>> +                                          const char *fmt, ...)
>>> +{
>>> +    va_list ap;
>>> +
>>> +    va_start(ap, fmt);
>>> +    fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
>>> +            acb->sector_num, acb->nb_sectors);
>>> +    vfprintf(stderr, fmt, ap);
>>> +    fprintf(stderr, "\n");
>>> +    va_end(ap);
>>> +    exit(1);
>>> +}
>>> +
>>> +static bool quorum_compare(QuorumAIOCB *acb,
>>> +                           QEMUIOVector *a,
>>> +                           QEMUIOVector *b)
>>> +{
>>> +    BDRVQuorumState *s = acb->bqs;
>>> +    ssize_t offset;
>>> +
>>> +    /* This driver will replace blkverify in this particular case */
>>> +    if (s->is_blkverify) {
>>> +        offset = qemu_iovec_compare(a, b);
>>> +        if (offset != -1) {
>>> +            quorum_err(acb, "contents mismatch in sector %" PRId64,
>>> +                       acb->sector_num +
>>> +                       (uint64_t)(offset / BDRV_SECTOR_SIZE));
>>> +        }
>>> +        return true;
>>> +    }
>>> +
>>> +    return quorum_iovec_compare(a, b);
>>> +}
>>> +
>>> +static void quorum_vote(QuorumAIOCB *acb)
>>> +{
>>> +    bool quorum = true;
>>> +    int i, j, ret;
>>> +    QuorumVoteValue hash;
>>> +    BDRVQuorumState *s = acb->bqs;
>>> +    QuorumVoteVersion *winner;
>>> +
>>> +    /* get the index of the first successful read */
>>> +    for (i = 0; i < s->total; i++) {
>>> +        if (!acb->aios[i].ret) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    /* compare this read with all other successful read looking for quorum */
>>> +    for (j = i + 1; j < s->total; j++) {
>>> +        if (acb->aios[j].ret) {
>>> +            continue;
>>> +        }
>>> +        quorum = quorum_compare(acb, &acb->aios[i].qiov, &acb->aios[j].qiov);
>>> +        if (!quorum) {
>>> +            break;
>>> +       }
>>> +    }
>>> +
>>> +    /* Every successful read agrees -> Quorum */
>>> +    if (quorum) {
>>> +        quorum_copy_qiov(acb->qiov, &acb->aios[i].qiov);
>> If no reads were successful at all, quorum will be true and i will
>> be s->total. Therefore, this will result in an array access with an
>> index out of bounds here.
> Thanks a lot for spotting this.
>
>> Generally, why is this function executed at all if any read failed?
>> If a read fails, the quorum read function will return an error, thus
>> the caller will ignore the result anyway.
> I should if there is a quorum between the successfull reads; if all the success
> ful reads agrees and their count is greater than threshold.

Ah, I see it now; quorum_aio_readv() returns "success" if more reads
than the threshould succeeded (and return the same result). I thought it
would fail as soon as just a single read operation failed – sorry, my fault.

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-29  7:59   ` Benoît Canet
@ 2013-10-29 17:55     ` Max Reitz
  2013-10-29 17:56       ` Max Reitz
  0 siblings, 1 reply; 36+ messages in thread
From: Max Reitz @ 2013-10-29 17:55 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

Am 29.10.2013 08:59, schrieb Benoît Canet:
> Le Friday 04 Oct 2013 à 16:31:56 (+0200), Max Reitz a écrit :
>> On 2013-10-02 14:39, Benoît Canet wrote:
>>> It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
>>>
>>> This patchset create a block driver implementing a quorum using total qemu disk
>>> images. Writes are mirrored on the $total files.
>>> For the reading part the $total files are read at the same time and a vote is
>>> done to determine if a qiov version is present $threshold or more times. It then
>>> return this majority version to the upper layers.
>>> When i < $threshold versions of the data are returned by the lower layer the
>>> quorum is broken and the read return -EIO.
>>>
>>> The goal of this patchset is to be turned in a QEMU block filter living just
>>> above raw-*.c and below qcow2/qed when the required infrastructure will be done.
>>>
>>> Main use of this feature will be people using NFS appliances which can be
>>> subjected to bitflip errors.
>>>
>>> This patchset can be used to replace blkverify and the out of tree blkmirror.
>>>
>>> v9: integrate latests comments from kevin
>>>
>>> Benoît Canet (11):
>>>   quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
>>>   quorum: Create BDRVQuorumState and BlkDriver and do init.
>>>   quorum: Add quorum_aio_writev and its dependencies.
>>>   blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
>>>     blkverify.
>>>   quorum: Add quorum_aio_readv.
>>>   quorum: Add quorum mechanism.
>>>   quorum: Add quorum_getlength().
>>>   quorum: Add quorum_invalidate_cache().
>>>   quorum: Add quorum_co_get_block_status.
>>>   quorum: Add quorum_co_flush().
>>>   quorum: Add quorum_open() and quorum_close().
>>>
>>>  block/Makefile.objs       |    1 +
>>>  block/blkverify.c         |  108 +----
>>>  block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
>>>  configure                 |   36 ++
>>>  include/monitor/monitor.h |    2 +
>>>  include/qemu-common.h     |    2 +
>>>  monitor.c                 |    2 +
>>>  util/iov.c                |  103 +++++
>>>  8 files changed, 1183 insertions(+), 106 deletions(-)
>>>  create mode 100644 block/quorum.c
>>>
>> Could you add a qemu-iotest if possible?
> I don't see how to specify quorum to qemu-io for bytes checking.
>
> Best regards
>
> Benoît

As of commit b543c5cd, you can pass options to the block driver through
qemu-io -c open. The problem for quorum is that it cannot handle the
filename option (which qemu-io will always set). I've attached a diff to
change this behavior (which makes the path parameter for open optional
and passes NULL to bdrv_open if it is not given). You're free to make
use of it. ;-)

If this diff is applied, you can invoke qemu-io with quorum for example
as following:
qemu-io -c 'open -o
file.driver=quorum,file.children.0.file.filename=[Image
1],file.children.1.file.filename=[Image 2],file.vote_threshold=[x]'

Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-29 17:55     ` Max Reitz
@ 2013-10-29 17:56       ` Max Reitz
  2013-10-29 20:06         ` Benoît Canet
  0 siblings, 1 reply; 36+ messages in thread
From: Max Reitz @ 2013-10-29 17:56 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 3208 bytes --]

Am 29.10.2013 18:55, schrieb Max Reitz:
> Am 29.10.2013 08:59, schrieb Benoît Canet:
>> Le Friday 04 Oct 2013 à 16:31:56 (+0200), Max Reitz a écrit :
>>> On 2013-10-02 14:39, Benoît Canet wrote:
>>>> It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
>>>>
>>>> This patchset create a block driver implementing a quorum using total qemu disk
>>>> images. Writes are mirrored on the $total files.
>>>> For the reading part the $total files are read at the same time and a vote is
>>>> done to determine if a qiov version is present $threshold or more times. It then
>>>> return this majority version to the upper layers.
>>>> When i < $threshold versions of the data are returned by the lower layer the
>>>> quorum is broken and the read return -EIO.
>>>>
>>>> The goal of this patchset is to be turned in a QEMU block filter living just
>>>> above raw-*.c and below qcow2/qed when the required infrastructure will be done.
>>>>
>>>> Main use of this feature will be people using NFS appliances which can be
>>>> subjected to bitflip errors.
>>>>
>>>> This patchset can be used to replace blkverify and the out of tree blkmirror.
>>>>
>>>> v9: integrate latests comments from kevin
>>>>
>>>> Benoît Canet (11):
>>>>   quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
>>>>   quorum: Create BDRVQuorumState and BlkDriver and do init.
>>>>   quorum: Add quorum_aio_writev and its dependencies.
>>>>   blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
>>>>     blkverify.
>>>>   quorum: Add quorum_aio_readv.
>>>>   quorum: Add quorum mechanism.
>>>>   quorum: Add quorum_getlength().
>>>>   quorum: Add quorum_invalidate_cache().
>>>>   quorum: Add quorum_co_get_block_status.
>>>>   quorum: Add quorum_co_flush().
>>>>   quorum: Add quorum_open() and quorum_close().
>>>>
>>>>  block/Makefile.objs       |    1 +
>>>>  block/blkverify.c         |  108 +----
>>>>  block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  configure                 |   36 ++
>>>>  include/monitor/monitor.h |    2 +
>>>>  include/qemu-common.h     |    2 +
>>>>  monitor.c                 |    2 +
>>>>  util/iov.c                |  103 +++++
>>>>  8 files changed, 1183 insertions(+), 106 deletions(-)
>>>>  create mode 100644 block/quorum.c
>>>>
>>> Could you add a qemu-iotest if possible?
>> I don't see how to specify quorum to qemu-io for bytes checking.
>>
>> Best regards
>>
>> Benoît
> As of commit b543c5cd, you can pass options to the block driver through
> qemu-io -c open. The problem for quorum is that it cannot handle the
> filename option (which qemu-io will always set). I've attached a diff to
> change this behavior (which makes the path parameter for open optional
> and passes NULL to bdrv_open if it is not given). You're free to make
> use of it. ;-)
>
> If this diff is applied, you can invoke qemu-io with quorum for example
> as following:
> qemu-io -c 'open -o
> file.driver=quorum,file.children.0.file.filename=[Image
> 1],file.children.1.file.filename=[Image 2],file.vote_threshold=[x]'
>
> Max

Obviously forgot to attach the diff.

Max

[-- Attachment #2: qemu-io-open.diff --]
[-- Type: text/plain, Size: 2331 bytes --]

diff --git a/qemu-io.c b/qemu-io.c
index 3b3340a..4d0e9f6 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -57,8 +57,13 @@ static int openfile(char *name, int flags, int growable, QDict *opts)
 
     if (growable) {
         if (bdrv_file_open(&qemuio_bs, name, opts, flags, &local_err)) {
-            fprintf(stderr, "%s: can't open device %s: %s\n", progname, name,
-                    error_get_pretty(local_err));
+            if (name) {
+                fprintf(stderr, "%s: can't open device %s: %s\n", progname,
+                        name, error_get_pretty(local_err));
+            } else {
+                fprintf(stderr, "%s: can't open device: %s\n", progname,
+                        error_get_pretty(local_err));
+            }
             error_free(local_err);
             return 1;
         }
@@ -66,8 +71,13 @@ static int openfile(char *name, int flags, int growable, QDict *opts)
         qemuio_bs = bdrv_new("hda");
 
         if (bdrv_open(qemuio_bs, name, opts, flags, NULL, &local_err) < 0) {
-            fprintf(stderr, "%s: can't open device %s: %s\n", progname, name,
-                    error_get_pretty(local_err));
+            if (name) {
+                fprintf(stderr, "%s: can't open device %s: %s\n", progname,
+                        name, error_get_pretty(local_err));
+            } else {
+                fprintf(stderr, "%s: can't open device: %s\n", progname,
+                        error_get_pretty(local_err));
+            }
             error_free(local_err);
             bdrv_unref(qemuio_bs);
             qemuio_bs = NULL;
@@ -127,6 +137,7 @@ static int open_f(BlockDriverState *bs, int argc, char **argv)
     int c;
     QemuOpts *qopts;
     QDict *opts = NULL;
+    char *filename = NULL;
 
     while ((c = getopt(argc, argv, "snrgo:")) != EOF) {
         switch (c) {
@@ -160,11 +171,13 @@ static int open_f(BlockDriverState *bs, int argc, char **argv)
         flags |= BDRV_O_RDWR;
     }
 
-    if (optind != argc - 1) {
+    if (optind == argc - 1) {
+        filename = argv[optind];
+    } else if (optind < argc - 1) {
         return qemuio_command_usage(&open_cmd);
     }
 
-    return openfile(argv[optind], flags, growable, opts);
+    return openfile(filename, flags, growable, opts);
 }
 
 static int quit_f(BlockDriverState *bs, int argc, char **argv)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 00/11] Quorum block driver
  2013-10-29 17:56       ` Max Reitz
@ 2013-10-29 20:06         ` Benoît Canet
  0 siblings, 0 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-29 20:06 UTC (permalink / raw)
  To: Max Reitz; +Cc: Benoît Canet, kwolf, qemu-devel, stefanha

Le Tuesday 29 Oct 2013 à 18:56:34 (+0100), Max Reitz a écrit :
> Am 29.10.2013 18:55, schrieb Max Reitz:
> > Am 29.10.2013 08:59, schrieb Benoît Canet:
> >> Le Friday 04 Oct 2013 à 16:31:56 (+0200), Max Reitz a écrit :
> >>> On 2013-10-02 14:39, Benoît Canet wrote:
> >>>> It must be applied on top of "block: Add BlockDriver.bdrv_check_ext_snapshot."
> >>>>
> >>>> This patchset create a block driver implementing a quorum using total qemu disk
> >>>> images. Writes are mirrored on the $total files.
> >>>> For the reading part the $total files are read at the same time and a vote is
> >>>> done to determine if a qiov version is present $threshold or more times. It then
> >>>> return this majority version to the upper layers.
> >>>> When i < $threshold versions of the data are returned by the lower layer the
> >>>> quorum is broken and the read return -EIO.
> >>>>
> >>>> The goal of this patchset is to be turned in a QEMU block filter living just
> >>>> above raw-*.c and below qcow2/qed when the required infrastructure will be done.
> >>>>
> >>>> Main use of this feature will be people using NFS appliances which can be
> >>>> subjected to bitflip errors.
> >>>>
> >>>> This patchset can be used to replace blkverify and the out of tree blkmirror.
> >>>>
> >>>> v9: integrate latests comments from kevin
> >>>>
> >>>> Benoît Canet (11):
> >>>>   quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB.
> >>>>   quorum: Create BDRVQuorumState and BlkDriver and do init.
> >>>>   quorum: Add quorum_aio_writev and its dependencies.
> >>>>   blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from
> >>>>     blkverify.
> >>>>   quorum: Add quorum_aio_readv.
> >>>>   quorum: Add quorum mechanism.
> >>>>   quorum: Add quorum_getlength().
> >>>>   quorum: Add quorum_invalidate_cache().
> >>>>   quorum: Add quorum_co_get_block_status.
> >>>>   quorum: Add quorum_co_flush().
> >>>>   quorum: Add quorum_open() and quorum_close().
> >>>>
> >>>>  block/Makefile.objs       |    1 +
> >>>>  block/blkverify.c         |  108 +----
> >>>>  block/quorum.c            | 1035 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>  configure                 |   36 ++
> >>>>  include/monitor/monitor.h |    2 +
> >>>>  include/qemu-common.h     |    2 +
> >>>>  monitor.c                 |    2 +
> >>>>  util/iov.c                |  103 +++++
> >>>>  8 files changed, 1183 insertions(+), 106 deletions(-)
> >>>>  create mode 100644 block/quorum.c
> >>>>
> >>> Could you add a qemu-iotest if possible?
> >> I don't see how to specify quorum to qemu-io for bytes checking.
> >>
> >> Best regards
> >>
> >> Benoît
> > As of commit b543c5cd, you can pass options to the block driver through
> > qemu-io -c open. The problem for quorum is that it cannot handle the
> > filename option (which qemu-io will always set). I've attached a diff to
> > change this behavior (which makes the path parameter for open optional
> > and passes NULL to bdrv_open if it is not given). You're free to make
> > use of it. ;-)
> >
> > If this diff is applied, you can invoke qemu-io with quorum for example
> > as following:
> > qemu-io -c 'open -o
> > file.driver=quorum,file.children.0.file.filename=[Image
> > 1],file.children.1.file.filename=[Image 2],file.vote_threshold=[x]'
> >
> > Max
> 
> Obviously forgot to attach the diff.
> 
> Max

> diff --git a/qemu-io.c b/qemu-io.c
> index 3b3340a..4d0e9f6 100644
> --- a/qemu-io.c
> +++ b/qemu-io.c
> @@ -57,8 +57,13 @@ static int openfile(char *name, int flags, int growable, QDict *opts)
>  
>      if (growable) {
>          if (bdrv_file_open(&qemuio_bs, name, opts, flags, &local_err)) {
> -            fprintf(stderr, "%s: can't open device %s: %s\n", progname, name,
> -                    error_get_pretty(local_err));
> +            if (name) {
> +                fprintf(stderr, "%s: can't open device %s: %s\n", progname,
> +                        name, error_get_pretty(local_err));
> +            } else {
> +                fprintf(stderr, "%s: can't open device: %s\n", progname,
> +                        error_get_pretty(local_err));
> +            }
>              error_free(local_err);
>              return 1;
>          }
> @@ -66,8 +71,13 @@ static int openfile(char *name, int flags, int growable, QDict *opts)
>          qemuio_bs = bdrv_new("hda");
>  
>          if (bdrv_open(qemuio_bs, name, opts, flags, NULL, &local_err) < 0) {
> -            fprintf(stderr, "%s: can't open device %s: %s\n", progname, name,
> -                    error_get_pretty(local_err));
> +            if (name) {
> +                fprintf(stderr, "%s: can't open device %s: %s\n", progname,
> +                        name, error_get_pretty(local_err));
> +            } else {
> +                fprintf(stderr, "%s: can't open device: %s\n", progname,
> +                        error_get_pretty(local_err));
> +            }
>              error_free(local_err);
>              bdrv_unref(qemuio_bs);
>              qemuio_bs = NULL;
> @@ -127,6 +137,7 @@ static int open_f(BlockDriverState *bs, int argc, char **argv)
>      int c;
>      QemuOpts *qopts;
>      QDict *opts = NULL;
> +    char *filename = NULL;
>  
>      while ((c = getopt(argc, argv, "snrgo:")) != EOF) {
>          switch (c) {
> @@ -160,11 +171,13 @@ static int open_f(BlockDriverState *bs, int argc, char **argv)
>          flags |= BDRV_O_RDWR;
>      }
>  
> -    if (optind != argc - 1) {
> +    if (optind == argc - 1) {
> +        filename = argv[optind];
> +    } else if (optind < argc - 1) {
>          return qemuio_command_usage(&open_cmd);
>      }
>  
> -    return openfile(argv[optind], flags, growable, opts);
> +    return openfile(filename, flags, growable, opts);
>  }
>  
>  static int quit_f(BlockDriverState *bs, int argc, char **argv)

Thanks a lot.

Best regards

Benoît

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies.
  2013-10-29 17:21       ` Max Reitz
@ 2013-10-29 20:06         ` Benoît Canet
  0 siblings, 0 replies; 36+ messages in thread
From: Benoît Canet @ 2013-10-29 20:06 UTC (permalink / raw)
  To: Max Reitz; +Cc: Benoît Canet, kwolf, qemu-devel, stefanha

L Tuesday 29 Oct 2013 à 18:21:57 (+0100), Max Reitz a écrit :
> Am 28.10.2013 13:21, schrieb Benoît Canet:
> > Le Friday 04 Oct 2013 à 16:35:18 (+0200), Max Reitz a écrit :
> >> On 2013-10-02 14:39, Benoît Canet wrote:
> >>> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> >>> ---
> >>>  block/quorum.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 123 insertions(+)
> >>>
> >>> diff --git a/block/quorum.c b/block/quorum.c
> >>> index 9557e61..b49e3c6 100644
> >>> --- a/block/quorum.c
> >>> +++ b/block/quorum.c
> >>> @@ -64,11 +64,134 @@ struct QuorumAIOCB {
> >>>      int vote_ret;
> >>>  };
> >>> +static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
> >>> +{
> >>> +    QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> >>> +    bool finished = false;
> >>> +
> >>> +    /* Wait for the request to finish */
> >>> +    acb->finished = &finished;
> >>> +    while (!finished) {
> >>> +        qemu_aio_wait();
> >>> +    }
> >> Hm, wouldn't it be better to pass the cancel to the children?
> >>
> >> Max
> > Hi Max,
> >
> > Hi don't understand how you would do this.
> >
> > Best regards
> >
> > Benoît
> 
> I thought of calling bdrv_aio_cancel() on every aios[i].aiocb – I don't
> know if that will work but it seems reasonable to me to try to cancel
> all the operations quorum has initiated when the quorum operation itself
> is requested to be canceled (instead of waiting for the spawned
> operations to finish normally).

That makes sense. I will do this.

Best regards

Benoît

> 
> Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-10-29 20:07 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-02 12:39 [Qemu-devel] [PATCH V9 00/11] Quorum block driver Benoît Canet
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 01/11] quorum: Create quorum.c, add QuorumSingleAIOCB and QuorumAIOCB Benoît Canet
2013-10-04 14:33   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 02/11] quorum: Create BDRVQuorumState and BlkDriver and do init Benoît Canet
2013-10-04 14:34   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 03/11] quorum: Add quorum_aio_writev and its dependencies Benoît Canet
2013-10-04 14:35   ` Max Reitz
2013-10-28 12:21     ` Benoît Canet
2013-10-29 17:21       ` Max Reitz
2013-10-29 20:06         ` Benoît Canet
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 04/11] blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify Benoît Canet
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 05/11] quorum: Add quorum_aio_readv Benoît Canet
2013-10-04 14:38   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 06/11] quorum: Add quorum mechanism Benoît Canet
2013-10-04 14:48   ` Max Reitz
2013-10-28 12:31     ` Benoît Canet
2013-10-29 16:42       ` Max Reitz
2013-10-28 13:04     ` Benoît Canet
2013-10-29 17:34       ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 07/11] quorum: Add quorum_getlength() Benoît Canet
2013-10-04 14:49   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 08/11] quorum: Add quorum_invalidate_cache() Benoît Canet
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 09/11] quorum: Add quorum_co_get_block_status Benoît Canet
2013-10-04 14:51   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 10/11] quorum: Add quorum_co_flush() Benoît Canet
2013-10-04 14:53   ` Max Reitz
2013-10-02 12:39 ` [Qemu-devel] [PATCH V9 11/11] quorum: Add quorum_open() and quorum_close() Benoît Canet
2013-10-04 15:14   ` Max Reitz
2013-10-04 15:31     ` Eric Blake
2013-10-07  7:23       ` Max Reitz
2013-10-04 14:31 ` [Qemu-devel] [PATCH V9 00/11] Quorum block driver Max Reitz
2013-10-04 22:25   ` Benoît Canet
2013-10-29  7:59   ` Benoît Canet
2013-10-29 17:55     ` Max Reitz
2013-10-29 17:56       ` Max Reitz
2013-10-29 20:06         ` Benoît Canet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.