All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
@ 2014-09-01  7:43 Liu Yuan
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read Liu Yuan
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

This patch set mainly add mainly two logics to implement device recover
- notify qourum driver of the broken states from the child driver(s)
- dirty track and sync the device after it is repaired

Thus quorum allow VMs to continue while some child devices are broken and when
the child devices are repaired and return back, we sync dirty bits during
downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:
(similary to DRBD)

+ denote device sync iteration
- IO on a single device
= IO on two devices

                                      sync complete, release dirty bitmap
                                         ^
                                         |
  ====-----------------++++----++++----++==========
     |                 |
     |                 v
     |               device repaired and begin to sync
     v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Todo:
- use aio interface to sync data (multiple transfer in one go)
- dynamic slice window to control sync bandwidth more smoothly
- add auto-reconnection mechanism to other protocol (if not support yet)
- add tests

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>

Liu Yuan (8):
  block/quorum: initialize qcrs.aiocb for read
  block: add driver operation callbacks
  block/sheepdog: propagate disconnect/reconnect events to upper driver
  block/quorum: add quorum_aio_release() helper
  quorum: fix quorum_aio_cancel()
  block/quorum: add broken state to BlockDriverState
  block: add two helpers
  quorum: add basic device recovery logic

 block.c                   |  17 +++
 block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
 block/sheepdog.c          |   9 ++
 include/block/block.h     |   9 ++
 include/block/block_int.h |   6 +
 trace-events              |   5 +
 6 files changed, 336 insertions(+), 34 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks Liu Yuan
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

This is required by quorum_aio_cancel()

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index af48e8c..5866bca 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -653,8 +653,10 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
     }
 
     for (i = 0; i < s->num_children; i++) {
-        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
-                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
+        acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
+                                            &acb->qcrs[i].qiov,
+                                            acb->nb_sectors, quorum_aio_cb,
+                                            &acb->qcrs[i]);
     }
 
     return &acb->common;
@@ -663,15 +665,14 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
 static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
 {
     BDRVQuorumState *s = acb->common.bs->opaque;
-
-    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
-                                                     acb->qiov->size);
-    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
-    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
-                     acb->qcrs[acb->child_iter].buf);
-    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
-                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
-                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
+    int i = acb->child_iter;
+
+    acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
+    qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
+    qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
+    acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
+                                        &acb->qcrs[i].qiov, acb->nb_sectors,
+                                        quorum_aio_cb, &acb->qcrs[i]);
 
     return &acb->common;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:28   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver Liu Yuan
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

Driver operations are defined as callbacks passed from block upper drivers to
lower drivers and are supposed to be called by lower drivers.

Requests handling(queuing, submitting, etc.) are done in protocol tier in the
block layer and connection states are also maintained down there. Driver
operations are supposed to notify the upper tier (such as quorum) of the states
changes.

For now only two operation are added:

driver_disconnect: called when connection is off
driver_reconnect: called when connection is on after disconnection

Which are used to notify upper tier of the connection state.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block.c                   | 7 +++++++
 include/block/block.h     | 7 +++++++
 include/block/block_int.h | 3 +++
 3 files changed, 17 insertions(+)

diff --git a/block.c b/block.c
index c12b8de..22eb3e4 100644
--- a/block.c
+++ b/block.c
@@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
     bs->dev_opaque = opaque;
 }
 
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+                      void *opaque)
+{
+    bs->drv_ops = ops;
+    bs->drv_opaque = opaque;
+}
+
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load)
 {
     if (bs->dev_ops && bs->dev_ops->change_media_cb) {
diff --git a/include/block/block.h b/include/block/block.h
index 8f4ad16..a61eaf0 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -82,6 +82,11 @@ typedef struct BlockDevOps {
     void (*resize_cb)(void *opaque);
 } BlockDevOps;
 
+typedef struct BlockDrvOps {
+    void (*driver_reconnect)(BlockDriverState *bs);
+    void (*driver_disconnect)(BlockDriverState *bs);
+} BlockDrvOps;
+
 typedef enum {
     BDRV_REQ_COPY_ON_READ = 0x1,
     BDRV_REQ_ZERO_WRITE   = 0x2,
@@ -234,6 +239,8 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
                       void *opaque);
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+                      void *opaque);
 void bdrv_dev_eject_request(BlockDriverState *bs, bool force);
 bool bdrv_dev_has_removable_media(BlockDriverState *bs);
 bool bdrv_dev_is_tray_open(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2334895..9fdec7f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,6 +319,9 @@ struct BlockDriverState {
     const BlockDevOps *dev_ops;
     void *dev_opaque;
 
+    const BlockDrvOps *drv_ops;
+    void *drv_opaque;
+
     AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
 
     char filename[1024];
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read Liu Yuan
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:31   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper Liu Yuan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

This is the reference usage how we propagate connection state to upper tier.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/sheepdog.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 53c24d6..9c0fc49 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
     BDRVSheepdogState *s = opaque;
     AIOReq *aio_req, *next;
+    BlockDriverState *bs = s->bs;
+
+    if (bs->drv_ops && bs->drv_ops->driver_disconnect) {
+        bs->drv_ops->driver_disconnect(bs);
+    }
 
     aio_set_fd_handler(s->aio_context, s->fd, NULL, NULL, NULL);
     close(s->fd);
@@ -756,6 +761,10 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
         QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
         resend_aioreq(s, aio_req);
     }
+
+    if (bs->drv_ops && bs->drv_ops->driver_reconnect) {
+        bs->drv_ops->driver_reconnect(bs);
+    }
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (2 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:33   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel() Liu Yuan
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 5866bca..9e056d6 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -130,6 +130,12 @@ struct QuorumAIOCB {
 
 static bool quorum_vote(QuorumAIOCB *acb);
 
+static void quorum_aio_release(QuorumAIOCB *acb)
+{
+    g_free(acb->qcrs);
+    qemu_aio_release(acb);
+}
+
 static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
@@ -141,8 +147,7 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
         bdrv_aio_cancel(acb->qcrs[i].aiocb);
     }
 
-    g_free(acb->qcrs);
-    qemu_aio_release(acb);
+    quorum_aio_release(acb);
 }
 
 static AIOCBInfo quorum_aiocb_info = {
@@ -168,8 +173,7 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
         }
     }
 
-    g_free(acb->qcrs);
-    qemu_aio_release(acb);
+    quorum_aio_release(acb);
 }
 
 static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (3 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:35   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState Liu Yuan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

For a fifo read pattern, we only have one running aio (possible other cases that
has less number than num_children in the future), so we need to check if
.acb is NULL against bdrv_aio_cancel() to avoid segfault.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 9e056d6..b9eeda3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -144,7 +144,9 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 
     /* cancel all callbacks */
     for (i = 0; i < s->num_children; i++) {
-        bdrv_aio_cancel(acb->qcrs[i].aiocb);
+        if (acb->qcrs[i].aiocb) {
+            bdrv_aio_cancel(acb->qcrs[i].aiocb);
+        }
     }
 
     quorum_aio_release(acb);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (4 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel() Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:57   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 7/8] block: add two helpers Liu Yuan
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

This allow VM continues to process even if some devices are broken meanwhile
with proper configuration.

We mark the device broken when the protocol tier notify back some broken
state(s) of the device, such as diconnection via driver operations. We could
also reset the device as sound when the protocol tier is repaired.

Origianlly .threshold controls how we should decide the success of read/write
and return the failure only if the success count of read/write is less than
.threshold specified by users. But it doesn't record the states of underlying
states and will impact performance a bit in some cases.

For example, we have 3 children and .threshold is set 2. If one of the devices
broken, we should still return success and continue to run VM. But for every
IO operations, we will blindly send the requests to the broken device.

To store broken state into driver state we can save requests to borken devices
and resend the requests to the repaired ones by setting broken as false.

This is especially useful for network based protocol such as sheepdog, which
has a auto-reconnection mechanism and will never report EIO if the connection
is broken but just store the requests to its local queue and wait for resending.
Without broken state, quorum request will not come back until the connection is
re-established. So we have to skip the broken deivces to allow VM to continue
running with networked backed child (glusterfs, nfs, sheepdog, etc).

With the combination of read-pattern and threshold, we can easily mimic the DRVD
behavior with following configuration:

 read-pattern=fifo,threshold=1 will two children.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c            | 102 ++++++++++++++++++++++++++++++++++++++--------
 include/block/block_int.h |   3 ++
 2 files changed, 87 insertions(+), 18 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index b9eeda3..7b07e35 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -120,6 +120,7 @@ struct QuorumAIOCB {
     int rewrite_count;          /* number of replica to rewrite: count down to
                                  * zero once writes are fired
                                  */
+    int issued_count;           /* actual read&write issued count */
 
     QuorumVotes votes;
 
@@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
     if (acb->is_read) {
         /* on the quorum case acb->child_iter == s->num_children - 1 */
         for (i = 0; i <= acb->child_iter; i++) {
-            qemu_vfree(acb->qcrs[i].buf);
-            qemu_iovec_destroy(&acb->qcrs[i].qiov);
+            if (acb->qcrs[i].buf) {
+                qemu_vfree(acb->qcrs[i].buf);
+                qemu_iovec_destroy(&acb->qcrs[i].qiov);
+            }
         }
     }
 
@@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
     acb->count = 0;
     acb->success_count = 0;
     acb->rewrite_count = 0;
+    acb->issued_count = 0;
     acb->votes.compare = quorum_sha256_compare;
     QLIST_INIT(&acb->votes.vote_list);
     acb->is_read = false;
@@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
     }
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+    BDRVQuorumState *s = acb->common.bs->opaque;
+    int i;
+
+    for (i = acb->child_iter; i < s->num_children; i++) {
+        if (!s->bs[i]->broken) {
+            break;
+        }
+    }
+    if (i == s->num_children) {
+        return -1;
+    }
+    return i;
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
     QuorumChildRequest *sacb = opaque;
@@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
     BDRVQuorumState *s = acb->common.bs->opaque;
     bool rewrite = false;
 
+    if (ret < 0) {
+        s->bs[acb->child_iter]->broken = true;
+    }
+
     if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
         /* We try to read next child in FIFO order if we fail to read */
-        if (ret < 0 && ++acb->child_iter < s->num_children) {
-            read_fifo_child(acb);
-            return;
+        if (ret < 0) {
+            acb->child_iter = next_fifo_child(acb);
+            if (acb->child_iter > 0) {
+                read_fifo_child(acb);
+                return;
+            }
         }
 
         if (ret == 0) {
@@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
     } else {
         quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
     }
-    assert(acb->count <= s->num_children);
-    assert(acb->success_count <= s->num_children);
-    if (acb->count < s->num_children) {
+    assert(acb->count <= acb->issued_count);
+    assert(acb->success_count <= acb->issued_count);
+    if (acb->count < acb->issued_count) {
         return;
     }
 
@@ -647,22 +674,46 @@ free_exit:
     return rewrite;
 }
 
+static bool has_enough_children(BDRVQuorumState *s)
+{
+    int good = 0, i;
+
+    for (i = 0; i < s->num_children; i++) {
+        if (s->bs[i]->broken) {
+            continue;
+        }
+        good++;
+    }
+
+    if (good >= s->threshold) {
+        return true;
+    } else {
+        return false;
+    }
+}
+
 static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
 {
     BDRVQuorumState *s = acb->common.bs->opaque;
     int i;
 
+    if (!has_enough_children(s)) {
+        quorum_aio_release(acb);
+        return NULL;
+    }
+
     for (i = 0; i < s->num_children; i++) {
+        if (s->bs[i]->broken) {
+            continue;
+        }
         acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
         qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
         qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
-    }
-
-    for (i = 0; i < s->num_children; i++) {
         acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
                                             &acb->qcrs[i].qiov,
                                             acb->nb_sectors, quorum_aio_cb,
                                             &acb->qcrs[i]);
+        acb->issued_count++;
     }
 
     return &acb->common;
@@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
     acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
                                         &acb->qcrs[i].qiov, acb->nb_sectors,
                                         quorum_aio_cb, &acb->qcrs[i]);
-
+    acb->issued_count = 1;
     return &acb->common;
 }
 
@@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
     BDRVQuorumState *s = bs->opaque;
     QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
                                       nb_sectors, cb, opaque);
+
     acb->is_read = true;
 
     if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
@@ -701,6 +753,11 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
     }
 
     acb->child_iter = 0;
+    acb->child_iter = next_fifo_child(acb);
+    if (acb->child_iter < 0) {
+        quorum_aio_release(acb);
+        return NULL;
+    }
     return read_fifo_child(acb);
 }
 
@@ -716,10 +773,19 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
                                       cb, opaque);
     int i;
 
+    if (!has_enough_children(s)) {
+        quorum_aio_release(acb);
+        return NULL;
+    }
+
     for (i = 0; i < s->num_children; i++) {
+        if (s->bs[i]->broken) {
+            continue;
+        }
         acb->qcrs[i].aiocb = bdrv_aio_writev(s->bs[i], sector_num, qiov,
                                              nb_sectors, &quorum_aio_cb,
                                              &acb->qcrs[i]);
+        acb->issued_count++;
     }
 
     return &acb->common;
@@ -926,6 +992,12 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
+    /* and validate it against s->num_children */
+    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
+    if (ret < 0) {
+        goto exit;
+    }
+
     ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
     if (ret < 0) {
         error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
@@ -934,12 +1006,6 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     s->read_pattern = ret;
 
     if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
-        /* and validate it against s->num_children */
-        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
-        if (ret < 0) {
-            goto exit;
-        }
-
         /* is the driver in blkverify mode */
         if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
             s->num_children == 2 && s->threshold == 2) {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9fdec7f..599110b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -402,6 +402,9 @@ struct BlockDriverState {
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
+
+    /* True if the associated device is broken */
+    bool broken;
 };
 
 int get_tmp_filename(char *filename, int size);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 7/8] block: add two helpers
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (5 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  8:59   ` Benoît Canet
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic Liu Yuan
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

These helpers are needed by later quorum sync device logic.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block.c               | 10 ++++++++++
 include/block/block.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/block.c b/block.c
index 22eb3e4..2e2f1d9 100644
--- a/block.c
+++ b/block.c
@@ -2145,6 +2145,16 @@ void *bdrv_get_attached_dev(BlockDriverState *bs)
     return bs->dev;
 }
 
+BlockDriverState *bdrv_get_file(BlockDriverState *bs)
+{
+    return bs->file;
+}
+
+const char *bdrv_get_filename(BlockDriverState *bs)
+{
+    return bs->filename;
+}
+
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
                       void *opaque)
 {
diff --git a/include/block/block.h b/include/block/block.h
index a61eaf0..1e116cc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -237,6 +237,8 @@ int bdrv_attach_dev(BlockDriverState *bs, void *dev);
 void bdrv_attach_dev_nofail(BlockDriverState *bs, void *dev);
 void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
+BlockDriverState *bdrv_get_file(BlockDriverState *bs);
+const char *bdrv_get_filename(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
                       void *opaque);
 void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (6 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 7/8] block: add two helpers Liu Yuan
@ 2014-09-01  7:43 ` Liu Yuan
  2014-09-01  9:37   ` Benoît Canet
  2014-09-01  8:19 ` [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Benoît Canet
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  7:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

For some configuration, quorum allow VMs to continue while some child devices
are broken and when the child devices are repaired and return back, we need to
sync dirty bits during downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:

+ denote device sync iteration
- IO on a single device
= IO on two devices

                                      sync complete, release dirty bitmap
                                         ^
                                         |
  ====-----------------++++----++++----++==========
     |                 |
     |                 v
     |               device repaired and begin to sync
     v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c | 189 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 trace-events   |   5 ++
 2 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 7b07e35..ffd7c2d 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
+#include "trace.h"
 
 #define HASH_LENGTH 32
 
@@ -31,6 +32,10 @@
 #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
 #define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
+#define SLICE_TIME          100000000ULL /* 100 ms */
+#define CHUNK_SIZE          (1 << 20) /* 1M */
+#define SECTORS_PER_CHUNK   (CHUNK_SIZE >> BDRV_SECTOR_BITS)
+
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
     char h[HASH_LENGTH];       /* SHA-256 hash */
@@ -64,6 +69,7 @@ typedef struct QuorumVotes {
 
 /* the following structure holds the state of one quorum instance */
 typedef struct BDRVQuorumState {
+    BlockDriverState *mybs;/* Quorum block driver base state */
     BlockDriverState **bs; /* children BlockDriverStates */
     int num_children;      /* children count */
     int threshold;         /* if less than threshold children reads gave the
@@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
                             */
 
     QuorumReadPattern read_pattern;
+    BdrvDirtyBitmap *dirty_bitmap;
+    uint8_t *sync_buf;
+    HBitmapIter hbi;
+    int64_t sector_num;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
     }
 }
 
-static int next_fifo_child(QuorumAIOCB *acb)
+static int get_good_child(BDRVQuorumState *s, int iter)
 {
-    BDRVQuorumState *s = acb->common.bs->opaque;
     int i;
 
-    for (i = acb->child_iter; i < s->num_children; i++) {
+    for (i = iter; i < s->num_children; i++) {
         if (!s->bs[i]->broken) {
             break;
         }
@@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
     return i;
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+    BDRVQuorumState *s = acb->common.bs->opaque;
+
+    return get_good_child(s, acb->child_iter);
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
     QuorumChildRequest *sacb = opaque;
@@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
     return -EINVAL;
 }
 
+static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
+{
+    int64_t nb, total = bdrv_nb_sectors(qs->mybs);
+
+    qs->sector_num = hbitmap_iter_next(&qs->hbi);
+    /* Wrap around if previous bits get dirty while syncing */
+    if (qs->sector_num < 0) {
+        bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
+        qs->sector_num = hbitmap_iter_next(&qs->hbi);
+        assert(qs->sector_num >= 0);
+    }
+
+    for (nb = 1; nb < SECTORS_PER_CHUNK && qs->sector_num + nb < total;
+         nb++) {
+        if (!bdrv_get_dirty(qs->mybs, qs->dirty_bitmap, qs->sector_num + nb)) {
+            break;
+        }
+    }
+    *num = nb;
+}
+
+static void sync_finish(BDRVQuorumState *qs, int64_t num)
+{
+    int64_t i;
+
+    for (i = 0; i < num; i++) {
+        /* We need to advance the iterator manually */
+        hbitmap_iter_next(&qs->hbi);
+    }
+    bdrv_reset_dirty(qs->mybs, qs->sector_num, num);
+}
+
+static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target)
+{
+    BlockDriverState *source;
+    QEMUIOVector qiov;
+    int ret, good;
+    int64_t nb_sectors;
+    struct iovec iov;
+    const char *sname, *tname = bdrv_get_filename(target);
+
+    good = get_good_child(qs, 0);
+    if (good < 0) {
+        error_report("No good device available.");
+        return -1;
+    }
+    source = qs->bs[good];
+    sname = bdrv_get_filename(source);
+    sync_prepare(qs, &nb_sectors);
+    iov.iov_base = qs->sync_buf;
+    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
+    qemu_iovec_init_external(&qiov, &iov, 1);
+
+    trace_quorum_sync_iteration(sname, tname, qs->sector_num, nb_sectors);
+    ret = bdrv_co_readv(source, qs->sector_num, nb_sectors, &qiov);
+    if (ret < 0) {
+        error_report("Read source %s failed.", sname);
+        return ret;
+    }
+    ret = bdrv_co_writev(target, qs->sector_num, nb_sectors, &qiov);
+    if (ret < 0) {
+        error_report("Write target %s failed.", tname);
+        return ret;
+    }
+    sync_finish(qs, nb_sectors);
+
+    return 0;
+}
+
+static int quorum_sync_device(BDRVQuorumState *qs, BlockDriverState *target)
+{
+    uint64_t last_pause_ns;
+
+    bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
+    last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    for (;;) {
+        int64_t cnt;
+
+        cnt = bdrv_get_dirty_count(qs->mybs, qs->dirty_bitmap);
+        if (cnt == 0) {
+            break;
+        }
+        error_report("count %ld", cnt);
+        if (quorum_sync_iteration(qs, target) < 0) {
+            return -1;
+        }
+        cnt = bdrv_get_dirty_count(qs->mybs, qs->dirty_bitmap);
+        if (cnt == 0) {
+            break;
+        }
+
+        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns >=
+            SLICE_TIME) {
+            co_aio_sleep_ns(bdrv_get_aio_context(target), QEMU_CLOCK_REALTIME,
+                            SLICE_TIME);
+            last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+        }
+    }
+
+    return 0;
+}
+
+static BlockDriverState *file_to_bs(BDRVQuorumState *qs, BlockDriverState *file)
+{
+    int i;
+
+    for (i = 0; i < qs->num_children; i++) {
+        BlockDriverState *f = bdrv_get_file(qs->bs[i]);
+
+        if (f == file) {
+            return qs->bs[i];
+        }
+    }
+
+    error_report("Can't find driver state for %s", bdrv_get_filename(file));
+    abort();
+}
+
+static void quorum_driver_reconnect(BlockDriverState *file)
+{
+    BDRVQuorumState *qs = file->drv_opaque;
+    BlockDriverState *bs = file_to_bs(qs, file);
+    const char *name = bdrv_get_filename(bs);
+
+    trace_quorum_driver_reconnect(name);
+    assert(bs->broken == true);
+    if (quorum_sync_device(qs, bs) < 0) {
+        error_report("Failed to sync device %s", name);
+        return;
+    }
+
+    bdrv_release_dirty_bitmap(qs->mybs, qs->dirty_bitmap);
+    qemu_vfree(qs->sync_buf);
+    bs->broken = false;
+}
+
+static void quorum_driver_disconnect(BlockDriverState *file)
+{
+    BDRVQuorumState *qs = file->drv_opaque;
+    BlockDriverState *bs = file_to_bs(qs, file);
+    const char *name = bdrv_get_filename(bs);
+
+    trace_quorum_driver_disconnect(name);
+    /*
+     * If we are disconnected while being syncing, we expect to reconnect to the
+     * target again and resume the data sync from the last synced point.
+     */
+    if (bs->broken) {
+        return;
+    }
+
+    bs->broken = true;
+    qs->dirty_bitmap = bdrv_create_dirty_bitmap(qs->mybs, BDRV_SECTOR_SIZE,
+                                                NULL);
+    if (!qs->dirty_bitmap) {
+        abort();
+    }
+    qs->sync_buf = qemu_blockalign(bs, CHUNK_SIZE);
+}
+
+static const BlockDrvOps quorum_block_drv_ops = {
+    .driver_reconnect = quorum_driver_reconnect,
+    .driver_disconnect = quorum_driver_disconnect,
+};
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
                        Error **errp)
 {
@@ -975,6 +1156,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
         goto exit;
     }
 
+    s->mybs = bs;
     /* count how many different children are present */
     s->num_children = qlist_size(list);
     if (s->num_children < 2) {
@@ -1061,6 +1243,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
             goto close_exit;
         }
         opened[i] = true;
+        bdrv_set_drv_ops(bdrv_get_file(s->bs[i]), &quorum_block_drv_ops, s);
     }
 
     g_free(opened);
diff --git a/trace-events b/trace-events
index 81bc915..8da0a13 100644
--- a/trace-events
+++ b/trace-events
@@ -572,6 +572,11 @@ qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t o
 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
 
+# block/quorum.c
+quorum_sync_iteration(const char *source, const char *target, int64_t sector, int num) "%s -> %s, sector %"PRId64" nb_sectors %d"
+quorum_driver_reconnect(const char *target) "%s"
+quorum_driver_disconnect(const char *target) "%s"
+
 # hw/display/g364fb.c
 g364fb_read(uint64_t addr, uint32_t val) "read addr=0x%"PRIx64": 0x%x"
 g364fb_write(uint64_t addr, uint32_t new) "write addr=0x%"PRIx64": 0x%x"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (7 preceding siblings ...)
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic Liu Yuan
@ 2014-09-01  8:19 ` Benoît Canet
  2014-09-02 22:19 ` Benoît Canet
  2014-09-07 15:12 ` Benoît Canet
  10 siblings, 0 replies; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:19 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> This patch set mainly add mainly two logics to implement device recover
> - notify qourum driver of the broken states from the child driver(s)
> - dirty track and sync the device after it is repaired
> 
> Thus quorum allow VMs to continue while some child devices are broken and when
> the child devices are repaired and return back, we sync dirty bits during
> downtime to keep data consistency.
> 
> The recovery logic is based on the driver state bitmap and will sync the dirty
> bits with a timeslice window in a coroutine in this prtimive implementation.
> 
> Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> (similary to DRBD)
> 
> + denote device sync iteration
> - IO on a single device
> = IO on two devices
> 
>                                       sync complete, release dirty bitmap
>                                          ^
>                                          |
>   ====-----------------++++----++++----++==========
>      |                 |
>      |                 v
>      |               device repaired and begin to sync
>      v
>    device broken, create a dirty bitmap
> 
>   This sync logic can take care of nested broken problem, that devices are
>   broken while in sync. We just start a sync process after the devices are
>   repaired again and switch the devices from broken to sound only when the sync
>   completes.
> 
> For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Hi Liu,

I had something like that in mind.

This series seems very cool I will review it.

Thanks for contributing to quorum.

Best regards

Benoît

> 
> Todo:
> - use aio interface to sync data (multiple transfer in one go)
> - dynamic slice window to control sync bandwidth more smoothly
> - add auto-reconnection mechanism to other protocol (if not support yet)
> - add tests
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Liu Yuan (8):
>   block/quorum: initialize qcrs.aiocb for read
>   block: add driver operation callbacks
>   block/sheepdog: propagate disconnect/reconnect events to upper driver
>   block/quorum: add quorum_aio_release() helper
>   quorum: fix quorum_aio_cancel()
>   block/quorum: add broken state to BlockDriverState
>   block: add two helpers
>   quorum: add basic device recovery logic
> 
>  block.c                   |  17 +++
>  block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
>  block/sheepdog.c          |   9 ++
>  include/block/block.h     |   9 ++
>  include/block/block_int.h |   6 +
>  trace-events              |   5 +
>  6 files changed, 336 insertions(+), 34 deletions(-)
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks Liu Yuan
@ 2014-09-01  8:28   ` Benoît Canet
  2014-09-01  9:19     ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:28 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> Driver operations are defined as callbacks passed from block upper drivers to
> lower drivers and are supposed to be called by lower drivers.
> 
> Requests handling(queuing, submitting, etc.) are done in protocol tier in the
> block layer and connection states are also maintained down there. Driver
> operations are supposed to notify the upper tier (such as quorum) of the states
> changes.
> 
> For now only two operation are added:
> 
> driver_disconnect: called when connection is off
> driver_reconnect: called when connection is on after disconnection
> 
> Which are used to notify upper tier of the connection state.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block.c                   | 7 +++++++
>  include/block/block.h     | 7 +++++++
>  include/block/block_int.h | 3 +++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/block.c b/block.c
> index c12b8de..22eb3e4 100644
> --- a/block.c
> +++ b/block.c
> @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
>      bs->dev_opaque = opaque;
>  }
>  
> +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> +                      void *opaque)
> +{
> +    bs->drv_ops = ops;
> +    bs->drv_opaque = opaque;

We need to be very carefull of the mix between these fields and the infamous
bdrv_swap function.

Also I don't know if "driver operations" is the right name since the BlockDriver structure's
callback could also be named "driver operations".

> +}
> +
>  static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load)
>  {
>      if (bs->dev_ops && bs->dev_ops->change_media_cb) {
> diff --git a/include/block/block.h b/include/block/block.h
> index 8f4ad16..a61eaf0 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -82,6 +82,11 @@ typedef struct BlockDevOps {
>      void (*resize_cb)(void *opaque);
>  } BlockDevOps;
>  
> +typedef struct BlockDrvOps {
> +    void (*driver_reconnect)(BlockDriverState *bs);
> +    void (*driver_disconnect)(BlockDriverState *bs);
> +} BlockDrvOps;
> +
>  typedef enum {
>      BDRV_REQ_COPY_ON_READ = 0x1,
>      BDRV_REQ_ZERO_WRITE   = 0x2,
> @@ -234,6 +239,8 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev);
>  void *bdrv_get_attached_dev(BlockDriverState *bs);
>  void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
>                        void *opaque);
> +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> +                      void *opaque);
>  void bdrv_dev_eject_request(BlockDriverState *bs, bool force);
>  bool bdrv_dev_has_removable_media(BlockDriverState *bs);
>  bool bdrv_dev_is_tray_open(BlockDriverState *bs);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 2334895..9fdec7f 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -319,6 +319,9 @@ struct BlockDriverState {
>      const BlockDevOps *dev_ops;
>      void *dev_opaque;
>  
> +    const BlockDrvOps *drv_ops;
> +    void *drv_opaque;
> +
>      AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
>  
>      char filename[1024];
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver Liu Yuan
@ 2014-09-01  8:31   ` Benoît Canet
  2014-09-01  9:22     ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:31 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:09 (+0800), Liu Yuan wrote :
> This is the reference usage how we propagate connection state to upper tier.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/sheepdog.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 53c24d6..9c0fc49 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
>  {
>      BDRVSheepdogState *s = opaque;
>      AIOReq *aio_req, *next;
> +    BlockDriverState *bs = s->bs;
> +
> +    if (bs->drv_ops && bs->drv_ops->driver_disconnect) {
> +        bs->drv_ops->driver_disconnect(bs);
> +    }

Since this sequence will be strictly the same for all the implementation
could we create a bdrv_signal_disconnect(bs); in the block layer to make this
code generic ?

>  
>      aio_set_fd_handler(s->aio_context, s->fd, NULL, NULL, NULL);
>      close(s->fd);
> @@ -756,6 +761,10 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
>          QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
>          resend_aioreq(s, aio_req);
>      }
> +
> +    if (bs->drv_ops && bs->drv_ops->driver_reconnect) {
> +        bs->drv_ops->driver_reconnect(bs);
> +    }

Same here bdrv_signal_reconnect(bs);

>  }
>  
>  /*
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper Liu Yuan
@ 2014-09-01  8:33   ` Benoît Canet
  0 siblings, 0 replies; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:33 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:10 (+0800), Liu Yuan wrote :
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 5866bca..9e056d6 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -130,6 +130,12 @@ struct QuorumAIOCB {
>  
>  static bool quorum_vote(QuorumAIOCB *acb);
>  
> +static void quorum_aio_release(QuorumAIOCB *acb)
> +{
> +    g_free(acb->qcrs);
> +    qemu_aio_release(acb);
> +}
> +
>  static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>  {
>      QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
> @@ -141,8 +147,7 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>          bdrv_aio_cancel(acb->qcrs[i].aiocb);
>      }
>  
> -    g_free(acb->qcrs);
> -    qemu_aio_release(acb);
> +    quorum_aio_release(acb);
>  }
>  
>  static AIOCBInfo quorum_aiocb_info = {
> @@ -168,8 +173,7 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>          }
>      }
>  
> -    g_free(acb->qcrs);
> -    qemu_aio_release(acb);
> +    quorum_aio_release(acb);
>  }
>  
>  static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
> -- 
> 1.9.1
> 

Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel() Liu Yuan
@ 2014-09-01  8:35   ` Benoît Canet
  2014-09-01  9:26     ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:35 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> For a fifo read pattern, we only have one running aio

>(possible other cases that has less number than num_children in the future)
I have trouble understanding this part of the commit message could you try
to clarify it ?

> , so we need to check if
> .acb is NULL against bdrv_aio_cancel() to avoid segfault.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 9e056d6..b9eeda3 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -144,7 +144,9 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
>  
>      /* cancel all callbacks */
>      for (i = 0; i < s->num_children; i++) {
> -        bdrv_aio_cancel(acb->qcrs[i].aiocb);
> +        if (acb->qcrs[i].aiocb) {
> +            bdrv_aio_cancel(acb->qcrs[i].aiocb);
> +        }
>      }
>  
>      quorum_aio_release(acb);
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState Liu Yuan
@ 2014-09-01  8:57   ` Benoît Canet
  2014-09-01  9:30     ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:57 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote :
> This allow VM continues to process even if some devices are broken meanwhile
> with proper configuration.
> 
> We mark the device broken when the protocol tier notify back some broken
> state(s) of the device, such as diconnection via driver operations. We could
> also reset the device as sound when the protocol tier is repaired.
> 
> Origianlly .threshold controls how we should decide the success of read/write
> and return the failure only if the success count of read/write is less than
> .threshold specified by users. But it doesn't record the states of underlying
> states and will impact performance a bit in some cases.
> 
> For example, we have 3 children and .threshold is set 2. If one of the devices
> broken, we should still return success and continue to run VM. But for every
> IO operations, we will blindly send the requests to the broken device.
> 
> To store broken state into driver state we can save requests to borken devices
> and resend the requests to the repaired ones by setting broken as false.
> 
> This is especially useful for network based protocol such as sheepdog, which
> has a auto-reconnection mechanism and will never report EIO if the connection
> is broken but just store the requests to its local queue and wait for resending.
> Without broken state, quorum request will not come back until the connection is
> re-established. So we have to skip the broken deivces to allow VM to continue
> running with networked backed child (glusterfs, nfs, sheepdog, etc).
> 
> With the combination of read-pattern and threshold, we can easily mimic the DRVD
> behavior with following configuration:
> 
>  read-pattern=fifo,threshold=1 will two children.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c            | 102 ++++++++++++++++++++++++++++++++++++++--------
>  include/block/block_int.h |   3 ++
>  2 files changed, 87 insertions(+), 18 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index b9eeda3..7b07e35 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -120,6 +120,7 @@ struct QuorumAIOCB {
>      int rewrite_count;          /* number of replica to rewrite: count down to
>                                   * zero once writes are fired
>                                   */
> +    int issued_count;           /* actual read&write issued count */
>  
>      QuorumVotes votes;
>  
> @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>      if (acb->is_read) {
>          /* on the quorum case acb->child_iter == s->num_children - 1 */
>          for (i = 0; i <= acb->child_iter; i++) {
> -            qemu_vfree(acb->qcrs[i].buf);
> -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            if (acb->qcrs[i].buf) {
> +                qemu_vfree(acb->qcrs[i].buf);
> +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            }
>          }
>      }
>  
> @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
>      acb->count = 0;
>      acb->success_count = 0;
>      acb->rewrite_count = 0;
> +    acb->issued_count = 0;
>      acb->votes.compare = quorum_sha256_compare;
>      QLIST_INIT(&acb->votes.vote_list);
>      acb->is_read = false;
> @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
>      }
>  }
>  
> +static int next_fifo_child(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->common.bs->opaque;
> +    int i;
> +
> +    for (i = acb->child_iter; i < s->num_children; i++) {
> +        if (!s->bs[i]->broken) {
> +            break;
> +        }
> +    }
> +    if (i == s->num_children) {
> +        return -1;
> +    }
> +    return i;
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>      QuorumChildRequest *sacb = opaque;
> @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
>      BDRVQuorumState *s = acb->common.bs->opaque;
>      bool rewrite = false;
>  
> +    if (ret < 0) {
> +        s->bs[acb->child_iter]->broken = true;
> +    }

child_iter is fifo mode stuff.
Do we need to write if (s->read_pattern == QUORUM_READ_PATTERN_FIFO && ret < 0) here ?


> +
>      if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
>          /* We try to read next child in FIFO order if we fail to read */
> -        if (ret < 0 && ++acb->child_iter < s->num_children) {
> -            read_fifo_child(acb);
> -            return;
> +        if (ret < 0) {
> +            acb->child_iter = next_fifo_child(acb);

You don't seem to increment child_iter anymore.

> +            if (acb->child_iter > 0) {
> +                read_fifo_child(acb);
> +                return;
> +            }
>          }
>  
>          if (ret == 0) {
> @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
>      } else {
>          quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
>      }
> -    assert(acb->count <= s->num_children);
> -    assert(acb->success_count <= s->num_children);
> -    if (acb->count < s->num_children) {
> +    assert(acb->count <= acb->issued_count);
> +    assert(acb->success_count <= acb->issued_count);
> +    if (acb->count < acb->issued_count) {
>          return;
>      }
>  
> @@ -647,22 +674,46 @@ free_exit:
>      return rewrite;
>  }
>  
> +static bool has_enough_children(BDRVQuorumState *s)
> +{
> +    int good = 0, i;
> +
> +    for (i = 0; i < s->num_children; i++) {
> +        if (s->bs[i]->broken) {
> +            continue;
> +        }
> +        good++;
> +    }
> +
> +    if (good >= s->threshold) {
> +        return true;
> +    } else {
> +        return false;
> +    }
> +}
> +
>  static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
>  {
>      BDRVQuorumState *s = acb->common.bs->opaque;
>      int i;
>  
> +    if (!has_enough_children(s)) {
> +        quorum_aio_release(acb);
> +        return NULL;
> +    }
> +
>      for (i = 0; i < s->num_children; i++) {
> +        if (s->bs[i]->broken) {
> +            continue;
> +        }
>          acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
>          qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
>          qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
> -    }
> -
> -    for (i = 0; i < s->num_children; i++) {
>          acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
>                                              &acb->qcrs[i].qiov,
>                                              acb->nb_sectors, quorum_aio_cb,
>                                              &acb->qcrs[i]);
> +        acb->issued_count++;
>      }
>  
>      return &acb->common;
> @@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
>      acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
>                                          &acb->qcrs[i].qiov, acb->nb_sectors,
>                                          quorum_aio_cb, &acb->qcrs[i]);
> -
> +    acb->issued_count = 1;
>      return &acb->common;
>  }
>  
> @@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
>      BDRVQuorumState *s = bs->opaque;
>      QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
>                                        nb_sectors, cb, opaque);

> +
Spurious carriage return.

>      acb->is_read = true;
>  
>      if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> @@ -701,6 +753,11 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
>      }
>  
>      acb->child_iter = 0;
> +    acb->child_iter = next_fifo_child(acb);
> +    if (acb->child_iter < 0) {
> +        quorum_aio_release(acb);
> +        return NULL;
> +    }
>      return read_fifo_child(acb);
>  }
>  
> @@ -716,10 +773,19 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>                                        cb, opaque);
>      int i;
>  
> +    if (!has_enough_children(s)) {
> +        quorum_aio_release(acb);
> +        return NULL;
> +    }
> +
>      for (i = 0; i < s->num_children; i++) {
> +        if (s->bs[i]->broken) {
> +            continue;
> +        }
>          acb->qcrs[i].aiocb = bdrv_aio_writev(s->bs[i], sector_num, qiov,
>                                               nb_sectors, &quorum_aio_cb,
>                                               &acb->qcrs[i]);
> +        acb->issued_count++;
>      }
>  
>      return &acb->common;
> @@ -926,6 +992,12 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> +    /* and validate it against s->num_children */
> +    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +
>      ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
>      if (ret < 0) {
>          error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
> @@ -934,12 +1006,6 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>      s->read_pattern = ret;
>  
>      if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> -        /* and validate it against s->num_children */
> -        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> -        if (ret < 0) {
> -            goto exit;
> -        }
> -
>          /* is the driver in blkverify mode */
>          if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
>              s->num_children == 2 && s->threshold == 2) {
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 9fdec7f..599110b 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -402,6 +402,9 @@ struct BlockDriverState {
>  
>      /* The error object in use for blocking operations on backing_hd */
>      Error *backing_blocker;
> +
> +    /* True if the associated device is broken */
> +    bool broken;
>  };
>  
>  int get_tmp_filename(char *filename, int size);
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] block: add two helpers
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 7/8] block: add two helpers Liu Yuan
@ 2014-09-01  8:59   ` Benoît Canet
  0 siblings, 0 replies; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  8:59 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:13 (+0800), Liu Yuan wrote :
> These helpers are needed by later quorum sync device logic.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block.c               | 10 ++++++++++
>  include/block/block.h |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 22eb3e4..2e2f1d9 100644
> --- a/block.c
> +++ b/block.c
> @@ -2145,6 +2145,16 @@ void *bdrv_get_attached_dev(BlockDriverState *bs)
>      return bs->dev;
>  }
>  
> +BlockDriverState *bdrv_get_file(BlockDriverState *bs)
> +{
> +    return bs->file;
> +}
> +
> +const char *bdrv_get_filename(BlockDriverState *bs)
> +{
> +    return bs->filename;
> +}
> +

I don't think why we would need this. We will see in next patch.

>  void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
>                        void *opaque)
>  {
> diff --git a/include/block/block.h b/include/block/block.h
> index a61eaf0..1e116cc 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -237,6 +237,8 @@ int bdrv_attach_dev(BlockDriverState *bs, void *dev);
>  void bdrv_attach_dev_nofail(BlockDriverState *bs, void *dev);
>  void bdrv_detach_dev(BlockDriverState *bs, void *dev);
>  void *bdrv_get_attached_dev(BlockDriverState *bs);
> +BlockDriverState *bdrv_get_file(BlockDriverState *bs);
> +const char *bdrv_get_filename(BlockDriverState *bs);
>  void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
>                        void *opaque);
>  void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
  2014-09-01  8:28   ` Benoît Canet
@ 2014-09-01  9:19     ` Liu Yuan
  2014-09-01  9:28       ` Benoît Canet
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:19 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> > Driver operations are defined as callbacks passed from block upper drivers to
> > lower drivers and are supposed to be called by lower drivers.
> > 
> > Requests handling(queuing, submitting, etc.) are done in protocol tier in the
> > block layer and connection states are also maintained down there. Driver
> > operations are supposed to notify the upper tier (such as quorum) of the states
> > changes.
> > 
> > For now only two operation are added:
> > 
> > driver_disconnect: called when connection is off
> > driver_reconnect: called when connection is on after disconnection
> > 
> > Which are used to notify upper tier of the connection state.
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block.c                   | 7 +++++++
> >  include/block/block.h     | 7 +++++++
> >  include/block/block_int.h | 3 +++
> >  3 files changed, 17 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index c12b8de..22eb3e4 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
> >      bs->dev_opaque = opaque;
> >  }
> >  
> > +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> > +                      void *opaque)
> > +{
> > +    bs->drv_ops = ops;
> > +    bs->drv_opaque = opaque;
> 
> We need to be very carefull of the mix between these fields and the infamous
> bdrv_swap function.
> 
> Also I don't know if "driver operations" is the right name since the BlockDriver structure's
> callback could also be named "driver operations".
> 

BlockDrvierState has a "device operation" for callbacks from devices. So I
choose "driver operation". So any sugguestion for better name?

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver
  2014-09-01  8:31   ` Benoît Canet
@ 2014-09-01  9:22     ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:22 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 10:31:47AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:09 (+0800), Liu Yuan wrote :
> > This is the reference usage how we propagate connection state to upper tier.
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/sheepdog.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > index 53c24d6..9c0fc49 100644
> > --- a/block/sheepdog.c
> > +++ b/block/sheepdog.c
> > @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
> >  {
> >      BDRVSheepdogState *s = opaque;
> >      AIOReq *aio_req, *next;
> > +    BlockDriverState *bs = s->bs;
> > +
> > +    if (bs->drv_ops && bs->drv_ops->driver_disconnect) {
> > +        bs->drv_ops->driver_disconnect(bs);
> > +    }
> 
> Since this sequence will be strictly the same for all the implementation
> could we create a bdrv_signal_disconnect(bs); in the block layer to make this
> code generic ?

I'm not sure if other protocol driver can have the same auto-reconnection logic.
Probably for simplicity, we keep it as is in the patch. Later when we get more
flesh of implementation of other protocols, we can make a better decision.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
  2014-09-01  8:35   ` Benoît Canet
@ 2014-09-01  9:26     ` Liu Yuan
  2014-09-01  9:32       ` Benoît Canet
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:26 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> > For a fifo read pattern, we only have one running aio
> 
> >(possible other cases that has less number than num_children in the future)
> I have trouble understanding this part of the commit message could you try
> to clarify it ?

Until this patch, we have two cases, read single or read all. But later patch
allow VMs to continue if some devices are down. So the discrete number 1 and N
becomes a range [1, N], that is possible running requests are from 1 to N.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
  2014-09-01  9:19     ` Liu Yuan
@ 2014-09-01  9:28       ` Benoît Canet
  2014-09-01  9:40         ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  9:28 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Benoît Canet, Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 17:19:19 (+0800), Liu Yuan wrote :
> On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
> > The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> > > Driver operations are defined as callbacks passed from block upper drivers to
> > > lower drivers and are supposed to be called by lower drivers.
> > > 
> > > Requests handling(queuing, submitting, etc.) are done in protocol tier in the
> > > block layer and connection states are also maintained down there. Driver
> > > operations are supposed to notify the upper tier (such as quorum) of the states
> > > changes.
> > > 
> > > For now only two operation are added:
> > > 
> > > driver_disconnect: called when connection is off
> > > driver_reconnect: called when connection is on after disconnection
> > > 
> > > Which are used to notify upper tier of the connection state.
> > > 
> > > Cc: Eric Blake <eblake@redhat.com>
> > > Cc: Benoit Canet <benoit@irqsave.net>
> > > Cc: Kevin Wolf <kwolf@redhat.com>
> > > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > > ---
> > >  block.c                   | 7 +++++++
> > >  include/block/block.h     | 7 +++++++
> > >  include/block/block_int.h | 3 +++
> > >  3 files changed, 17 insertions(+)
> > > 
> > > diff --git a/block.c b/block.c
> > > index c12b8de..22eb3e4 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
> > >      bs->dev_opaque = opaque;
> > >  }
> > >  
> > > +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> > > +                      void *opaque)
> > > +{
> > > +    bs->drv_ops = ops;
> > > +    bs->drv_opaque = opaque;
> > 
> > We need to be very carefull of the mix between these fields and the infamous
> > bdrv_swap function.
> > 
> > Also I don't know if "driver operations" is the right name since the BlockDriver structure's
> > callback could also be named "driver operations".
> > 
> 
> BlockDrvierState has a "device operation" for callbacks from devices. So I
> choose "driver operation". So any sugguestion for better name?

>From what I see in this series the job of these callbacks is to send a message or a signal to
the upper BDS.

Also the name must reflect it goes from the child to the parent.

child_signals ?
child_messages ?

Best regards

Benoît

> 
> Thanks
> Yuan
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
  2014-09-01  8:57   ` Benoît Canet
@ 2014-09-01  9:30     ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:30 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote :
> > This allow VM continues to process even if some devices are broken meanwhile
> > with proper configuration.
> > 
> > We mark the device broken when the protocol tier notify back some broken
> > state(s) of the device, such as diconnection via driver operations. We could
> > also reset the device as sound when the protocol tier is repaired.
> > 
> > Origianlly .threshold controls how we should decide the success of read/write
> > and return the failure only if the success count of read/write is less than
> > .threshold specified by users. But it doesn't record the states of underlying
> > states and will impact performance a bit in some cases.
> > 
> > For example, we have 3 children and .threshold is set 2. If one of the devices
> > broken, we should still return success and continue to run VM. But for every
> > IO operations, we will blindly send the requests to the broken device.
> > 
> > To store broken state into driver state we can save requests to borken devices
> > and resend the requests to the repaired ones by setting broken as false.
> > 
> > This is especially useful for network based protocol such as sheepdog, which
> > has a auto-reconnection mechanism and will never report EIO if the connection
> > is broken but just store the requests to its local queue and wait for resending.
> > Without broken state, quorum request will not come back until the connection is
> > re-established. So we have to skip the broken deivces to allow VM to continue
> > running with networked backed child (glusterfs, nfs, sheepdog, etc).
> > 
> > With the combination of read-pattern and threshold, we can easily mimic the DRVD
> > behavior with following configuration:
> > 
> >  read-pattern=fifo,threshold=1 will two children.
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/quorum.c            | 102 ++++++++++++++++++++++++++++++++++++++--------
> >  include/block/block_int.h |   3 ++
> >  2 files changed, 87 insertions(+), 18 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index b9eeda3..7b07e35 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -120,6 +120,7 @@ struct QuorumAIOCB {
> >      int rewrite_count;          /* number of replica to rewrite: count down to
> >                                   * zero once writes are fired
> >                                   */
> > +    int issued_count;           /* actual read&write issued count */
> >  
> >      QuorumVotes votes;
> >  
> > @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >      if (acb->is_read) {
> >          /* on the quorum case acb->child_iter == s->num_children - 1 */
> >          for (i = 0; i <= acb->child_iter; i++) {
> > -            qemu_vfree(acb->qcrs[i].buf);
> > -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            if (acb->qcrs[i].buf) {
> > +                qemu_vfree(acb->qcrs[i].buf);
> > +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            }
> >          }
> >      }
> >  
> > @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >      acb->count = 0;
> >      acb->success_count = 0;
> >      acb->rewrite_count = 0;
> > +    acb->issued_count = 0;
> >      acb->votes.compare = quorum_sha256_compare;
> >      QLIST_INIT(&acb->votes.vote_list);
> >      acb->is_read = false;
> > @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> >      }
> >  }
> >  
> > +static int next_fifo_child(QuorumAIOCB *acb)
> > +{
> > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > +    int i;
> > +
> > +    for (i = acb->child_iter; i < s->num_children; i++) {
> > +        if (!s->bs[i]->broken) {
> > +            break;
> > +        }
> > +    }
> > +    if (i == s->num_children) {
> > +        return -1;
> > +    }
> > +    return i;
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int ret)
> >  {
> >      QuorumChildRequest *sacb = opaque;
> > @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
> >      BDRVQuorumState *s = acb->common.bs->opaque;
> >      bool rewrite = false;
> >  
> > +    if (ret < 0) {
> > +        s->bs[acb->child_iter]->broken = true;
> > +    }
> 
> child_iter is fifo mode stuff.
> Do we need to write if (s->read_pattern == QUORUM_READ_PATTERN_FIFO && ret < 0) here ?

Probably not. child_iter denotes which bs the QuorumChildRequest belongs to.

> 
> > +
> >      if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> >          /* We try to read next child in FIFO order if we fail to read */
> > -        if (ret < 0 && ++acb->child_iter < s->num_children) {
> > -            read_fifo_child(acb);
> > -            return;
> > +        if (ret < 0) {
> > +            acb->child_iter = next_fifo_child(acb);
> 
> You don't seem to increment child_iter anymore.

No, I use a function next_fifo_child() to get the next_fifo_child because right
now we have to skip the broken devices.


> 
> > +            if (acb->child_iter > 0) {
> > +                read_fifo_child(acb);
> > +                return;
> > +            }
> >          }
> >  
> >          if (ret == 0) {
> > @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
> >      } else {
> >          quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
> >      }
> > -    assert(acb->count <= s->num_children);
> > -    assert(acb->success_count <= s->num_children);
> > -    if (acb->count < s->num_children) {
> > +    assert(acb->count <= acb->issued_count);
> > +    assert(acb->success_count <= acb->issued_count);
> > +    if (acb->count < acb->issued_count) {
> >          return;
> >      }
> >  
> > @@ -647,22 +674,46 @@ free_exit:
> >      return rewrite;
> >  }
> >  
> > +static bool has_enough_children(BDRVQuorumState *s)
> > +{
> > +    int good = 0, i;
> > +
> > +    for (i = 0; i < s->num_children; i++) {
> > +        if (s->bs[i]->broken) {
> > +            continue;
> > +        }
> > +        good++;
> > +    }
> > +
> > +    if (good >= s->threshold) {
> > +        return true;
> > +    } else {
> > +        return false;
> > +    }
> > +}
> > +
> >  static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
> >  {
> >      BDRVQuorumState *s = acb->common.bs->opaque;
> >      int i;
> >  
> > +    if (!has_enough_children(s)) {
> > +        quorum_aio_release(acb);
> > +        return NULL;
> > +    }
> > +
> >      for (i = 0; i < s->num_children; i++) {
> > +        if (s->bs[i]->broken) {
> > +            continue;
> > +        }
> >          acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> >          qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> >          qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
> > -    }
> > -
> > -    for (i = 0; i < s->num_children; i++) {
> >          acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
> >                                              &acb->qcrs[i].qiov,
> >                                              acb->nb_sectors, quorum_aio_cb,
> >                                              &acb->qcrs[i]);
> > +        acb->issued_count++;
> >      }
> >  
> >      return &acb->common;
> > @@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> >      acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
> >                                          &acb->qcrs[i].qiov, acb->nb_sectors,
> >                                          quorum_aio_cb, &acb->qcrs[i]);
> > -
> > +    acb->issued_count = 1;
> >      return &acb->common;
> >  }
> >  
> > @@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >      BDRVQuorumState *s = bs->opaque;
> >      QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> >                                        nb_sectors, cb, opaque);
> 
> > +
> Spurious carriage return.

Good catch, dude. Thanks!

Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
  2014-09-01  9:26     ` Liu Yuan
@ 2014-09-01  9:32       ` Benoît Canet
  2014-09-01  9:46         ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  9:32 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Benoît Canet, Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 17:26:09 (+0800), Liu Yuan wrote :
> On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
> > The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> > > For a fifo read pattern, we only have one running aio
> > 
> > >(possible other cases that has less number than num_children in the future)
> > I have trouble understanding this part of the commit message could you try
> > to clarify it ?
> 
> Until this patch, we have two cases, read single or read all. But later patch
> allow VMs to continue if some devices are down. So the discrete number 1 and N
> becomes a range [1, N], that is possible running requests are from 1 to N.

Why not 

(In some other cases some children of the num_children BDS could be disabled
reducing the number of requests needed) ?

> 
> Thanks
> Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic
  2014-09-01  7:43 ` [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic Liu Yuan
@ 2014-09-01  9:37   ` Benoît Canet
  2014-09-01  9:45     ` Liu Yuan
  0 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-01  9:37 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:14 (+0800), Liu Yuan wrote :
> For some configuration, quorum allow VMs to continue while some child devices
> are broken and when the child devices are repaired and return back, we need to
> sync dirty bits during downtime to keep data consistency.
> 
> The recovery logic is based on the driver state bitmap and will sync the dirty
> bits with a timeslice window in a coroutine in this prtimive implementation.
> 
> Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> 
> + denote device sync iteration
> - IO on a single device
> = IO on two devices
> 
>                                       sync complete, release dirty bitmap
>                                          ^
>                                          |
>   ====-----------------++++----++++----++==========
>      |                 |
>      |                 v
>      |               device repaired and begin to sync
>      v
>    device broken, create a dirty bitmap
> 
>   This sync logic can take care of nested broken problem, that devices are
>   broken while in sync. We just start a sync process after the devices are
>   repaired again and switch the devices from broken to sound only when the sync
>   completes.
> 
> For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 189 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  trace-events   |   5 ++
>  2 files changed, 191 insertions(+), 3 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 7b07e35..ffd7c2d 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -23,6 +23,7 @@
>  #include "qapi/qmp/qlist.h"
>  #include "qapi/qmp/qstring.h"
>  #include "qapi-event.h"
> +#include "trace.h"
>  
>  #define HASH_LENGTH 32
>  
> @@ -31,6 +32,10 @@
>  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
>  #define QUORUM_OPT_READ_PATTERN   "read-pattern"
>  
> +#define SLICE_TIME          100000000ULL /* 100 ms */
> +#define CHUNK_SIZE          (1 << 20) /* 1M */
> +#define SECTORS_PER_CHUNK   (CHUNK_SIZE >> BDRV_SECTOR_BITS)
> +
>  /* This union holds a vote hash value */
>  typedef union QuorumVoteValue {
>      char h[HASH_LENGTH];       /* SHA-256 hash */
> @@ -64,6 +69,7 @@ typedef struct QuorumVotes {
>  
>  /* the following structure holds the state of one quorum instance */
>  typedef struct BDRVQuorumState {
> +    BlockDriverState *mybs;/* Quorum block driver base state */
>      BlockDriverState **bs; /* children BlockDriverStates */
>      int num_children;      /* children count */
>      int threshold;         /* if less than threshold children reads gave the
> @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
>                              */
>  
>      QuorumReadPattern read_pattern;
> +    BdrvDirtyBitmap *dirty_bitmap;
> +    uint8_t *sync_buf;
> +    HBitmapIter hbi;
> +    int64_t sector_num;
>  } BDRVQuorumState;
>  
>  typedef struct QuorumAIOCB QuorumAIOCB;
> @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
>      }
>  }
>  
> -static int next_fifo_child(QuorumAIOCB *acb)
> +static int get_good_child(BDRVQuorumState *s, int iter)
>  {
> -    BDRVQuorumState *s = acb->common.bs->opaque;
>      int i;
>  
> -    for (i = acb->child_iter; i < s->num_children; i++) {
> +    for (i = iter; i < s->num_children; i++) {
>          if (!s->bs[i]->broken) {
>              break;
>          }
> @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
>      return i;
>  }
>  
> +static int next_fifo_child(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->common.bs->opaque;
> +
> +    return get_good_child(s, acb->child_iter);
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>      QuorumChildRequest *sacb = opaque;
> @@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
>      return -EINVAL;
>  }
>  
> +static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
> +{
> +    int64_t nb, total = bdrv_nb_sectors(qs->mybs);
> +
> +    qs->sector_num = hbitmap_iter_next(&qs->hbi);
> +    /* Wrap around if previous bits get dirty while syncing */
> +    if (qs->sector_num < 0) {
> +        bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
> +        qs->sector_num = hbitmap_iter_next(&qs->hbi);
> +        assert(qs->sector_num >= 0);
> +    }
> +
> +    for (nb = 1; nb < SECTORS_PER_CHUNK && qs->sector_num + nb < total;
> +         nb++) {
> +        if (!bdrv_get_dirty(qs->mybs, qs->dirty_bitmap, qs->sector_num + nb)) {
> +            break;
> +        }
> +    }
> +    *num = nb;
> +}
> +
> +static void sync_finish(BDRVQuorumState *qs, int64_t num)
> +{
> +    int64_t i;
> +
> +    for (i = 0; i < num; i++) {
> +        /* We need to advance the iterator manually */
> +        hbitmap_iter_next(&qs->hbi);
> +    }
> +    bdrv_reset_dirty(qs->mybs, qs->sector_num, num);
> +}
> +
> +static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target)
> +{
> +    BlockDriverState *source;
> +    QEMUIOVector qiov;
> +    int ret, good;
> +    int64_t nb_sectors;
> +    struct iovec iov;
> +    const char *sname, *tname = bdrv_get_filename(target);
> +
> +    good = get_good_child(qs, 0);
> +    if (good < 0) {
> +        error_report("No good device available.");
> +        return -1;
> +    }
> +    source = qs->bs[good];
> +    sname = bdrv_get_filename(source);
> +    sync_prepare(qs, &nb_sectors);
> +    iov.iov_base = qs->sync_buf;
> +    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
> +    qemu_iovec_init_external(&qiov, &iov, 1);
> +
> +    trace_quorum_sync_iteration(sname, tname, qs->sector_num, nb_sectors);
> +    ret = bdrv_co_readv(source, qs->sector_num, nb_sectors, &qiov);
> +    if (ret < 0) {
> +        error_report("Read source %s failed.", sname);

I didn't read this patch throughfully but in quorum if you need to name a child BDS
you must use bs->node_name.

bs->node_name was introduced to be able to merge quorum and uniquely identify a given
node of the BDS graph.

Best regards

Benoît

> +        return ret;
> +    }
> +    ret = bdrv_co_writev(target, qs->sector_num, nb_sectors, &qiov);
> +    if (ret < 0) {
> +        error_report("Write target %s failed.", tname);
> +        return ret;
> +    }
> +    sync_finish(qs, nb_sectors);
> +
> +    return 0;
> +}
> +
> +static int quorum_sync_device(BDRVQuorumState *qs, BlockDriverState *target)
> +{
> +    uint64_t last_pause_ns;
> +
> +    bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
> +    last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +    for (;;) {
> +        int64_t cnt;
> +
> +        cnt = bdrv_get_dirty_count(qs->mybs, qs->dirty_bitmap);
> +        if (cnt == 0) {
> +            break;
> +        }
> +        error_report("count %ld", cnt);
> +        if (quorum_sync_iteration(qs, target) < 0) {
> +            return -1;
> +        }
> +        cnt = bdrv_get_dirty_count(qs->mybs, qs->dirty_bitmap);
> +        if (cnt == 0) {
> +            break;
> +        }
> +
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns >=
> +            SLICE_TIME) {
> +            co_aio_sleep_ns(bdrv_get_aio_context(target), QEMU_CLOCK_REALTIME,
> +                            SLICE_TIME);
> +            last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static BlockDriverState *file_to_bs(BDRVQuorumState *qs, BlockDriverState *file)
> +{
> +    int i;
> +
> +    for (i = 0; i < qs->num_children; i++) {
> +        BlockDriverState *f = bdrv_get_file(qs->bs[i]);
> +
> +        if (f == file) {
> +            return qs->bs[i];
> +        }
> +    }
> +
> +    error_report("Can't find driver state for %s", bdrv_get_filename(file));
> +    abort();
> +}
> +
> +static void quorum_driver_reconnect(BlockDriverState *file)
> +{
> +    BDRVQuorumState *qs = file->drv_opaque;
> +    BlockDriverState *bs = file_to_bs(qs, file);
> +    const char *name = bdrv_get_filename(bs);
> +
> +    trace_quorum_driver_reconnect(name);
> +    assert(bs->broken == true);
> +    if (quorum_sync_device(qs, bs) < 0) {
> +        error_report("Failed to sync device %s", name);
> +        return;
> +    }
> +
> +    bdrv_release_dirty_bitmap(qs->mybs, qs->dirty_bitmap);
> +    qemu_vfree(qs->sync_buf);
> +    bs->broken = false;
> +}
> +
> +static void quorum_driver_disconnect(BlockDriverState *file)
> +{
> +    BDRVQuorumState *qs = file->drv_opaque;
> +    BlockDriverState *bs = file_to_bs(qs, file);
> +    const char *name = bdrv_get_filename(bs);
> +
> +    trace_quorum_driver_disconnect(name);
> +    /*
> +     * If we are disconnected while being syncing, we expect to reconnect to the
> +     * target again and resume the data sync from the last synced point.
> +     */
> +    if (bs->broken) {
> +        return;
> +    }
> +
> +    bs->broken = true;
> +    qs->dirty_bitmap = bdrv_create_dirty_bitmap(qs->mybs, BDRV_SECTOR_SIZE,
> +                                                NULL);
> +    if (!qs->dirty_bitmap) {
> +        abort();
> +    }
> +    qs->sync_buf = qemu_blockalign(bs, CHUNK_SIZE);
> +}
> +
> +static const BlockDrvOps quorum_block_drv_ops = {
> +    .driver_reconnect = quorum_driver_reconnect,
> +    .driver_disconnect = quorum_driver_disconnect,
> +};
> +
>  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>                         Error **errp)
>  {
> @@ -975,6 +1156,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>          goto exit;
>      }
>  
> +    s->mybs = bs;
>      /* count how many different children are present */
>      s->num_children = qlist_size(list);
>      if (s->num_children < 2) {
> @@ -1061,6 +1243,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>              goto close_exit;
>          }
>          opened[i] = true;
> +        bdrv_set_drv_ops(bdrv_get_file(s->bs[i]), &quorum_block_drv_ops, s);
>      }
>  
>      g_free(opened);
> diff --git a/trace-events b/trace-events
> index 81bc915..8da0a13 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -572,6 +572,11 @@ qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t o
>  qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
>  qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
>  
> +# block/quorum.c
> +quorum_sync_iteration(const char *source, const char *target, int64_t sector, int num) "%s -> %s, sector %"PRId64" nb_sectors %d"
> +quorum_driver_reconnect(const char *target) "%s"
> +quorum_driver_disconnect(const char *target) "%s"
> +
>  # hw/display/g364fb.c
>  g364fb_read(uint64_t addr, uint32_t val) "read addr=0x%"PRIx64": 0x%x"
>  g364fb_write(uint64_t addr, uint32_t new) "write addr=0x%"PRIx64": 0x%x"
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
  2014-09-01  9:28       ` Benoît Canet
@ 2014-09-01  9:40         ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:40 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 11:28:22AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 17:19:19 (+0800), Liu Yuan wrote :
> > On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
> > > The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> > > > Driver operations are defined as callbacks passed from block upper drivers to
> > > > lower drivers and are supposed to be called by lower drivers.
> > > > 
> > > > Requests handling(queuing, submitting, etc.) are done in protocol tier in the
> > > > block layer and connection states are also maintained down there. Driver
> > > > operations are supposed to notify the upper tier (such as quorum) of the states
> > > > changes.
> > > > 
> > > > For now only two operation are added:
> > > > 
> > > > driver_disconnect: called when connection is off
> > > > driver_reconnect: called when connection is on after disconnection
> > > > 
> > > > Which are used to notify upper tier of the connection state.
> > > > 
> > > > Cc: Eric Blake <eblake@redhat.com>
> > > > Cc: Benoit Canet <benoit@irqsave.net>
> > > > Cc: Kevin Wolf <kwolf@redhat.com>
> > > > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > > > ---
> > > >  block.c                   | 7 +++++++
> > > >  include/block/block.h     | 7 +++++++
> > > >  include/block/block_int.h | 3 +++
> > > >  3 files changed, 17 insertions(+)
> > > > 
> > > > diff --git a/block.c b/block.c
> > > > index c12b8de..22eb3e4 100644
> > > > --- a/block.c
> > > > +++ b/block.c
> > > > @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
> > > >      bs->dev_opaque = opaque;
> > > >  }
> > > >  
> > > > +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> > > > +                      void *opaque)
> > > > +{
> > > > +    bs->drv_ops = ops;
> > > > +    bs->drv_opaque = opaque;
> > > 
> > > We need to be very carefull of the mix between these fields and the infamous
> > > bdrv_swap function.
> > > 
> > > Also I don't know if "driver operations" is the right name since the BlockDriver structure's
> > > callback could also be named "driver operations".
> > > 
> > 
> > BlockDrvierState has a "device operation" for callbacks from devices. So I
> > choose "driver operation". So any sugguestion for better name?
> 
> From what I see in this series the job of these callbacks is to send a message or a signal to
> the upper BDS.
> 
> Also the name must reflect it goes from the child to the parent.
> 
> child_signals ?
> child_messages ?
> 

As far as I see, put child in the name will make it too quorum centric. Since it
is operation in BlockDriverState, we need to keep it as generic as we could.

These operations [here we mean disconnect() and reconnect(), but probably later
some other will add more opeartions] are passed from 'upper driver' to protocol
driver [in the code we call the protocol as 'file' driver, a narrow name too],
so I chose to name it as 'driver operation'. If we can rename 'file' as
protocol, include file, nfs, sheepdog, etc, such as

bdrv_create_file -> bdrv_create_protocol
bs.file -> bs.protocol

then the 'driver operation' here would sound better.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic
  2014-09-01  9:37   ` Benoît Canet
@ 2014-09-01  9:45     ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:45 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 11:37:20AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:14 (+0800), Liu Yuan wrote :
> > For some configuration, quorum allow VMs to continue while some child devices
> > are broken and when the child devices are repaired and return back, we need to
> > sync dirty bits during downtime to keep data consistency.
> > 
> > The recovery logic is based on the driver state bitmap and will sync the dirty
> > bits with a timeslice window in a coroutine in this prtimive implementation.
> > 
> > Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> > 
> > + denote device sync iteration
> > - IO on a single device
> > = IO on two devices
> > 
> >                                       sync complete, release dirty bitmap
> >                                          ^
> >                                          |
> >   ====-----------------++++----++++----++==========
> >      |                 |
> >      |                 v
> >      |               device repaired and begin to sync
> >      v
> >    device broken, create a dirty bitmap
> > 
> >   This sync logic can take care of nested broken problem, that devices are
> >   broken while in sync. We just start a sync process after the devices are
> >   repaired again and switch the devices from broken to sound only when the sync
> >   completes.
> > 
> > For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/quorum.c | 189 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  trace-events   |   5 ++
> >  2 files changed, 191 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index 7b07e35..ffd7c2d 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -23,6 +23,7 @@
> >  #include "qapi/qmp/qlist.h"
> >  #include "qapi/qmp/qstring.h"
> >  #include "qapi-event.h"
> > +#include "trace.h"
> >  
> >  #define HASH_LENGTH 32
> >  
> > @@ -31,6 +32,10 @@
> >  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> >  #define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> > +#define SLICE_TIME          100000000ULL /* 100 ms */
> > +#define CHUNK_SIZE          (1 << 20) /* 1M */
> > +#define SECTORS_PER_CHUNK   (CHUNK_SIZE >> BDRV_SECTOR_BITS)
> > +
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> >      char h[HASH_LENGTH];       /* SHA-256 hash */
> > @@ -64,6 +69,7 @@ typedef struct QuorumVotes {
> >  
> >  /* the following structure holds the state of one quorum instance */
> >  typedef struct BDRVQuorumState {
> > +    BlockDriverState *mybs;/* Quorum block driver base state */
> >      BlockDriverState **bs; /* children BlockDriverStates */
> >      int num_children;      /* children count */
> >      int threshold;         /* if less than threshold children reads gave the
> > @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
> >                              */
> >  
> >      QuorumReadPattern read_pattern;
> > +    BdrvDirtyBitmap *dirty_bitmap;
> > +    uint8_t *sync_buf;
> > +    HBitmapIter hbi;
> > +    int64_t sector_num;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> >      }
> >  }
> >  
> > -static int next_fifo_child(QuorumAIOCB *acb)
> > +static int get_good_child(BDRVQuorumState *s, int iter)
> >  {
> > -    BDRVQuorumState *s = acb->common.bs->opaque;
> >      int i;
> >  
> > -    for (i = acb->child_iter; i < s->num_children; i++) {
> > +    for (i = iter; i < s->num_children; i++) {
> >          if (!s->bs[i]->broken) {
> >              break;
> >          }
> > @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
> >      return i;
> >  }
> >  
> > +static int next_fifo_child(QuorumAIOCB *acb)
> > +{
> > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > +
> > +    return get_good_child(s, acb->child_iter);
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int ret)
> >  {
> >      QuorumChildRequest *sacb = opaque;
> > @@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
> >      return -EINVAL;
> >  }
> >  
> > +static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
> > +{
> > +    int64_t nb, total = bdrv_nb_sectors(qs->mybs);
> > +
> > +    qs->sector_num = hbitmap_iter_next(&qs->hbi);
> > +    /* Wrap around if previous bits get dirty while syncing */
> > +    if (qs->sector_num < 0) {
> > +        bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
> > +        qs->sector_num = hbitmap_iter_next(&qs->hbi);
> > +        assert(qs->sector_num >= 0);
> > +    }
> > +
> > +    for (nb = 1; nb < SECTORS_PER_CHUNK && qs->sector_num + nb < total;
> > +         nb++) {
> > +        if (!bdrv_get_dirty(qs->mybs, qs->dirty_bitmap, qs->sector_num + nb)) {
> > +            break;
> > +        }
> > +    }
> > +    *num = nb;
> > +}
> > +
> > +static void sync_finish(BDRVQuorumState *qs, int64_t num)
> > +{
> > +    int64_t i;
> > +
> > +    for (i = 0; i < num; i++) {
> > +        /* We need to advance the iterator manually */
> > +        hbitmap_iter_next(&qs->hbi);
> > +    }
> > +    bdrv_reset_dirty(qs->mybs, qs->sector_num, num);
> > +}
> > +
> > +static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target)
> > +{
> > +    BlockDriverState *source;
> > +    QEMUIOVector qiov;
> > +    int ret, good;
> > +    int64_t nb_sectors;
> > +    struct iovec iov;
> > +    const char *sname, *tname = bdrv_get_filename(target);
> > +
> > +    good = get_good_child(qs, 0);
> > +    if (good < 0) {
> > +        error_report("No good device available.");
> > +        return -1;
> > +    }
> > +    source = qs->bs[good];
> > +    sname = bdrv_get_filename(source);
> > +    sync_prepare(qs, &nb_sectors);
> > +    iov.iov_base = qs->sync_buf;
> > +    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
> > +    qemu_iovec_init_external(&qiov, &iov, 1);
> > +
> > +    trace_quorum_sync_iteration(sname, tname, qs->sector_num, nb_sectors);
> > +    ret = bdrv_co_readv(source, qs->sector_num, nb_sectors, &qiov);
> > +    if (ret < 0) {
> > +        error_report("Read source %s failed.", sname);
> 
> I didn't read this patch throughfully but in quorum if you need to name a child BDS
> you must use bs->node_name.
> 
> bs->node_name was introduced to be able to merge quorum and uniquely identify a given
> node of the BDS graph.

Ah I see, thanks for reminding. Will do in next version.

Stefan and Kevin, a minor question I need to make sure if it is conventional to
add a helper to access any member of bs outside block.c? In this case, I need to
add bdrv_get_node_name(struct *bs)?

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
  2014-09-01  9:32       ` Benoît Canet
@ 2014-09-01  9:46         ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-01  9:46 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Sep 01, 2014 at 11:32:04AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 17:26:09 (+0800), Liu Yuan wrote :
> > On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
> > > The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> > > > For a fifo read pattern, we only have one running aio
> > > 
> > > >(possible other cases that has less number than num_children in the future)
> > > I have trouble understanding this part of the commit message could you try
> > > to clarify it ?
> > 
> > Until this patch, we have two cases, read single or read all. But later patch
> > allow VMs to continue if some devices are down. So the discrete number 1 and N
> > becomes a range [1, N], that is possible running requests are from 1 to N.
> 
> Why not 
> 
> (In some other cases some children of the num_children BDS could be disabled
> reducing the number of requests needed) ?
> 

Sounds better, I'll take yours, thanks!

Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (8 preceding siblings ...)
  2014-09-01  8:19 ` [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Benoît Canet
@ 2014-09-02 22:19 ` Benoît Canet
  2014-09-10  7:31   ` Liu Yuan
  2014-09-07 15:12 ` Benoît Canet
  10 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-02 22:19 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :

Liu,

Do you think this could work with qcow2 file backed by NFS servers ?

Best regards

Benoît

> This patch set mainly add mainly two logics to implement device recover
> - notify qourum driver of the broken states from the child driver(s)
> - dirty track and sync the device after it is repaired
> 
> Thus quorum allow VMs to continue while some child devices are broken and when
> the child devices are repaired and return back, we sync dirty bits during
> downtime to keep data consistency.
> 
> The recovery logic is based on the driver state bitmap and will sync the dirty
> bits with a timeslice window in a coroutine in this prtimive implementation.
> 
> Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> (similary to DRBD)
> 
> + denote device sync iteration
> - IO on a single device
> = IO on two devices
> 
>                                       sync complete, release dirty bitmap
>                                          ^
>                                          |
>   ====-----------------++++----++++----++==========
>      |                 |
>      |                 v
>      |               device repaired and begin to sync
>      v
>    device broken, create a dirty bitmap
> 
>   This sync logic can take care of nested broken problem, that devices are
>   broken while in sync. We just start a sync process after the devices are
>   repaired again and switch the devices from broken to sound only when the sync
>   completes.
> 
> For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> 
> Todo:
> - use aio interface to sync data (multiple transfer in one go)
> - dynamic slice window to control sync bandwidth more smoothly
> - add auto-reconnection mechanism to other protocol (if not support yet)
> - add tests
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Liu Yuan (8):
>   block/quorum: initialize qcrs.aiocb for read
>   block: add driver operation callbacks
>   block/sheepdog: propagate disconnect/reconnect events to upper driver
>   block/quorum: add quorum_aio_release() helper
>   quorum: fix quorum_aio_cancel()
>   block/quorum: add broken state to BlockDriverState
>   block: add two helpers
>   quorum: add basic device recovery logic
> 
>  block.c                   |  17 +++
>  block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
>  block/sheepdog.c          |   9 ++
>  include/block/block.h     |   9 ++
>  include/block/block_int.h |   6 +
>  trace-events              |   5 +
>  6 files changed, 336 insertions(+), 34 deletions(-)
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
                   ` (9 preceding siblings ...)
  2014-09-02 22:19 ` Benoît Canet
@ 2014-09-07 15:12 ` Benoît Canet
  2014-09-10  7:18   ` Liu Yuan
  10 siblings, 1 reply; 31+ messages in thread
From: Benoît Canet @ 2014-09-07 15:12 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> This patch set mainly add mainly two logics to implement device recover
> - notify qourum driver of the broken states from the child driver(s)
> - dirty track and sync the device after it is repaired
> 
> Thus quorum allow VMs to continue while some child devices are broken and when
> the child devices are repaired and return back, we sync dirty bits during
> downtime to keep data consistency.
> 
> The recovery logic is based on the driver state bitmap and will sync the dirty
> bits with a timeslice window in a coroutine in this prtimive implementation.
> 
> Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> (similary to DRBD)
> 
> + denote device sync iteration
> - IO on a single device
> = IO on two devices
> 
>                                       sync complete, release dirty bitmap
>                                          ^
>                                          |
>   ====-----------------++++----++++----++==========
>      |                 |
>      |                 v
>      |               device repaired and begin to sync
>      v
>    device broken, create a dirty bitmap
> 
>   This sync logic can take care of nested broken problem, that devices are
>   broken while in sync. We just start a sync process after the devices are
>   repaired again and switch the devices from broken to sound only when the sync
>   completes.
> 
> For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> 
> Todo:
> - use aio interface to sync data (multiple transfer in one go)
> - dynamic slice window to control sync bandwidth more smoothly
> - add auto-reconnection mechanism to other protocol (if not support yet)
> - add tests
> 
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Liu Yuan (8):
>   block/quorum: initialize qcrs.aiocb for read
>   block: add driver operation callbacks
>   block/sheepdog: propagate disconnect/reconnect events to upper driver
>   block/quorum: add quorum_aio_release() helper
>   quorum: fix quorum_aio_cancel()
>   block/quorum: add broken state to BlockDriverState
>   block: add two helpers
>   quorum: add basic device recovery logic
> 
>  block.c                   |  17 +++
>  block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
>  block/sheepdog.c          |   9 ++
>  include/block/block.h     |   9 ++
>  include/block/block_int.h |   6 +
>  trace-events              |   5 +
>  6 files changed, 336 insertions(+), 34 deletions(-)
> 
> -- 
> 1.9.1
> 

Hi liu,

Had you noticed that your series conflict with one of Fam's series in the quorum cancel
function fix patch ?

Could you find an arrangement with Fam so the two patches don't collide anymore ?

Do you intend to respin your series ?

Best regards

Benoît

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-07 15:12 ` Benoît Canet
@ 2014-09-10  7:18   ` Liu Yuan
  2014-09-10 13:12     ` Benoît Canet
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Yuan @ 2014-09-10  7:18 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Sun, Sep 07, 2014 at 05:12:31PM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> > This patch set mainly add mainly two logics to implement device recover
> > - notify qourum driver of the broken states from the child driver(s)
> > - dirty track and sync the device after it is repaired
> > 
> > Thus quorum allow VMs to continue while some child devices are broken and when
> > the child devices are repaired and return back, we sync dirty bits during
> > downtime to keep data consistency.
> > 
> > The recovery logic is based on the driver state bitmap and will sync the dirty
> > bits with a timeslice window in a coroutine in this prtimive implementation.
> > 
> > Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> > (similary to DRBD)
> > 
> > + denote device sync iteration
> > - IO on a single device
> > = IO on two devices
> > 
> >                                       sync complete, release dirty bitmap
> >                                          ^
> >                                          |
> >   ====-----------------++++----++++----++==========
> >      |                 |
> >      |                 v
> >      |               device repaired and begin to sync
> >      v
> >    device broken, create a dirty bitmap
> > 
> >   This sync logic can take care of nested broken problem, that devices are
> >   broken while in sync. We just start a sync process after the devices are
> >   repaired again and switch the devices from broken to sound only when the sync
> >   completes.
> > 
> > For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> > 
> > Todo:
> > - use aio interface to sync data (multiple transfer in one go)
> > - dynamic slice window to control sync bandwidth more smoothly
> > - add auto-reconnection mechanism to other protocol (if not support yet)
> > - add tests
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Liu Yuan (8):
> >   block/quorum: initialize qcrs.aiocb for read
> >   block: add driver operation callbacks
> >   block/sheepdog: propagate disconnect/reconnect events to upper driver
> >   block/quorum: add quorum_aio_release() helper
> >   quorum: fix quorum_aio_cancel()
> >   block/quorum: add broken state to BlockDriverState
> >   block: add two helpers
> >   quorum: add basic device recovery logic
> > 
> >  block.c                   |  17 +++
> >  block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
> >  block/sheepdog.c          |   9 ++
> >  include/block/block.h     |   9 ++
> >  include/block/block_int.h |   6 +
> >  trace-events              |   5 +
> >  6 files changed, 336 insertions(+), 34 deletions(-)
> > 
> > -- 
> > 1.9.1
> > 
> 
> Hi liu,
> 
> Had you noticed that your series conflict with one of Fam's series in the quorum cancel
> function fix patch ?

Not yet, thanks for reminding.

> Could you find an arrangement with Fam so the two patches don't collide anymore ?
> 
> Do you intend to respin your series ?

Yes, I'll rebase the v2 later before more possible reviews.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-02 22:19 ` Benoît Canet
@ 2014-09-10  7:31   ` Liu Yuan
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Yuan @ 2014-09-10  7:31 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Wed, Sep 03, 2014 at 12:19:14AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> 
> Liu,
> 
> Do you think this could work with qcow2 file backed by NFS servers ?

It depends on which client we use.

If we use Linux NFS client by 'posix file' protocol, I guess we need to hack
raw-posix.c and insert reconnect/disconnect callbacks.

I'm exploring the possibility of QEMU's nfs client block/nfs.c too.

Either way, this should work with all the protocols with proper hacks.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
  2014-09-10  7:18   ` Liu Yuan
@ 2014-09-10 13:12     ` Benoît Canet
  0 siblings, 0 replies; 31+ messages in thread
From: Benoît Canet @ 2014-09-10 13:12 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Benoît Canet, Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Wednesday 10 Sep 2014 à 15:18:22 (+0800), Liu Yuan wrote :
> On Sun, Sep 07, 2014 at 05:12:31PM +0200, Benoît Canet wrote:
> > The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> > > This patch set mainly add mainly two logics to implement device recover
> > > - notify qourum driver of the broken states from the child driver(s)
> > > - dirty track and sync the device after it is repaired
> > > 
> > > Thus quorum allow VMs to continue while some child devices are broken and when
> > > the child devices are repaired and return back, we sync dirty bits during
> > > downtime to keep data consistency.
> > > 
> > > The recovery logic is based on the driver state bitmap and will sync the dirty
> > > bits with a timeslice window in a coroutine in this prtimive implementation.
> > > 
> > > Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> > > (similary to DRBD)
> > > 
> > > + denote device sync iteration
> > > - IO on a single device
> > > = IO on two devices
> > > 
> > >                                       sync complete, release dirty bitmap
> > >                                          ^
> > >                                          |
> > >   ====-----------------++++----++++----++==========
> > >      |                 |
> > >      |                 v
> > >      |               device repaired and begin to sync
> > >      v
> > >    device broken, create a dirty bitmap
> > > 
> > >   This sync logic can take care of nested broken problem, that devices are
> > >   broken while in sync. We just start a sync process after the devices are
> > >   repaired again and switch the devices from broken to sound only when the sync
> > >   completes.
> > > 
> > > For read-pattern=quorum mode, it enjoys the recovery logic without any problem.
> > > 
> > > Todo:
> > > - use aio interface to sync data (multiple transfer in one go)
> > > - dynamic slice window to control sync bandwidth more smoothly
> > > - add auto-reconnection mechanism to other protocol (if not support yet)
> > > - add tests
> > > 
> > > Cc: Eric Blake <eblake@redhat.com>
> > > Cc: Benoit Canet <benoit@irqsave.net>
> > > Cc: Kevin Wolf <kwolf@redhat.com>
> > > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > > 
> > > Liu Yuan (8):
> > >   block/quorum: initialize qcrs.aiocb for read
> > >   block: add driver operation callbacks
> > >   block/sheepdog: propagate disconnect/reconnect events to upper driver
> > >   block/quorum: add quorum_aio_release() helper
> > >   quorum: fix quorum_aio_cancel()
> > >   block/quorum: add broken state to BlockDriverState
> > >   block: add two helpers
> > >   quorum: add basic device recovery logic
> > > 
> > >  block.c                   |  17 +++
> > >  block/quorum.c            | 324 +++++++++++++++++++++++++++++++++++++++++-----
> > >  block/sheepdog.c          |   9 ++
> > >  include/block/block.h     |   9 ++
> > >  include/block/block_int.h |   6 +
> > >  trace-events              |   5 +
> > >  6 files changed, 336 insertions(+), 34 deletions(-)
> > > 
> > > -- 
> > > 1.9.1
> > > 
> > 
> > Hi liu,
> > 
> > Had you noticed that your series conflict with one of Fam's series in the quorum cancel
> > function fix patch ?
> 
> Not yet, thanks for reminding.
I think Fam somehow digested you patch.
> 
> > Could you find an arrangement with Fam so the two patches don't collide anymore ?
> > 
> > Do you intend to respin your series ?
> 
> Yes, I'll rebase the v2 later before more possible reviews.
> 
> Thanks
> Yuan
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-09-10 13:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks Liu Yuan
2014-09-01  8:28   ` Benoît Canet
2014-09-01  9:19     ` Liu Yuan
2014-09-01  9:28       ` Benoît Canet
2014-09-01  9:40         ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver Liu Yuan
2014-09-01  8:31   ` Benoît Canet
2014-09-01  9:22     ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper Liu Yuan
2014-09-01  8:33   ` Benoît Canet
2014-09-01  7:43 ` [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel() Liu Yuan
2014-09-01  8:35   ` Benoît Canet
2014-09-01  9:26     ` Liu Yuan
2014-09-01  9:32       ` Benoît Canet
2014-09-01  9:46         ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState Liu Yuan
2014-09-01  8:57   ` Benoît Canet
2014-09-01  9:30     ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 7/8] block: add two helpers Liu Yuan
2014-09-01  8:59   ` Benoît Canet
2014-09-01  7:43 ` [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic Liu Yuan
2014-09-01  9:37   ` Benoît Canet
2014-09-01  9:45     ` Liu Yuan
2014-09-01  8:19 ` [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Benoît Canet
2014-09-02 22:19 ` Benoît Canet
2014-09-10  7:31   ` Liu Yuan
2014-09-07 15:12 ` Benoît Canet
2014-09-10  7:18   ` Liu Yuan
2014-09-10 13:12     ` Benoît Canet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.