All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring
@ 2017-09-13 18:18 Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 01/18] block: Add BdrvDeletedStatus Max Reitz
                   ` (18 more replies)
  0 siblings, 19 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This series implements an active and synchronous mirroring mode.

Currently, the mirror block job is passive an asynchronous: Depending on
your start conditions, some part of the source disk starts as "dirty".
Then, the block job will (as a background operation) continuously copy
dirty parts to the target disk until all of the source disk is clean.
In the meantime, any write to the source disk dirties the affected area.

One effect of this operational mode is that the job may never converge:
If the writes to the source happen faster than the block job copies data
to the target, the job can never finish.

When the active mode implemented in this series is enabled, every write
request to the source will automatically trigger a synchronous write to
the target right afterwards.  Therefore, the source can never get dirty
faster than data is copied to the target.  Most importantly, once source
and target are in sync (BLOCK_JOB_READY is emitted), they will not
diverge (unless e.g. an I/O error occurs).

Active mirroring also improves on a second issue of the passive mode: We
do not have to read data from the source in order to write it to the
target.  When new data is written to the source in active mode, it is
automatically mirrored to the target, which saves us the superfluous
read from the source.
(Optionally, one can choose to also mirror data read from the source.
This does not necessarily help with convergence, but it saves an extra
read operation (in exchange for slower read access to the source because
this mirroring is implemented synchronously).)


There may be a couple of things to do on top of this series:
- Allow switching between active and passive mode at runtime: This
  should not be too difficult to implement, the main question is how to
  expose it to the user.
  (I seem to recall we wanted some form of block-job-set-option
  command...?)

- Implement an asynchronous active mode: May be detrimental when it
  comes to convergence, but it might be nice to have anyway.  May or may
  not be complicated to implement.

- Make the target a BdrvChild of the mirror BDS: This series does some
  work to make the mirror BDS a more integral part of the block job (see
  below for more).  One of the things I wanted to do is to make both the
  source and the target plain children of that BDS, and I did have
  patches to do this.  However, at some point continuing to do this for
  the target seemed rather difficult, and also a bit pointless, so I
  decided to keep it for later.
  (To be specific, that "some point" was exactly when I tried to rebase
  onto 045a2f8254c.)


=== Structure of this series ===

The first half (up until patch 10) restructures parts of the mirror
block job:

- Patches 4/5:
  The job is converted to use coroutines instead of AIO.
  (because this is probably where we want to go, and also because active
   mirroring will need to wait on conflicting in-flight operations, and
   I really don't want to wait on in-flight AIO requests)

  This is done primarily by patch 5, with patch 4 being necessary
  beforehand.

- Patches 6/7:
  Every in-flight operation gets a CoQueue so it can be waited on
  (because this allows active mirroring operations to wait for
  conflicting writes)

  This is started by patch 6, and with patch 7, every bit in the
  in-flight bitmap has at least one  corresponding operation in the
  MirrorBlockJob.ops_in_flight list that can be waited on.

- Patches 1/2/3/8/9/10:
  The source node is now no longer used through a BlockBackend (patch 8)
  and also it is now attached to the mirror BDS as the "file" child
  instead of the "backing" child (patch 10).
  This is mostly because I'd personally like the mirror BDS to be a real
  filter BDS instead of some technicality that needs to be there to
  solve op blocker issues.

  Patches 3 and 9 are necessary for patch 10.

  Patches 1 and 2 were necessary for this when I decided to include
  another patch to make the target node an immediate child of the mirror
  BDS, too.  However, as I wrote above, I later decided to put this idea
  off until later, and as long as the mirror BDS only has a single
  child, those patches are not strictly necessary.
  However, I think that those patches are good to have anyway, so I
  decided to keep them.


The second half (patches 11 to 18) implement active mirroring:
- Patch 11 is required by patch 12.  This in turn is required by the
  active-sync mode when mirroring data read from the source to the
  target, because that functionality needs to be able to find all the
  parts of the data read which are actually dirty so we don't copy clean
  data.

- Patches 13 and 14 prepare for the job for active operations.

- Patch 15 implements active mirroring.

- Patch 16 allows it to be used (by adding a parameter to
  blockdev-mirror and drive-mirror).

- Patch 18 adds an iotest which relies on functionality introduced by
  patch 17.



Max Reitz (18):
  block: Add BdrvDeletedStatus
  block: BDS deletion during bdrv_drain_recurse
  blockjob: Make drained_{begin,end} public
  block/mirror: Pull out mirror_perform()
  block/mirror: Convert to coroutines
  block/mirror: Use CoQueue to wait on in-flight ops
  block/mirror: Wait for in-flight op conflicts
  block/mirror: Use source as a BdrvChild
  block: Generalize should_update_child() rule
  block/mirror: Make source the file child
  hbitmap: Add @advance param to hbitmap_iter_next()
  block/dirty-bitmap: Add bdrv_dirty_iter_next_area
  block/mirror: Keep write perm for pending writes
  block/mirror: Distinguish active from passive ops
  block/mirror: Add active mirroring
  block/mirror: Add copy mode QAPI interface
  qemu-io: Add background write
  iotests: Add test for active mirroring

 qapi/block-core.json         |  34 ++-
 include/block/block_int.h    |  18 +-
 include/block/blockjob.h     |  15 +
 include/block/dirty-bitmap.h |   2 +
 include/qemu/hbitmap.h       |   4 +-
 block.c                      |  50 +++-
 block/dirty-bitmap.c         |  54 +++-
 block/io.c                   |  72 +++--
 block/mirror.c               | 633 +++++++++++++++++++++++++++++++++----------
 block/qapi.c                 |  25 +-
 blockdev.c                   |   9 +-
 blockjob.c                   |  20 +-
 qemu-io-cmds.c               |  83 +++++-
 tests/test-hbitmap.c         |  26 +-
 util/hbitmap.c               |  10 +-
 tests/qemu-iotests/141.out   |   4 +-
 tests/qemu-iotests/151       | 111 ++++++++
 tests/qemu-iotests/151.out   |   5 +
 tests/qemu-iotests/group     |   1 +
 19 files changed, 964 insertions(+), 212 deletions(-)
 create mode 100755 tests/qemu-iotests/151
 create mode 100644 tests/qemu-iotests/151.out

-- 
2.13.5

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 01/18] block: Add BdrvDeletedStatus
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse Max Reitz
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Sometimes an operation may delete a BDS.  It may then not be trivial to
determine this because the BDS object itself cannot be accessed
afterwards.  With this patch, one can attach a BdrvDeletedStatus object
to a BDS which can be used to safely query whether the BDS still exists
even after it has been deleted.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 12 ++++++++++++
 block.c                   |  6 ++++++
 2 files changed, 18 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ba4c383393..eaeaad9428 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -498,6 +498,13 @@ typedef struct BdrvAioNotifier {
     QLIST_ENTRY(BdrvAioNotifier) list;
 } BdrvAioNotifier;
 
+typedef struct BdrvDeletedStatus {
+    /* Set to true by bdrv_delete() */
+    bool deleted;
+
+    QLIST_ENTRY(BdrvDeletedStatus) next;
+} BdrvDeletedStatus;
+
 struct BdrvChildRole {
     /* If true, bdrv_replace_node() doesn't change the node this BdrvChild
      * points to. */
@@ -706,6 +713,11 @@ struct BlockDriverState {
 
     /* Only read/written by whoever has set active_flush_req to true.  */
     unsigned int flushed_gen;             /* Flushed write generation */
+
+    /* When bdrv_delete() is invoked, it will walk through this list
+     * and set every entry's @deleted field to true.  The entries will
+     * not be freed automatically. */
+    QLIST_HEAD(, BdrvDeletedStatus) deleted_status;
 };
 
 struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 6dd47e414e..0b55c5a41c 100644
--- a/block.c
+++ b/block.c
@@ -3246,10 +3246,16 @@ out:
 
 static void bdrv_delete(BlockDriverState *bs)
 {
+    BdrvDeletedStatus *del_stat;
+
     assert(!bs->job);
     assert(bdrv_op_blocker_is_empty(bs));
     assert(!bs->refcnt);
 
+    QLIST_FOREACH(del_stat, &bs->deleted_status, next) {
+        del_stat->deleted = true;
+    }
+
     bdrv_close(bs);
 
     /* remove from list, if necessary */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 01/18] block: Add BdrvDeletedStatus Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-18  3:44   ` Fam Zheng
  2017-10-10  8:36   ` Kevin Wolf
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public Max Reitz
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Drainined a BDS child may lead to both the original BDS and/or its other
children being deleted (e.g. if the original BDS represents a block
job).  We should prepare for this in both bdrv_drain_recurse() and
bdrv_drained_begin() by monitoring whether the BDS we are about to drain
still exists at all.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/io.c | 72 +++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4378ae4c7d..8ec1a564ad 100644
--- a/block/io.c
+++ b/block/io.c
@@ -182,33 +182,57 @@ static void bdrv_drain_invoke(BlockDriverState *bs)
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
-    BdrvChild *child, *tmp;
+    BdrvChild *child;
     bool waited;
+    struct BDSToDrain {
+        BlockDriverState *bs;
+        BdrvDeletedStatus del_stat;
+        QLIST_ENTRY(BDSToDrain) next;
+    };
+    QLIST_HEAD(, BDSToDrain) bs_list = QLIST_HEAD_INITIALIZER(bs_list);
+    bool in_main_loop =
+        qemu_get_current_aio_context() == qemu_get_aio_context();
 
     waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
 
     /* Ensure any pending metadata writes are submitted to bs->file.  */
     bdrv_drain_invoke(bs);
 
-    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
-        BlockDriverState *bs = child->bs;
-        bool in_main_loop =
-            qemu_get_current_aio_context() == qemu_get_aio_context();
-        assert(bs->refcnt > 0);
-        if (in_main_loop) {
-            /* In case the recursive bdrv_drain_recurse processes a
-             * block_job_defer_to_main_loop BH and modifies the graph,
-             * let's hold a reference to bs until we are done.
-             *
-             * IOThread doesn't have such a BH, and it is not safe to call
-             * bdrv_unref without BQL, so skip doing it there.
-             */
-            bdrv_ref(bs);
-        }
-        waited |= bdrv_drain_recurse(bs);
-        if (in_main_loop) {
-            bdrv_unref(bs);
+    /* Draining children may result in other children being removed and maybe
+     * even deleted, so copy the children list first */
+    QLIST_FOREACH(child, &bs->children, next) {
+        struct BDSToDrain *bs2d = g_new0(struct BDSToDrain, 1);
+
+        bs2d->bs = child->bs;
+        QLIST_INSERT_HEAD(&bs->deleted_status, &bs2d->del_stat, next);
+
+        QLIST_INSERT_HEAD(&bs_list, bs2d, next);
+    }
+
+    while (!QLIST_EMPTY(&bs_list)) {
+        struct BDSToDrain *bs2d = QLIST_FIRST(&bs_list);
+        QLIST_REMOVE(bs2d, next);
+
+        if (!bs2d->del_stat.deleted) {
+            QLIST_REMOVE(&bs2d->del_stat, next);
+
+            if (in_main_loop) {
+                /* In case the recursive bdrv_drain_recurse processes a
+                 * block_job_defer_to_main_loop BH and modifies the graph,
+                 * let's hold a reference to the BDS until we are done.
+                 *
+                 * IOThread doesn't have such a BH, and it is not safe to call
+                 * bdrv_unref without BQL, so skip doing it there.
+                 */
+                bdrv_ref(bs2d->bs);
+            }
+            waited |= bdrv_drain_recurse(bs2d->bs);
+            if (in_main_loop) {
+                bdrv_unref(bs2d->bs);
+            }
         }
+
+        g_free(bs2d);
     }
 
     return waited;
@@ -252,17 +276,25 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
+    BdrvDeletedStatus del_stat = { .deleted = false };
+
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(bs);
         return;
     }
 
+    QLIST_INSERT_HEAD(&bs->deleted_status, &del_stat, next);
+
     if (atomic_fetch_inc(&bs->quiesce_counter) == 0) {
         aio_disable_external(bdrv_get_aio_context(bs));
         bdrv_parent_drained_begin(bs);
     }
 
-    bdrv_drain_recurse(bs);
+    if (!del_stat.deleted) {
+        QLIST_REMOVE(&del_stat, next);
+
+        bdrv_drain_recurse(bs);
+    }
 }
 
 void bdrv_drained_end(BlockDriverState *bs)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 01/18] block: Add BdrvDeletedStatus Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-18  3:46   ` Fam Zheng
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform() Max Reitz
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

When a block job decides to be represented as a BDS and track its
associated child nodes itself instead of having the BlockJob object
track them, it needs to implement the drained_begin/drained_end child
operations.  In order to do that, it has to be able to control drainage
of the block job (i.e. to pause and resume it).  Therefore, we need to
make these operations public.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/blockjob.h | 15 +++++++++++++++
 blockjob.c               | 20 ++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 67c0968fa5..a59f316788 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -339,6 +339,21 @@ void block_job_ref(BlockJob *job);
 void block_job_unref(BlockJob *job);
 
 /**
+ * block_job_drained_begin:
+ *
+ * Inhibit I/O requests initiated by the block job.
+ */
+void block_job_drained_begin(BlockJob *job);
+
+/**
+ * block_job_drained_end:
+ *
+ * Resume I/O after it has been paused through
+ * block_job_drained_begin().
+ */
+void block_job_drained_end(BlockJob *job);
+
+/**
  * block_job_txn_unref:
  *
  * Release a reference that was previously acquired with block_job_txn_add_job
diff --git a/blockjob.c b/blockjob.c
index 3a0c49137e..4312a121fa 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -217,21 +217,29 @@ static const BdrvChildRole child_job = {
     .stay_at_node       = true,
 };
 
-static void block_job_drained_begin(void *opaque)
+void block_job_drained_begin(BlockJob *job)
 {
-    BlockJob *job = opaque;
     block_job_pause(job);
 }
 
-static void block_job_drained_end(void *opaque)
+static void block_job_drained_begin_op(void *opaque)
+{
+    block_job_drained_begin(opaque);
+}
+
+void block_job_drained_end(BlockJob *job)
 {
-    BlockJob *job = opaque;
     block_job_resume(job);
 }
 
+static void block_job_drained_end_op(void *opaque)
+{
+    block_job_drained_end(opaque);
+}
+
 static const BlockDevOps block_job_dev_ops = {
-    .drained_begin = block_job_drained_begin,
-    .drained_end = block_job_drained_end,
+    .drained_begin = block_job_drained_begin_op,
+    .drained_end = block_job_drained_end_op,
 };
 
 void block_job_remove_all_bdrv(BlockJob *job)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform()
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (2 preceding siblings ...)
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-18  3:48   ` Fam Zheng
  2017-09-25  9:38   ` Vladimir Sementsov-Ogievskiy
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines Max Reitz
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created.  mirror_perform() is going to be
that point.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 53 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 6531652d73..4664b0516f 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -82,6 +82,12 @@ typedef struct MirrorOp {
     uint64_t bytes;
 } MirrorOp;
 
+typedef enum MirrorMethod {
+    MIRROR_METHOD_COPY,
+    MIRROR_METHOD_ZERO,
+    MIRROR_METHOD_DISCARD,
+} MirrorMethod;
+
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
                                             int error)
 {
@@ -324,6 +330,22 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
     }
 }
 
+static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
+                               unsigned bytes, MirrorMethod mirror_method)
+{
+    switch (mirror_method) {
+    case MIRROR_METHOD_COPY:
+        return mirror_do_read(s, offset, bytes);
+    case MIRROR_METHOD_ZERO:
+    case MIRROR_METHOD_DISCARD:
+        mirror_do_zero_or_discard(s, offset, bytes,
+                                  mirror_method == MIRROR_METHOD_DISCARD);
+        return bytes;
+    default:
+        abort();
+    }
+}
+
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
     BlockDriverState *source = s->source;
@@ -395,11 +417,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         unsigned int io_bytes;
         int64_t io_bytes_acct;
         BlockDriverState *file;
-        enum MirrorMethod {
-            MIRROR_METHOD_COPY,
-            MIRROR_METHOD_ZERO,
-            MIRROR_METHOD_DISCARD
-        } mirror_method = MIRROR_METHOD_COPY;
+        MirrorMethod mirror_method = MIRROR_METHOD_COPY;
 
         assert(!(offset % s->granularity));
         ret = bdrv_get_block_status_above(source, NULL,
@@ -439,22 +457,11 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         }
 
         io_bytes = mirror_clip_bytes(s, offset, io_bytes);
-        switch (mirror_method) {
-        case MIRROR_METHOD_COPY:
-            io_bytes = io_bytes_acct = mirror_do_read(s, offset, io_bytes);
-            break;
-        case MIRROR_METHOD_ZERO:
-        case MIRROR_METHOD_DISCARD:
-            mirror_do_zero_or_discard(s, offset, io_bytes,
-                                      mirror_method == MIRROR_METHOD_DISCARD);
-            if (write_zeroes_ok) {
-                io_bytes_acct = 0;
-            } else {
-                io_bytes_acct = io_bytes;
-            }
-            break;
-        default:
-            abort();
+        io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
+        if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
+            io_bytes_acct = 0;
+        } else {
+            io_bytes_acct = io_bytes;
         }
         assert(io_bytes);
         offset += io_bytes;
@@ -650,8 +657,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
                 continue;
             }
 
-            mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
-                                      nb_sectors * BDRV_SECTOR_SIZE, false);
+            mirror_perform(s, sector_num * BDRV_SECTOR_SIZE,
+                           nb_sectors * BDRV_SECTOR_SIZE, MIRROR_METHOD_ZERO);
             sector_num += nb_sectors;
         }
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (3 preceding siblings ...)
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform() Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-18  6:02   ` Fam Zheng
  2017-10-10  9:14   ` Kevin Wolf
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 06/18] block/mirror: Use CoQueue to wait on in-flight ops Max Reitz
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 134 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 78 insertions(+), 56 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 4664b0516f..2b3297aa61 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -80,6 +80,9 @@ typedef struct MirrorOp {
     QEMUIOVector qiov;
     int64_t offset;
     uint64_t bytes;
+
+    /* Set by mirror_co_read() before yielding for the first time */
+    uint64_t bytes_copied;
 } MirrorOp;
 
 typedef enum MirrorMethod {
@@ -101,7 +104,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
     }
 }
 
-static void mirror_iteration_done(MirrorOp *op, int ret)
+static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
     MirrorBlockJob *s = op->s;
     struct iovec *iov;
@@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     }
 }
 
-static void mirror_write_complete(void *opaque, int ret)
+static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
 {
-    MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
 
     aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
     aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
-static void mirror_read_complete(void *opaque, int ret)
+static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
 {
-    MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
 
     aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
 
         mirror_iteration_done(op, ret);
     } else {
-        blk_aio_pwritev(s->target, op->offset, &op->qiov,
-                        0, mirror_write_complete, op);
+        int ret;
+
+        ret = blk_co_pwritev(s->target, op->offset,
+                             op->qiov.size, &op->qiov, 0);
+        mirror_write_complete(op, ret);
     }
     aio_context_release(blk_get_aio_context(s->common.blk));
 }
@@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
  *          (new_end - offset) if tail is rounded up or down due to
  *          alignment or buffer limit.
  */
-static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
-                               uint64_t bytes)
+static void coroutine_fn mirror_co_read(void *opaque)
 {
+    MirrorOp *op = opaque;
+    MirrorBlockJob *s = op->s;
     BlockBackend *source = s->common.blk;
     int nb_chunks;
     uint64_t ret;
-    MirrorOp *op;
     uint64_t max_bytes;
 
     max_bytes = s->granularity * s->max_iov;
 
     /* We can only handle as much as buf_size at a time. */
-    bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
-    assert(bytes);
-    assert(bytes < BDRV_REQUEST_MAX_BYTES);
-    ret = bytes;
+    op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
+    assert(op->bytes);
+    assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
+    op->bytes_copied = op->bytes;
 
     if (s->cow_bitmap) {
-        ret += mirror_cow_align(s, &offset, &bytes);
+        op->bytes_copied += mirror_cow_align(s, &op->offset, &op->bytes);
     }
-    assert(bytes <= s->buf_size);
+    /* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
+    assert(op->bytes_copied <= UINT_MAX);
+    assert(op->bytes <= s->buf_size);
     /* The offset is granularity-aligned because:
      * 1) Caller passes in aligned values;
      * 2) mirror_cow_align is used only when target cluster is larger. */
-    assert(QEMU_IS_ALIGNED(offset, s->granularity));
+    assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
     /* The range is sector-aligned, since bdrv_getlength() rounds up. */
-    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
-    nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
+    assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
+    nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
 
     while (s->buf_free_count < nb_chunks) {
-        trace_mirror_yield_in_flight(s, offset, s->in_flight);
+        trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
         mirror_wait_for_io(s);
     }
 
-    /* Allocate a MirrorOp that is used as an AIO callback.  */
-    op = g_new(MirrorOp, 1);
-    op->s = s;
-    op->offset = offset;
-    op->bytes = bytes;
-
     /* Now make a QEMUIOVector taking enough granularity-sized chunks
      * from s->buf_free.
      */
     qemu_iovec_init(&op->qiov, nb_chunks);
     while (nb_chunks-- > 0) {
         MirrorBuffer *buf = QSIMPLEQ_FIRST(&s->buf_free);
-        size_t remaining = bytes - op->qiov.size;
+        size_t remaining = op->bytes - op->qiov.size;
 
         QSIMPLEQ_REMOVE_HEAD(&s->buf_free, next);
         s->buf_free_count--;
@@ -297,53 +297,75 @@ static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
 
     /* Copy the dirty cluster.  */
     s->in_flight++;
-    s->bytes_in_flight += bytes;
-    trace_mirror_one_iteration(s, offset, bytes);
+    s->bytes_in_flight += op->bytes;
+    trace_mirror_one_iteration(s, op->offset, op->bytes);
 
-    blk_aio_preadv(source, offset, &op->qiov, 0, mirror_read_complete, op);
-    return ret;
+    ret = blk_co_preadv(source, op->offset, op->bytes, &op->qiov, 0);
+    mirror_read_complete(op, ret);
 }
 
-static void mirror_do_zero_or_discard(MirrorBlockJob *s,
-                                      int64_t offset,
-                                      uint64_t bytes,
-                                      bool is_discard)
+static void coroutine_fn mirror_co_zero(void *opaque)
 {
-    MirrorOp *op;
+    MirrorOp *op = opaque;
+    int ret;
 
-    /* Allocate a MirrorOp that is used as an AIO callback. The qiov is zeroed
-     * so the freeing in mirror_iteration_done is nop. */
-    op = g_new0(MirrorOp, 1);
-    op->s = s;
-    op->offset = offset;
-    op->bytes = bytes;
+    op->s->in_flight++;
+    op->s->bytes_in_flight += op->bytes;
 
-    s->in_flight++;
-    s->bytes_in_flight += bytes;
-    if (is_discard) {
-        blk_aio_pdiscard(s->target, offset,
-                         op->bytes, mirror_write_complete, op);
-    } else {
-        blk_aio_pwrite_zeroes(s->target, offset,
-                              op->bytes, s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
-                              mirror_write_complete, op);
-    }
+    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
+                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
+    mirror_write_complete(op, ret);
+}
+
+static void coroutine_fn mirror_co_discard(void *opaque)
+{
+    MirrorOp *op = opaque;
+    int ret;
+
+    op->s->in_flight++;
+    op->s->bytes_in_flight += op->bytes;
+
+    ret = blk_co_pdiscard(op->s->target, op->offset, op->bytes);
+    mirror_write_complete(op, ret);
 }
 
 static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
                                unsigned bytes, MirrorMethod mirror_method)
 {
+    MirrorOp *op;
+    Coroutine *co;
+    unsigned ret = bytes;
+
+    op = g_new(MirrorOp, 1);
+    *op = (MirrorOp){
+        .s      = s,
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
     switch (mirror_method) {
     case MIRROR_METHOD_COPY:
-        return mirror_do_read(s, offset, bytes);
+        co = qemu_coroutine_create(mirror_co_read, op);
+        break;
     case MIRROR_METHOD_ZERO:
+        co = qemu_coroutine_create(mirror_co_zero, op);
+        break;
     case MIRROR_METHOD_DISCARD:
-        mirror_do_zero_or_discard(s, offset, bytes,
-                                  mirror_method == MIRROR_METHOD_DISCARD);
-        return bytes;
+        co = qemu_coroutine_create(mirror_co_discard, op);
+        break;
     default:
         abort();
     }
+
+    qemu_coroutine_enter(co);
+
+    if (mirror_method == MIRROR_METHOD_COPY) {
+        /* Same assertion as in mirror_co_read() */
+        assert(op->bytes_copied <= UINT_MAX);
+        ret = op->bytes_copied;
+    }
+
+    return ret;
 }
 
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 06/18] block/mirror: Use CoQueue to wait on in-flight ops
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (4 preceding siblings ...)
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 07/18] block/mirror: Wait for in-flight op conflicts Max Reitz
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.

A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2b3297aa61..81253fbad1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/coroutine.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -34,6 +35,8 @@ typedef struct MirrorBuffer {
     QSIMPLEQ_ENTRY(MirrorBuffer) next;
 } MirrorBuffer;
 
+typedef struct MirrorOp MirrorOp;
+
 typedef struct MirrorBlockJob {
     BlockJob common;
     RateLimit limit;
@@ -67,15 +70,15 @@ typedef struct MirrorBlockJob {
     unsigned long *in_flight_bitmap;
     int in_flight;
     int64_t bytes_in_flight;
+    QTAILQ_HEAD(MirrorOpList, MirrorOp) ops_in_flight;
     int ret;
     bool unmap;
-    bool waiting_for_io;
     int target_cluster_size;
     int max_iov;
     bool initial_zeroing_ongoing;
 } MirrorBlockJob;
 
-typedef struct MirrorOp {
+struct MirrorOp {
     MirrorBlockJob *s;
     QEMUIOVector qiov;
     int64_t offset;
@@ -83,7 +86,11 @@ typedef struct MirrorOp {
 
     /* Set by mirror_co_read() before yielding for the first time */
     uint64_t bytes_copied;
-} MirrorOp;
+
+    CoQueue waiting_requests;
+
+    QTAILQ_ENTRY(MirrorOp) next;
+};
 
 typedef enum MirrorMethod {
     MIRROR_METHOD_COPY,
@@ -124,7 +131,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 
     chunk_num = op->offset / s->granularity;
     nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
+
     bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
+    QTAILQ_REMOVE(&s->ops_in_flight, op, next);
     if (ret >= 0) {
         if (s->cow_bitmap) {
             bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
@@ -134,11 +143,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
         }
     }
     qemu_iovec_destroy(&op->qiov);
-    g_free(op);
 
-    if (s->waiting_for_io) {
-        qemu_coroutine_enter(s->common.co);
-    }
+    qemu_co_queue_restart_all(&op->waiting_requests);
+    g_free(op);
 }
 
 static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
@@ -233,10 +240,11 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
 
 static inline void mirror_wait_for_io(MirrorBlockJob *s)
 {
-    assert(!s->waiting_for_io);
-    s->waiting_for_io = true;
-    qemu_coroutine_yield();
-    s->waiting_for_io = false;
+    MirrorOp *op;
+
+    op = QTAILQ_FIRST(&s->ops_in_flight);
+    assert(op);
+    qemu_co_queue_wait(&op->waiting_requests, NULL);
 }
 
 /* Submit async read while handling COW.
@@ -342,6 +350,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
         .offset = offset,
         .bytes  = bytes,
     };
+    qemu_co_queue_init(&op->waiting_requests);
 
     switch (mirror_method) {
     case MIRROR_METHOD_COPY:
@@ -357,6 +366,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
         abort();
     }
 
+    QTAILQ_INSERT_TAIL(&s->ops_in_flight, op, next);
     qemu_coroutine_enter(co);
 
     if (mirror_method == MIRROR_METHOD_COPY) {
@@ -1292,6 +1302,8 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         }
     }
 
+    QTAILQ_INIT(&s->ops_in_flight);
+
     trace_mirror_start(bs, s, opaque);
     block_job_start(&s->common);
     return;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 07/18] block/mirror: Wait for in-flight op conflicts
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (5 preceding siblings ...)
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 06/18] block/mirror: Use CoQueue to wait on in-flight ops Max Reitz
@ 2017-09-13 18:18 ` Max Reitz
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild Max Reitz
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:18 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 70 insertions(+), 15 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 81253fbad1..2ece38094d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/coroutine.h"
+#include "qemu/range.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -111,6 +112,41 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
     }
 }
 
+static void coroutine_fn mirror_wait_on_conflicts(MirrorOp *self,
+                                                  MirrorBlockJob *s,
+                                                  uint64_t offset,
+                                                  uint64_t bytes)
+{
+    uint64_t self_start_chunk = offset / s->granularity;
+    uint64_t self_end_chunk = DIV_ROUND_UP(offset + bytes, s->granularity);
+    uint64_t self_nb_chunks = self_end_chunk - self_start_chunk;
+
+    while (find_next_bit(s->in_flight_bitmap, self_end_chunk,
+                         self_start_chunk) < self_end_chunk &&
+           s->ret >= 0)
+    {
+        MirrorOp *op;
+
+        QTAILQ_FOREACH(op, &s->ops_in_flight, next) {
+            uint64_t op_start_chunk = op->offset / s->granularity;
+            uint64_t op_nb_chunks = DIV_ROUND_UP(op->offset + op->bytes,
+                                                 s->granularity) -
+                                    op_start_chunk;
+
+            if (op == self) {
+                continue;
+            }
+
+            if (ranges_overlap(self_start_chunk, self_nb_chunks,
+                               op_start_chunk, op_nb_chunks))
+            {
+                qemu_co_queue_wait(&op->waiting_requests, NULL);
+                break;
+            }
+        }
+    }
+}
+
 static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
     MirrorBlockJob *s = op->s;
@@ -238,7 +274,7 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
     return ret;
 }
 
-static inline void mirror_wait_for_io(MirrorBlockJob *s)
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
     MirrorOp *op;
 
@@ -287,7 +323,7 @@ static void coroutine_fn mirror_co_read(void *opaque)
 
     while (s->buf_free_count < nb_chunks) {
         trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
-        mirror_wait_for_io(s);
+        mirror_wait_for_free_in_flight_slot(s);
     }
 
     /* Now make a QEMUIOVector taking enough granularity-sized chunks
@@ -381,8 +417,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
     BlockDriverState *source = s->source;
-    int64_t offset, first_chunk;
-    uint64_t delay_ns = 0;
+    MirrorOp *pseudo_op;
+    int64_t offset;
+    uint64_t delay_ns = 0, ret = 0;
     /* At least the first dirty chunk is mirrored in one iteration. */
     int nb_chunks = 1;
     int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
@@ -400,11 +437,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     }
     bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
-    first_chunk = offset / s->granularity;
-    while (test_bit(first_chunk, s->in_flight_bitmap)) {
-        trace_mirror_yield_in_flight(s, offset, s->in_flight);
-        mirror_wait_for_io(s);
-    }
+    mirror_wait_on_conflicts(NULL, s, offset, 1);
 
     block_job_pause_point(&s->common);
 
@@ -442,6 +475,20 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
                                    nb_chunks * sectors_per_chunk);
     bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
+    /* Before claiming an area in the in-flight bitmap, we have to
+     * create a MirrorOp for it so that conflicting requests can wait
+     * for it.  mirror_perform() will create the real MirrorOps later,
+     * for now we just create a pseudo operation that will wake up all
+     * conflicting requests once all real operations have been
+     * launched. */
+    pseudo_op = g_new(MirrorOp, 1);
+    *pseudo_op = (MirrorOp){
+        .offset = offset,
+        .bytes  = nb_chunks * s->granularity,
+    };
+    qemu_co_queue_init(&pseudo_op->waiting_requests);
+    QTAILQ_INSERT_TAIL(&s->ops_in_flight, pseudo_op, next);
+
     bitmap_set(s->in_flight_bitmap, offset / s->granularity, nb_chunks);
     while (nb_chunks > 0 && offset < s->bdev_length) {
         int64_t ret;
@@ -481,11 +528,12 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
         while (s->in_flight >= MAX_IN_FLIGHT) {
             trace_mirror_yield_in_flight(s, offset, s->in_flight);
-            mirror_wait_for_io(s);
+            mirror_wait_for_free_in_flight_slot(s);
         }
 
         if (s->ret < 0) {
-            return 0;
+            ret = 0;
+            goto fail;
         }
 
         io_bytes = mirror_clip_bytes(s, offset, io_bytes);
@@ -502,7 +550,14 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
             delay_ns = ratelimit_calculate_delay(&s->limit, io_bytes_acct);
         }
     }
-    return delay_ns;
+
+    ret = delay_ns;
+fail:
+    QTAILQ_REMOVE(&s->ops_in_flight, pseudo_op, next);
+    qemu_co_queue_restart_all(&pseudo_op->waiting_requests);
+    g_free(pseudo_op);
+
+    return ret;
 }
 
 static void mirror_free_init(MirrorBlockJob *s)
@@ -529,7 +584,7 @@ static void mirror_free_init(MirrorBlockJob *s)
 static void mirror_wait_for_all_io(MirrorBlockJob *s)
 {
     while (s->in_flight > 0) {
-        mirror_wait_for_io(s);
+        mirror_wait_for_free_in_flight_slot(s);
     }
 }
 
@@ -685,7 +740,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
             if (s->in_flight >= MAX_IN_FLIGHT) {
                 trace_mirror_yield(s, UINT64_MAX, s->buf_free_count,
                                    s->in_flight);
-                mirror_wait_for_io(s);
+                mirror_wait_for_free_in_flight_slot(s);
                 continue;
             }
 
@@ -868,7 +923,7 @@ static void coroutine_fn mirror_run(void *opaque)
                 (cnt == 0 && s->in_flight > 0)) {
                 trace_mirror_yield(s, cnt * BDRV_SECTOR_SIZE,
                                    s->buf_free_count, s->in_flight);
-                mirror_wait_for_io(s);
+                mirror_wait_for_free_in_flight_slot(s);
                 continue;
             } else if (cnt != 0) {
                 delay_ns = mirror_iteration(s);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (6 preceding siblings ...)
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 07/18] block/mirror: Wait for in-flight op conflicts Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-10-10  9:27   ` Kevin Wolf
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 09/18] block: Generalize should_update_child() rule Max Reitz
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

With this, the mirror_top_bs is no longer just a technically required
node in the BDS graph but actually represents the block job operation.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2ece38094d..9df4157511 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -43,8 +43,8 @@ typedef struct MirrorBlockJob {
     RateLimit limit;
     BlockBackend *target;
     BlockDriverState *mirror_top_bs;
-    BlockDriverState *source;
     BlockDriverState *base;
+    BdrvChild *source;
 
     /* The name of the graph node to replace */
     char *replaces;
@@ -294,7 +294,6 @@ static void coroutine_fn mirror_co_read(void *opaque)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
-    BlockBackend *source = s->common.blk;
     int nb_chunks;
     uint64_t ret;
     uint64_t max_bytes;
@@ -344,7 +343,7 @@ static void coroutine_fn mirror_co_read(void *opaque)
     s->bytes_in_flight += op->bytes;
     trace_mirror_one_iteration(s, op->offset, op->bytes);
 
-    ret = blk_co_preadv(source, op->offset, op->bytes, &op->qiov, 0);
+    ret = bdrv_co_preadv(s->source, op->offset, op->bytes, &op->qiov, 0);
     mirror_read_complete(op, ret);
 }
 
@@ -416,7 +415,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
 
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
-    BlockDriverState *source = s->source;
+    BlockDriverState *source = s->source->bs;
     MirrorOp *pseudo_op;
     int64_t offset;
     uint64_t delay_ns = 0, ret = 0;
@@ -597,7 +596,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
     MirrorExitData *data = opaque;
     AioContext *replace_aio_context = NULL;
-    BlockDriverState *src = s->source;
+    BlockDriverState *src = s->source->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     BlockDriverState *mirror_top_bs = s->mirror_top_bs;
     Error *local_err = NULL;
@@ -712,7 +711,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 {
     int64_t sector_num, end;
     BlockDriverState *base = s->base;
-    BlockDriverState *bs = s->source;
+    BlockDriverState *bs = s->source->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     int ret, n;
     int64_t count;
@@ -802,7 +801,7 @@ static void coroutine_fn mirror_run(void *opaque)
 {
     MirrorBlockJob *s = opaque;
     MirrorExitData *data;
-    BlockDriverState *bs = s->source;
+    BlockDriverState *bs = s->source->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     bool need_drain = true;
     int64_t length;
@@ -1284,7 +1283,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
     /* The block job now has a reference to this node */
     bdrv_unref(mirror_top_bs);
 
-    s->source = bs;
+    s->source = mirror_top_bs->backing;
     s->mirror_top_bs = mirror_top_bs;
 
     /* No resize for the target either; while the mirror is still running, a
@@ -1330,6 +1329,9 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         s->should_complete = true;
     }
 
+    s->source = mirror_top_bs->backing;
+    s->mirror_top_bs = mirror_top_bs;
+
     s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
     if (!s->dirty_bitmap) {
         goto fail;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 09/18] block: Generalize should_update_child() rule
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (7 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child Max Reitz
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Currently, bdrv_replace_node() refuses to create loops from one BDS to
itself if the BDS to be replaced is the backing node of the BDS to
replace it: Say there is a node A and a node B.  Replacing B by A means
making all references to B point to A.  If B is a child of A (i.e. A has
a reference to B), that would mean we would have to make this reference
point to A itself -- so we'd create a loop.

bdrv_replace_node() (through should_update_child()) refuses to do so if
B is the backing node of A.  There is no reason why we should create
loops if B is not the backing node of A, though.  The BDS graph should
never contain loops, so we should always refuse to create them.

If B is a child of A and B is to be replaced by A, we should simply
leave B in place there because it is the most sensible choice.

A more specific argument would be: Putting filter drivers into the BDS
graph is basically the same as appending an overlay to a backing chain.
But the main child BDS of a filter driver is not "backing" but "file",
so restricting the no-loop rule to backing nodes would fail here.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  2 ++
 block.c                   | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index eaeaad9428..fa8bbf1f8b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -573,6 +573,8 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+typedef QLIST_HEAD(BdrvChildList, BdrvChild) BdrvChildList;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
diff --git a/block.c b/block.c
index 0b55c5a41c..1898b958c9 100644
--- a/block.c
+++ b/block.c
@@ -3134,16 +3134,39 @@ static bool should_update_child(BdrvChild *c, BlockDriverState *to)
         return false;
     }
 
-    if (c->role == &child_backing) {
-        /* If @from is a backing file of @to, ignore the child to avoid
-         * creating a loop. We only want to change the pointer of other
-         * parents. */
-        QLIST_FOREACH(to_c, &to->children, next) {
-            if (to_c == c) {
-                break;
-            }
-        }
-        if (to_c) {
+    /* If the child @c belongs to the BDS @to, replacing the current
+     * c->bs by @to would mean to create a loop.
+     *
+     * Such a case occurs when appending a BDS to a backing chain.
+     * For instance, imagine the following chain:
+     *
+     *   guest device -> node A -> further backing chain...
+     *
+     * Now we create a new BDS B which we want to put on top of this
+     * chain, so we first attach A as its backing node:
+     *
+     *                   node B
+     *                     |
+     *                     v
+     *   guest device -> node A -> further backing chain...
+     *
+     * Finally we want to replace A by B.  When doing that, we want to
+     * replace all pointers to A by pointers to B -- except for the
+     * pointer from B because (1) that would create a loop, and (2)
+     * that pointer should simply stay intact:
+     *
+     *   guest device -> node B
+     *                     |
+     *                     v
+     *                   node A -> further backing chain...
+     *
+     * In general, when replacing a node A (c->bs) by a node B (@to),
+     * if A is a child of B, that means we cannot replace A by B there
+     * because that would create a loop.  Silently detaching A from B
+     * is also not really an option.  So overall just leaving A in
+     * place there is the most sensible choice. */
+    QLIST_FOREACH(to_c, &to->children, next) {
+        if (to_c == c) {
             return false;
         }
     }
@@ -3169,6 +3192,7 @@ void bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
 
     /* Put all parents into @list and calculate their cumulative permissions */
     QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
+        assert(c->bs == from);
         if (!should_update_child(c, to)) {
             continue;
         }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (8 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 09/18] block: Generalize should_update_child() rule Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-10-10  9:47   ` Kevin Wolf
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next() Max Reitz
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Regarding the source BDS, the mirror BDS is arguably a filter node.
Therefore, the source BDS should be its "file" child.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c             | 127 ++++++++++++++++++++++++++++++++++-----------
 block/qapi.c               |  25 ++++++---
 tests/qemu-iotests/141.out |   4 +-
 3 files changed, 119 insertions(+), 37 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 9df4157511..05410c94ca 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -77,8 +77,16 @@ typedef struct MirrorBlockJob {
     int target_cluster_size;
     int max_iov;
     bool initial_zeroing_ongoing;
+
+    /* Signals that we are no longer accessing source and target and the mirror
+     * BDS should thus relinquish all permissions */
+    bool exiting;
 } MirrorBlockJob;
 
+typedef struct MirrorBDSOpaque {
+    MirrorBlockJob *job;
+} MirrorBDSOpaque;
+
 struct MirrorOp {
     MirrorBlockJob *s;
     QEMUIOVector qiov;
@@ -595,12 +603,15 @@ static void mirror_exit(BlockJob *job, void *opaque)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
     MirrorExitData *data = opaque;
+    MirrorBDSOpaque *bs_opaque = s->mirror_top_bs->opaque;
     AioContext *replace_aio_context = NULL;
     BlockDriverState *src = s->source->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     BlockDriverState *mirror_top_bs = s->mirror_top_bs;
     Error *local_err = NULL;
 
+    s->exiting = true;
+
     bdrv_release_dirty_bitmap(src, s->dirty_bitmap);
 
     /* Make sure that the source BDS doesn't go away before we called
@@ -622,7 +633,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 
     /* We don't access the source any more. Dropping any WRITE/RESIZE is
      * required before it could become a backing file of target_bs. */
-    bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
+    bdrv_child_try_set_perm(mirror_top_bs->file, 0, BLK_PERM_ALL,
                             &error_abort);
     if (s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
         BlockDriverState *backing = s->is_none_mode ? src : s->base;
@@ -673,12 +684,11 @@ static void mirror_exit(BlockJob *job, void *opaque)
 
     /* Remove the mirror filter driver from the graph. Before this, get rid of
      * the blockers on the intermediate nodes so that the resulting state is
-     * valid. Also give up permissions on mirror_top_bs->backing, which might
+     * valid. Also give up permissions on mirror_top_bs->file, which might
      * block the removal. */
     block_job_remove_all_bdrv(job);
-    bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
-                            &error_abort);
-    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
+    bdrv_child_try_set_perm(mirror_top_bs->file, 0, BLK_PERM_ALL, &error_abort);
+    bdrv_replace_node(mirror_top_bs, mirror_top_bs->file->bs, &error_abort);
 
     /* We just changed the BDS the job BB refers to (with either or both of the
      * bdrv_replace_node() calls), so switch the BB back so the cleanup does
@@ -687,6 +697,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
     blk_set_perm(job->blk, 0, BLK_PERM_ALL, &error_abort);
     blk_insert_bs(job->blk, mirror_top_bs, &error_abort);
 
+    bs_opaque->job = NULL;
     block_job_completed(&s->common, data->ret);
 
     g_free(data);
@@ -1102,7 +1113,7 @@ static void mirror_drain(BlockJob *job)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 
-    /* Need to keep a reference in case blk_drain triggers execution
+    /* Need to keep a reference in case bdrv_drain triggers execution
      * of mirror_complete...
      */
     if (s->target) {
@@ -1135,44 +1146,88 @@ static const BlockJobDriver commit_active_job_driver = {
     .drain                  = mirror_drain,
 };
 
+static void source_child_inherit_fmt_options(int *child_flags,
+                                             QDict *child_options,
+                                             int parent_flags,
+                                             QDict *parent_options)
+{
+    child_backing.inherit_options(child_flags, child_options,
+                                  parent_flags, parent_options);
+}
+
+static char *source_child_get_parent_desc(BdrvChild *c)
+{
+    return child_backing.get_parent_desc(c);
+}
+
+static void source_child_cb_drained_begin(BdrvChild *c)
+{
+    BlockDriverState *bs = c->opaque;
+    MirrorBDSOpaque *s = bs->opaque;
+
+    if (s && s->job) {
+        block_job_drained_begin(&s->job->common);
+    }
+    bdrv_drained_begin(bs);
+}
+
+static void source_child_cb_drained_end(BdrvChild *c)
+{
+    BlockDriverState *bs = c->opaque;
+    MirrorBDSOpaque *s = bs->opaque;
+
+    if (s && s->job) {
+        block_job_drained_end(&s->job->common);
+    }
+    bdrv_drained_end(bs);
+}
+
+static BdrvChildRole source_child_role = {
+    .inherit_options    = source_child_inherit_fmt_options,
+    .get_parent_desc    = source_child_get_parent_desc,
+    .drained_begin      = source_child_cb_drained_begin,
+    .drained_end        = source_child_cb_drained_end,
+};
+
 static int coroutine_fn bdrv_mirror_top_preadv(BlockDriverState *bs,
     uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
+    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
 static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
     uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
+    return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
 }
 
 static int coroutine_fn bdrv_mirror_top_flush(BlockDriverState *bs)
 {
-    return bdrv_co_flush(bs->backing->bs);
+    return bdrv_co_flush(bs->file->bs);
 }
 
 static int coroutine_fn bdrv_mirror_top_pwrite_zeroes(BlockDriverState *bs,
     int64_t offset, int bytes, BdrvRequestFlags flags)
 {
-    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
+    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
 }
 
 static int coroutine_fn bdrv_mirror_top_pdiscard(BlockDriverState *bs,
     int64_t offset, int bytes)
 {
-    return bdrv_co_pdiscard(bs->backing->bs, offset, bytes);
+    return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
 }
 
 static void bdrv_mirror_top_refresh_filename(BlockDriverState *bs, QDict *opts)
 {
-    bdrv_refresh_filename(bs->backing->bs);
     pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
-            bs->backing->bs->filename);
+            bs->file->bs->filename);
 }
 
 static void bdrv_mirror_top_close(BlockDriverState *bs)
 {
+    bdrv_unref_child(bs, bs->file);
+    bs->file = NULL;
 }
 
 static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
@@ -1180,6 +1235,14 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
                                        uint64_t perm, uint64_t shared,
                                        uint64_t *nperm, uint64_t *nshared)
 {
+    MirrorBDSOpaque *s = bs->opaque;
+
+    if (s->job && s->job->exiting) {
+        *nperm = 0;
+        *nshared = BLK_PERM_ALL;
+        return;
+    }
+
     /* Must be able to forward guest writes to the real image */
     *nperm = 0;
     if (perm & BLK_PERM_WRITE) {
@@ -1190,7 +1253,7 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
 }
 
 /* Dummy node that provides consistent read to its users without requiring it
- * from its backing file and that allows writes on the backing file chain. */
+ * from its source file and that allows writes on the source file. */
 static BlockDriver bdrv_mirror_top = {
     .format_name                = "mirror_top",
     .bdrv_co_preadv             = bdrv_mirror_top_preadv,
@@ -1198,7 +1261,7 @@ static BlockDriver bdrv_mirror_top = {
     .bdrv_co_pwrite_zeroes      = bdrv_mirror_top_pwrite_zeroes,
     .bdrv_co_pdiscard           = bdrv_mirror_top_pdiscard,
     .bdrv_co_flush              = bdrv_mirror_top_flush,
-    .bdrv_co_get_block_status   = bdrv_co_get_block_status_from_backing,
+    .bdrv_co_get_block_status   = bdrv_co_get_block_status_from_file,
     .bdrv_refresh_filename      = bdrv_mirror_top_refresh_filename,
     .bdrv_close                 = bdrv_mirror_top_close,
     .bdrv_child_perm            = bdrv_mirror_top_child_perm,
@@ -1221,6 +1284,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
                              Error **errp)
 {
     MirrorBlockJob *s;
+    MirrorBDSOpaque *bs_opaque;
     BlockDriverState *mirror_top_bs;
     bool target_graph_mod;
     bool target_is_backing;
@@ -1244,9 +1308,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         buf_size = DEFAULT_MIRROR_BUF_SIZE;
     }
 
-    /* In the case of active commit, add dummy driver to provide consistent
-     * reads on the top, while disabling it in the intermediate nodes, and make
-     * the backing chain writable. */
+    /* Create mirror BDS */
     mirror_top_bs = bdrv_new_open_driver(&bdrv_mirror_top, filter_node_name,
                                          BDRV_O_RDWR, errp);
     if (mirror_top_bs == NULL) {
@@ -1256,14 +1318,19 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         mirror_top_bs->implicit = true;
     }
     mirror_top_bs->total_sectors = bs->total_sectors;
+    bs_opaque = g_new0(MirrorBDSOpaque, 1);
+    mirror_top_bs->opaque = bs_opaque;
     bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
 
-    /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
-     * it alive until block_job_create() succeeds even if bs has no parent. */
-    bdrv_ref(mirror_top_bs);
-    bdrv_drained_begin(bs);
-    bdrv_append(mirror_top_bs, bs, &local_err);
-    bdrv_drained_end(bs);
+    /* Create reference for bdrv_attach_child() */
+    bdrv_ref(bs);
+    mirror_top_bs->file = bdrv_attach_child(mirror_top_bs, bs, "file",
+                                            &source_child_role, &local_err);
+    if (!local_err) {
+        bdrv_drained_begin(bs);
+        bdrv_replace_node(bs, mirror_top_bs, &local_err);
+        bdrv_drained_end(bs);
+    }
 
     if (local_err) {
         bdrv_unref(mirror_top_bs);
@@ -1280,6 +1347,8 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
     if (!s) {
         goto fail;
     }
+    bs_opaque->job = s;
+
     /* The block job now has a reference to this node */
     bdrv_unref(mirror_top_bs);
 
@@ -1329,7 +1398,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         s->should_complete = true;
     }
 
-    s->source = mirror_top_bs->backing;
+    s->source = mirror_top_bs->file;
     s->mirror_top_bs = mirror_top_bs;
 
     s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
@@ -1373,12 +1442,12 @@ fail:
 
         g_free(s->replaces);
         blk_unref(s->target);
-        block_job_early_fail(&s->common);
+        bs_opaque->job = NULL;
+        block_job_unref(&s->common);
     }
 
-    bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
-                            &error_abort);
-    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
+    bdrv_child_try_set_perm(mirror_top_bs->file, 0, BLK_PERM_ALL, &error_abort);
+    bdrv_replace_node(mirror_top_bs, mirror_top_bs->file->bs, &error_abort);
 
     bdrv_unref(mirror_top_bs);
 }
diff --git a/block/qapi.c b/block/qapi.c
index 7fa2437923..ee792d0cbc 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -147,9 +147,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 
         /* Skip automatically inserted nodes that the user isn't aware of for
          * query-block (blk != NULL), but not for query-named-block-nodes */
-        while (blk && bs0->drv && bs0->implicit) {
-            bs0 = backing_bs(bs0);
-            assert(bs0);
+        while (blk && bs0 && bs0->drv && bs0->implicit) {
+            if (bs0->backing) {
+                bs0 = backing_bs(bs0);
+            } else {
+                assert(bs0->file);
+                bs0 = bs0->file->bs;
+            }
         }
     }
 
@@ -337,7 +341,12 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info,
 
     /* Skip automatically inserted nodes that the user isn't aware of */
     while (bs && bs->drv && bs->implicit) {
-        bs = backing_bs(bs);
+        if (bs->backing) {
+            bs = backing_bs(bs);
+        } else {
+            assert(bs->file);
+            bs = bs->file->bs;
+        }
     }
 
     info->device = g_strdup(blk_name(blk));
@@ -466,8 +475,12 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
      * a BlockBackend-level command. Stay at the exact node for a node-level
      * command. */
     while (blk_level && bs->drv && bs->implicit) {
-        bs = backing_bs(bs);
-        assert(bs);
+        if (bs->backing) {
+            bs = backing_bs(bs);
+        } else {
+            assert(bs->file);
+            bs = bs->file->bs;
+        }
     }
 
     if (bdrv_get_node_name(bs)[0]) {
diff --git a/tests/qemu-iotests/141.out b/tests/qemu-iotests/141.out
index 82e763b68d..8c4dd6d531 100644
--- a/tests/qemu-iotests/141.out
+++ b/tests/qemu-iotests/141.out
@@ -20,7 +20,7 @@ Formatting 'TEST_DIR/o.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.
 Formatting 'TEST_DIR/o.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.IMGFMT backing_fmt=IMGFMT
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "mirror"}}
 {"return": {}}
-{"error": {"class": "GenericError", "desc": "Node 'drv0' is busy: node is used as backing hd of 'NODE_NAME'"}}
+{"error": {"class": "GenericError", "desc": "Block device drv0 is in use"}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "mirror"}}
 {"return": {}}
@@ -30,7 +30,7 @@ Formatting 'TEST_DIR/o.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
 {"return": {}}
-{"error": {"class": "GenericError", "desc": "Node 'drv0' is busy: node is used as backing hd of 'NODE_NAME'"}}
+{"error": {"class": "GenericError", "desc": "Block device drv0 is in use"}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
 {"return": {}}
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next()
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (9 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-25 15:38   ` Vladimir Sementsov-Ogievskiy
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area Max Reitz
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This new parameter allows the caller to just query the next dirty
position without moving the iterator.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/hbitmap.h |  4 +++-
 block/dirty-bitmap.c   |  2 +-
 tests/test-hbitmap.c   | 26 +++++++++++++-------------
 util/hbitmap.c         | 10 +++++++---
 4 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index d3a74a21fc..6a52575ad5 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -316,11 +316,13 @@ void hbitmap_free_meta(HBitmap *hb);
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
+ * @advance: If true, advance the iterator.  Otherwise, the next call
+ *           of this function will return the same result.
  *
  * Return the next bit that is set in @hbi's associated HBitmap,
  * or -1 if all remaining bits are zero.
  */
-int64_t hbitmap_iter_next(HBitmapIter *hbi);
+int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance);
 
 /**
  * hbitmap_iter_next_word:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 30462d4f9a..aee57cf8c8 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -547,7 +547,7 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)
 
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-    return hbitmap_iter_next(&iter->hbi);
+    return hbitmap_iter_next(&iter->hbi, true);
 }
 
 /* Called within bdrv_dirty_bitmap_lock..unlock */
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 1acb353889..e6d4d563cb 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
 
     i = first;
     for (;;) {
-        next = hbitmap_iter_next(&hbi);
+        next = hbitmap_iter_next(&hbi, true);
         if (next < 0) {
             next = data->size;
         }
@@ -435,25 +435,25 @@ static void test_hbitmap_iter_granularity(TestHBitmapData *data,
     /* Note that hbitmap_test_check has to be invoked manually in this test.  */
     hbitmap_test_init(data, 131072 << 7, 7);
     hbitmap_iter_init(&hbi, data->hb, 0);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
 
     hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
     hbitmap_iter_init(&hbi, data->hb, 0);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, (L2 + L1 + 1) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
 
     hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
 
     hbitmap_test_set(data, (131072 << 7) - 8, 8);
     hbitmap_iter_init(&hbi, data->hb, 0);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, (L2 + L1 + 1) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, 131071 << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
 
     hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
-    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, 131071 << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
 }
 
 static void hbitmap_test_set_boundary_bits(TestHBitmapData *data, ssize_t diff)
@@ -893,7 +893,7 @@ static void test_hbitmap_serialize_zeroes(TestHBitmapData *data,
     for (i = 0; i < num_positions; i++) {
         hbitmap_deserialize_zeroes(data->hb, positions[i], min_l1, true);
         hbitmap_iter_init(&iter, data->hb, 0);
-        next = hbitmap_iter_next(&iter);
+        next = hbitmap_iter_next(&iter, true);
         if (i == num_positions - 1) {
             g_assert_cmpint(next, ==, -1);
         } else {
@@ -919,10 +919,10 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData *data,
 
     hbitmap_iter_init(&hbi, data->hb, BITS_PER_LONG - 1);
 
-    hbitmap_iter_next(&hbi);
+    hbitmap_iter_next(&hbi, true);
 
     hbitmap_reset_all(data->hb);
-    hbitmap_iter_next(&hbi);
+    hbitmap_iter_next(&hbi, true);
 }
 
 int main(int argc, char **argv)
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 21535cc90b..96525983ce 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -141,7 +141,7 @@ unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
     return cur;
 }
 
-int64_t hbitmap_iter_next(HBitmapIter *hbi)
+int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance)
 {
     unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1] &
             hbi->hb->levels[HBITMAP_LEVELS - 1][hbi->pos];
@@ -154,8 +154,12 @@ int64_t hbitmap_iter_next(HBitmapIter *hbi)
         }
     }
 
-    /* The next call will resume work from the next bit.  */
-    hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
+    if (advance) {
+        /* The next call will resume work from the next bit.  */
+        hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
+    } else {
+        hbi->cur[HBITMAP_LEVELS - 1] = cur;
+    }
     item = ((uint64_t)hbi->pos << BITS_PER_LEVEL) + ctzl(cur);
 
     return item << hbi->granularity;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (10 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next() Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-25 15:49   ` Vladimir Sementsov-Ogievskiy
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes Max Reitz
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This new function allows to look for a consecutively dirty area in a
dirty bitmap.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/dirty-bitmap.h |  2 ++
 block/dirty-bitmap.c         | 52 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index a79a58d2c3..7654748700 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -90,6 +90,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
                                     int64_t cur_sector, int64_t nr_sectors);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
+bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
+                               uint64_t *offset, int *bytes);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index aee57cf8c8..81b2f78016 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -550,6 +550,58 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
     return hbitmap_iter_next(&iter->hbi, true);
 }
 
+/**
+ * Return the next consecutively dirty area in the dirty bitmap
+ * belonging to the given iterator @iter.
+ *
+ * @max_offset: Maximum value that may be returned for
+ *              *offset + *bytes
+ * @offset:     Will contain the start offset of the next dirty area
+ * @bytes:      Will contain the length of the next dirty area
+ *
+ * Returns: True if a dirty area could be found before max_offset
+ *          (which means that *offset and *bytes then contain valid
+ *          values), false otherwise.
+ */
+bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
+                               uint64_t *offset, int *bytes)
+{
+    uint32_t granularity = bdrv_dirty_bitmap_granularity(iter->bitmap);
+    uint64_t gran_max_offset;
+    int sector_gran = granularity >> BDRV_SECTOR_BITS;
+    int64_t ret;
+    int size;
+
+    if (DIV_ROUND_UP(max_offset, BDRV_SECTOR_SIZE) == iter->bitmap->size) {
+        /* If max_offset points to the image end, round it up by the
+         * bitmap granularity */
+        gran_max_offset = ROUND_UP(max_offset, granularity);
+    } else {
+        gran_max_offset = max_offset;
+    }
+
+    ret = hbitmap_iter_next(&iter->hbi, false);
+    if (ret < 0 || (ret << BDRV_SECTOR_BITS) + granularity > gran_max_offset) {
+        return false;
+    }
+
+    *offset = ret << BDRV_SECTOR_BITS;
+    size = 0;
+
+    assert(granularity <= INT_MAX);
+
+    do {
+        /* Advance iterator */
+        ret = hbitmap_iter_next(&iter->hbi, true);
+        size += granularity;
+    } while ((ret << BDRV_SECTOR_BITS) + granularity <= gran_max_offset &&
+             hbitmap_iter_next(&iter->hbi, false) == ret + sector_gran &&
+             size <= INT_MAX - granularity);
+
+    *bytes = MIN(size, max_offset - *offset);
+    return true;
+}
+
 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
                                   int64_t cur_sector, int64_t nr_sectors)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (11 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-10-10  9:58   ` Kevin Wolf
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 14/18] block/mirror: Distinguish active from passive ops Max Reitz
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

The owner of the mirror BDS might retire its write permission; but there
may still be pending mirror operations so the mirror BDS cannot
necessarily retire its write permission for its child then.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 05410c94ca..612fab660e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1236,6 +1236,7 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
                                        uint64_t *nperm, uint64_t *nshared)
 {
     MirrorBDSOpaque *s = bs->opaque;
+    bool ops_in_flight = s->job && !QTAILQ_EMPTY(&s->job->ops_in_flight);
 
     if (s->job && s->job->exiting) {
         *nperm = 0;
@@ -1243,9 +1244,10 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
         return;
     }
 
-    /* Must be able to forward guest writes to the real image */
+    /* Must be able to forward both new and pending guest writes to
+     * the real image */
     *nperm = 0;
-    if (perm & BLK_PERM_WRITE) {
+    if ((perm & BLK_PERM_WRITE) || ops_in_flight) {
         *nperm |= BLK_PERM_WRITE;
     }
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 14/18] block/mirror: Distinguish active from passive ops
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (12 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring Max Reitz
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Currently, the mirror block job only knows passive operations.  But once
we introduce active writes, we need to distinguish between the two; for
example, mirror_wait_for_free_in_flight_slot() should wait for a passive
operation because active writes will not use the same in-flight slots.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 612fab660e..8fea619a68 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -96,6 +96,7 @@ struct MirrorOp {
     /* Set by mirror_co_read() before yielding for the first time */
     uint64_t bytes_copied;
 
+    bool is_active_write;
     CoQueue waiting_requests;
 
     QTAILQ_ENTRY(MirrorOp) next;
@@ -286,9 +287,14 @@ static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
     MirrorOp *op;
 
-    op = QTAILQ_FIRST(&s->ops_in_flight);
-    assert(op);
-    qemu_co_queue_wait(&op->waiting_requests, NULL);
+    QTAILQ_FOREACH(op, &s->ops_in_flight, next) {
+        if (!op->is_active_write) {
+            /* Only non-active operations use up in-flight slots */
+            qemu_co_queue_wait(&op->waiting_requests, NULL);
+            return;
+        }
+    }
+    abort();
 }
 
 /* Submit async read while handling COW.
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (13 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 14/18] block/mirror: Distinguish active from passive ops Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-14 15:57   ` Stefan Hajnoczi
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 16/18] block/mirror: Add copy mode QAPI interface Max Reitz
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This patch implements active synchronous mirroring.  In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target.  Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).

Optionally, dirty data can be copied to the target disk on read
operations, too.

Active mode is completely optional and currently disabled at runtime.  A
later patch will add a way for users to enable it.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qapi/block-core.json |  23 +++++++
 block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 205 insertions(+), 5 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index bb11815608..e072cfa67c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -938,6 +938,29 @@
   'data': ['top', 'full', 'none', 'incremental'] }
 
 ##
+# @MirrorCopyMode:
+#
+# An enumeration whose values tell the mirror block job when to
+# trigger writes to the target.
+#
+# @passive: copy data in background only.
+#
+# @active-write: when data is written to the source, write it
+#                (synchronously) to the target as well.  In addition,
+#                data is copied in background just like in @passive
+#                mode.
+#
+# @active-read-write: write data to the target (synchronously) both
+#                     when it is read from and written to the source.
+#                     In addition, data is copied in background just
+#                     like in @passive mode.
+#
+# Since: 2.11
+##
+{ 'enum': 'MirrorCopyMode',
+  'data': ['passive', 'active-write', 'active-read-write'] }
+
+##
 # @BlockJobType:
 #
 # Type of a block job.
diff --git a/block/mirror.c b/block/mirror.c
index 8fea619a68..c429aa77bb 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -54,8 +54,12 @@ typedef struct MirrorBlockJob {
     Error *replace_blocker;
     bool is_none_mode;
     BlockMirrorBackingMode backing_mode;
+    MirrorCopyMode copy_mode;
     BlockdevOnError on_source_error, on_target_error;
     bool synced;
+    /* Set when the target is synced (dirty bitmap is clean, nothing
+     * in flight) and the job is running in active mode */
+    bool actively_synced;
     bool should_complete;
     int64_t granularity;
     size_t buf_size;
@@ -77,6 +81,7 @@ typedef struct MirrorBlockJob {
     int target_cluster_size;
     int max_iov;
     bool initial_zeroing_ongoing;
+    int in_active_write_counter;
 
     /* Signals that we are no longer accessing source and target and the mirror
      * BDS should thus relinquish all permissions */
@@ -112,6 +117,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
                                             int error)
 {
     s->synced = false;
+    s->actively_synced = false;
     if (read) {
         return block_job_error_action(&s->common, s->on_source_error,
                                       true, error);
@@ -283,13 +289,12 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
     return ret;
 }
 
-static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+static inline void mirror_wait_for_any_operation(MirrorBlockJob *s, bool active)
 {
     MirrorOp *op;
 
     QTAILQ_FOREACH(op, &s->ops_in_flight, next) {
-        if (!op->is_active_write) {
-            /* Only non-active operations use up in-flight slots */
+        if (op->is_active_write == active) {
             qemu_co_queue_wait(&op->waiting_requests, NULL);
             return;
         }
@@ -297,6 +302,12 @@ static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
     abort();
 }
 
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+{
+    /* Only non-active operations use up in-flight slots */
+    mirror_wait_for_any_operation(s, false);
+}
+
 /* Submit async read while handling COW.
  * Returns: The number of bytes copied after and including offset,
  *          excluding any bytes copied prior to offset due to alignment.
@@ -861,6 +872,7 @@ static void coroutine_fn mirror_run(void *opaque)
         /* Report BLOCK_JOB_READY and wait for complete. */
         block_job_event_ready(&s->common);
         s->synced = true;
+        s->actively_synced = true;
         while (!block_job_is_cancelled(&s->common) && !s->should_complete) {
             block_job_yield(&s->common);
         }
@@ -912,6 +924,12 @@ static void coroutine_fn mirror_run(void *opaque)
         int64_t cnt, delta;
         bool should_complete;
 
+        /* Do not start passive operations while there are active
+         * writes in progress */
+        while (s->in_active_write_counter) {
+            mirror_wait_for_any_operation(s, true);
+        }
+
         if (s->ret < 0) {
             ret = s->ret;
             goto immediate_exit;
@@ -961,6 +979,9 @@ static void coroutine_fn mirror_run(void *opaque)
                  */
                 block_job_event_ready(&s->common);
                 s->synced = true;
+                if (s->copy_mode != MIRROR_COPY_MODE_PASSIVE) {
+                    s->actively_synced = true;
+                }
             }
 
             should_complete = s->should_complete ||
@@ -1195,16 +1216,171 @@ static BdrvChildRole source_child_role = {
     .drained_end        = source_child_cb_drained_end,
 };
 
+static void do_sync_target_write(MirrorBlockJob *job, uint64_t offset,
+                                 uint64_t bytes, QEMUIOVector *qiov, int flags)
+{
+    BdrvDirtyBitmapIter *iter;
+    QEMUIOVector target_qiov;
+    uint64_t dirty_offset;
+    int dirty_bytes;
+
+    qemu_iovec_init(&target_qiov, qiov->niov);
+
+    iter = bdrv_dirty_iter_new(job->dirty_bitmap, offset >> BDRV_SECTOR_BITS);
+
+    while (true) {
+        bool valid_area;
+        int ret;
+
+        bdrv_dirty_bitmap_lock(job->dirty_bitmap);
+        valid_area = bdrv_dirty_iter_next_area(iter, offset + bytes,
+                                               &dirty_offset, &dirty_bytes);
+        bdrv_dirty_bitmap_unlock(job->dirty_bitmap);
+        if (!valid_area) {
+            break;
+        }
+
+        job->common.len += dirty_bytes;
+
+        assert(dirty_offset - offset <= SIZE_MAX);
+        if (qiov) {
+            qemu_iovec_reset(&target_qiov);
+            qemu_iovec_concat(&target_qiov, qiov,
+                              dirty_offset - offset, dirty_bytes);
+        }
+
+        ret = blk_co_pwritev(job->target, dirty_offset, dirty_bytes,
+                             qiov ? &target_qiov : NULL, flags);
+        if (ret >= 0) {
+            assert(dirty_offset % BDRV_SECTOR_SIZE == 0);
+            assert(dirty_bytes % BDRV_SECTOR_SIZE == 0);
+            bdrv_reset_dirty_bitmap(job->dirty_bitmap,
+                                    dirty_offset >> BDRV_SECTOR_BITS,
+                                    dirty_bytes >> BDRV_SECTOR_BITS);
+
+            job->common.offset += dirty_bytes;
+        } else {
+            BlockErrorAction action;
+
+            action = mirror_error_action(job, false, -ret);
+            if (action == BLOCK_ERROR_ACTION_REPORT) {
+                if (!job->ret) {
+                    job->ret = ret;
+                }
+                break;
+            }
+        }
+    }
+
+    bdrv_dirty_iter_free(iter);
+    qemu_iovec_destroy(&target_qiov);
+}
+
+static MirrorOp *coroutine_fn active_write_prepare(MirrorBlockJob *s,
+                                                   uint64_t offset,
+                                                   uint64_t bytes)
+{
+    MirrorOp *op;
+    uint64_t start_chunk = offset / s->granularity;
+    uint64_t end_chunk = DIV_ROUND_UP(offset + bytes, s->granularity);
+
+    op = g_new(MirrorOp, 1);
+    *op = (MirrorOp){
+        .s                  = s,
+        .offset             = offset,
+        .bytes              = bytes,
+        .is_active_write    = true,
+    };
+    qemu_co_queue_init(&op->waiting_requests);
+    QTAILQ_INSERT_TAIL(&s->ops_in_flight, op, next);
+
+    s->in_active_write_counter++;
+
+    mirror_wait_on_conflicts(op, s, offset, bytes);
+
+    bitmap_set(s->in_flight_bitmap, start_chunk, end_chunk - start_chunk);
+
+    return op;
+}
+
+static void coroutine_fn active_write_settle(MirrorOp *op)
+{
+    uint64_t start_chunk = op->offset / op->s->granularity;
+    uint64_t end_chunk = DIV_ROUND_UP(op->offset + op->bytes,
+                                      op->s->granularity);
+
+    if (!--op->s->in_active_write_counter && op->s->actively_synced) {
+        /* Assert that we are back in sync once all active write
+         * operations are settled */
+        assert(!bdrv_get_dirty_count(op->s->dirty_bitmap));
+    }
+    bitmap_clear(op->s->in_flight_bitmap, start_chunk, end_chunk - start_chunk);
+    QTAILQ_REMOVE(&op->s->ops_in_flight, op, next);
+    qemu_co_queue_restart_all(&op->waiting_requests);
+    g_free(op);
+}
+
 static int coroutine_fn bdrv_mirror_top_preadv(BlockDriverState *bs,
     uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
+    MirrorOp *op = NULL;
+    MirrorBDSOpaque *s = bs->opaque;
+    int ret = 0;
+    bool copy_to_target;
+
+    copy_to_target = s->job->ret >= 0 &&
+                     s->job->copy_mode == MIRROR_COPY_MODE_ACTIVE_READ_WRITE;
+
+    if (copy_to_target) {
+        op = active_write_prepare(s->job, offset, bytes);
+    }
+
+    ret = bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
+    if (ret < 0) {
+        goto out;
+    }
+
+    if (copy_to_target) {
+        do_sync_target_write(s->job, offset, bytes, qiov, 0);
+    }
+
+out:
+    if (copy_to_target) {
+        active_write_settle(op);
+    }
+    return ret;
 }
 
 static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
     uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-    return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+    MirrorOp *op = NULL;
+    MirrorBDSOpaque *s = bs->opaque;
+    int ret = 0;
+    bool copy_to_target;
+
+    copy_to_target = s->job->ret >= 0 &&
+                     (s->job->copy_mode == MIRROR_COPY_MODE_ACTIVE_WRITE ||
+                      s->job->copy_mode == MIRROR_COPY_MODE_ACTIVE_READ_WRITE);
+
+    if (copy_to_target) {
+        op = active_write_prepare(s->job, offset, bytes);
+    }
+
+    ret = bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+    if (ret < 0) {
+        goto out;
+    }
+
+    if (copy_to_target) {
+        do_sync_target_write(s->job, offset, bytes, qiov, flags);
+    }
+
+out:
+    if (copy_to_target) {
+        active_write_settle(op);
+    }
+    return ret;
 }
 
 static int coroutine_fn bdrv_mirror_top_flush(BlockDriverState *bs)
@@ -1398,6 +1574,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
     s->on_target_error = on_target_error;
     s->is_none_mode = is_none_mode;
     s->backing_mode = backing_mode;
+    s->copy_mode = MIRROR_COPY_MODE_PASSIVE;
     s->base = base;
     s->granularity = granularity;
     s->buf_size = ROUND_UP(buf_size, granularity);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 16/18] block/mirror: Add copy mode QAPI interface
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (14 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 17/18] qemu-io: Add background write Max Reitz
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

This patch allows the user to specify whether to use active or only
passive mode for mirror block jobs.  Currently, this setting will remain
constant for the duration of the entire block job.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qapi/block-core.json      | 11 +++++++++--
 include/block/block_int.h |  4 +++-
 block/mirror.c            | 11 ++++++-----
 blockdev.c                |  9 ++++++++-
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e072cfa67c..40204d367a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1578,6 +1578,9 @@
 #         written. Both will result in identical contents.
 #         Default is true. (Since 2.4)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'passive'
+#             (Since: 2.11)
+#
 # Since: 1.3
 ##
 { 'struct': 'DriveMirror',
@@ -1587,7 +1590,7 @@
             '*speed': 'int', '*granularity': 'uint32',
             '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
             '*on-target-error': 'BlockdevOnError',
-            '*unmap': 'bool' } }
+            '*unmap': 'bool', '*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @BlockDirtyBitmap:
@@ -1766,6 +1769,9 @@
 #                    above @device. If this option is not given, a node name is
 #                    autogenerated. (Since: 2.9)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'passive'
+#             (Since: 2.11)
+#
 # Returns: nothing on success.
 #
 # Since: 2.6
@@ -1786,7 +1792,8 @@
             '*speed': 'int', '*granularity': 'uint32',
             '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
             '*on-target-error': 'BlockdevOnError',
-            '*filter-node-name': 'str' } }
+            '*filter-node-name': 'str',
+            '*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @block_set_io_throttle:
diff --git a/include/block/block_int.h b/include/block/block_int.h
index fa8bbf1f8b..517b2680ce 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -934,6 +934,7 @@ void commit_active_start(const char *job_id, BlockDriverState *bs,
  * @filter_node_name: The node name that should be assigned to the filter
  * driver that the mirror job inserts into the graph above @bs. NULL means that
  * a node name should be autogenerated.
+ * @copy_mode: When to trigger writes to the target.
  * @errp: Error object.
  *
  * Start a mirroring operation on @bs.  Clusters that are allocated
@@ -947,7 +948,8 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
                   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
                   BlockdevOnError on_source_error,
                   BlockdevOnError on_target_error,
-                  bool unmap, const char *filter_node_name, Error **errp);
+                  bool unmap, const char *filter_node_name,
+                  MirrorCopyMode copy_mode, Error **errp);
 
 /*
  * backup_job_create:
diff --git a/block/mirror.c b/block/mirror.c
index c429aa77bb..8a67935cc4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1464,7 +1464,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
                              const BlockJobDriver *driver,
                              bool is_none_mode, BlockDriverState *base,
                              bool auto_complete, const char *filter_node_name,
-                             bool is_mirror,
+                             bool is_mirror, MirrorCopyMode copy_mode,
                              Error **errp)
 {
     MirrorBlockJob *s;
@@ -1574,7 +1574,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
     s->on_target_error = on_target_error;
     s->is_none_mode = is_none_mode;
     s->backing_mode = backing_mode;
-    s->copy_mode = MIRROR_COPY_MODE_PASSIVE;
+    s->copy_mode = copy_mode;
     s->base = base;
     s->granularity = granularity;
     s->buf_size = ROUND_UP(buf_size, granularity);
@@ -1643,7 +1643,8 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
                   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
                   BlockdevOnError on_source_error,
                   BlockdevOnError on_target_error,
-                  bool unmap, const char *filter_node_name, Error **errp)
+                  bool unmap, const char *filter_node_name,
+                  MirrorCopyMode copy_mode, Error **errp)
 {
     bool is_none_mode;
     BlockDriverState *base;
@@ -1658,7 +1659,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
                      speed, granularity, buf_size, backing_mode,
                      on_source_error, on_target_error, unmap, NULL, NULL,
                      &mirror_job_driver, is_none_mode, base, false,
-                     filter_node_name, true, errp);
+                     filter_node_name, true, copy_mode, errp);
 }
 
 void commit_active_start(const char *job_id, BlockDriverState *bs,
@@ -1681,7 +1682,7 @@ void commit_active_start(const char *job_id, BlockDriverState *bs,
                      MIRROR_LEAVE_BACKING_CHAIN,
                      on_error, on_error, true, cb, opaque,
                      &commit_active_job_driver, false, base, auto_complete,
-                     filter_node_name, false, &local_err);
+                     filter_node_name, false, MIRROR_COPY_MODE_PASSIVE, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 56a6b24a0b..7f9c215e98 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3408,6 +3408,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
                                    bool has_unmap, bool unmap,
                                    bool has_filter_node_name,
                                    const char *filter_node_name,
+                                   bool has_copy_mode, MirrorCopyMode copy_mode,
                                    Error **errp)
 {
 
@@ -3432,6 +3433,9 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
     if (!has_filter_node_name) {
         filter_node_name = NULL;
     }
+    if (!has_copy_mode) {
+        copy_mode = MIRROR_COPY_MODE_PASSIVE;
+    }
 
     if (granularity != 0 && (granularity < 512 || granularity > 1048576 * 64)) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "granularity",
@@ -3462,7 +3466,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
                  has_replaces ? replaces : NULL,
                  speed, granularity, buf_size, sync, backing_mode,
                  on_source_error, on_target_error, unmap, filter_node_name,
-                 errp);
+                 copy_mode, errp);
 }
 
 void qmp_drive_mirror(DriveMirror *arg, Error **errp)
@@ -3603,6 +3607,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
                            arg->has_on_target_error, arg->on_target_error,
                            arg->has_unmap, arg->unmap,
                            false, NULL,
+                           arg->has_copy_mode, arg->copy_mode,
                            &local_err);
     bdrv_unref(target_bs);
     error_propagate(errp, local_err);
@@ -3623,6 +3628,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
                          BlockdevOnError on_target_error,
                          bool has_filter_node_name,
                          const char *filter_node_name,
+                         bool has_copy_mode, MirrorCopyMode copy_mode,
                          Error **errp)
 {
     BlockDriverState *bs;
@@ -3655,6 +3661,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
                            has_on_target_error, on_target_error,
                            true, true,
                            has_filter_node_name, filter_node_name,
+                           has_copy_mode, copy_mode,
                            &local_err);
     error_propagate(errp, local_err);
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (15 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 16/18] block/mirror: Add copy mode QAPI interface Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-18  6:46   ` Fam Zheng
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring Max Reitz
  2017-09-14 15:42 ` [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Stefan Hajnoczi
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Add a new parameter -B to qemu-io's write command.  When used, qemu-io
will not wait for the result of the operation and instead execute it in
the background.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-io-cmds.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 6 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 2811a89099..c635a248f5 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -481,6 +481,62 @@ static int do_pwrite(BlockBackend *blk, char *buf, int64_t offset,
 typedef struct {
     BlockBackend *blk;
     int64_t offset;
+    int bytes;
+    char *buf;
+    int flags;
+} CoBackgroundWrite;
+
+static void coroutine_fn co_background_pwrite_entry(void *opaque)
+{
+    CoBackgroundWrite *data = opaque;
+    QEMUIOVector qiov;
+    int ret;
+
+    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_add(&qiov, data->buf, data->bytes);
+
+    ret = blk_co_pwritev(data->blk, data->offset, data->bytes, &qiov,
+                         data->flags);
+
+    qemu_iovec_destroy(&qiov);
+    g_free(data->buf);
+
+    if (ret < 0) {
+        Error *err;
+        error_setg_errno(&err, -ret, "Background write failed");
+        error_report_err(err);
+    }
+}
+
+/* Takes ownership of @buf */
+static int do_background_pwrite(BlockBackend *blk, char *buf, int64_t offset,
+                                int64_t bytes, int flags)
+{
+    Coroutine *co;
+    CoBackgroundWrite *data;
+
+    if (bytes > INT_MAX) {
+        return -ERANGE;
+    }
+
+    data = g_new(CoBackgroundWrite, 1);
+    *data = (CoBackgroundWrite){
+        .blk    = blk,
+        .offset = offset,
+        .bytes  = bytes,
+        .buf    = buf,
+        .flags  = flags,
+    };
+
+    co = qemu_coroutine_create(co_background_pwrite_entry, data);
+    bdrv_coroutine_enter(blk_bs(blk), co);
+
+    return bytes;
+}
+
+typedef struct {
+    BlockBackend *blk;
+    int64_t offset;
     int64_t bytes;
     int64_t *total;
     int flags;
@@ -931,6 +987,7 @@ static void write_help(void)
 " Writes into a segment of the currently open file, using a buffer\n"
 " filled with a set pattern (0xcdcdcdcd).\n"
 " -b, -- write to the VM state rather than the virtual disk\n"
+" -B, -- just start a background write, do not wait for the result\n"
 " -c, -- write compressed data with blk_write_compressed\n"
 " -f, -- use Force Unit Access semantics\n"
 " -p, -- ignored for backwards compatibility\n"
@@ -951,7 +1008,7 @@ static const cmdinfo_t write_cmd = {
     .perm       = BLK_PERM_WRITE,
     .argmin     = 2,
     .argmax     = -1,
-    .args       = "[-bcCfquz] [-P pattern] off len",
+    .args       = "[-bBcCfquz] [-P pattern] off len",
     .oneline    = "writes a number of bytes at a specified offset",
     .help       = write_help,
 };
@@ -961,6 +1018,7 @@ static int write_f(BlockBackend *blk, int argc, char **argv)
     struct timeval t1, t2;
     bool Cflag = false, qflag = false, bflag = false;
     bool Pflag = false, zflag = false, cflag = false;
+    bool background = false;
     int flags = 0;
     int c, cnt;
     char *buf = NULL;
@@ -970,11 +1028,14 @@ static int write_f(BlockBackend *blk, int argc, char **argv)
     int64_t total = 0;
     int pattern = 0xcd;
 
-    while ((c = getopt(argc, argv, "bcCfpP:quz")) != -1) {
+    while ((c = getopt(argc, argv, "bBcCfpP:quz")) != -1) {
         switch (c) {
         case 'b':
             bflag = true;
             break;
+        case 'B':
+            background = true;
+            break;
         case 'c':
             cflag = true;
             break;
@@ -1032,6 +1093,11 @@ static int write_f(BlockBackend *blk, int argc, char **argv)
         return 0;
     }
 
+    if (background && (bflag || cflag || zflag)) {
+        printf("-B cannot be specified together with -b, -c, or -z\n");
+        return 0;
+    }
+
     offset = cvtnum(argv[optind]);
     if (offset < 0) {
         print_cvtnum_err(offset, argv[optind]);
@@ -1074,6 +1140,8 @@ static int write_f(BlockBackend *blk, int argc, char **argv)
         cnt = do_co_pwrite_zeroes(blk, offset, count, flags, &total);
     } else if (cflag) {
         cnt = do_write_compressed(blk, buf, offset, count, &total);
+    } else if (background) {
+        cnt = do_background_pwrite(blk, buf, offset, count, flags);
     } else {
         cnt = do_pwrite(blk, buf, offset, count, flags, &total);
     }
@@ -1088,12 +1156,15 @@ static int write_f(BlockBackend *blk, int argc, char **argv)
         goto out;
     }
 
-    /* Finally, report back -- -C gives a parsable format */
-    t2 = tsub(t2, t1);
-    print_report("wrote", &t2, offset, count, total, cnt, Cflag);
+    if (!background) {
+        /* Finally, report back -- -C gives a parsable format */
+        t2 = tsub(t2, t1);
+        print_report("wrote", &t2, offset, count, total, cnt, Cflag);
+    }
 
 out:
-    if (!zflag) {
+    /* do_background_pwrite() takes ownership of the buffer */
+    if (!zflag && !background) {
         qemu_io_free(buf);
     }
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (16 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 17/18] qemu-io: Add background write Max Reitz
@ 2017-09-13 18:19 ` Max Reitz
  2017-09-18  6:45   ` Fam Zheng
  2017-09-14 15:42 ` [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Stefan Hajnoczi
  18 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-13 18:19 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Max Reitz, Fam Zheng, Kevin Wolf, Stefan Hajnoczi, John Snow

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/151     | 111 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/151.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 117 insertions(+)
 create mode 100755 tests/qemu-iotests/151
 create mode 100644 tests/qemu-iotests/151.out

diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
new file mode 100755
index 0000000000..49a60773f9
--- /dev/null
+++ b/tests/qemu-iotests/151
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+#
+# Tests for active mirroring
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import qemu_img
+
+source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
+target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
+
+class TestActiveMirror(iotests.QMPTestCase):
+    image_len = 128 * 1024 * 1024 # MB
+    potential_writes_in_flight = True
+
+    def setUp(self):
+        qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
+        qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
+
+        blk_source = {'node-name': 'source',
+                      'driver': iotests.imgfmt,
+                      'file': {'driver': 'file',
+                               'filename': source_img}}
+
+        blk_target = {'node-name': 'target',
+                      'driver': iotests.imgfmt,
+                      'file': {'driver': 'file',
+                               'filename': target_img}}
+
+        self.vm = iotests.VM()
+        self.vm.add_blockdev(self.qmp_to_opts(blk_source))
+        self.vm.add_blockdev(self.qmp_to_opts(blk_target))
+        self.vm.launch()
+
+    def tearDown(self):
+        self.vm.shutdown()
+
+        if not self.potential_writes_in_flight:
+            self.assertTrue(iotests.compare_images(source_img, target_img),
+                            'mirror target does not match source')
+
+        os.remove(source_img)
+        os.remove(target_img)
+
+    def doActiveIO(self, sync_source_and_target):
+        # Fill the source image
+        self.vm.hmp_qemu_io('source',
+                            'write -P 1 0 %i' % self.image_len);
+
+        # Start some background requests
+        for offset in range(0, self.image_len, 1024 * 1024):
+            self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
+
+        # Start the block job
+        result = self.vm.qmp('blockdev-mirror',
+                             job_id='mirror',
+                             filter_node_name='mirror-node',
+                             device='source',
+                             target='target',
+                             sync='full',
+                             copy_mode='active-write')
+        self.assert_qmp(result, 'return', {})
+
+        # Start some more requests
+        for offset in range(0, self.image_len, 1024 * 1024):
+            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % offset)
+
+        # Wait for the READY event
+        self.wait_ready(drive='mirror')
+
+        # Now start some final requests; all of these (which land on
+        # the source) should be settled using the active mechanism.
+        # The mirror code itself asserts that the source BDS's dirty
+        # bitmap will stay clean between READY and COMPLETED.
+        for offset in range(0, self.image_len, 1024 * 1024):
+            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % offset)
+
+        if sync_source_and_target:
+            # If source and target should be in sync after the mirror,
+            # we have to flush before completion
+            self.vm.hmp_qemu_io('mirror-node', 'flush')
+            self.potential_writes_in_flight = False
+
+        self.complete_and_wait(drive='mirror', wait_ready=False)
+
+    def testActiveIO(self):
+        self.doActiveIO(False)
+
+    def testActiveIOFlushed(self):
+        self.doActiveIO(True)
+
+
+
+if __name__ == '__main__':
+    iotests.main(supported_fmts=['qcow2', 'raw'])
diff --git a/tests/qemu-iotests/151.out b/tests/qemu-iotests/151.out
new file mode 100644
index 0000000000..fbc63e62f8
--- /dev/null
+++ b/tests/qemu-iotests/151.out
@@ -0,0 +1,5 @@
+..
+----------------------------------------------------------------------
+Ran 2 tests
+
+OK
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 94e764865a..c64adbe5bf 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -156,6 +156,7 @@
 148 rw auto quick
 149 rw auto sudo
 150 rw auto quick
+151 rw auto
 152 rw auto quick
 153 rw auto quick
 154 rw auto backing quick
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring
  2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
                   ` (17 preceding siblings ...)
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring Max Reitz
@ 2017-09-14 15:42 ` Stefan Hajnoczi
  2017-09-16 14:02   ` Max Reitz
  18 siblings, 1 reply; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-14 15:42 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Kevin Wolf, John Snow

On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
> There may be a couple of things to do on top of this series:
> - Allow switching between active and passive mode at runtime: This
>   should not be too difficult to implement, the main question is how to
>   expose it to the user.
>   (I seem to recall we wanted some form of block-job-set-option
>   command...?)
> 
> - Implement an asynchronous active mode: May be detrimental when it
>   comes to convergence, but it might be nice to have anyway.  May or may
>   not be complicated to implement.

Ideally the user doesn't have to know about async vs sync.  It's an
implementation detail.

Async makes sense during the bulk copy phase (e.g. sync=full) because
guest read/write latencies are mostly unaffected.  Once the entire
device has been copied there are probably still dirty blocks left
because the guest touched them while the mirror job was running.  At
that point it definitely makes sense to switch to synchronous mirroring
in order to converge.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring Max Reitz
@ 2017-09-14 15:57   ` Stefan Hajnoczi
  2017-09-16 13:58     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-14 15:57 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Kevin Wolf, John Snow

On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> This patch implements active synchronous mirroring.  In active mode, the
> passive mechanism will still be in place and is used to copy all
> initially dirty clusters off the source disk; but every write request
> will write data both to the source and the target disk, so the source
> cannot be dirtied faster than data is mirrored to the target.  Also,
> once the block job has converged (BLOCK_JOB_READY sent), source and
> target are guaranteed to stay in sync (unless an error occurs).
> 
> Optionally, dirty data can be copied to the target disk on read
> operations, too.
> 
> Active mode is completely optional and currently disabled at runtime.  A
> later patch will add a way for users to enable it.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  qapi/block-core.json |  23 +++++++
>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 205 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index bb11815608..e072cfa67c 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -938,6 +938,29 @@
>    'data': ['top', 'full', 'none', 'incremental'] }
>  
>  ##
> +# @MirrorCopyMode:
> +#
> +# An enumeration whose values tell the mirror block job when to
> +# trigger writes to the target.
> +#
> +# @passive: copy data in background only.
> +#
> +# @active-write: when data is written to the source, write it
> +#                (synchronously) to the target as well.  In addition,
> +#                data is copied in background just like in @passive
> +#                mode.
> +#
> +# @active-read-write: write data to the target (synchronously) both
> +#                     when it is read from and written to the source.
> +#                     In addition, data is copied in background just
> +#                     like in @passive mode.

I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
means committing the top-most BDS while the guest is accessing it.  The
"passive" mirror block still works on the top-most BDS while the guest
is accessing it.

Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
the terminology used in disk replication (e.g. DRBD).

Ideally the user wouldn't have to worry about async vs sync because QEMU
would switch modes as appropriate in order to converge.  That way
libvirt also doesn't have to worry about this.

>  static int coroutine_fn bdrv_mirror_top_preadv(BlockDriverState *bs,
>      uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>  {
> -    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
> +    MirrorOp *op = NULL;
> +    MirrorBDSOpaque *s = bs->opaque;
> +    int ret = 0;
> +    bool copy_to_target;
> +
> +    copy_to_target = s->job->ret >= 0 &&
> +                     s->job->copy_mode == MIRROR_COPY_MODE_ACTIVE_READ_WRITE;
> +
> +    if (copy_to_target) {
> +        op = active_write_prepare(s->job, offset, bytes);
> +    }
> +
> +    ret = bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    if (copy_to_target) {
> +        do_sync_target_write(s->job, offset, bytes, qiov, 0);
> +    }

This mode is dangerous.  See bdrv_co_do_copy_on_readv():

  /* Perform I/O through a temporary buffer so that users who scribble over
   * their read buffer while the operation is in progress do not end up
   * modifying the image file.  This is critical for zero-copy guest I/O
   * where anything might happen inside guest memory.
   */
  void *bounce_buffer;

Stefan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-14 15:57   ` Stefan Hajnoczi
@ 2017-09-16 13:58     ` Max Reitz
  2017-09-18 10:06       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-16 13:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, Fam Zheng, Kevin Wolf, John Snow

[-- Attachment #1: Type: text/plain, Size: 4758 bytes --]

On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
>> This patch implements active synchronous mirroring.  In active mode, the
>> passive mechanism will still be in place and is used to copy all
>> initially dirty clusters off the source disk; but every write request
>> will write data both to the source and the target disk, so the source
>> cannot be dirtied faster than data is mirrored to the target.  Also,
>> once the block job has converged (BLOCK_JOB_READY sent), source and
>> target are guaranteed to stay in sync (unless an error occurs).
>>
>> Optionally, dirty data can be copied to the target disk on read
>> operations, too.
>>
>> Active mode is completely optional and currently disabled at runtime.  A
>> later patch will add a way for users to enable it.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  qapi/block-core.json |  23 +++++++
>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 205 insertions(+), 5 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index bb11815608..e072cfa67c 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -938,6 +938,29 @@
>>    'data': ['top', 'full', 'none', 'incremental'] }
>>  
>>  ##
>> +# @MirrorCopyMode:
>> +#
>> +# An enumeration whose values tell the mirror block job when to
>> +# trigger writes to the target.
>> +#
>> +# @passive: copy data in background only.
>> +#
>> +# @active-write: when data is written to the source, write it
>> +#                (synchronously) to the target as well.  In addition,
>> +#                data is copied in background just like in @passive
>> +#                mode.
>> +#
>> +# @active-read-write: write data to the target (synchronously) both
>> +#                     when it is read from and written to the source.
>> +#                     In addition, data is copied in background just
>> +#                     like in @passive mode.
> 
> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> means committing the top-most BDS while the guest is accessing it.  The
> "passive" mirror block still works on the top-most BDS while the guest
> is accessing it.
> 
> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> the terminology used in disk replication (e.g. DRBD).

I'd be OK with that, too, but I think I remember that in the past at
least Kevin made a clear distinction between active/passive and
sync/async when it comes to mirroring.

> Ideally the user wouldn't have to worry about async vs sync because QEMU
> would switch modes as appropriate in order to converge.  That way
> libvirt also doesn't have to worry about this.

So here you mean async/sync in the way I meant it, i.e., whether the
mirror operations themselves are async/sync?

Maybe we could call the the passive mode "background" and the active
mode "synchronous" (or maybe even "foreground")?  Because sync then
means three things:
(1) The block job sync mode
(2) Synchronous write operations to the target
(3) Active mirroring mode

(1) is relatively easy to distinguish because it's just "the job sync
mode".  If we call the passive mode "background" we can distinguish (2)
and (3) at least based on their antonyms: "sync (as in sync/async)" vs.
"sync (as in sync/background)".

>>  static int coroutine_fn bdrv_mirror_top_preadv(BlockDriverState *bs,
>>      uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>>  {
>> -    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
>> +    MirrorOp *op = NULL;
>> +    MirrorBDSOpaque *s = bs->opaque;
>> +    int ret = 0;
>> +    bool copy_to_target;
>> +
>> +    copy_to_target = s->job->ret >= 0 &&
>> +                     s->job->copy_mode == MIRROR_COPY_MODE_ACTIVE_READ_WRITE;
>> +
>> +    if (copy_to_target) {
>> +        op = active_write_prepare(s->job, offset, bytes);
>> +    }
>> +
>> +    ret = bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    if (copy_to_target) {
>> +        do_sync_target_write(s->job, offset, bytes, qiov, 0);
>> +    }
> 
> This mode is dangerous.  See bdrv_co_do_copy_on_readv():
> 
>   /* Perform I/O through a temporary buffer so that users who scribble over
>    * their read buffer while the operation is in progress do not end up
>    * modifying the image file.  This is critical for zero-copy guest I/O
>    * where anything might happen inside guest memory.
>    */
>   void *bounce_buffer;

Ooh, yes, right.  I'll just drop it for now, then.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring
  2017-09-14 15:42 ` [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Stefan Hajnoczi
@ 2017-09-16 14:02   ` Max Reitz
  2017-09-18 10:02     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-16 14:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, Fam Zheng, Kevin Wolf, John Snow

[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]

On 2017-09-14 17:42, Stefan Hajnoczi wrote:
> On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
>> There may be a couple of things to do on top of this series:
>> - Allow switching between active and passive mode at runtime: This
>>   should not be too difficult to implement, the main question is how to
>>   expose it to the user.
>>   (I seem to recall we wanted some form of block-job-set-option
>>   command...?)
>>
>> - Implement an asynchronous active mode: May be detrimental when it
>>   comes to convergence, but it might be nice to have anyway.  May or may
>>   not be complicated to implement.
> 
> Ideally the user doesn't have to know about async vs sync.  It's an
> implementation detail.
> 
> Async makes sense during the bulk copy phase (e.g. sync=full) because
> guest read/write latencies are mostly unaffected.  Once the entire
> device has been copied there are probably still dirty blocks left
> because the guest touched them while the mirror job was running.  At
> that point it definitely makes sense to switch to synchronous mirroring
> in order to converge.

Makes sense, but I'm not sure whether it really is just an
implementation detail.  If you're in the bulk copy phase in active/async
mode and you have enough write requests with the target being slow
enough, I suspect you might still not get convergence then (because the
writes to the target yield for a long time while ever more write
requests pile up) -- so then you'd just shift the dirty tracking from
the bitmap to a list of requests in progress.

And I think we do want the bulk copy phase to guarantee convergence,
too, usually (when active/foreground/synchronous mode is selected).  If
we don't, then that's a policy decision and would be up to libvirt, as I
see it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse Max Reitz
@ 2017-09-18  3:44   ` Fam Zheng
  2017-09-18 16:13     ` Max Reitz
  2017-10-10  8:36   ` Kevin Wolf
  1 sibling, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  3:44 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:18, Max Reitz wrote:
> Drainined a BDS child may lead to both the original BDS and/or its other
> children being deleted (e.g. if the original BDS represents a block
> job).  We should prepare for this in both bdrv_drain_recurse() and
> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
> still exists at all.

Can the deletion happen when IOThread calls
bdrv_drain_recurse/bdrv_drained_begin?  If not, is it enough to do

    ...
    if (in_main_loop) {
        bdrv_ref(bs);
    }
    ...
    if (in_main_loop) {
        bdrv_unref(bs);
    }

to protect the main loop case? So the BdrvDeletedStatus state is not needed.

Fam

> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/io.c | 72 +++++++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 20 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 4378ae4c7d..8ec1a564ad 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -182,33 +182,57 @@ static void bdrv_drain_invoke(BlockDriverState *bs)
>  
>  static bool bdrv_drain_recurse(BlockDriverState *bs)
>  {
> -    BdrvChild *child, *tmp;
> +    BdrvChild *child;
>      bool waited;
> +    struct BDSToDrain {
> +        BlockDriverState *bs;
> +        BdrvDeletedStatus del_stat;
> +        QLIST_ENTRY(BDSToDrain) next;
> +    };
> +    QLIST_HEAD(, BDSToDrain) bs_list = QLIST_HEAD_INITIALIZER(bs_list);
> +    bool in_main_loop =
> +        qemu_get_current_aio_context() == qemu_get_aio_context();
>  
>      waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
>  
>      /* Ensure any pending metadata writes are submitted to bs->file.  */
>      bdrv_drain_invoke(bs);
>  
> -    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> -        BlockDriverState *bs = child->bs;
> -        bool in_main_loop =
> -            qemu_get_current_aio_context() == qemu_get_aio_context();
> -        assert(bs->refcnt > 0);
> -        if (in_main_loop) {
> -            /* In case the recursive bdrv_drain_recurse processes a
> -             * block_job_defer_to_main_loop BH and modifies the graph,
> -             * let's hold a reference to bs until we are done.
> -             *
> -             * IOThread doesn't have such a BH, and it is not safe to call
> -             * bdrv_unref without BQL, so skip doing it there.
> -             */
> -            bdrv_ref(bs);
> -        }
> -        waited |= bdrv_drain_recurse(bs);
> -        if (in_main_loop) {
> -            bdrv_unref(bs);
> +    /* Draining children may result in other children being removed and maybe
> +     * even deleted, so copy the children list first */
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        struct BDSToDrain *bs2d = g_new0(struct BDSToDrain, 1);
> +
> +        bs2d->bs = child->bs;
> +        QLIST_INSERT_HEAD(&bs->deleted_status, &bs2d->del_stat, next);
> +
> +        QLIST_INSERT_HEAD(&bs_list, bs2d, next);
> +    }
> +
> +    while (!QLIST_EMPTY(&bs_list)) {
> +        struct BDSToDrain *bs2d = QLIST_FIRST(&bs_list);
> +        QLIST_REMOVE(bs2d, next);
> +
> +        if (!bs2d->del_stat.deleted) {
> +            QLIST_REMOVE(&bs2d->del_stat, next);
> +
> +            if (in_main_loop) {
> +                /* In case the recursive bdrv_drain_recurse processes a
> +                 * block_job_defer_to_main_loop BH and modifies the graph,
> +                 * let's hold a reference to the BDS until we are done.
> +                 *
> +                 * IOThread doesn't have such a BH, and it is not safe to call
> +                 * bdrv_unref without BQL, so skip doing it there.
> +                 */
> +                bdrv_ref(bs2d->bs);
> +            }
> +            waited |= bdrv_drain_recurse(bs2d->bs);
> +            if (in_main_loop) {
> +                bdrv_unref(bs2d->bs);
> +            }
>          }
> +
> +        g_free(bs2d);
>      }
>  
>      return waited;
> @@ -252,17 +276,25 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
>  
>  void bdrv_drained_begin(BlockDriverState *bs)
>  {
> +    BdrvDeletedStatus del_stat = { .deleted = false };
> +
>      if (qemu_in_coroutine()) {
>          bdrv_co_yield_to_drain(bs);
>          return;
>      }
>  
> +    QLIST_INSERT_HEAD(&bs->deleted_status, &del_stat, next);
> +
>      if (atomic_fetch_inc(&bs->quiesce_counter) == 0) {
>          aio_disable_external(bdrv_get_aio_context(bs));
>          bdrv_parent_drained_begin(bs);
>      }
>  
> -    bdrv_drain_recurse(bs);
> +    if (!del_stat.deleted) {
> +        QLIST_REMOVE(&del_stat, next);
> +
> +        bdrv_drain_recurse(bs);
> +    }
>  }
>  
>  void bdrv_drained_end(BlockDriverState *bs)
> -- 
> 2.13.5
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public Max Reitz
@ 2017-09-18  3:46   ` Fam Zheng
  0 siblings, 0 replies; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  3:46 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, Kevin Wolf, qemu-devel, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:18, Max Reitz wrote:
> When a block job decides to be represented as a BDS and track its
> associated child nodes itself instead of having the BlockJob object
> track them, it needs to implement the drained_begin/drained_end child
> operations.  In order to do that, it has to be able to control drainage
> of the block job (i.e. to pause and resume it).  Therefore, we need to
> make these operations public.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  include/block/blockjob.h | 15 +++++++++++++++
>  blockjob.c               | 20 ++++++++++++++------
>  2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index 67c0968fa5..a59f316788 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -339,6 +339,21 @@ void block_job_ref(BlockJob *job);
>  void block_job_unref(BlockJob *job);
>  
>  /**
> + * block_job_drained_begin:
> + *
> + * Inhibit I/O requests initiated by the block job.
> + */
> +void block_job_drained_begin(BlockJob *job);
> +
> +/**
> + * block_job_drained_end:
> + *
> + * Resume I/O after it has been paused through
> + * block_job_drained_begin().
> + */
> +void block_job_drained_end(BlockJob *job);
> +
> +/**
>   * block_job_txn_unref:
>   *
>   * Release a reference that was previously acquired with block_job_txn_add_job
> diff --git a/blockjob.c b/blockjob.c
> index 3a0c49137e..4312a121fa 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -217,21 +217,29 @@ static const BdrvChildRole child_job = {
>      .stay_at_node       = true,
>  };
>  
> -static void block_job_drained_begin(void *opaque)
> +void block_job_drained_begin(BlockJob *job)
>  {
> -    BlockJob *job = opaque;
>      block_job_pause(job);
>  }
>  
> -static void block_job_drained_end(void *opaque)
> +static void block_job_drained_begin_op(void *opaque)
> +{
> +    block_job_drained_begin(opaque);
> +}
> +
> +void block_job_drained_end(BlockJob *job)
>  {
> -    BlockJob *job = opaque;
>      block_job_resume(job);
>  }
>  
> +static void block_job_drained_end_op(void *opaque)
> +{
> +    block_job_drained_end(opaque);
> +}
> +
>  static const BlockDevOps block_job_dev_ops = {
> -    .drained_begin = block_job_drained_begin,
> -    .drained_end = block_job_drained_end,
> +    .drained_begin = block_job_drained_begin_op,
> +    .drained_end = block_job_drained_end_op,
>  };
>  
>  void block_job_remove_all_bdrv(BlockJob *job)
> -- 
> 2.13.5
> 
> 

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform()
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform() Max Reitz
@ 2017-09-18  3:48   ` Fam Zheng
  2017-09-25  9:38   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  3:48 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:18, Max Reitz wrote:
> When converting mirror's I/O to coroutines, we are going to need a point
> where these coroutines are created.  mirror_perform() is going to be
> that point.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 53 ++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 30 insertions(+), 23 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 6531652d73..4664b0516f 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -82,6 +82,12 @@ typedef struct MirrorOp {
>      uint64_t bytes;
>  } MirrorOp;
>  
> +typedef enum MirrorMethod {
> +    MIRROR_METHOD_COPY,
> +    MIRROR_METHOD_ZERO,
> +    MIRROR_METHOD_DISCARD,
> +} MirrorMethod;
> +
>  static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>                                              int error)
>  {
> @@ -324,6 +330,22 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
>      }
>  }
>  
> +static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
> +                               unsigned bytes, MirrorMethod mirror_method)
> +{
> +    switch (mirror_method) {
> +    case MIRROR_METHOD_COPY:
> +        return mirror_do_read(s, offset, bytes);
> +    case MIRROR_METHOD_ZERO:
> +    case MIRROR_METHOD_DISCARD:
> +        mirror_do_zero_or_discard(s, offset, bytes,
> +                                  mirror_method == MIRROR_METHOD_DISCARD);
> +        return bytes;
> +    default:
> +        abort();
> +    }
> +}
> +
>  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>  {
>      BlockDriverState *source = s->source;
> @@ -395,11 +417,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>          unsigned int io_bytes;
>          int64_t io_bytes_acct;
>          BlockDriverState *file;
> -        enum MirrorMethod {
> -            MIRROR_METHOD_COPY,
> -            MIRROR_METHOD_ZERO,
> -            MIRROR_METHOD_DISCARD
> -        } mirror_method = MIRROR_METHOD_COPY;
> +        MirrorMethod mirror_method = MIRROR_METHOD_COPY;
>  
>          assert(!(offset % s->granularity));
>          ret = bdrv_get_block_status_above(source, NULL,
> @@ -439,22 +457,11 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>          }
>  
>          io_bytes = mirror_clip_bytes(s, offset, io_bytes);
> -        switch (mirror_method) {
> -        case MIRROR_METHOD_COPY:
> -            io_bytes = io_bytes_acct = mirror_do_read(s, offset, io_bytes);
> -            break;
> -        case MIRROR_METHOD_ZERO:
> -        case MIRROR_METHOD_DISCARD:
> -            mirror_do_zero_or_discard(s, offset, io_bytes,
> -                                      mirror_method == MIRROR_METHOD_DISCARD);
> -            if (write_zeroes_ok) {
> -                io_bytes_acct = 0;
> -            } else {
> -                io_bytes_acct = io_bytes;
> -            }
> -            break;
> -        default:
> -            abort();
> +        io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
> +        if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
> +            io_bytes_acct = 0;
> +        } else {
> +            io_bytes_acct = io_bytes;
>          }
>          assert(io_bytes);
>          offset += io_bytes;
> @@ -650,8 +657,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>                  continue;
>              }
>  
> -            mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
> -                                      nb_sectors * BDRV_SECTOR_SIZE, false);
> +            mirror_perform(s, sector_num * BDRV_SECTOR_SIZE,
> +                           nb_sectors * BDRV_SECTOR_SIZE, MIRROR_METHOD_ZERO);
>              sector_num += nb_sectors;
>          }
>  
> -- 
> 2.13.5
> 

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines Max Reitz
@ 2017-09-18  6:02   ` Fam Zheng
  2017-09-18 16:41     ` Max Reitz
  2017-10-10  9:14   ` Kevin Wolf
  1 sibling, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  6:02 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:18, Max Reitz wrote:
> In order to talk to the source BDS (and maybe in the future to the
> target BDS as well) directly, we need to convert our existing AIO
> requests into coroutine I/O requests.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 134 +++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 78 insertions(+), 56 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 4664b0516f..2b3297aa61 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -80,6 +80,9 @@ typedef struct MirrorOp {
>      QEMUIOVector qiov;
>      int64_t offset;
>      uint64_t bytes;
> +
> +    /* Set by mirror_co_read() before yielding for the first time */
> +    uint64_t bytes_copied;
>  } MirrorOp;
>  
>  typedef enum MirrorMethod {
> @@ -101,7 +104,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>      }
>  }
>  
> -static void mirror_iteration_done(MirrorOp *op, int ret)
> +static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
>  {
>      MirrorBlockJob *s = op->s;
>      struct iovec *iov;
> @@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      }
>  }
>  
> -static void mirror_write_complete(void *opaque, int ret)
> +static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
>  {
> -    MirrorOp *op = opaque;
>      MirrorBlockJob *s = op->s;
>  
>      aio_context_acquire(blk_get_aio_context(s->common.blk));
> @@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
>      aio_context_release(blk_get_aio_context(s->common.blk));
>  }
>  
> -static void mirror_read_complete(void *opaque, int ret)
> +static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
>  {
> -    MirrorOp *op = opaque;
>      MirrorBlockJob *s = op->s;
>  
>      aio_context_acquire(blk_get_aio_context(s->common.blk));
> @@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
>  
>          mirror_iteration_done(op, ret);
>      } else {
> -        blk_aio_pwritev(s->target, op->offset, &op->qiov,
> -                        0, mirror_write_complete, op);
> +        int ret;
> +
> +        ret = blk_co_pwritev(s->target, op->offset,
> +                             op->qiov.size, &op->qiov, 0);
> +        mirror_write_complete(op, ret);
>      }
>      aio_context_release(blk_get_aio_context(s->common.blk));
>  }
> @@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
>   *          (new_end - offset) if tail is rounded up or down due to
>   *          alignment or buffer limit.
>   */
> -static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
> -                               uint64_t bytes)
> +static void coroutine_fn mirror_co_read(void *opaque)
>  {
> +    MirrorOp *op = opaque;
> +    MirrorBlockJob *s = op->s;
>      BlockBackend *source = s->common.blk;
>      int nb_chunks;
>      uint64_t ret;
> -    MirrorOp *op;
>      uint64_t max_bytes;
>  
>      max_bytes = s->granularity * s->max_iov;
>  
>      /* We can only handle as much as buf_size at a time. */
> -    bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
> -    assert(bytes);
> -    assert(bytes < BDRV_REQUEST_MAX_BYTES);
> -    ret = bytes;
> +    op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
> +    assert(op->bytes);
> +    assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
> +    op->bytes_copied = op->bytes;
>  
>      if (s->cow_bitmap) {
> -        ret += mirror_cow_align(s, &offset, &bytes);
> +        op->bytes_copied += mirror_cow_align(s, &op->offset, &op->bytes);
>      }
> -    assert(bytes <= s->buf_size);
> +    /* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
> +    assert(op->bytes_copied <= UINT_MAX);
> +    assert(op->bytes <= s->buf_size);
>      /* The offset is granularity-aligned because:
>       * 1) Caller passes in aligned values;
>       * 2) mirror_cow_align is used only when target cluster is larger. */
> -    assert(QEMU_IS_ALIGNED(offset, s->granularity));
> +    assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
>      /* The range is sector-aligned, since bdrv_getlength() rounds up. */
> -    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
> -    nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
> +    assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
> +    nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
>  
>      while (s->buf_free_count < nb_chunks) {
> -        trace_mirror_yield_in_flight(s, offset, s->in_flight);
> +        trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
>          mirror_wait_for_io(s);
>      }
>  
> -    /* Allocate a MirrorOp that is used as an AIO callback.  */
> -    op = g_new(MirrorOp, 1);
> -    op->s = s;
> -    op->offset = offset;
> -    op->bytes = bytes;
> -
>      /* Now make a QEMUIOVector taking enough granularity-sized chunks
>       * from s->buf_free.
>       */
>      qemu_iovec_init(&op->qiov, nb_chunks);
>      while (nb_chunks-- > 0) {
>          MirrorBuffer *buf = QSIMPLEQ_FIRST(&s->buf_free);
> -        size_t remaining = bytes - op->qiov.size;
> +        size_t remaining = op->bytes - op->qiov.size;
>  
>          QSIMPLEQ_REMOVE_HEAD(&s->buf_free, next);
>          s->buf_free_count--;
> @@ -297,53 +297,75 @@ static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
>  
>      /* Copy the dirty cluster.  */
>      s->in_flight++;
> -    s->bytes_in_flight += bytes;
> -    trace_mirror_one_iteration(s, offset, bytes);
> +    s->bytes_in_flight += op->bytes;
> +    trace_mirror_one_iteration(s, op->offset, op->bytes);
>  
> -    blk_aio_preadv(source, offset, &op->qiov, 0, mirror_read_complete, op);
> -    return ret;
> +    ret = blk_co_preadv(source, op->offset, op->bytes, &op->qiov, 0);
> +    mirror_read_complete(op, ret);
>  }
>  
> -static void mirror_do_zero_or_discard(MirrorBlockJob *s,
> -                                      int64_t offset,
> -                                      uint64_t bytes,
> -                                      bool is_discard)
> +static void coroutine_fn mirror_co_zero(void *opaque)
>  {
> -    MirrorOp *op;
> +    MirrorOp *op = opaque;
> +    int ret;
>  
> -    /* Allocate a MirrorOp that is used as an AIO callback. The qiov is zeroed
> -     * so the freeing in mirror_iteration_done is nop. */
> -    op = g_new0(MirrorOp, 1);
> -    op->s = s;
> -    op->offset = offset;
> -    op->bytes = bytes;
> +    op->s->in_flight++;
> +    op->s->bytes_in_flight += op->bytes;
>  
> -    s->in_flight++;
> -    s->bytes_in_flight += bytes;
> -    if (is_discard) {
> -        blk_aio_pdiscard(s->target, offset,
> -                         op->bytes, mirror_write_complete, op);
> -    } else {
> -        blk_aio_pwrite_zeroes(s->target, offset,
> -                              op->bytes, s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
> -                              mirror_write_complete, op);
> -    }
> +    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
> +                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
> +    mirror_write_complete(op, ret);
> +}
> +
> +static void coroutine_fn mirror_co_discard(void *opaque)
> +{
> +    MirrorOp *op = opaque;
> +    int ret;
> +
> +    op->s->in_flight++;
> +    op->s->bytes_in_flight += op->bytes;
> +
> +    ret = blk_co_pdiscard(op->s->target, op->offset, op->bytes);
> +    mirror_write_complete(op, ret);
>  }
>  
>  static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
>                                 unsigned bytes, MirrorMethod mirror_method)
>  {
> +    MirrorOp *op;
> +    Coroutine *co;
> +    unsigned ret = bytes;
> +
> +    op = g_new(MirrorOp, 1);
> +    *op = (MirrorOp){
> +        .s      = s,
> +        .offset = offset,
> +        .bytes  = bytes,
> +    };
> +
>      switch (mirror_method) {
>      case MIRROR_METHOD_COPY:
> -        return mirror_do_read(s, offset, bytes);
> +        co = qemu_coroutine_create(mirror_co_read, op);
> +        break;
>      case MIRROR_METHOD_ZERO:
> +        co = qemu_coroutine_create(mirror_co_zero, op);
> +        break;
>      case MIRROR_METHOD_DISCARD:
> -        mirror_do_zero_or_discard(s, offset, bytes,
> -                                  mirror_method == MIRROR_METHOD_DISCARD);
> -        return bytes;
> +        co = qemu_coroutine_create(mirror_co_discard, op);
> +        break;
>      default:
>          abort();
>      }
> +
> +    qemu_coroutine_enter(co);
> +
> +    if (mirror_method == MIRROR_METHOD_COPY) {
> +        /* Same assertion as in mirror_co_read() */
> +        assert(op->bytes_copied <= UINT_MAX);
> +        ret = op->bytes_copied;
> +    }

This special casing is a bit ugly. Can you just make mirror_co_zero and
mirror_co_discard set op->bytes_copied too? (and perhaps rename to
op->bytes_handled) If so the comment in MirrorOp needs an update too.

And is it better to initialize it to -1 before entering coroutine, then assert
it is != -1 afterwards?

> +
> +    return ret;
>  }
>  
>  static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
> -- 
> 2.13.5
> 

Fam

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring Max Reitz
@ 2017-09-18  6:45   ` Fam Zheng
  2017-09-18 16:53     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  6:45 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:19, Max Reitz wrote:
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  tests/qemu-iotests/151     | 111 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/151.out |   5 ++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 117 insertions(+)
>  create mode 100755 tests/qemu-iotests/151
>  create mode 100644 tests/qemu-iotests/151.out
> 
> diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
> new file mode 100755
> index 0000000000..49a60773f9
> --- /dev/null
> +++ b/tests/qemu-iotests/151
> @@ -0,0 +1,111 @@
> +#!/usr/bin/env python
> +#
> +# Tests for active mirroring
> +#
> +# Copyright (C) 2017 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +#
> +
> +import os
> +import iotests
> +from iotests import qemu_img
> +
> +source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
> +target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
> +
> +class TestActiveMirror(iotests.QMPTestCase):
> +    image_len = 128 * 1024 * 1024 # MB
> +    potential_writes_in_flight = True
> +
> +    def setUp(self):
> +        qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
> +        qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
> +
> +        blk_source = {'node-name': 'source',
> +                      'driver': iotests.imgfmt,
> +                      'file': {'driver': 'file',
> +                               'filename': source_img}}
> +
> +        blk_target = {'node-name': 'target',
> +                      'driver': iotests.imgfmt,
> +                      'file': {'driver': 'file',
> +                               'filename': target_img}}
> +
> +        self.vm = iotests.VM()
> +        self.vm.add_blockdev(self.qmp_to_opts(blk_source))
> +        self.vm.add_blockdev(self.qmp_to_opts(blk_target))
> +        self.vm.launch()
> +
> +    def tearDown(self):
> +        self.vm.shutdown()
> +
> +        if not self.potential_writes_in_flight:
> +            self.assertTrue(iotests.compare_images(source_img, target_img),
> +                            'mirror target does not match source')
> +
> +        os.remove(source_img)
> +        os.remove(target_img)
> +
> +    def doActiveIO(self, sync_source_and_target):
> +        # Fill the source image
> +        self.vm.hmp_qemu_io('source',
> +                            'write -P 1 0 %i' % self.image_len);
> +
> +        # Start some background requests
> +        for offset in range(0, self.image_len, 1024 * 1024):
> +            self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
> +
> +        # Start the block job
> +        result = self.vm.qmp('blockdev-mirror',
> +                             job_id='mirror',
> +                             filter_node_name='mirror-node',
> +                             device='source',
> +                             target='target',
> +                             sync='full',
> +                             copy_mode='active-write')
> +        self.assert_qmp(result, 'return', {})
> +
> +        # Start some more requests
> +        for offset in range(0, self.image_len, 1024 * 1024):
> +            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % offset)
> +
> +        # Wait for the READY event
> +        self.wait_ready(drive='mirror')
> +
> +        # Now start some final requests; all of these (which land on
> +        # the source) should be settled using the active mechanism.
> +        # The mirror code itself asserts that the source BDS's dirty
> +        # bitmap will stay clean between READY and COMPLETED.
> +        for offset in range(0, self.image_len, 1024 * 1024):
> +            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % offset)
> +
> +        if sync_source_and_target:
> +            # If source and target should be in sync after the mirror,
> +            # we have to flush before completion

Not sure I understand this requirements, does it apply to libvirt and user too?
I.e. it's a part of the interface ? Why cannot mirror_complete do it
automatically?

Fam

> +            self.vm.hmp_qemu_io('mirror-node', 'flush')
> +            self.potential_writes_in_flight = False
> +
> +        self.complete_and_wait(drive='mirror', wait_ready=False)
> +
> +    def testActiveIO(self):
> +        self.doActiveIO(False)
> +
> +    def testActiveIOFlushed(self):
> +        self.doActiveIO(True)
> +
> +
> +
> +if __name__ == '__main__':
> +    iotests.main(supported_fmts=['qcow2', 'raw'])
> diff --git a/tests/qemu-iotests/151.out b/tests/qemu-iotests/151.out
> new file mode 100644
> index 0000000000..fbc63e62f8
> --- /dev/null
> +++ b/tests/qemu-iotests/151.out
> @@ -0,0 +1,5 @@
> +..
> +----------------------------------------------------------------------
> +Ran 2 tests
> +
> +OK
> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
> index 94e764865a..c64adbe5bf 100644
> --- a/tests/qemu-iotests/group
> +++ b/tests/qemu-iotests/group
> @@ -156,6 +156,7 @@
>  148 rw auto quick
>  149 rw auto sudo
>  150 rw auto quick
> +151 rw auto
>  152 rw auto quick
>  153 rw auto quick
>  154 rw auto backing quick
> -- 
> 2.13.5
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 17/18] qemu-io: Add background write Max Reitz
@ 2017-09-18  6:46   ` Fam Zheng
  2017-09-18 17:53     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-18  6:46 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Wed, 09/13 20:19, Max Reitz wrote:
> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
> will not wait for the result of the operation and instead execute it in
> the background.

Cannot aio_write be used for this purpose?

Fam

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/18] block/mirror: Add active-sync mirroring
  2017-09-16 14:02   ` Max Reitz
@ 2017-09-18 10:02     ` Stefan Hajnoczi
  2017-09-18 15:42       ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-18 10:02 UTC (permalink / raw)
  To: Max Reitz; +Cc: Stefan Hajnoczi, Kevin Wolf, Fam Zheng, qemu-devel, qemu-block

On Sat, Sep 16, 2017 at 04:02:45PM +0200, Max Reitz wrote:
> On 2017-09-14 17:42, Stefan Hajnoczi wrote:
> > On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
> >> There may be a couple of things to do on top of this series:
> >> - Allow switching between active and passive mode at runtime: This
> >>   should not be too difficult to implement, the main question is how to
> >>   expose it to the user.
> >>   (I seem to recall we wanted some form of block-job-set-option
> >>   command...?)
> >>
> >> - Implement an asynchronous active mode: May be detrimental when it
> >>   comes to convergence, but it might be nice to have anyway.  May or may
> >>   not be complicated to implement.
> > 
> > Ideally the user doesn't have to know about async vs sync.  It's an
> > implementation detail.
> > 
> > Async makes sense during the bulk copy phase (e.g. sync=full) because
> > guest read/write latencies are mostly unaffected.  Once the entire
> > device has been copied there are probably still dirty blocks left
> > because the guest touched them while the mirror job was running.  At
> > that point it definitely makes sense to switch to synchronous mirroring
> > in order to converge.
> 
> Makes sense, but I'm not sure whether it really is just an
> implementation detail.  If you're in the bulk copy phase in active/async
> mode and you have enough write requests with the target being slow
> enough, I suspect you might still not get convergence then (because the
> writes to the target yield for a long time while ever more write
> requests pile up) -- so then you'd just shift the dirty tracking from
> the bitmap to a list of requests in progress.
> 
> And I think we do want the bulk copy phase to guarantee convergence,
> too, usually (when active/foreground/synchronous mode is selected).  If
> we don't, then that's a policy decision and would be up to libvirt, as I
> see it.

This is a good point.  Bulk copy should converge too.

Can we measure the target write rate and guest write rate?  A heuristic
can choose between async vs sync based on the write rates.

For example, if the guest write rate has been larger than the target
write rate for the past 10 seconds during the bulk phase, switch to
synchronous mirroring.

Stefan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-16 13:58     ` Max Reitz
@ 2017-09-18 10:06       ` Stefan Hajnoczi
  2017-09-18 16:26         ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-18 10:06 UTC (permalink / raw)
  To: Max Reitz; +Cc: Stefan Hajnoczi, Kevin Wolf, Fam Zheng, qemu-devel, qemu-block

On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> >> This patch implements active synchronous mirroring.  In active mode, the
> >> passive mechanism will still be in place and is used to copy all
> >> initially dirty clusters off the source disk; but every write request
> >> will write data both to the source and the target disk, so the source
> >> cannot be dirtied faster than data is mirrored to the target.  Also,
> >> once the block job has converged (BLOCK_JOB_READY sent), source and
> >> target are guaranteed to stay in sync (unless an error occurs).
> >>
> >> Optionally, dirty data can be copied to the target disk on read
> >> operations, too.
> >>
> >> Active mode is completely optional and currently disabled at runtime.  A
> >> later patch will add a way for users to enable it.
> >>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>  qapi/block-core.json |  23 +++++++
> >>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
> >>  2 files changed, 205 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> index bb11815608..e072cfa67c 100644
> >> --- a/qapi/block-core.json
> >> +++ b/qapi/block-core.json
> >> @@ -938,6 +938,29 @@
> >>    'data': ['top', 'full', 'none', 'incremental'] }
> >>  
> >>  ##
> >> +# @MirrorCopyMode:
> >> +#
> >> +# An enumeration whose values tell the mirror block job when to
> >> +# trigger writes to the target.
> >> +#
> >> +# @passive: copy data in background only.
> >> +#
> >> +# @active-write: when data is written to the source, write it
> >> +#                (synchronously) to the target as well.  In addition,
> >> +#                data is copied in background just like in @passive
> >> +#                mode.
> >> +#
> >> +# @active-read-write: write data to the target (synchronously) both
> >> +#                     when it is read from and written to the source.
> >> +#                     In addition, data is copied in background just
> >> +#                     like in @passive mode.
> > 
> > I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> > means committing the top-most BDS while the guest is accessing it.  The
> > "passive" mirror block still works on the top-most BDS while the guest
> > is accessing it.
> > 
> > Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> > the terminology used in disk replication (e.g. DRBD).
> 
> I'd be OK with that, too, but I think I remember that in the past at
> least Kevin made a clear distinction between active/passive and
> sync/async when it comes to mirroring.
> 
> > Ideally the user wouldn't have to worry about async vs sync because QEMU
> > would switch modes as appropriate in order to converge.  That way
> > libvirt also doesn't have to worry about this.
> 
> So here you mean async/sync in the way I meant it, i.e., whether the
> mirror operations themselves are async/sync?

The meaning I had in mind is:

Sync mirroring means a guest write waits until the target write
completes.

Async mirroring means guest writes completes independently of target
writes.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/18] block/mirror: Add active-sync mirroring
  2017-09-18 10:02     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2017-09-18 15:42       ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-18 15:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Kevin Wolf, Fam Zheng, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2559 bytes --]

On 2017-09-18 12:02, Stefan Hajnoczi wrote:
> On Sat, Sep 16, 2017 at 04:02:45PM +0200, Max Reitz wrote:
>> On 2017-09-14 17:42, Stefan Hajnoczi wrote:
>>> On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
>>>> There may be a couple of things to do on top of this series:
>>>> - Allow switching between active and passive mode at runtime: This
>>>>   should not be too difficult to implement, the main question is how to
>>>>   expose it to the user.
>>>>   (I seem to recall we wanted some form of block-job-set-option
>>>>   command...?)
>>>>
>>>> - Implement an asynchronous active mode: May be detrimental when it
>>>>   comes to convergence, but it might be nice to have anyway.  May or may
>>>>   not be complicated to implement.
>>>
>>> Ideally the user doesn't have to know about async vs sync.  It's an
>>> implementation detail.
>>>
>>> Async makes sense during the bulk copy phase (e.g. sync=full) because
>>> guest read/write latencies are mostly unaffected.  Once the entire
>>> device has been copied there are probably still dirty blocks left
>>> because the guest touched them while the mirror job was running.  At
>>> that point it definitely makes sense to switch to synchronous mirroring
>>> in order to converge.
>>
>> Makes sense, but I'm not sure whether it really is just an
>> implementation detail.  If you're in the bulk copy phase in active/async
>> mode and you have enough write requests with the target being slow
>> enough, I suspect you might still not get convergence then (because the
>> writes to the target yield for a long time while ever more write
>> requests pile up) -- so then you'd just shift the dirty tracking from
>> the bitmap to a list of requests in progress.
>>
>> And I think we do want the bulk copy phase to guarantee convergence,
>> too, usually (when active/foreground/synchronous mode is selected).  If
>> we don't, then that's a policy decision and would be up to libvirt, as I
>> see it.
> 
> This is a good point.  Bulk copy should converge too.
> 
> Can we measure the target write rate and guest write rate?  A heuristic
> can choose between async vs sync based on the write rates.
> 
> For example, if the guest write rate has been larger than the target
> write rate for the past 10 seconds during the bulk phase, switch to
> synchronous mirroring.

I guess we can just count how many unfinished target write requests are
piling up.

...or libvirt can simply see that the block job is not progressing and
switch the mode. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-09-18  3:44   ` Fam Zheng
@ 2017-09-18 16:13     ` Max Reitz
  2017-10-09 18:30       ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-18 16:13 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 2257 bytes --]

On 2017-09-18 05:44, Fam Zheng wrote:
> On Wed, 09/13 20:18, Max Reitz wrote:
>> Drainined a BDS child may lead to both the original BDS and/or its other
>> children being deleted (e.g. if the original BDS represents a block
>> job).  We should prepare for this in both bdrv_drain_recurse() and
>> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
>> still exists at all.
> 
> Can the deletion happen when IOThread calls
> bdrv_drain_recurse/bdrv_drained_begin?

I don't think so, because (1) my issue was draining a block job and that
can only be completed in the main loop, and (2) I would like to think
it's always impossible, considering that bdrv_unref() may only be called
with the BQL.

>                                         If not, is it enough to do
> 
>     ...
>     if (in_main_loop) {
>         bdrv_ref(bs);
>     }
>     ...
>     if (in_main_loop) {
>         bdrv_unref(bs);
>     }
> 
> to protect the main loop case? So the BdrvDeletedStatus state is not needed.

We already have that in bdrv_drained_recurse(), don't we?

The issue here is, though, that QLIST_FOREACH_SAFE() stores the next
child pointer to @tmp.  However, once the current child @child is
drained, @tmp may no longer be valid -- it may have been detached from
@bs, and it may even have been deleted.

We could work around the latter by increasing the next child's reference
somehow (but BdrvChild doesn't really have a refcount, and in order to
do so, we would probably have to emulate being a parent or
something...), but then you'd still have the issue of @tmp being
detached from the children list we're trying to iterate over.  So
tmp->next is no longer valid.

Anyway, so the latter is the reason why I decided to introduce the bs_list.

But maybe that actually saves us from having to fiddle with BdrvChild...
 Since it's just a list of BDSs now, it may be enough to simply
bdrv_ref() all of the BDSs in that list before draining any of them.  So
 we'd keep creating the bs_list and then we'd move the existing
bdrv_ref() from the drain loop into the loop filling bs_list.

And adding a bdrv_ref()/bdrv_unref() pair to bdrv_drained_begin() should
hopefully work there, too.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-18 10:06       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2017-09-18 16:26         ` Max Reitz
  2017-09-19  9:44           ` Stefan Hajnoczi
  2017-10-10 10:16           ` Kevin Wolf
  0 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-18 16:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Kevin Wolf, Fam Zheng, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 4358 bytes --]

On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
>> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
>>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
>>>> This patch implements active synchronous mirroring.  In active mode, the
>>>> passive mechanism will still be in place and is used to copy all
>>>> initially dirty clusters off the source disk; but every write request
>>>> will write data both to the source and the target disk, so the source
>>>> cannot be dirtied faster than data is mirrored to the target.  Also,
>>>> once the block job has converged (BLOCK_JOB_READY sent), source and
>>>> target are guaranteed to stay in sync (unless an error occurs).
>>>>
>>>> Optionally, dirty data can be copied to the target disk on read
>>>> operations, too.
>>>>
>>>> Active mode is completely optional and currently disabled at runtime.  A
>>>> later patch will add a way for users to enable it.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>  qapi/block-core.json |  23 +++++++
>>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>  2 files changed, 205 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>> index bb11815608..e072cfa67c 100644
>>>> --- a/qapi/block-core.json
>>>> +++ b/qapi/block-core.json
>>>> @@ -938,6 +938,29 @@
>>>>    'data': ['top', 'full', 'none', 'incremental'] }
>>>>  
>>>>  ##
>>>> +# @MirrorCopyMode:
>>>> +#
>>>> +# An enumeration whose values tell the mirror block job when to
>>>> +# trigger writes to the target.
>>>> +#
>>>> +# @passive: copy data in background only.
>>>> +#
>>>> +# @active-write: when data is written to the source, write it
>>>> +#                (synchronously) to the target as well.  In addition,
>>>> +#                data is copied in background just like in @passive
>>>> +#                mode.
>>>> +#
>>>> +# @active-read-write: write data to the target (synchronously) both
>>>> +#                     when it is read from and written to the source.
>>>> +#                     In addition, data is copied in background just
>>>> +#                     like in @passive mode.
>>>
>>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
>>> means committing the top-most BDS while the guest is accessing it.  The
>>> "passive" mirror block still works on the top-most BDS while the guest
>>> is accessing it.
>>>
>>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
>>> the terminology used in disk replication (e.g. DRBD).
>>
>> I'd be OK with that, too, but I think I remember that in the past at
>> least Kevin made a clear distinction between active/passive and
>> sync/async when it comes to mirroring.
>>
>>> Ideally the user wouldn't have to worry about async vs sync because QEMU
>>> would switch modes as appropriate in order to converge.  That way
>>> libvirt also doesn't have to worry about this.
>>
>> So here you mean async/sync in the way I meant it, i.e., whether the
>> mirror operations themselves are async/sync?
> 
> The meaning I had in mind is:
> 
> Sync mirroring means a guest write waits until the target write
> completes.

I.e. active-sync, ...

> Async mirroring means guest writes completes independently of target
> writes.

... i.e. passive or active-async in the future.

So you really want qemu to decide whether to use active or passive mode
depending on what's enough to let the block job converge and not
introduce any switch for the user?

I'm not sure whether I like this too much, mostly because "libvirt
doesn't have to worry" doesn't feel quite right to me.  If we don't make
libvirt worry about this, then qemu has to worry.  I'm not sure whether
that's any better.

I think this really does get into policy territory.  Just switching to
active mode the instant target writes are slower than source writes may
not be what the user wants: Maybe it's OK for a short duration because
they don't care about hard convergence too much.  Maybe they want to
switch to active mode already when "only" twice as much is written to
the target as to the source.

I think this is a decision the management layer (or the user) has to make.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines
  2017-09-18  6:02   ` Fam Zheng
@ 2017-09-18 16:41     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-18 16:41 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 9580 bytes --]

On 2017-09-18 08:02, Fam Zheng wrote:
> On Wed, 09/13 20:18, Max Reitz wrote:
>> In order to talk to the source BDS (and maybe in the future to the
>> target BDS as well) directly, we need to convert our existing AIO
>> requests into coroutine I/O requests.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/mirror.c | 134 +++++++++++++++++++++++++++++++++------------------------
>>  1 file changed, 78 insertions(+), 56 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 4664b0516f..2b3297aa61 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -80,6 +80,9 @@ typedef struct MirrorOp {
>>      QEMUIOVector qiov;
>>      int64_t offset;
>>      uint64_t bytes;
>> +
>> +    /* Set by mirror_co_read() before yielding for the first time */
>> +    uint64_t bytes_copied;
>>  } MirrorOp;
>>  
>>  typedef enum MirrorMethod {
>> @@ -101,7 +104,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>>      }
>>  }
>>  
>> -static void mirror_iteration_done(MirrorOp *op, int ret)
>> +static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
>>  {
>>      MirrorBlockJob *s = op->s;
>>      struct iovec *iov;
>> @@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>>      }
>>  }
>>  
>> -static void mirror_write_complete(void *opaque, int ret)
>> +static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
>>  {
>> -    MirrorOp *op = opaque;
>>      MirrorBlockJob *s = op->s;
>>  
>>      aio_context_acquire(blk_get_aio_context(s->common.blk));
>> @@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
>>      aio_context_release(blk_get_aio_context(s->common.blk));
>>  }
>>  
>> -static void mirror_read_complete(void *opaque, int ret)
>> +static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
>>  {
>> -    MirrorOp *op = opaque;
>>      MirrorBlockJob *s = op->s;
>>  
>>      aio_context_acquire(blk_get_aio_context(s->common.blk));
>> @@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
>>  
>>          mirror_iteration_done(op, ret);
>>      } else {
>> -        blk_aio_pwritev(s->target, op->offset, &op->qiov,
>> -                        0, mirror_write_complete, op);
>> +        int ret;
>> +
>> +        ret = blk_co_pwritev(s->target, op->offset,
>> +                             op->qiov.size, &op->qiov, 0);
>> +        mirror_write_complete(op, ret);
>>      }
>>      aio_context_release(blk_get_aio_context(s->common.blk));
>>  }
>> @@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
>>   *          (new_end - offset) if tail is rounded up or down due to
>>   *          alignment or buffer limit.
>>   */
>> -static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
>> -                               uint64_t bytes)
>> +static void coroutine_fn mirror_co_read(void *opaque)
>>  {
>> +    MirrorOp *op = opaque;
>> +    MirrorBlockJob *s = op->s;
>>      BlockBackend *source = s->common.blk;
>>      int nb_chunks;
>>      uint64_t ret;
>> -    MirrorOp *op;
>>      uint64_t max_bytes;
>>  
>>      max_bytes = s->granularity * s->max_iov;
>>  
>>      /* We can only handle as much as buf_size at a time. */
>> -    bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
>> -    assert(bytes);
>> -    assert(bytes < BDRV_REQUEST_MAX_BYTES);
>> -    ret = bytes;
>> +    op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
>> +    assert(op->bytes);
>> +    assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
>> +    op->bytes_copied = op->bytes;
>>  
>>      if (s->cow_bitmap) {
>> -        ret += mirror_cow_align(s, &offset, &bytes);
>> +        op->bytes_copied += mirror_cow_align(s, &op->offset, &op->bytes);
>>      }
>> -    assert(bytes <= s->buf_size);
>> +    /* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
>> +    assert(op->bytes_copied <= UINT_MAX);
>> +    assert(op->bytes <= s->buf_size);
>>      /* The offset is granularity-aligned because:
>>       * 1) Caller passes in aligned values;
>>       * 2) mirror_cow_align is used only when target cluster is larger. */
>> -    assert(QEMU_IS_ALIGNED(offset, s->granularity));
>> +    assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
>>      /* The range is sector-aligned, since bdrv_getlength() rounds up. */
>> -    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
>> -    nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
>> +    assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
>> +    nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
>>  
>>      while (s->buf_free_count < nb_chunks) {
>> -        trace_mirror_yield_in_flight(s, offset, s->in_flight);
>> +        trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
>>          mirror_wait_for_io(s);
>>      }
>>  
>> -    /* Allocate a MirrorOp that is used as an AIO callback.  */
>> -    op = g_new(MirrorOp, 1);
>> -    op->s = s;
>> -    op->offset = offset;
>> -    op->bytes = bytes;
>> -
>>      /* Now make a QEMUIOVector taking enough granularity-sized chunks
>>       * from s->buf_free.
>>       */
>>      qemu_iovec_init(&op->qiov, nb_chunks);
>>      while (nb_chunks-- > 0) {
>>          MirrorBuffer *buf = QSIMPLEQ_FIRST(&s->buf_free);
>> -        size_t remaining = bytes - op->qiov.size;
>> +        size_t remaining = op->bytes - op->qiov.size;
>>  
>>          QSIMPLEQ_REMOVE_HEAD(&s->buf_free, next);
>>          s->buf_free_count--;
>> @@ -297,53 +297,75 @@ static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
>>  
>>      /* Copy the dirty cluster.  */
>>      s->in_flight++;
>> -    s->bytes_in_flight += bytes;
>> -    trace_mirror_one_iteration(s, offset, bytes);
>> +    s->bytes_in_flight += op->bytes;
>> +    trace_mirror_one_iteration(s, op->offset, op->bytes);
>>  
>> -    blk_aio_preadv(source, offset, &op->qiov, 0, mirror_read_complete, op);
>> -    return ret;
>> +    ret = blk_co_preadv(source, op->offset, op->bytes, &op->qiov, 0);
>> +    mirror_read_complete(op, ret);
>>  }
>>  
>> -static void mirror_do_zero_or_discard(MirrorBlockJob *s,
>> -                                      int64_t offset,
>> -                                      uint64_t bytes,
>> -                                      bool is_discard)
>> +static void coroutine_fn mirror_co_zero(void *opaque)
>>  {
>> -    MirrorOp *op;
>> +    MirrorOp *op = opaque;
>> +    int ret;
>>  
>> -    /* Allocate a MirrorOp that is used as an AIO callback. The qiov is zeroed
>> -     * so the freeing in mirror_iteration_done is nop. */
>> -    op = g_new0(MirrorOp, 1);
>> -    op->s = s;
>> -    op->offset = offset;
>> -    op->bytes = bytes;
>> +    op->s->in_flight++;
>> +    op->s->bytes_in_flight += op->bytes;
>>  
>> -    s->in_flight++;
>> -    s->bytes_in_flight += bytes;
>> -    if (is_discard) {
>> -        blk_aio_pdiscard(s->target, offset,
>> -                         op->bytes, mirror_write_complete, op);
>> -    } else {
>> -        blk_aio_pwrite_zeroes(s->target, offset,
>> -                              op->bytes, s->unmap ? BDRV_REQ_MAY_UNMAP : 0,
>> -                              mirror_write_complete, op);
>> -    }
>> +    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
>> +                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
>> +    mirror_write_complete(op, ret);
>> +}
>> +
>> +static void coroutine_fn mirror_co_discard(void *opaque)
>> +{
>> +    MirrorOp *op = opaque;
>> +    int ret;
>> +
>> +    op->s->in_flight++;
>> +    op->s->bytes_in_flight += op->bytes;
>> +
>> +    ret = blk_co_pdiscard(op->s->target, op->offset, op->bytes);
>> +    mirror_write_complete(op, ret);
>>  }
>>  
>>  static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
>>                                 unsigned bytes, MirrorMethod mirror_method)
>>  {
>> +    MirrorOp *op;
>> +    Coroutine *co;
>> +    unsigned ret = bytes;
>> +
>> +    op = g_new(MirrorOp, 1);
>> +    *op = (MirrorOp){
>> +        .s      = s,
>> +        .offset = offset,
>> +        .bytes  = bytes,
>> +    };
>> +
>>      switch (mirror_method) {
>>      case MIRROR_METHOD_COPY:
>> -        return mirror_do_read(s, offset, bytes);
>> +        co = qemu_coroutine_create(mirror_co_read, op);
>> +        break;
>>      case MIRROR_METHOD_ZERO:
>> +        co = qemu_coroutine_create(mirror_co_zero, op);
>> +        break;
>>      case MIRROR_METHOD_DISCARD:
>> -        mirror_do_zero_or_discard(s, offset, bytes,
>> -                                  mirror_method == MIRROR_METHOD_DISCARD);
>> -        return bytes;
>> +        co = qemu_coroutine_create(mirror_co_discard, op);
>> +        break;
>>      default:
>>          abort();
>>      }
>> +
>> +    qemu_coroutine_enter(co);
>> +
>> +    if (mirror_method == MIRROR_METHOD_COPY) {
>> +        /* Same assertion as in mirror_co_read() */
>> +        assert(op->bytes_copied <= UINT_MAX);
>> +        ret = op->bytes_copied;
>> +    }
> 
> This special casing is a bit ugly. Can you just make mirror_co_zero and
> mirror_co_discard set op->bytes_copied too? (and perhaps rename to
> op->bytes_handled) If so the comment in MirrorOp needs an update too.

Sure.

> And is it better to initialize it to -1 before entering coroutine, then assert
> it is != -1 afterwards?

Sounds good, will do.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring
  2017-09-18  6:45   ` Fam Zheng
@ 2017-09-18 16:53     ` Max Reitz
  2017-09-19  8:08       ` Fam Zheng
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-18 16:53 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 7255 bytes --]

On 2017-09-18 08:45, Fam Zheng wrote:
> On Wed, 09/13 20:19, Max Reitz wrote:
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  tests/qemu-iotests/151     | 111 +++++++++++++++++++++++++++++++++++++++++++++
>>  tests/qemu-iotests/151.out |   5 ++
>>  tests/qemu-iotests/group   |   1 +
>>  3 files changed, 117 insertions(+)
>>  create mode 100755 tests/qemu-iotests/151
>>  create mode 100644 tests/qemu-iotests/151.out
>>
>> diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
>> new file mode 100755
>> index 0000000000..49a60773f9
>> --- /dev/null
>> +++ b/tests/qemu-iotests/151
>> @@ -0,0 +1,111 @@
>> +#!/usr/bin/env python
>> +#
>> +# Tests for active mirroring
>> +#
>> +# Copyright (C) 2017 Red Hat, Inc.
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License as published by
>> +# the Free Software Foundation; either version 2 of the License, or
>> +# (at your option) any later version.
>> +#
>> +# This program is distributed in the hope that it will be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> +#
>> +
>> +import os
>> +import iotests
>> +from iotests import qemu_img
>> +
>> +source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
>> +target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
>> +
>> +class TestActiveMirror(iotests.QMPTestCase):
>> +    image_len = 128 * 1024 * 1024 # MB
>> +    potential_writes_in_flight = True
>> +
>> +    def setUp(self):
>> +        qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
>> +        qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
>> +
>> +        blk_source = {'node-name': 'source',
>> +                      'driver': iotests.imgfmt,
>> +                      'file': {'driver': 'file',
>> +                               'filename': source_img}}
>> +
>> +        blk_target = {'node-name': 'target',
>> +                      'driver': iotests.imgfmt,
>> +                      'file': {'driver': 'file',
>> +                               'filename': target_img}}
>> +
>> +        self.vm = iotests.VM()
>> +        self.vm.add_blockdev(self.qmp_to_opts(blk_source))
>> +        self.vm.add_blockdev(self.qmp_to_opts(blk_target))
>> +        self.vm.launch()
>> +
>> +    def tearDown(self):
>> +        self.vm.shutdown()
>> +
>> +        if not self.potential_writes_in_flight:
>> +            self.assertTrue(iotests.compare_images(source_img, target_img),
>> +                            'mirror target does not match source')
>> +
>> +        os.remove(source_img)
>> +        os.remove(target_img)
>> +
>> +    def doActiveIO(self, sync_source_and_target):
>> +        # Fill the source image
>> +        self.vm.hmp_qemu_io('source',
>> +                            'write -P 1 0 %i' % self.image_len);
>> +
>> +        # Start some background requests
>> +        for offset in range(0, self.image_len, 1024 * 1024):
>> +            self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
>> +
>> +        # Start the block job
>> +        result = self.vm.qmp('blockdev-mirror',
>> +                             job_id='mirror',
>> +                             filter_node_name='mirror-node',
>> +                             device='source',
>> +                             target='target',
>> +                             sync='full',
>> +                             copy_mode='active-write')
>> +        self.assert_qmp(result, 'return', {})
>> +
>> +        # Start some more requests
>> +        for offset in range(0, self.image_len, 1024 * 1024):
>> +            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % offset)
>> +
>> +        # Wait for the READY event
>> +        self.wait_ready(drive='mirror')
>> +
>> +        # Now start some final requests; all of these (which land on
>> +        # the source) should be settled using the active mechanism.
>> +        # The mirror code itself asserts that the source BDS's dirty
>> +        # bitmap will stay clean between READY and COMPLETED.
>> +        for offset in range(0, self.image_len, 1024 * 1024):
>> +            self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % offset)
>> +
>> +        if sync_source_and_target:
>> +            # If source and target should be in sync after the mirror,
>> +            # we have to flush before completion
> 
> Not sure I understand this requirements, does it apply to libvirt and user too?
> I.e. it's a part of the interface ? Why cannot mirror_complete do it
> automatically?

Well, it seems to pass without this flush, but the original intention
was this: When mirror is completed, the source node is replaced by the
target.  All further writes are then only executed on the (former)
target node.  So what might happen (or at least I think it could) is
that qemu-io submits some writes, but before they are actually
performed, the mirror block job is completed and the source is replaced
by the target.  Then, the write operations are performed on the target
but no longer on the source, so source and target are then out of sync.

For the mirror block job, that is fine -- at the point of completion,
source and target were in sync.  The job doesn't care that they get out
of sync after completion.  But here, we have to care or we can't compare
source and target.

The reason for why it always seems to pass without a flush is that every
submitted write is actually sent to the mirror node before it yields for
the first time.  But I wouldn't bet on that, so I think it's better to
keep the flush before completing the block job.

Max

> 
> Fam
> 
>> +            self.vm.hmp_qemu_io('mirror-node', 'flush')
>> +            self.potential_writes_in_flight = False
>> +
>> +        self.complete_and_wait(drive='mirror', wait_ready=False)
>> +
>> +    def testActiveIO(self):
>> +        self.doActiveIO(False)
>> +
>> +    def testActiveIOFlushed(self):
>> +        self.doActiveIO(True)
>> +
>> +
>> +
>> +if __name__ == '__main__':
>> +    iotests.main(supported_fmts=['qcow2', 'raw'])
>> diff --git a/tests/qemu-iotests/151.out b/tests/qemu-iotests/151.out
>> new file mode 100644
>> index 0000000000..fbc63e62f8
>> --- /dev/null
>> +++ b/tests/qemu-iotests/151.out
>> @@ -0,0 +1,5 @@
>> +..
>> +----------------------------------------------------------------------
>> +Ran 2 tests
>> +
>> +OK
>> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
>> index 94e764865a..c64adbe5bf 100644
>> --- a/tests/qemu-iotests/group
>> +++ b/tests/qemu-iotests/group
>> @@ -156,6 +156,7 @@
>>  148 rw auto quick
>>  149 rw auto sudo
>>  150 rw auto quick
>> +151 rw auto
>>  152 rw auto quick
>>  153 rw auto quick
>>  154 rw auto backing quick
>> -- 
>> 2.13.5
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-18  6:46   ` Fam Zheng
@ 2017-09-18 17:53     ` Max Reitz
  2017-09-19  8:03       ` Fam Zheng
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-18 17:53 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]

On 2017-09-18 08:46, Fam Zheng wrote:
> On Wed, 09/13 20:19, Max Reitz wrote:
>> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
>> will not wait for the result of the operation and instead execute it in
>> the background.
> 
> Cannot aio_write be used for this purpose?

Depends.  I have been trained to dislike *_aio_*, so that's probably the
initial reason why I didn't use it.

Second, I'd have to fix aio_write before it can be used.  Currently,
this aborts:

echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
    | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
          -blockdev node-name=drv0,driver=null-co

because aio_write_done thinks it's a good idea to use qemu-io's
BlockBackend -- but when qemu-io is executed through the HMP, the
BlockBackend is only created for the duration of the qemu-io command
(unless there already is a BB).  So what I'd have to do is add a
blk_ref()/blk_unref() there, but for some reason I really don't like that.

So I'd probably have to give up on using -blockdev in the new iotest and
would have to use -drive again.
(Note: With if=none, it still aborts while doing the block accounting,
and I have looked long enough into it to just decide I'd go with
if=virtio instead.)

So, yes, it appears I can use aio_write, together with -drive if=virtio
instead of -blockdev.

The remaining difference is the following: With aio_write, all writes
come from the same BlockBackend, and they are really asynchronous.
That's nice because it's like a guest behaves.

With write -B, they come from different BBs and the BB is usually
already gone when the write is completed -- or maybe destroying the BB
means that everything is flushed and thus the writes are not necessarily
asynchronous.  That doesn't seem so nice, but this behavior made me
write patch 13, so maybe it actually is a good idea to test this.

So I'm a bit torn.  On one hand it seems to be a good idea to use
aio_write because that's already there and it's good enough to simulate
a guest.  But on the other hand, write -B gives a bit more funny
behavior which in my opinion is always good for a test...

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-18 17:53     ` Max Reitz
@ 2017-09-19  8:03       ` Fam Zheng
  2017-09-21 14:40         ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-19  8:03 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Mon, 09/18 19:53, Max Reitz wrote:
> On 2017-09-18 08:46, Fam Zheng wrote:
> > On Wed, 09/13 20:19, Max Reitz wrote:
> >> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
> >> will not wait for the result of the operation and instead execute it in
> >> the background.
> > 
> > Cannot aio_write be used for this purpose?
> 
> Depends.  I have been trained to dislike *_aio_*, so that's probably the
> initial reason why I didn't use it.
> 
> Second, I'd have to fix aio_write before it can be used.  Currently,
> this aborts:
> 
> echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
>     | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
>           -blockdev node-name=drv0,driver=null-co
> 
> because aio_write_done thinks it's a good idea to use qemu-io's
> BlockBackend -- but when qemu-io is executed through the HMP, the
> BlockBackend is only created for the duration of the qemu-io command
> (unless there already is a BB).  So what I'd have to do is add a
> blk_ref()/blk_unref() there, but for some reason I really don't like that.

What is the reason? If it crashes it should be fixed anyway, I assume?

Fam

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring
  2017-09-18 16:53     ` Max Reitz
@ 2017-09-19  8:08       ` Fam Zheng
  0 siblings, 0 replies; 64+ messages in thread
From: Fam Zheng @ 2017-09-19  8:08 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Mon, 09/18 18:53, Max Reitz wrote:
> >> +
> >> +        if sync_source_and_target:
> >> +            # If source and target should be in sync after the mirror,
> >> +            # we have to flush before completion
> > 
> > Not sure I understand this requirements, does it apply to libvirt and user too?
> > I.e. it's a part of the interface ? Why cannot mirror_complete do it
> > automatically?
> 
> Well, it seems to pass without this flush, but the original intention
> was this: When mirror is completed, the source node is replaced by the
> target.  All further writes are then only executed on the (former)
> target node.  So what might happen (or at least I think it could) is
> that qemu-io submits some writes, but before they are actually
> performed, the mirror block job is completed and the source is replaced
> by the target.  Then, the write operations are performed on the target
> but no longer on the source, so source and target are then out of sync.
> For the mirror block job, that is fine -- at the point of completion,
> source and target were in sync.  The job doesn't care that they get out
> of sync after completion.  But here, we have to care or we can't compare
> source and target.
> 
> The reason for why it always seems to pass without a flush is that every
> submitted write is actually sent to the mirror node before it yields for
> the first time.  But I wouldn't bet on that, so I think it's better to
> keep the flush before completing the block job.

OK, that makes sense. Thanks.

Fam

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-18 16:26         ` Max Reitz
@ 2017-09-19  9:44           ` Stefan Hajnoczi
  2017-09-19  9:57             ` Daniel P. Berrange
  2017-10-10 10:16           ` Kevin Wolf
  1 sibling, 1 reply; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-19  9:44 UTC (permalink / raw)
  To: Eric Blake
  Cc: Stefan Hajnoczi, Kevin Wolf, Fam Zheng, qemu-devel, qemu-block,
	Max Reitz

On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> >>>> This patch implements active synchronous mirroring.  In active mode, the
> >>>> passive mechanism will still be in place and is used to copy all
> >>>> initially dirty clusters off the source disk; but every write request
> >>>> will write data both to the source and the target disk, so the source
> >>>> cannot be dirtied faster than data is mirrored to the target.  Also,
> >>>> once the block job has converged (BLOCK_JOB_READY sent), source and
> >>>> target are guaranteed to stay in sync (unless an error occurs).
> >>>>
> >>>> Optionally, dirty data can be copied to the target disk on read
> >>>> operations, too.
> >>>>
> >>>> Active mode is completely optional and currently disabled at runtime.  A
> >>>> later patch will add a way for users to enable it.
> >>>>
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>> ---
> >>>>  qapi/block-core.json |  23 +++++++
> >>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
> >>>>  2 files changed, 205 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >>>> index bb11815608..e072cfa67c 100644
> >>>> --- a/qapi/block-core.json
> >>>> +++ b/qapi/block-core.json
> >>>> @@ -938,6 +938,29 @@
> >>>>    'data': ['top', 'full', 'none', 'incremental'] }
> >>>>  
> >>>>  ##
> >>>> +# @MirrorCopyMode:
> >>>> +#
> >>>> +# An enumeration whose values tell the mirror block job when to
> >>>> +# trigger writes to the target.
> >>>> +#
> >>>> +# @passive: copy data in background only.
> >>>> +#
> >>>> +# @active-write: when data is written to the source, write it
> >>>> +#                (synchronously) to the target as well.  In addition,
> >>>> +#                data is copied in background just like in @passive
> >>>> +#                mode.
> >>>> +#
> >>>> +# @active-read-write: write data to the target (synchronously) both
> >>>> +#                     when it is read from and written to the source.
> >>>> +#                     In addition, data is copied in background just
> >>>> +#                     like in @passive mode.
> >>>
> >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> >>> means committing the top-most BDS while the guest is accessing it.  The
> >>> "passive" mirror block still works on the top-most BDS while the guest
> >>> is accessing it.
> >>>
> >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> >>> the terminology used in disk replication (e.g. DRBD).
> >>
> >> I'd be OK with that, too, but I think I remember that in the past at
> >> least Kevin made a clear distinction between active/passive and
> >> sync/async when it comes to mirroring.
> >>
> >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> >>> would switch modes as appropriate in order to converge.  That way
> >>> libvirt also doesn't have to worry about this.
> >>
> >> So here you mean async/sync in the way I meant it, i.e., whether the
> >> mirror operations themselves are async/sync?
> > 
> > The meaning I had in mind is:
> > 
> > Sync mirroring means a guest write waits until the target write
> > completes.
> 
> I.e. active-sync, ...
> 
> > Async mirroring means guest writes completes independently of target
> > writes.
> 
> ... i.e. passive or active-async in the future.
> 
> So you really want qemu to decide whether to use active or passive mode
> depending on what's enough to let the block job converge and not
> introduce any switch for the user?
> 
> I'm not sure whether I like this too much, mostly because "libvirt
> doesn't have to worry" doesn't feel quite right to me.  If we don't make
> libvirt worry about this, then qemu has to worry.  I'm not sure whether
> that's any better.
> 
> I think this really does get into policy territory.  Just switching to
> active mode the instant target writes are slower than source writes may
> not be what the user wants: Maybe it's OK for a short duration because
> they don't care about hard convergence too much.  Maybe they want to
> switch to active mode already when "only" twice as much is written to
> the target as to the source.
> 
> I think this is a decision the management layer (or the user) has to make.

Eric: Does libvirt want to be involved with converging the mirror job
(i.e. if the guest is writing to disk faster than QEMU can copy data to
the target)?

Stefan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-19  9:44           ` Stefan Hajnoczi
@ 2017-09-19  9:57             ` Daniel P. Berrange
  2017-09-20 14:56               ` Stefan Hajnoczi
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel P. Berrange @ 2017-09-19  9:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eric Blake, Kevin Wolf, Fam Zheng, qemu-block, qemu-devel,
	Max Reitz, Stefan Hajnoczi

On Tue, Sep 19, 2017 at 10:44:16AM +0100, Stefan Hajnoczi wrote:
> On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> > On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> > >>>> This patch implements active synchronous mirroring.  In active mode, the
> > >>>> passive mechanism will still be in place and is used to copy all
> > >>>> initially dirty clusters off the source disk; but every write request
> > >>>> will write data both to the source and the target disk, so the source
> > >>>> cannot be dirtied faster than data is mirrored to the target.  Also,
> > >>>> once the block job has converged (BLOCK_JOB_READY sent), source and
> > >>>> target are guaranteed to stay in sync (unless an error occurs).
> > >>>>
> > >>>> Optionally, dirty data can be copied to the target disk on read
> > >>>> operations, too.
> > >>>>
> > >>>> Active mode is completely optional and currently disabled at runtime.  A
> > >>>> later patch will add a way for users to enable it.
> > >>>>
> > >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> > >>>> ---
> > >>>>  qapi/block-core.json |  23 +++++++
> > >>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > >>>>  2 files changed, 205 insertions(+), 5 deletions(-)
> > >>>>
> > >>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
> > >>>> index bb11815608..e072cfa67c 100644
> > >>>> --- a/qapi/block-core.json
> > >>>> +++ b/qapi/block-core.json
> > >>>> @@ -938,6 +938,29 @@
> > >>>>    'data': ['top', 'full', 'none', 'incremental'] }
> > >>>>  
> > >>>>  ##
> > >>>> +# @MirrorCopyMode:
> > >>>> +#
> > >>>> +# An enumeration whose values tell the mirror block job when to
> > >>>> +# trigger writes to the target.
> > >>>> +#
> > >>>> +# @passive: copy data in background only.
> > >>>> +#
> > >>>> +# @active-write: when data is written to the source, write it
> > >>>> +#                (synchronously) to the target as well.  In addition,
> > >>>> +#                data is copied in background just like in @passive
> > >>>> +#                mode.
> > >>>> +#
> > >>>> +# @active-read-write: write data to the target (synchronously) both
> > >>>> +#                     when it is read from and written to the source.
> > >>>> +#                     In addition, data is copied in background just
> > >>>> +#                     like in @passive mode.
> > >>>
> > >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> > >>> means committing the top-most BDS while the guest is accessing it.  The
> > >>> "passive" mirror block still works on the top-most BDS while the guest
> > >>> is accessing it.
> > >>>
> > >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> > >>> the terminology used in disk replication (e.g. DRBD).
> > >>
> > >> I'd be OK with that, too, but I think I remember that in the past at
> > >> least Kevin made a clear distinction between active/passive and
> > >> sync/async when it comes to mirroring.
> > >>
> > >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> > >>> would switch modes as appropriate in order to converge.  That way
> > >>> libvirt also doesn't have to worry about this.
> > >>
> > >> So here you mean async/sync in the way I meant it, i.e., whether the
> > >> mirror operations themselves are async/sync?
> > > 
> > > The meaning I had in mind is:
> > > 
> > > Sync mirroring means a guest write waits until the target write
> > > completes.
> > 
> > I.e. active-sync, ...
> > 
> > > Async mirroring means guest writes completes independently of target
> > > writes.
> > 
> > ... i.e. passive or active-async in the future.
> > 
> > So you really want qemu to decide whether to use active or passive mode
> > depending on what's enough to let the block job converge and not
> > introduce any switch for the user?
> > 
> > I'm not sure whether I like this too much, mostly because "libvirt
> > doesn't have to worry" doesn't feel quite right to me.  If we don't make
> > libvirt worry about this, then qemu has to worry.  I'm not sure whether
> > that's any better.
> > 
> > I think this really does get into policy territory.  Just switching to
> > active mode the instant target writes are slower than source writes may
> > not be what the user wants: Maybe it's OK for a short duration because
> > they don't care about hard convergence too much.  Maybe they want to
> > switch to active mode already when "only" twice as much is written to
> > the target as to the source.
> > 
> > I think this is a decision the management layer (or the user) has to make.
> 
> Eric: Does libvirt want to be involved with converging the mirror job
> (i.e. if the guest is writing to disk faster than QEMU can copy data to
> the target)?

Libvirt doesn't really want to set the policy - it will just need to expose
the mechansim & information to make such decisions upto the application.

I agree with Max that we don't want QEMU making this policy decision on
its own. Simply switching from passive to active mode is not an approach
that will be satisfactory to all scenarios. It might be preferrable to
apply CPU throttling of the guest, or I/O rate throttling, or temporarily
pause the guest to allow catch or any number of other policies. Neither
libvirt or QEMU knows which is best for the scenario at hand.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-19  9:57             ` Daniel P. Berrange
@ 2017-09-20 14:56               ` Stefan Hajnoczi
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-09-20 14:56 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Stefan Hajnoczi, Eric Blake, Kevin Wolf, Fam Zheng, qemu-block,
	qemu-devel, Max Reitz

On Tue, Sep 19, 2017 at 10:57:50AM +0100, Daniel P. Berrange wrote:
> On Tue, Sep 19, 2017 at 10:44:16AM +0100, Stefan Hajnoczi wrote:
> > On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> > > On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> > > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> > > >>>> This patch implements active synchronous mirroring.  In active mode, the
> > > >>>> passive mechanism will still be in place and is used to copy all
> > > >>>> initially dirty clusters off the source disk; but every write request
> > > >>>> will write data both to the source and the target disk, so the source
> > > >>>> cannot be dirtied faster than data is mirrored to the target.  Also,
> > > >>>> once the block job has converged (BLOCK_JOB_READY sent), source and
> > > >>>> target are guaranteed to stay in sync (unless an error occurs).
> > > >>>>
> > > >>>> Optionally, dirty data can be copied to the target disk on read
> > > >>>> operations, too.
> > > >>>>
> > > >>>> Active mode is completely optional and currently disabled at runtime.  A
> > > >>>> later patch will add a way for users to enable it.
> > > >>>>
> > > >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> > > >>>> ---
> > > >>>>  qapi/block-core.json |  23 +++++++
> > > >>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > > >>>>  2 files changed, 205 insertions(+), 5 deletions(-)
> > > >>>>
> > > >>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
> > > >>>> index bb11815608..e072cfa67c 100644
> > > >>>> --- a/qapi/block-core.json
> > > >>>> +++ b/qapi/block-core.json
> > > >>>> @@ -938,6 +938,29 @@
> > > >>>>    'data': ['top', 'full', 'none', 'incremental'] }
> > > >>>>  
> > > >>>>  ##
> > > >>>> +# @MirrorCopyMode:
> > > >>>> +#
> > > >>>> +# An enumeration whose values tell the mirror block job when to
> > > >>>> +# trigger writes to the target.
> > > >>>> +#
> > > >>>> +# @passive: copy data in background only.
> > > >>>> +#
> > > >>>> +# @active-write: when data is written to the source, write it
> > > >>>> +#                (synchronously) to the target as well.  In addition,
> > > >>>> +#                data is copied in background just like in @passive
> > > >>>> +#                mode.
> > > >>>> +#
> > > >>>> +# @active-read-write: write data to the target (synchronously) both
> > > >>>> +#                     when it is read from and written to the source.
> > > >>>> +#                     In addition, data is copied in background just
> > > >>>> +#                     like in @passive mode.
> > > >>>
> > > >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> > > >>> means committing the top-most BDS while the guest is accessing it.  The
> > > >>> "passive" mirror block still works on the top-most BDS while the guest
> > > >>> is accessing it.
> > > >>>
> > > >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> > > >>> the terminology used in disk replication (e.g. DRBD).
> > > >>
> > > >> I'd be OK with that, too, but I think I remember that in the past at
> > > >> least Kevin made a clear distinction between active/passive and
> > > >> sync/async when it comes to mirroring.
> > > >>
> > > >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> > > >>> would switch modes as appropriate in order to converge.  That way
> > > >>> libvirt also doesn't have to worry about this.
> > > >>
> > > >> So here you mean async/sync in the way I meant it, i.e., whether the
> > > >> mirror operations themselves are async/sync?
> > > > 
> > > > The meaning I had in mind is:
> > > > 
> > > > Sync mirroring means a guest write waits until the target write
> > > > completes.
> > > 
> > > I.e. active-sync, ...
> > > 
> > > > Async mirroring means guest writes completes independently of target
> > > > writes.
> > > 
> > > ... i.e. passive or active-async in the future.
> > > 
> > > So you really want qemu to decide whether to use active or passive mode
> > > depending on what's enough to let the block job converge and not
> > > introduce any switch for the user?
> > > 
> > > I'm not sure whether I like this too much, mostly because "libvirt
> > > doesn't have to worry" doesn't feel quite right to me.  If we don't make
> > > libvirt worry about this, then qemu has to worry.  I'm not sure whether
> > > that's any better.
> > > 
> > > I think this really does get into policy territory.  Just switching to
> > > active mode the instant target writes are slower than source writes may
> > > not be what the user wants: Maybe it's OK for a short duration because
> > > they don't care about hard convergence too much.  Maybe they want to
> > > switch to active mode already when "only" twice as much is written to
> > > the target as to the source.
> > > 
> > > I think this is a decision the management layer (or the user) has to make.
> > 
> > Eric: Does libvirt want to be involved with converging the mirror job
> > (i.e. if the guest is writing to disk faster than QEMU can copy data to
> > the target)?
> 
> Libvirt doesn't really want to set the policy - it will just need to expose
> the mechansim & information to make such decisions upto the application.
> 
> I agree with Max that we don't want QEMU making this policy decision on
> its own. Simply switching from passive to active mode is not an approach
> that will be satisfactory to all scenarios. It might be preferrable to
> apply CPU throttling of the guest, or I/O rate throttling, or temporarily
> pause the guest to allow catch or any number of other policies. Neither
> libvirt or QEMU knows which is best for the scenario at hand.

Okay.

Stefan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-19  8:03       ` Fam Zheng
@ 2017-09-21 14:40         ` Max Reitz
  2017-09-21 14:59           ` Fam Zheng
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-09-21 14:40 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]

On 2017-09-19 10:03, Fam Zheng wrote:
> On Mon, 09/18 19:53, Max Reitz wrote:
>> On 2017-09-18 08:46, Fam Zheng wrote:
>>> On Wed, 09/13 20:19, Max Reitz wrote:
>>>> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
>>>> will not wait for the result of the operation and instead execute it in
>>>> the background.
>>>
>>> Cannot aio_write be used for this purpose?
>>
>> Depends.  I have been trained to dislike *_aio_*, so that's probably the
>> initial reason why I didn't use it.
>>
>> Second, I'd have to fix aio_write before it can be used.  Currently,
>> this aborts:
>>
>> echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
>>     | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
>>           -blockdev node-name=drv0,driver=null-co
>>
>> because aio_write_done thinks it's a good idea to use qemu-io's
>> BlockBackend -- but when qemu-io is executed through the HMP, the
>> BlockBackend is only created for the duration of the qemu-io command
>> (unless there already is a BB).  So what I'd have to do is add a
>> blk_ref()/blk_unref() there, but for some reason I really don't like that.
> 
> What is the reason? If it crashes it should be fixed anyway, I assume?

Because the AIO CB (aio_write_done()) continues to use qemu-io's BB --
but in case of HMP's qemu-io, that is pretty much already gone once the
command is done.

That could be fixed, as I said, by blk_ref()ing the BB before aio_write
returns (and then blk_unref()ing it in aio_write_done()).  However, I'm
not even sure whether aio_write_done() is always executed in the main
thread...

Other than that, I just have a bad feeling about adding the pair, not
sure why.  Probably because it means having to carry a temporary BB
around until the command is done, which is weird.  Well, it's not an
issue permission-wise, because the qemu-io BB simply doesn't take the
proper permissions (no, I'm not going to question the fact how it's then
possible to even write to it, considering we have assertions that check
whether the correct permissions have been taken...), and I can't think
of another way.

In any case, you're right, it probably needs to be fixed anyway -- even
if the fix is just not allowing aio_write with a temporary BB (i.e. from
HMP).

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-21 14:40         ` Max Reitz
@ 2017-09-21 14:59           ` Fam Zheng
  2017-09-21 15:03             ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Fam Zheng @ 2017-09-21 14:59 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

On Thu, 09/21 16:40, Max Reitz wrote:
> On 2017-09-19 10:03, Fam Zheng wrote:
> > On Mon, 09/18 19:53, Max Reitz wrote:
> >> On 2017-09-18 08:46, Fam Zheng wrote:
> >>> On Wed, 09/13 20:19, Max Reitz wrote:
> >>>> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
> >>>> will not wait for the result of the operation and instead execute it in
> >>>> the background.
> >>>
> >>> Cannot aio_write be used for this purpose?
> >>
> >> Depends.  I have been trained to dislike *_aio_*, so that's probably the
> >> initial reason why I didn't use it.
> >>
> >> Second, I'd have to fix aio_write before it can be used.  Currently,
> >> this aborts:
> >>
> >> echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
> >>     | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
> >>           -blockdev node-name=drv0,driver=null-co
> >>
> >> because aio_write_done thinks it's a good idea to use qemu-io's
> >> BlockBackend -- but when qemu-io is executed through the HMP, the
> >> BlockBackend is only created for the duration of the qemu-io command
> >> (unless there already is a BB).  So what I'd have to do is add a
> >> blk_ref()/blk_unref() there, but for some reason I really don't like that.
> > 
> > What is the reason? If it crashes it should be fixed anyway, I assume?
> 
> Because the AIO CB (aio_write_done()) continues to use qemu-io's BB --
> but in case of HMP's qemu-io, that is pretty much already gone once the
> command is done.

I can see aio_{read,write}_done accesses BB for accounting, we can probably skip
this part altogether if issued from HMP (because the BB is gone). This way you
don't need the blk_ref/unref pair.

Fam

> 
> That could be fixed, as I said, by blk_ref()ing the BB before aio_write
> returns (and then blk_unref()ing it in aio_write_done()).  However, I'm
> not even sure whether aio_write_done() is always executed in the main
> thread...
> 
> Other than that, I just have a bad feeling about adding the pair, not
> sure why.  Probably because it means having to carry a temporary BB
> around until the command is done, which is weird.  Well, it's not an
> issue permission-wise, because the qemu-io BB simply doesn't take the
> proper permissions (no, I'm not going to question the fact how it's then
> possible to even write to it, considering we have assertions that check
> whether the correct permissions have been taken...), and I can't think
> of another way.
> 
> In any case, you're right, it probably needs to be fixed anyway -- even
> if the fix is just not allowing aio_write with a temporary BB (i.e. from
> HMP).
> 
> Max
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 17/18] qemu-io: Add background write
  2017-09-21 14:59           ` Fam Zheng
@ 2017-09-21 15:03             ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-21 15:03 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

On 2017-09-21 16:59, Fam Zheng wrote:
> On Thu, 09/21 16:40, Max Reitz wrote:
>> On 2017-09-19 10:03, Fam Zheng wrote:
>>> On Mon, 09/18 19:53, Max Reitz wrote:
>>>> On 2017-09-18 08:46, Fam Zheng wrote:
>>>>> On Wed, 09/13 20:19, Max Reitz wrote:
>>>>>> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
>>>>>> will not wait for the result of the operation and instead execute it in
>>>>>> the background.
>>>>>
>>>>> Cannot aio_write be used for this purpose?
>>>>
>>>> Depends.  I have been trained to dislike *_aio_*, so that's probably the
>>>> initial reason why I didn't use it.
>>>>
>>>> Second, I'd have to fix aio_write before it can be used.  Currently,
>>>> this aborts:
>>>>
>>>> echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
>>>>     | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
>>>>           -blockdev node-name=drv0,driver=null-co
>>>>
>>>> because aio_write_done thinks it's a good idea to use qemu-io's
>>>> BlockBackend -- but when qemu-io is executed through the HMP, the
>>>> BlockBackend is only created for the duration of the qemu-io command
>>>> (unless there already is a BB).  So what I'd have to do is add a
>>>> blk_ref()/blk_unref() there, but for some reason I really don't like that.
>>>
>>> What is the reason? If it crashes it should be fixed anyway, I assume?
>>
>> Because the AIO CB (aio_write_done()) continues to use qemu-io's BB --
>> but in case of HMP's qemu-io, that is pretty much already gone once the
>> command is done.
> 
> I can see aio_{read,write}_done accesses BB for accounting, we can probably skip
> this part altogether if issued from HMP (because the BB is gone). This way you
> don't need the blk_ref/unref pair.

Yep, and then it'd be functionally the same as this write -B, so that
sounds good.

Well, I fear that someone will have to rewrite it with coroutines
somewhere in the future anyway, but, er, well, not a problem for now!!1! :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform()
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform() Max Reitz
  2017-09-18  3:48   ` Fam Zheng
@ 2017-09-25  9:38   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 64+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-09-25  9:38 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

13.09.2017 21:18, Max Reitz wrote:
> When converting mirror's I/O to coroutines, we are going to need a point
> where these coroutines are created.  mirror_perform() is going to be
> that point.
>
> Signed-off-by: Max Reitz <mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> ---
>   block/mirror.c | 53 ++++++++++++++++++++++++++++++-----------------------
>   1 file changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/block/mirror.c b/block/mirror.c
> index 6531652d73..4664b0516f 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -82,6 +82,12 @@ typedef struct MirrorOp {
>       uint64_t bytes;
>   } MirrorOp;
>   
> +typedef enum MirrorMethod {
> +    MIRROR_METHOD_COPY,
> +    MIRROR_METHOD_ZERO,
> +    MIRROR_METHOD_DISCARD,
> +} MirrorMethod;
> +
>   static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>                                               int error)
>   {
> @@ -324,6 +330,22 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
>       }
>   }
>   
> +static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
> +                               unsigned bytes, MirrorMethod mirror_method)
> +{
> +    switch (mirror_method) {
> +    case MIRROR_METHOD_COPY:
> +        return mirror_do_read(s, offset, bytes);
> +    case MIRROR_METHOD_ZERO:
> +    case MIRROR_METHOD_DISCARD:
> +        mirror_do_zero_or_discard(s, offset, bytes,
> +                                  mirror_method == MIRROR_METHOD_DISCARD);
> +        return bytes;
> +    default:
> +        abort();
> +    }
> +}
> +
>   static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>   {
>       BlockDriverState *source = s->source;
> @@ -395,11 +417,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>           unsigned int io_bytes;
>           int64_t io_bytes_acct;
>           BlockDriverState *file;
> -        enum MirrorMethod {
> -            MIRROR_METHOD_COPY,
> -            MIRROR_METHOD_ZERO,
> -            MIRROR_METHOD_DISCARD
> -        } mirror_method = MIRROR_METHOD_COPY;
> +        MirrorMethod mirror_method = MIRROR_METHOD_COPY;
>   
>           assert(!(offset % s->granularity));
>           ret = bdrv_get_block_status_above(source, NULL,
> @@ -439,22 +457,11 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
>           }
>   
>           io_bytes = mirror_clip_bytes(s, offset, io_bytes);
> -        switch (mirror_method) {
> -        case MIRROR_METHOD_COPY:
> -            io_bytes = io_bytes_acct = mirror_do_read(s, offset, io_bytes);
> -            break;
> -        case MIRROR_METHOD_ZERO:
> -        case MIRROR_METHOD_DISCARD:
> -            mirror_do_zero_or_discard(s, offset, io_bytes,
> -                                      mirror_method == MIRROR_METHOD_DISCARD);
> -            if (write_zeroes_ok) {
> -                io_bytes_acct = 0;
> -            } else {
> -                io_bytes_acct = io_bytes;
> -            }
> -            break;
> -        default:
> -            abort();
> +        io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
> +        if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
> +            io_bytes_acct = 0;
> +        } else {
> +            io_bytes_acct = io_bytes;
>           }
>           assert(io_bytes);
>           offset += io_bytes;
> @@ -650,8 +657,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>                   continue;
>               }
>   
> -            mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
> -                                      nb_sectors * BDRV_SECTOR_SIZE, false);
> +            mirror_perform(s, sector_num * BDRV_SECTOR_SIZE,
> +                           nb_sectors * BDRV_SECTOR_SIZE, MIRROR_METHOD_ZERO);
>               sector_num += nb_sectors;
>           }
>   


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next()
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next() Max Reitz
@ 2017-09-25 15:38   ` Vladimir Sementsov-Ogievskiy
  2017-09-25 20:40     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-09-25 15:38 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

13.09.2017 21:19, Max Reitz wrote:
> This new parameter allows the caller to just query the next dirty
> position without moving the iterator.
>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/qemu/hbitmap.h |  4 +++-
>   block/dirty-bitmap.c   |  2 +-
>   tests/test-hbitmap.c   | 26 +++++++++++++-------------
>   util/hbitmap.c         | 10 +++++++---
>   4 files changed, 24 insertions(+), 18 deletions(-)
>
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index d3a74a21fc..6a52575ad5 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -316,11 +316,13 @@ void hbitmap_free_meta(HBitmap *hb);
>   /**
>    * hbitmap_iter_next:
>    * @hbi: HBitmapIter to operate on.
> + * @advance: If true, advance the iterator.  Otherwise, the next call
> + *           of this function will return the same result.

it's not quit right, as hbitmap iterator allows concurrent resetting of 
bits, and in
this case next call may return some other result. (see f63ea4e92bad1db)

>    *
>    * Return the next bit that is set in @hbi's associated HBitmap,
>    * or -1 if all remaining bits are zero.
>    */
> -int64_t hbitmap_iter_next(HBitmapIter *hbi);
> +int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance);
>   
>   /**
>    * hbitmap_iter_next_word:
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index 30462d4f9a..aee57cf8c8 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -547,7 +547,7 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)
>   
>   int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
>   {
> -    return hbitmap_iter_next(&iter->hbi);
> +    return hbitmap_iter_next(&iter->hbi, true);
>   }
>   
>   /* Called within bdrv_dirty_bitmap_lock..unlock */
> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
> index 1acb353889..e6d4d563cb 100644
> --- a/tests/test-hbitmap.c
> +++ b/tests/test-hbitmap.c
> @@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
>   
>       i = first;
>       for (;;) {
> -        next = hbitmap_iter_next(&hbi);
> +        next = hbitmap_iter_next(&hbi, true);
>           if (next < 0) {
>               next = data->size;
>           }
> @@ -435,25 +435,25 @@ static void test_hbitmap_iter_granularity(TestHBitmapData *data,
>       /* Note that hbitmap_test_check has to be invoked manually in this test.  */
>       hbitmap_test_init(data, 131072 << 7, 7);
>       hbitmap_iter_init(&hbi, data->hb, 0);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
>   
>       hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
>       hbitmap_iter_init(&hbi, data->hb, 0);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, (L2 + L1 + 1) << 7);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
>   
>       hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
>   
>       hbitmap_test_set(data, (131072 << 7) - 8, 8);
>       hbitmap_iter_init(&hbi, data->hb, 0);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, (L2 + L1 + 1) << 7);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, 131071 << 7);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
>   
>       hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
> -    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), ==, 131071 << 7);
> +    g_assert_cmpint(hbitmap_iter_next(&hbi, true), <, 0);
>   }
>   
>   static void hbitmap_test_set_boundary_bits(TestHBitmapData *data, ssize_t diff)
> @@ -893,7 +893,7 @@ static void test_hbitmap_serialize_zeroes(TestHBitmapData *data,
>       for (i = 0; i < num_positions; i++) {
>           hbitmap_deserialize_zeroes(data->hb, positions[i], min_l1, true);
>           hbitmap_iter_init(&iter, data->hb, 0);
> -        next = hbitmap_iter_next(&iter);
> +        next = hbitmap_iter_next(&iter, true);
>           if (i == num_positions - 1) {
>               g_assert_cmpint(next, ==, -1);
>           } else {
> @@ -919,10 +919,10 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData *data,
>   
>       hbitmap_iter_init(&hbi, data->hb, BITS_PER_LONG - 1);
>   
> -    hbitmap_iter_next(&hbi);
> +    hbitmap_iter_next(&hbi, true);
>   
>       hbitmap_reset_all(data->hb);
> -    hbitmap_iter_next(&hbi);
> +    hbitmap_iter_next(&hbi, true);
>   }
>   
>   int main(int argc, char **argv)
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index 21535cc90b..96525983ce 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -141,7 +141,7 @@ unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
>       return cur;
>   }
>   
> -int64_t hbitmap_iter_next(HBitmapIter *hbi)
> +int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance)
>   {
>       unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1] &
>               hbi->hb->levels[HBITMAP_LEVELS - 1][hbi->pos];
> @@ -154,8 +154,12 @@ int64_t hbitmap_iter_next(HBitmapIter *hbi)
>           }
>       }
>   
> -    /* The next call will resume work from the next bit.  */
> -    hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
> +    if (advance) {
> +        /* The next call will resume work from the next bit.  */
> +        hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
> +    } else {
> +        hbi->cur[HBITMAP_LEVELS - 1] = cur;
> +    }
>       item = ((uint64_t)hbi->pos << BITS_PER_LEVEL) + ctzl(cur);
>   
>       return item << hbi->granularity;


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area Max Reitz
@ 2017-09-25 15:49   ` Vladimir Sementsov-Ogievskiy
  2017-09-25 20:43     ` Max Reitz
  2017-10-02 13:32     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 64+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-09-25 15:49 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

I have a patch on list, which adds hbitmap_next_zero function, it may help
https://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg00809.html

13.09.2017 21:19, Max Reitz wrote:
> This new function allows to look for a consecutively dirty area in a
> dirty bitmap.
>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/dirty-bitmap.h |  2 ++
>   block/dirty-bitmap.c         | 52 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 54 insertions(+)
>
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index a79a58d2c3..7654748700 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -90,6 +90,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>   void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>                                       int64_t cur_sector, int64_t nr_sectors);
>   int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
> +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
> +                               uint64_t *offset, int *bytes);
>   void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
>   int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>   int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index aee57cf8c8..81b2f78016 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -550,6 +550,58 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
>       return hbitmap_iter_next(&iter->hbi, true);
>   }
>   
> +/**
> + * Return the next consecutively dirty area in the dirty bitmap
> + * belonging to the given iterator @iter.
> + *
> + * @max_offset: Maximum value that may be returned for
> + *              *offset + *bytes
> + * @offset:     Will contain the start offset of the next dirty area
> + * @bytes:      Will contain the length of the next dirty area
> + *
> + * Returns: True if a dirty area could be found before max_offset
> + *          (which means that *offset and *bytes then contain valid
> + *          values), false otherwise.
> + */
> +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
> +                               uint64_t *offset, int *bytes)
> +{
> +    uint32_t granularity = bdrv_dirty_bitmap_granularity(iter->bitmap);
> +    uint64_t gran_max_offset;
> +    int sector_gran = granularity >> BDRV_SECTOR_BITS;
> +    int64_t ret;
> +    int size;
> +
> +    if (DIV_ROUND_UP(max_offset, BDRV_SECTOR_SIZE) == iter->bitmap->size) {
> +        /* If max_offset points to the image end, round it up by the
> +         * bitmap granularity */
> +        gran_max_offset = ROUND_UP(max_offset, granularity);
> +    } else {
> +        gran_max_offset = max_offset;
> +    }
> +
> +    ret = hbitmap_iter_next(&iter->hbi, false);
> +    if (ret < 0 || (ret << BDRV_SECTOR_BITS) + granularity > gran_max_offset) {
> +        return false;
> +    }
> +
> +    *offset = ret << BDRV_SECTOR_BITS;
> +    size = 0;
> +
> +    assert(granularity <= INT_MAX);
> +
> +    do {
> +        /* Advance iterator */
> +        ret = hbitmap_iter_next(&iter->hbi, true);
> +        size += granularity;
> +    } while ((ret << BDRV_SECTOR_BITS) + granularity <= gran_max_offset &&
> +             hbitmap_iter_next(&iter->hbi, false) == ret + sector_gran &&
> +             size <= INT_MAX - granularity);
> +
> +    *bytes = MIN(size, max_offset - *offset);
> +    return true;
> +}
> +
>   /* Called within bdrv_dirty_bitmap_lock..unlock */
>   void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>                                     int64_t cur_sector, int64_t nr_sectors)


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next()
  2017-09-25 15:38   ` Vladimir Sementsov-Ogievskiy
@ 2017-09-25 20:40     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-25 20:40 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

On 2017-09-25 17:38, Vladimir Sementsov-Ogievskiy wrote:
> 13.09.2017 21:19, Max Reitz wrote:
>> This new parameter allows the caller to just query the next dirty
>> position without moving the iterator.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/qemu/hbitmap.h |  4 +++-
>>   block/dirty-bitmap.c   |  2 +-
>>   tests/test-hbitmap.c   | 26 +++++++++++++-------------
>>   util/hbitmap.c         | 10 +++++++---
>>   4 files changed, 24 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
>> index d3a74a21fc..6a52575ad5 100644
>> --- a/include/qemu/hbitmap.h
>> +++ b/include/qemu/hbitmap.h
>> @@ -316,11 +316,13 @@ void hbitmap_free_meta(HBitmap *hb);
>>   /**
>>    * hbitmap_iter_next:
>>    * @hbi: HBitmapIter to operate on.
>> + * @advance: If true, advance the iterator.  Otherwise, the next call
>> + *           of this function will return the same result.
> 
> it's not quit right, as hbitmap iterator allows concurrent resetting of
> bits, and in
> this case next call may return some other result. (see f63ea4e92bad1db)

Ah, right!  I think it should still be useful for what I (currently)
need in patch 12, I would just need a different description then.

(Like "...will return the same result (if that position is still dirty).")

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area
  2017-09-25 15:49   ` Vladimir Sementsov-Ogievskiy
@ 2017-09-25 20:43     ` Max Reitz
  2017-10-02 13:32     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-09-25 20:43 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 650 bytes --]

On 2017-09-25 17:49, Vladimir Sementsov-Ogievskiy wrote:
> I have a patch on list, which adds hbitmap_next_zero function, it may help
> https://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg00809.html

Hmmm.  Sounds good, but (1) I would need to directly access the bitmap
instead of the iterator, and (2) I would still need to clear the whole
in the iterator...

It does sound tempting because I could drop the previous patch, then
(and thus wouldn't have to worry about concurrent resetting), but I
don't think the whole implementation would be simpler.

I'll think about it, but thanks for pointing it out in any case!

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area
  2017-09-25 15:49   ` Vladimir Sementsov-Ogievskiy
  2017-09-25 20:43     ` Max Reitz
@ 2017-10-02 13:32     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 64+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-10-02 13:32 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, John Snow

25.09.2017 18:49, Vladimir Sementsov-Ogievskiy wrote:
> I have a patch on list, which adds hbitmap_next_zero function, it may 
> help
> https://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg00809.html

there is a mistake in this hbitmap_next_zero, I'll send today corrected 
version as part of small backup-related series.


>
> 13.09.2017 21:19, Max Reitz wrote:
>> This new function allows to look for a consecutively dirty area in a
>> dirty bitmap.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/dirty-bitmap.h |  2 ++
>>   block/dirty-bitmap.c         | 52 
>> ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
>> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
>> index a79a58d2c3..7654748700 100644
>> --- a/include/block/dirty-bitmap.h
>> +++ b/include/block/dirty-bitmap.h
>> @@ -90,6 +90,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap 
>> *bitmap,
>>   void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>>                                       int64_t cur_sector, int64_t 
>> nr_sectors);
>>   int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
>> +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t 
>> max_offset,
>> +                               uint64_t *offset, int *bytes);
>>   void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t 
>> sector_num);
>>   int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>>   int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>> index aee57cf8c8..81b2f78016 100644
>> --- a/block/dirty-bitmap.c
>> +++ b/block/dirty-bitmap.c
>> @@ -550,6 +550,58 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter 
>> *iter)
>>       return hbitmap_iter_next(&iter->hbi, true);
>>   }
>>   +/**
>> + * Return the next consecutively dirty area in the dirty bitmap
>> + * belonging to the given iterator @iter.
>> + *
>> + * @max_offset: Maximum value that may be returned for
>> + *              *offset + *bytes
>> + * @offset:     Will contain the start offset of the next dirty area
>> + * @bytes:      Will contain the length of the next dirty area
>> + *
>> + * Returns: True if a dirty area could be found before max_offset
>> + *          (which means that *offset and *bytes then contain valid
>> + *          values), false otherwise.
>> + */
>> +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t 
>> max_offset,
>> +                               uint64_t *offset, int *bytes)
>> +{
>> +    uint32_t granularity = bdrv_dirty_bitmap_granularity(iter->bitmap);
>> +    uint64_t gran_max_offset;
>> +    int sector_gran = granularity >> BDRV_SECTOR_BITS;
>> +    int64_t ret;
>> +    int size;
>> +
>> +    if (DIV_ROUND_UP(max_offset, BDRV_SECTOR_SIZE) == 
>> iter->bitmap->size) {
>> +        /* If max_offset points to the image end, round it up by the
>> +         * bitmap granularity */
>> +        gran_max_offset = ROUND_UP(max_offset, granularity);
>> +    } else {
>> +        gran_max_offset = max_offset;
>> +    }
>> +
>> +    ret = hbitmap_iter_next(&iter->hbi, false);
>> +    if (ret < 0 || (ret << BDRV_SECTOR_BITS) + granularity > 
>> gran_max_offset) {
>> +        return false;
>> +    }
>> +
>> +    *offset = ret << BDRV_SECTOR_BITS;
>> +    size = 0;
>> +
>> +    assert(granularity <= INT_MAX);
>> +
>> +    do {
>> +        /* Advance iterator */
>> +        ret = hbitmap_iter_next(&iter->hbi, true);
>> +        size += granularity;
>> +    } while ((ret << BDRV_SECTOR_BITS) + granularity <= 
>> gran_max_offset &&
>> +             hbitmap_iter_next(&iter->hbi, false) == ret + 
>> sector_gran &&
>> +             size <= INT_MAX - granularity);
>> +
>> +    *bytes = MIN(size, max_offset - *offset);
>> +    return true;
>> +}
>> +
>>   /* Called within bdrv_dirty_bitmap_lock..unlock */
>>   void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>>                                     int64_t cur_sector, int64_t 
>> nr_sectors)
>
>


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-09-18 16:13     ` Max Reitz
@ 2017-10-09 18:30       ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-09 18:30 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, Kevin Wolf, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 2974 bytes --]

On 2017-09-18 18:13, Max Reitz wrote:
> On 2017-09-18 05:44, Fam Zheng wrote:
>> On Wed, 09/13 20:18, Max Reitz wrote:
>>> Drainined a BDS child may lead to both the original BDS and/or its other
>>> children being deleted (e.g. if the original BDS represents a block
>>> job).  We should prepare for this in both bdrv_drain_recurse() and
>>> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
>>> still exists at all.
>>
>> Can the deletion happen when IOThread calls
>> bdrv_drain_recurse/bdrv_drained_begin?
> 
> I don't think so, because (1) my issue was draining a block job and that
> can only be completed in the main loop, and (2) I would like to think
> it's always impossible, considering that bdrv_unref() may only be called
> with the BQL.
> 
>>                                         If not, is it enough to do
>>
>>     ...
>>     if (in_main_loop) {
>>         bdrv_ref(bs);
>>     }
>>     ...
>>     if (in_main_loop) {
>>         bdrv_unref(bs);
>>     }
>>
>> to protect the main loop case? So the BdrvDeletedStatus state is not needed.
> 
> We already have that in bdrv_drained_recurse(), don't we?
> 
> The issue here is, though, that QLIST_FOREACH_SAFE() stores the next
> child pointer to @tmp.  However, once the current child @child is
> drained, @tmp may no longer be valid -- it may have been detached from
> @bs, and it may even have been deleted.
> 
> We could work around the latter by increasing the next child's reference
> somehow (but BdrvChild doesn't really have a refcount, and in order to
> do so, we would probably have to emulate being a parent or
> something...), but then you'd still have the issue of @tmp being
> detached from the children list we're trying to iterate over.  So
> tmp->next is no longer valid.
> 
> Anyway, so the latter is the reason why I decided to introduce the bs_list.
> 
> But maybe that actually saves us from having to fiddle with BdrvChild...
>  Since it's just a list of BDSs now, it may be enough to simply
> bdrv_ref() all of the BDSs in that list before draining any of them.  So
>  we'd keep creating the bs_list and then we'd move the existing
> bdrv_ref() from the drain loop into the loop filling bs_list.
> 
> And adding a bdrv_ref()/bdrv_unref() pair to bdrv_drained_begin() should
> hopefully work there, too.

It turns out it isn't so simple after all... because bdrv_close()
invokes bdrv_drained_begin(). So we may end up with an endless recursion
here.

One way to fix this would be to skip the bdrv_drained_begin() in
bdrv_close() if this would result in such a recursion...  But any
solution that comes quickly to my mind would require another BDS field,
too -- just checking the quiesce_counter is probably not enough because
this might just indicate concurrent drainage that stops before
bdrv_close() wants it to stop.

So maybe BdrvDeletedStatus is the simplest solution after all...?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse Max Reitz
  2017-09-18  3:44   ` Fam Zheng
@ 2017-10-10  8:36   ` Kevin Wolf
  2017-10-11 11:41     ` Max Reitz
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10  8:36 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

Am 13.09.2017 um 20:18 hat Max Reitz geschrieben:
> Drainined a BDS child may lead to both the original BDS and/or its other
> children being deleted (e.g. if the original BDS represents a block
> job).  We should prepare for this in both bdrv_drain_recurse() and
> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
> still exists at all.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

How hard would it be to write a test case for this? qemu-iotests
probably isn't the right tool, but I feel a C unit test would be
possible.

> -    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> -        BlockDriverState *bs = child->bs;
> -        bool in_main_loop =
> -            qemu_get_current_aio_context() == qemu_get_aio_context();
> -        assert(bs->refcnt > 0);

Would it make sense to keep this assertion for the !deleted case?

> -        if (in_main_loop) {
> -            /* In case the recursive bdrv_drain_recurse processes a
> -             * block_job_defer_to_main_loop BH and modifies the graph,
> -             * let's hold a reference to bs until we are done.
> -             *
> -             * IOThread doesn't have such a BH, and it is not safe to call
> -             * bdrv_unref without BQL, so skip doing it there.
> -             */
> -            bdrv_ref(bs);
> -        }
> -        waited |= bdrv_drain_recurse(bs);
> -        if (in_main_loop) {
> -            bdrv_unref(bs);
> +    /* Draining children may result in other children being removed and maybe
> +     * even deleted, so copy the children list first */

Maybe it's just me, but I failed to understand this correctly at first.
How about "being removed from their parent" to clarify that it's not the
BDS that is removed, but just the reference?

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines
  2017-09-13 18:18 ` [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines Max Reitz
  2017-09-18  6:02   ` Fam Zheng
@ 2017-10-10  9:14   ` Kevin Wolf
  2017-10-11 11:43     ` Max Reitz
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10  9:14 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

Am 13.09.2017 um 20:18 hat Max Reitz geschrieben:
> In order to talk to the source BDS (and maybe in the future to the
> target BDS as well) directly, we need to convert our existing AIO
> requests into coroutine I/O requests.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Please follow through with it and add a few patches that turn it into
natural coroutine code rather than just any coroutine code. I know I did
the same kind of half-assed conversion in qed, but mirror is code that
is actually used and that people look at for more than just a bad
example.

You'll probably notice more things when you do this, but the obvious
things would be changing mirror_co_read() into a mirror_co_copy() with
the former callbacks inlined; keeping op on the stack instead of
mallocing it in mirror_perform() and free it deep inside the nested
functions that used to be callbacks; and probably also cleaning up the
random calls to aio_context_acquire/release() that will now appear in
the middle of the function.

Anyway, that's for follow-up patches (though ideally in the same
series), so for this one you can have:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild Max Reitz
@ 2017-10-10  9:27   ` Kevin Wolf
  2017-10-11 11:46     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10  9:27 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
> With this, the mirror_top_bs is no longer just a technically required
> node in the BDS graph but actually represents the block job operation.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 2ece38094d..9df4157511 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -43,8 +43,8 @@ typedef struct MirrorBlockJob {
>      RateLimit limit;
>      BlockBackend *target;
>      BlockDriverState *mirror_top_bs;
> -    BlockDriverState *source;
>      BlockDriverState *base;
> +    BdrvChild *source;

Is it actually useful to store source seperately when we already have
mirror_top_bs->backing?

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child Max Reitz
@ 2017-10-10  9:47   ` Kevin Wolf
  2017-10-11 12:02     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10  9:47 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
> Regarding the source BDS, the mirror BDS is arguably a filter node.
> Therefore, the source BDS should be its "file" child.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

TODO: Justification why this doesn't break things like
bdrv_is_allocated_above() that iterate through the backing chain.

>  block/mirror.c             | 127 ++++++++++++++++++++++++++++++++++-----------
>  block/qapi.c               |  25 ++++++---
>  tests/qemu-iotests/141.out |   4 +-
>  3 files changed, 119 insertions(+), 37 deletions(-)
> 
> diff --git a/block/qapi.c b/block/qapi.c
> index 7fa2437923..ee792d0cbc 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -147,9 +147,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>  
>          /* Skip automatically inserted nodes that the user isn't aware of for
>           * query-block (blk != NULL), but not for query-named-block-nodes */
> -        while (blk && bs0->drv && bs0->implicit) {
> -            bs0 = backing_bs(bs0);
> -            assert(bs0);
> +        while (blk && bs0 && bs0->drv && bs0->implicit) {
> +            if (bs0->backing) {
> +                bs0 = backing_bs(bs0);
> +            } else {
> +                assert(bs0->file);
> +                bs0 = bs0->file->bs;
> +            }
>          }
>      }

Maybe backing_bs() should skip filters? If so, need to show that all
existing users of backing_bs() really want to skip filters. If not,
explain why the missing backing_bs() callers don't need it (I'm quite
sure that some do).

Also, if we attack this at the backing_bs() level, we need to audit
code that it doesn't directly use bs->backing.

> @@ -1135,44 +1146,88 @@ static const BlockJobDriver commit_active_job_driver = {
>      .drain                  = mirror_drain,
>  };
>  
> +static void source_child_inherit_fmt_options(int *child_flags,
> +                                             QDict *child_options,
> +                                             int parent_flags,
> +                                             QDict *parent_options)
> +{
> +    child_backing.inherit_options(child_flags, child_options,
> +                                  parent_flags, parent_options);
> +}
> +
> +static char *source_child_get_parent_desc(BdrvChild *c)
> +{
> +    return child_backing.get_parent_desc(c);
> +}
> +
> +static void source_child_cb_drained_begin(BdrvChild *c)
> +{
> +    BlockDriverState *bs = c->opaque;
> +    MirrorBDSOpaque *s = bs->opaque;
> +
> +    if (s && s->job) {
> +        block_job_drained_begin(&s->job->common);
> +    }
> +    bdrv_drained_begin(bs);
> +}
> +
> +static void source_child_cb_drained_end(BdrvChild *c)
> +{
> +    BlockDriverState *bs = c->opaque;
> +    MirrorBDSOpaque *s = bs->opaque;
> +
> +    if (s && s->job) {
> +        block_job_drained_end(&s->job->common);
> +    }
> +    bdrv_drained_end(bs);
> +}
> +
> +static BdrvChildRole source_child_role = {
> +    .inherit_options    = source_child_inherit_fmt_options,
> +    .get_parent_desc    = source_child_get_parent_desc,
> +    .drained_begin      = source_child_cb_drained_begin,
> +    .drained_end        = source_child_cb_drained_end,
> +};

Wouldn't it make much more sense to use a standard child role and just
implement BlockDriver callbacks for .bdrv_drained_begin/end? It seems
that master still only has .bdrv_co_drain (which is begin), but one of
Manos' pending series adds the missing end callback.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes
  2017-09-13 18:19 ` [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes Max Reitz
@ 2017-10-10  9:58   ` Kevin Wolf
  2017-10-11 12:20     ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10  9:58 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
> The owner of the mirror BDS might retire its write permission; but there
> may still be pending mirror operations so the mirror BDS cannot
> necessarily retire its write permission for its child then.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

I'm confused. The child of mirror_top_bs is the source, but don't mirror
operations only write to the target?

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-09-18 16:26         ` Max Reitz
  2017-09-19  9:44           ` Stefan Hajnoczi
@ 2017-10-10 10:16           ` Kevin Wolf
  2017-10-11 12:33             ` Max Reitz
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-10-10 10:16 UTC (permalink / raw)
  To: Max Reitz
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, Fam Zheng, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 4612 bytes --]

Am 18.09.2017 um 18:26 hat Max Reitz geschrieben:
> On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> >>>> This patch implements active synchronous mirroring.  In active mode, the
> >>>> passive mechanism will still be in place and is used to copy all
> >>>> initially dirty clusters off the source disk; but every write request
> >>>> will write data both to the source and the target disk, so the source
> >>>> cannot be dirtied faster than data is mirrored to the target.  Also,
> >>>> once the block job has converged (BLOCK_JOB_READY sent), source and
> >>>> target are guaranteed to stay in sync (unless an error occurs).
> >>>>
> >>>> Optionally, dirty data can be copied to the target disk on read
> >>>> operations, too.
> >>>>
> >>>> Active mode is completely optional and currently disabled at runtime.  A
> >>>> later patch will add a way for users to enable it.
> >>>>
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>> ---
> >>>>  qapi/block-core.json |  23 +++++++
> >>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
> >>>>  2 files changed, 205 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >>>> index bb11815608..e072cfa67c 100644
> >>>> --- a/qapi/block-core.json
> >>>> +++ b/qapi/block-core.json
> >>>> @@ -938,6 +938,29 @@
> >>>>    'data': ['top', 'full', 'none', 'incremental'] }
> >>>>  
> >>>>  ##
> >>>> +# @MirrorCopyMode:
> >>>> +#
> >>>> +# An enumeration whose values tell the mirror block job when to
> >>>> +# trigger writes to the target.
> >>>> +#
> >>>> +# @passive: copy data in background only.
> >>>> +#
> >>>> +# @active-write: when data is written to the source, write it
> >>>> +#                (synchronously) to the target as well.  In addition,
> >>>> +#                data is copied in background just like in @passive
> >>>> +#                mode.
> >>>> +#
> >>>> +# @active-read-write: write data to the target (synchronously) both
> >>>> +#                     when it is read from and written to the source.
> >>>> +#                     In addition, data is copied in background just
> >>>> +#                     like in @passive mode.
> >>>
> >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> >>> means committing the top-most BDS while the guest is accessing it.  The
> >>> "passive" mirror block still works on the top-most BDS while the guest
> >>> is accessing it.
> >>>
> >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> >>> the terminology used in disk replication (e.g. DRBD).
> >>
> >> I'd be OK with that, too, but I think I remember that in the past at
> >> least Kevin made a clear distinction between active/passive and
> >> sync/async when it comes to mirroring.
> >>
> >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> >>> would switch modes as appropriate in order to converge.  That way
> >>> libvirt also doesn't have to worry about this.
> >>
> >> So here you mean async/sync in the way I meant it, i.e., whether the
> >> mirror operations themselves are async/sync?
> > 
> > The meaning I had in mind is:
> > 
> > Sync mirroring means a guest write waits until the target write
> > completes.
> 
> I.e. active-sync, ...
> 
> > Async mirroring means guest writes completes independently of target
> > writes.
> 
> ... i.e. passive or active-async in the future.

So we already have at least three different modes, sync/async doesn't
quite cut it anyway. There's a reason why we have been talking about
both active/passive and sync/async.

When I was looking at the code, it actually occurred to me that there
are more possible different modes than I thought there were: This patch
waits for successful completion on the source before it even attempts to
write to the destination.

Wouldn't it be generally (i.e. in the success case) more useful if we
start both requests at the same time and only wait for both to complete,
avoiding to double the latency? If the source write fails, we're out of
sync, obviously, so we'd have to mark the block dirty again.

By the way, what happens when the guest modifies the RAM during the
request? Is it acceptable even for writes if source and target differ
after a successful write operation? Don't we need a bounce buffer
anyway?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse
  2017-10-10  8:36   ` Kevin Wolf
@ 2017-10-11 11:41     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 11:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

On 2017-10-10 10:36, Kevin Wolf wrote:
> Am 13.09.2017 um 20:18 hat Max Reitz geschrieben:
>> Drainined a BDS child may lead to both the original BDS and/or its other
>> children being deleted (e.g. if the original BDS represents a block
>> job).  We should prepare for this in both bdrv_drain_recurse() and
>> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
>> still exists at all.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> How hard would it be to write a test case for this? qemu-iotests
> probably isn't the right tool, but I feel a C unit test would be
> possible.

I can look into it, but I can't promise anything.

>> -    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
>> -        BlockDriverState *bs = child->bs;
>> -        bool in_main_loop =
>> -            qemu_get_current_aio_context() == qemu_get_aio_context();
>> -        assert(bs->refcnt > 0);
> 
> Would it make sense to keep this assertion for the !deleted case?

Sure, why not.

>> -        if (in_main_loop) {
>> -            /* In case the recursive bdrv_drain_recurse processes a
>> -             * block_job_defer_to_main_loop BH and modifies the graph,
>> -             * let's hold a reference to bs until we are done.
>> -             *
>> -             * IOThread doesn't have such a BH, and it is not safe to call
>> -             * bdrv_unref without BQL, so skip doing it there.
>> -             */
>> -            bdrv_ref(bs);
>> -        }
>> -        waited |= bdrv_drain_recurse(bs);
>> -        if (in_main_loop) {
>> -            bdrv_unref(bs);
>> +    /* Draining children may result in other children being removed and maybe
>> +     * even deleted, so copy the children list first */
> 
> Maybe it's just me, but I failed to understand this correctly at first.
> How about "being removed from their parent" to clarify that it's not the
> BDS that is removed, but just the reference?

Well, it's the BdrvChild that's removed, that's what I meant by
"children".  But then the comment speaks of "children list" and means
creation of a list of BDSs, sooo...  Yes, some change necessary.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines
  2017-10-10  9:14   ` Kevin Wolf
@ 2017-10-11 11:43     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 11:43 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

On 2017-10-10 11:14, Kevin Wolf wrote:
> Am 13.09.2017 um 20:18 hat Max Reitz geschrieben:
>> In order to talk to the source BDS (and maybe in the future to the
>> target BDS as well) directly, we need to convert our existing AIO
>> requests into coroutine I/O requests.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> Please follow through with it and add a few patches that turn it into
> natural coroutine code rather than just any coroutine code. I know I did
> the same kind of half-assed conversion in qed, but mirror is code that
> is actually used and that people look at for more than just a bad
> example.
> 
> You'll probably notice more things when you do this, but the obvious
> things would be changing mirror_co_read() into a mirror_co_copy() with
> the former callbacks inlined; keeping op on the stack instead of
> mallocing it in mirror_perform() and free it deep inside the nested
> functions that used to be callbacks; and probably also cleaning up the
> random calls to aio_context_acquire/release() that will now appear in
> the middle of the function.
> 
> Anyway, that's for follow-up patches (though ideally in the same
> series), so for this one you can have:
> 
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>

Phew. :-)

I think I'll write the patches (while working on v2), but I'll send them
as a follow-up.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild
  2017-10-10  9:27   ` Kevin Wolf
@ 2017-10-11 11:46     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 11:46 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]

On 2017-10-10 11:27, Kevin Wolf wrote:
> Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
>> With this, the mirror_top_bs is no longer just a technically required
>> node in the BDS graph but actually represents the block job operation.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/mirror.c | 18 ++++++++++--------
>>  1 file changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 2ece38094d..9df4157511 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -43,8 +43,8 @@ typedef struct MirrorBlockJob {
>>      RateLimit limit;
>>      BlockBackend *target;
>>      BlockDriverState *mirror_top_bs;
>> -    BlockDriverState *source;
>>      BlockDriverState *base;
>> +    BdrvChild *source;
> 
> Is it actually useful to store source seperately when we already have
> mirror_top_bs->backing?

I'll take a look whether it is, and if it isn't, I'll probably add that
as a separate patch (to keep this one as simple as it is).

I imagine it is not, right.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child
  2017-10-10  9:47   ` Kevin Wolf
@ 2017-10-11 12:02     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 12:02 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 3950 bytes --]

On 2017-10-10 11:47, Kevin Wolf wrote:
> Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
>> Regarding the source BDS, the mirror BDS is arguably a filter node.
>> Therefore, the source BDS should be its "file" child.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> TODO: Justification why this doesn't break things like
> bdrv_is_allocated_above() that iterate through the backing chain.

Er, well, yes.

>>  block/mirror.c             | 127 ++++++++++++++++++++++++++++++++++-----------
>>  block/qapi.c               |  25 ++++++---
>>  tests/qemu-iotests/141.out |   4 +-
>>  3 files changed, 119 insertions(+), 37 deletions(-)
>>
>> diff --git a/block/qapi.c b/block/qapi.c
>> index 7fa2437923..ee792d0cbc 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -147,9 +147,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>>  
>>          /* Skip automatically inserted nodes that the user isn't aware of for
>>           * query-block (blk != NULL), but not for query-named-block-nodes */
>> -        while (blk && bs0->drv && bs0->implicit) {
>> -            bs0 = backing_bs(bs0);
>> -            assert(bs0);
>> +        while (blk && bs0 && bs0->drv && bs0->implicit) {
>> +            if (bs0->backing) {
>> +                bs0 = backing_bs(bs0);
>> +            } else {
>> +                assert(bs0->file);
>> +                bs0 = bs0->file->bs;
>> +            }
>>          }
>>      }
> 
> Maybe backing_bs() should skip filters? If so, need to show that all
> existing users of backing_bs() really want to skip filters. If not,
> explain why the missing backing_bs() callers don't need it (I'm quite
> sure that some do).

Arguably any BDS is a filter BDS regarding its backing BDS.  So maybe I
could add a new function filter_child_bs().

> Also, if we attack this at the backing_bs() level, we need to audit
> code that it doesn't directly use bs->backing.

Yup.

Maybe the best idea is to leave this patch for a follow-up...

>> @@ -1135,44 +1146,88 @@ static const BlockJobDriver commit_active_job_driver = {
>>      .drain                  = mirror_drain,
>>  };
>>  
>> +static void source_child_inherit_fmt_options(int *child_flags,
>> +                                             QDict *child_options,
>> +                                             int parent_flags,
>> +                                             QDict *parent_options)
>> +{
>> +    child_backing.inherit_options(child_flags, child_options,
>> +                                  parent_flags, parent_options);
>> +}
>> +
>> +static char *source_child_get_parent_desc(BdrvChild *c)
>> +{
>> +    return child_backing.get_parent_desc(c);
>> +}
>> +
>> +static void source_child_cb_drained_begin(BdrvChild *c)
>> +{
>> +    BlockDriverState *bs = c->opaque;
>> +    MirrorBDSOpaque *s = bs->opaque;
>> +
>> +    if (s && s->job) {
>> +        block_job_drained_begin(&s->job->common);
>> +    }
>> +    bdrv_drained_begin(bs);
>> +}
>> +
>> +static void source_child_cb_drained_end(BdrvChild *c)
>> +{
>> +    BlockDriverState *bs = c->opaque;
>> +    MirrorBDSOpaque *s = bs->opaque;
>> +
>> +    if (s && s->job) {
>> +        block_job_drained_end(&s->job->common);
>> +    }
>> +    bdrv_drained_end(bs);
>> +}
>> +
>> +static BdrvChildRole source_child_role = {
>> +    .inherit_options    = source_child_inherit_fmt_options,
>> +    .get_parent_desc    = source_child_get_parent_desc,
>> +    .drained_begin      = source_child_cb_drained_begin,
>> +    .drained_end        = source_child_cb_drained_end,
>> +};
> 
> Wouldn't it make much more sense to use a standard child role and just
> implement BlockDriver callbacks for .bdrv_drained_begin/end? It seems
> that master still only has .bdrv_co_drain (which is begin), but one of
> Manos' pending series adds the missing end callback.

OK then. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes
  2017-10-10  9:58   ` Kevin Wolf
@ 2017-10-11 12:20     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 12:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Fam Zheng, Stefan Hajnoczi, John Snow

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

On 2017-10-10 11:58, Kevin Wolf wrote:
> Am 13.09.2017 um 20:19 hat Max Reitz geschrieben:
>> The owner of the mirror BDS might retire its write permission; but there
>> may still be pending mirror operations so the mirror BDS cannot
>> necessarily retire its write permission for its child then.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> I'm confused. The child of mirror_top_bs is the source, but don't mirror
> operations only write to the target?

I do know that the iotests added in the end fails without this patch, if
it helps. :-)

Right, for some reason I never thought about this...  OK, so the issue
is that if you create a BB, submit a request and then delete it (the
only BB), permissions requirements for the mirror BDS are dropped and
then it in turn also drops its permissions on the source.

The issue now occurs whenever the BB is deleted before the write request
checks the permissions on the source.  In passive mode, this does not
happen because nothing yields before the permission check.

In active mode, however, active_write_prepare() may yield due to
mirror_wait_on_conflicts().  Since active_write_prepare() also creates
an operation (before yielding), this patch "fixes" the issue.

I think the real bug fix would be to also have a counter of (write)
operations running on the source (incremented/decremented in
bdrv_mirror_top_pwritev()) and to evaluate that in
bdrv_mirror_top_child_perm().

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
  2017-10-10 10:16           ` Kevin Wolf
@ 2017-10-11 12:33             ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-10-11 12:33 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, Fam Zheng, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 6540 bytes --]

On 2017-10-10 12:16, Kevin Wolf wrote:
> Am 18.09.2017 um 18:26 hat Max Reitz geschrieben:
>> On 2017-09-18 12:06, Stefan Hajnoczi wrote:
>>> On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
>>>> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
>>>>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
>>>>>> This patch implements active synchronous mirroring.  In active mode, the
>>>>>> passive mechanism will still be in place and is used to copy all
>>>>>> initially dirty clusters off the source disk; but every write request
>>>>>> will write data both to the source and the target disk, so the source
>>>>>> cannot be dirtied faster than data is mirrored to the target.  Also,
>>>>>> once the block job has converged (BLOCK_JOB_READY sent), source and
>>>>>> target are guaranteed to stay in sync (unless an error occurs).
>>>>>>
>>>>>> Optionally, dirty data can be copied to the target disk on read
>>>>>> operations, too.
>>>>>>
>>>>>> Active mode is completely optional and currently disabled at runtime.  A
>>>>>> later patch will add a way for users to enable it.
>>>>>>
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>> ---
>>>>>>  qapi/block-core.json |  23 +++++++
>>>>>>  block/mirror.c       | 187 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>>>  2 files changed, 205 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>>>> index bb11815608..e072cfa67c 100644
>>>>>> --- a/qapi/block-core.json
>>>>>> +++ b/qapi/block-core.json
>>>>>> @@ -938,6 +938,29 @@
>>>>>>    'data': ['top', 'full', 'none', 'incremental'] }
>>>>>>  
>>>>>>  ##
>>>>>> +# @MirrorCopyMode:
>>>>>> +#
>>>>>> +# An enumeration whose values tell the mirror block job when to
>>>>>> +# trigger writes to the target.
>>>>>> +#
>>>>>> +# @passive: copy data in background only.
>>>>>> +#
>>>>>> +# @active-write: when data is written to the source, write it
>>>>>> +#                (synchronously) to the target as well.  In addition,
>>>>>> +#                data is copied in background just like in @passive
>>>>>> +#                mode.
>>>>>> +#
>>>>>> +# @active-read-write: write data to the target (synchronously) both
>>>>>> +#                     when it is read from and written to the source.
>>>>>> +#                     In addition, data is copied in background just
>>>>>> +#                     like in @passive mode.
>>>>>
>>>>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
>>>>> means committing the top-most BDS while the guest is accessing it.  The
>>>>> "passive" mirror block still works on the top-most BDS while the guest
>>>>> is accessing it.
>>>>>
>>>>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
>>>>> the terminology used in disk replication (e.g. DRBD).
>>>>
>>>> I'd be OK with that, too, but I think I remember that in the past at
>>>> least Kevin made a clear distinction between active/passive and
>>>> sync/async when it comes to mirroring.
>>>>
>>>>> Ideally the user wouldn't have to worry about async vs sync because QEMU
>>>>> would switch modes as appropriate in order to converge.  That way
>>>>> libvirt also doesn't have to worry about this.
>>>>
>>>> So here you mean async/sync in the way I meant it, i.e., whether the
>>>> mirror operations themselves are async/sync?
>>>
>>> The meaning I had in mind is:
>>>
>>> Sync mirroring means a guest write waits until the target write
>>> completes.
>>
>> I.e. active-sync, ...
>>
>>> Async mirroring means guest writes completes independently of target
>>> writes.
>>
>> ... i.e. passive or active-async in the future.
> 
> So we already have at least three different modes, sync/async doesn't
> quite cut it anyway. There's a reason why we have been talking about
> both active/passive and sync/async.
> 
> When I was looking at the code, it actually occurred to me that there
> are more possible different modes than I thought there were: This patch
> waits for successful completion on the source before it even attempts to
> write to the destination.
> 
> Wouldn't it be generally (i.e. in the success case) more useful if we
> start both requests at the same time and only wait for both to complete,
> avoiding to double the latency? If the source write fails, we're out of
> sync, obviously, so we'd have to mark the block dirty again.

I've thought about it, but my issues were:

(1) What to do when something fails
and
(2) I didn't really want to start coroutines from coroutines...

As for (1)...  My notes actually say I've come to a conclusion: If the
target write fails, that's pretty much OK, because then the source is
newer than the target, which is normal for mirroring.  If the source
write fails, we can just consider the target outdated, too (as you've
said).  Also, we'll give an error to the guest, so it's clear that
something has gone wrong.

So (2) was the reason I didn't do it in this series.  I think it's OK to
add this later on and let future me worry about how to coordinate both
requests.

I guess I'd start e.g. the target operation as a new coroutine, then
continue the source operation in the original one, and finally yield
until the target operation has finished?

> By the way, what happens when the guest modifies the RAM during the
> request? Is it acceptable even for writes if source and target differ
> after a successful write operation? Don't we need a bounce buffer
> anyway?

Sometimes I think that maybe I shouldn't keep my thoughts to myself
after I've come to the conclusion "...naah, it's all bad anyway". :-)

When Stefan mentioned this for reads, I thought about the write
situation, yes.  My conclusion was that the guest would be required (by
protocol) to keep the write buffer constant while the operation is
running, because otherwise the guest has no idea what is going to be on
disk.  So it would be stupid for the guest to modify the write buffer then.

But (1) depending on the emulated hardware, maybe the guest does have an
idea (e.g. some register that tells the guest which offset is currently
written) -- but with the structure of the block layer, I doubt that's
possible in qemu,

and (2) maybe the guest wants to be stupid.  Even if the guest doesn't
know what will end up on disk, we have to make sure that it's the same
on both source and target.

So, yeah, a bounce buffer would be good in all cases.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2017-10-11 12:34 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-13 18:18 [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Max Reitz
2017-09-13 18:18 ` [Qemu-devel] [PATCH 01/18] block: Add BdrvDeletedStatus Max Reitz
2017-09-13 18:18 ` [Qemu-devel] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse Max Reitz
2017-09-18  3:44   ` Fam Zheng
2017-09-18 16:13     ` Max Reitz
2017-10-09 18:30       ` Max Reitz
2017-10-10  8:36   ` Kevin Wolf
2017-10-11 11:41     ` Max Reitz
2017-09-13 18:18 ` [Qemu-devel] [PATCH 03/18] blockjob: Make drained_{begin, end} public Max Reitz
2017-09-18  3:46   ` Fam Zheng
2017-09-13 18:18 ` [Qemu-devel] [PATCH 04/18] block/mirror: Pull out mirror_perform() Max Reitz
2017-09-18  3:48   ` Fam Zheng
2017-09-25  9:38   ` Vladimir Sementsov-Ogievskiy
2017-09-13 18:18 ` [Qemu-devel] [PATCH 05/18] block/mirror: Convert to coroutines Max Reitz
2017-09-18  6:02   ` Fam Zheng
2017-09-18 16:41     ` Max Reitz
2017-10-10  9:14   ` Kevin Wolf
2017-10-11 11:43     ` Max Reitz
2017-09-13 18:18 ` [Qemu-devel] [PATCH 06/18] block/mirror: Use CoQueue to wait on in-flight ops Max Reitz
2017-09-13 18:18 ` [Qemu-devel] [PATCH 07/18] block/mirror: Wait for in-flight op conflicts Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 08/18] block/mirror: Use source as a BdrvChild Max Reitz
2017-10-10  9:27   ` Kevin Wolf
2017-10-11 11:46     ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 09/18] block: Generalize should_update_child() rule Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 10/18] block/mirror: Make source the file child Max Reitz
2017-10-10  9:47   ` Kevin Wolf
2017-10-11 12:02     ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next() Max Reitz
2017-09-25 15:38   ` Vladimir Sementsov-Ogievskiy
2017-09-25 20:40     ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area Max Reitz
2017-09-25 15:49   ` Vladimir Sementsov-Ogievskiy
2017-09-25 20:43     ` Max Reitz
2017-10-02 13:32     ` Vladimir Sementsov-Ogievskiy
2017-09-13 18:19 ` [Qemu-devel] [PATCH 13/18] block/mirror: Keep write perm for pending writes Max Reitz
2017-10-10  9:58   ` Kevin Wolf
2017-10-11 12:20     ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 14/18] block/mirror: Distinguish active from passive ops Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring Max Reitz
2017-09-14 15:57   ` Stefan Hajnoczi
2017-09-16 13:58     ` Max Reitz
2017-09-18 10:06       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2017-09-18 16:26         ` Max Reitz
2017-09-19  9:44           ` Stefan Hajnoczi
2017-09-19  9:57             ` Daniel P. Berrange
2017-09-20 14:56               ` Stefan Hajnoczi
2017-10-10 10:16           ` Kevin Wolf
2017-10-11 12:33             ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 16/18] block/mirror: Add copy mode QAPI interface Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 17/18] qemu-io: Add background write Max Reitz
2017-09-18  6:46   ` Fam Zheng
2017-09-18 17:53     ` Max Reitz
2017-09-19  8:03       ` Fam Zheng
2017-09-21 14:40         ` Max Reitz
2017-09-21 14:59           ` Fam Zheng
2017-09-21 15:03             ` Max Reitz
2017-09-13 18:19 ` [Qemu-devel] [PATCH 18/18] iotests: Add test for active mirroring Max Reitz
2017-09-18  6:45   ` Fam Zheng
2017-09-18 16:53     ` Max Reitz
2017-09-19  8:08       ` Fam Zheng
2017-09-14 15:42 ` [Qemu-devel] [PATCH 00/18] block/mirror: Add active-sync mirroring Stefan Hajnoczi
2017-09-16 14:02   ` Max Reitz
2017-09-18 10:02     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2017-09-18 15:42       ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.