All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
@ 2016-06-14 15:25 Denis V. Lunev
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
                   ` (11 more replies)
  0 siblings, 12 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

Block commit of the active image to the backing store on a slow disk
could never end. For example with the guest with the following loop
inside
    while true; do
        dd bs=1k count=1 if=/dev/zero of=x
    done
running above slow storage could not complete the operation with a
resonable amount of time:
    virsh blockcommit rhel7 sda --active --shallow
    virsh qemu-monitor-event
    virsh qemu-monitor-command rhel7 \
        '{"execute":"block-job-complete",\
          "arguments":{"device":"drive-scsi0-0-0-0"} }'
    virsh qemu-monitor-event
Completion event is never received.

This problem could not be fixed easily with the current architecture. We
should either prohibit guest writes (making dirty bitmap dirty) or switch
to the sycnchronous scheme.

This series switches driver mirror to synch scheme. Actually we can have
something more intelligent and switch to sync mirroring just after
the first pass over the bitmap. Though this could be done relatively
easily during discussion. The most difficult things are here.

The set also adds some performance improvements dealing with
known-to-be-zero areas.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>

Denis V. Lunev (9):
  mirror: fix calling of blk_aio_pwritev/blk_aio_preadv
  mirror: create mirror_dirty_init helper for mirror_run
  mirror: optimize dirty bitmap filling in mirror_run a bit
  mirror: efficiently zero out target
  mirror: improve performance of mirroring of empty disk
  block: pass qiov into before_write notifier
  mirror: allow to save buffer for QEMUIOVector in MirrorOp
  mirror: use synch scheme for drive mirror
  mirror: replace bdrv_dirty_bitmap with plain hbitmap

 block/io.c                |  12 +-
 block/mirror.c            | 290 +++++++++++++++++++++++++++++++++++-----------
 include/block/block_int.h |   1 +
 3 files changed, 229 insertions(+), 74 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-14 22:48   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run Denis V. Lunev
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

4th argument is flags rather than size. Fortunately flags occupies
5 less significant bits and they are always zero due to alignment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 80fd3c7..3760e29 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -156,8 +156,7 @@ static void mirror_read_complete(void *opaque, int ret)
         mirror_iteration_done(op, ret);
         return;
     }
-    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-                    op->nb_sectors * BDRV_SECTOR_SIZE,
+    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov, 0,
                     mirror_write_complete, op);
 }
 
@@ -274,8 +273,7 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t sector_num,
     s->sectors_in_flight += nb_sectors;
     trace_mirror_one_iteration(s, sector_num, nb_sectors);
 
-    blk_aio_preadv(source, sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-                   nb_sectors * BDRV_SECTOR_SIZE,
+    blk_aio_preadv(source, sector_num * BDRV_SECTOR_SIZE, &op->qiov, 0,
                    mirror_read_complete, op);
     return ret;
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  2:29   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit Denis V. Lunev
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

The code inside the helper will be extended in the next patch. mirror_run
itself is overbloated at the moment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 83 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 3760e29..797659d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -501,19 +501,62 @@ static void mirror_exit(BlockJob *job, void *opaque)
     bdrv_unref(src);
 }
 
+static int mirror_dirty_init(MirrorBlockJob *s)
+{
+    int64_t sector_num, end;
+    BlockDriverState *base = s->base;
+    BlockDriverState *bs = blk_bs(s->common.blk);
+    BlockDriverState *target_bs = blk_bs(s->target);
+    bool mark_all_dirty = base == NULL && !bdrv_has_zero_init(target_bs);
+    uint64_t last_pause_ns;
+    int ret, n;
+
+    end = s->bdev_length / BDRV_SECTOR_SIZE;
+
+    last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+
+    /* First part, loop on the sectors and initialize the dirty bitmap.  */
+    for (sector_num = 0; sector_num < end; ) {
+        /* Just to make sure we are not exceeding int limit. */
+        int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
+                             end - sector_num);
+        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+
+        if (now - last_pause_ns > SLICE_TIME) {
+            last_pause_ns = now;
+            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
+        }
+
+        if (block_job_is_cancelled(&s->common)) {
+            return 0;
+        }
+
+        ret = bdrv_is_allocated_above(bs, base, sector_num, nb_sectors, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        assert(n > 0);
+        if (ret == 1 || mark_all_dirty) {
+            bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
+        }
+        sector_num += n;
+    }
+    return 0;
+}
+
 static void coroutine_fn mirror_run(void *opaque)
 {
     MirrorBlockJob *s = opaque;
     MirrorExitData *data;
     BlockDriverState *bs = blk_bs(s->common.blk);
     BlockDriverState *target_bs = blk_bs(s->target);
-    int64_t sector_num, end, length;
+    int64_t length;
     uint64_t last_pause_ns;
     BlockDriverInfo bdi;
     char backing_filename[2]; /* we only need 2 characters because we are only
                                  checking for a NULL string */
     int ret = 0;
-    int n;
     int target_cluster_size = BDRV_SECTOR_SIZE;
 
     if (block_job_is_cancelled(&s->common)) {
@@ -555,7 +598,6 @@ static void coroutine_fn mirror_run(void *opaque)
     s->target_cluster_sectors = target_cluster_size >> BDRV_SECTOR_BITS;
     s->max_iov = MIN(bs->bl.max_iov, target_bs->bl.max_iov);
 
-    end = s->bdev_length / BDRV_SECTOR_SIZE;
     s->buf = qemu_try_blockalign(bs, s->buf_size);
     if (s->buf == NULL) {
         ret = -ENOMEM;
@@ -564,41 +606,14 @@ static void coroutine_fn mirror_run(void *opaque)
 
     mirror_free_init(s);
 
-    last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     if (!s->is_none_mode) {
-        /* First part, loop on the sectors and initialize the dirty bitmap.  */
-        BlockDriverState *base = s->base;
-        bool mark_all_dirty = s->base == NULL && !bdrv_has_zero_init(target_bs);
-
-        for (sector_num = 0; sector_num < end; ) {
-            /* Just to make sure we are not exceeding int limit. */
-            int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
-                                 end - sector_num);
-            int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-
-            if (now - last_pause_ns > SLICE_TIME) {
-                last_pause_ns = now;
-                block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
-            }
-
-            if (block_job_is_cancelled(&s->common)) {
-                goto immediate_exit;
-            }
-
-            ret = bdrv_is_allocated_above(bs, base, sector_num, nb_sectors, &n);
-
-            if (ret < 0) {
-                goto immediate_exit;
-            }
-
-            assert(n > 0);
-            if (ret == 1 || mark_all_dirty) {
-                bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
-            }
-            sector_num += n;
+        ret = mirror_dirty_init(s);
+        if (ret < 0 || block_job_is_cancelled(&s->common)) {
+            goto immediate_exit;
         }
     }
 
+    last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
     for (;;) {
         uint64_t delay_ns = 0;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  2:36   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Denis V. Lunev
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

There is no need to scan allocation tables if we have mark_all_dirty flag
set. Just mark it all dirty.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 797659d..c7b3639 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -507,12 +507,16 @@ static int mirror_dirty_init(MirrorBlockJob *s)
     BlockDriverState *base = s->base;
     BlockDriverState *bs = blk_bs(s->common.blk);
     BlockDriverState *target_bs = blk_bs(s->target);
-    bool mark_all_dirty = base == NULL && !bdrv_has_zero_init(target_bs);
     uint64_t last_pause_ns;
     int ret, n;
 
     end = s->bdev_length / BDRV_SECTOR_SIZE;
 
+    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
+        bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
+        return 0;
+    }
+
     last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 
     /* First part, loop on the sectors and initialize the dirty bitmap.  */
@@ -537,7 +541,7 @@ static int mirror_dirty_init(MirrorBlockJob *s)
         }
 
         assert(n > 0);
-        if (ret == 1 || mark_all_dirty) {
+        if (ret == 1) {
             bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
         }
         sector_num += n;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (2 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  3:00   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk Denis V. Lunev
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
into the wire. Thus the target could be very efficiently zeroed out. This
is should be done with the largest chunk possible.

This improves the performance of the live migration of the empty disk by
150 times if NBD supports write_zeroes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index c7b3639..c2f8773 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -21,6 +21,7 @@
 #include "qemu/ratelimit.h"
 #include "qemu/bitmap.h"
 
+#define MIRROR_ZERO_CHUNK   (3u << (29 - BDRV_SECTOR_BITS))      /* 1.5 Gb */
 #define SLICE_TIME    100000000ULL /* ns */
 #define MAX_IN_FLIGHT 16
 #define DEFAULT_MIRROR_BUF_SIZE   (10 << 20)
@@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
 
     end = s->bdev_length / BDRV_SECTOR_SIZE;
 
-    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
+    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
+            target_bs->drv->bdrv_co_write_zeroes == NULL) {
         bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
         return 0;
     }
@@ -546,6 +548,34 @@ static int mirror_dirty_init(MirrorBlockJob *s)
         }
         sector_num += n;
     }
+
+    if (base != NULL || bdrv_has_zero_init(target_bs)) {
+        /* no need to zero out entire disk */
+        return 0;
+    }
+
+    for (sector_num = 0; sector_num < end; ) {
+        int nb_sectors = MIN(MIRROR_ZERO_CHUNK, end - sector_num);
+        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+
+        if (now - last_pause_ns > SLICE_TIME) {
+            last_pause_ns = now;
+            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
+        }
+
+        if (block_job_is_cancelled(&s->common)) {
+            return -EINTR;
+        }
+
+        if (s->in_flight == MAX_IN_FLIGHT) {
+            trace_mirror_yield(s, s->in_flight, s->buf_free_count, -1);
+            mirror_wait_for_io(s);
+            continue;
+        }
+
+        mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
+        sector_num += nb_sectors;
+    }
     return 0;
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (3 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  3:20   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

We should not take into account zero blocks for delay calculations.
They are not read and thus IO throttling is not required. In the
other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
days.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c2f8773..d8be80a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -363,7 +363,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     bitmap_set(s->in_flight_bitmap, sector_num / sectors_per_chunk, nb_chunks);
     while (nb_chunks > 0 && sector_num < end) {
         int ret;
-        int io_sectors;
+        int io_sectors, io_sectors_acct;
         BlockDriverState *file;
         enum MirrorMethod {
             MIRROR_METHOD_COPY,
@@ -399,12 +399,15 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         switch (mirror_method) {
         case MIRROR_METHOD_COPY:
             io_sectors = mirror_do_read(s, sector_num, io_sectors);
+            io_sectors_acct = io_sectors;
             break;
         case MIRROR_METHOD_ZERO:
             mirror_do_zero_or_discard(s, sector_num, io_sectors, false);
+            io_sectors_acct = 0;
             break;
         case MIRROR_METHOD_DISCARD:
             mirror_do_zero_or_discard(s, sector_num, io_sectors, true);
+            io_sectors_acct = 0;
             break;
         default:
             abort();
@@ -412,7 +415,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         assert(io_sectors);
         sector_num += io_sectors;
         nb_chunks -= DIV_ROUND_UP(io_sectors, sectors_per_chunk);
-        delay_ns += ratelimit_calculate_delay(&s->limit, io_sectors);
+        delay_ns += ratelimit_calculate_delay(&s->limit, io_sectors_acct);
     }
     return delay_ns;
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (4 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  4:07   ` Eric Blake
                     ` (2 more replies)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp Denis V. Lunev
                   ` (5 subsequent siblings)
  11 siblings, 3 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/io.c                | 12 +++++++-----
 include/block/block_int.h |  1 +
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 2d832aa..d2ad09c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -368,12 +368,14 @@ static void tracked_request_end(BdrvTrackedRequest *req)
  */
 static void tracked_request_begin(BdrvTrackedRequest *req,
                                   BlockDriverState *bs,
+                                  QEMUIOVector *qiov,
                                   int64_t offset,
                                   unsigned int bytes,
                                   enum BdrvTrackedRequestType type)
 {
     *req = (BdrvTrackedRequest){
         .bs = bs,
+        .qiov           = qiov,
         .offset         = offset,
         .bytes          = bytes,
         .type           = type,
@@ -1073,7 +1075,7 @@ int coroutine_fn bdrv_co_preadv(BlockDriverState *bs,
         bytes = ROUND_UP(bytes, align);
     }
 
-    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_READ);
+    tracked_request_begin(&req, bs, NULL, offset, bytes, BDRV_TRACKED_READ);
     ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
@@ -1391,7 +1393,7 @@ int coroutine_fn bdrv_co_pwritev(BlockDriverState *bs,
      * Pad qiov with the read parts and be sure to have a tracked request not
      * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
      */
-    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_WRITE);
+    tracked_request_begin(&req, bs, qiov, offset, bytes, BDRV_TRACKED_WRITE);
 
     if (!qiov) {
         ret = bdrv_co_do_zero_pwritev(bs, offset, bytes, flags, &req);
@@ -2098,7 +2100,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
         return 0;
     }
 
-    tracked_request_begin(&req, bs, 0, 0, BDRV_TRACKED_FLUSH);
+    tracked_request_begin(&req, bs, NULL, 0, 0, BDRV_TRACKED_FLUSH);
 
     /* Write back all layers by calling one driver function */
     if (bs->drv->bdrv_co_flush) {
@@ -2228,7 +2230,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
         return 0;
     }
 
-    tracked_request_begin(&req, bs, sector_num, nb_sectors,
+    tracked_request_begin(&req, bs, NULL, sector_num, nb_sectors,
                           BDRV_TRACKED_DISCARD);
     bdrv_set_dirty(bs, sector_num, nb_sectors);
 
@@ -2331,7 +2333,7 @@ static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
     };
     BlockAIOCB *acb;
 
-    tracked_request_begin(&tracked_req, bs, 0, 0, BDRV_TRACKED_IOCTL);
+    tracked_request_begin(&tracked_req, bs, NULL, 0, 0, BDRV_TRACKED_IOCTL);
     if (!drv || !drv->bdrv_aio_ioctl) {
         co.ret = -ENOTSUP;
         goto out;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 30a9717..72f463a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -69,6 +69,7 @@ enum BdrvTrackedRequestType {
 
 typedef struct BdrvTrackedRequest {
     BlockDriverState *bs;
+    QEMUIOVector *qiov;
     int64_t offset;
     unsigned int bytes;
     enum BdrvTrackedRequestType type;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (5 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  4:11   ` Eric Blake
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

Properly cook MirrorOp initialization/deinitialization. The field is not
yet used actually.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d8be80a..7471211 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,7 @@ typedef struct MirrorOp {
     QEMUIOVector qiov;
     int64_t sector_num;
     int nb_sectors;
+    void *buf;
 } MirrorOp;
 
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
@@ -100,24 +101,28 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     s->in_flight--;
     s->sectors_in_flight -= op->nb_sectors;
     iov = op->qiov.iov;
-    for (i = 0; i < op->qiov.niov; i++) {
-        MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
-        QSIMPLEQ_INSERT_TAIL(&s->buf_free, buf, next);
-        s->buf_free_count++;
-    }
 
-    sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
-    chunk_num = op->sector_num / sectors_per_chunk;
-    nb_chunks = DIV_ROUND_UP(op->nb_sectors, sectors_per_chunk);
-    bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
-    if (ret >= 0) {
-        if (s->cow_bitmap) {
-            bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+    if (op->buf == NULL) {
+        for (i = 0; i < op->qiov.niov; i++) {
+            MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
+            QSIMPLEQ_INSERT_TAIL(&s->buf_free, buf, next);
+            s->buf_free_count++;
+        }
+
+        sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
+        chunk_num = op->sector_num / sectors_per_chunk;
+        nb_chunks = DIV_ROUND_UP(op->nb_sectors, sectors_per_chunk);
+        bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
+        if (ret >= 0) {
+            if (s->cow_bitmap) {
+                bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+            }
+            s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
         }
-        s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
     }
 
     qemu_iovec_destroy(&op->qiov);
+    g_free(op->buf);
     g_free(op);
 
     if (s->waiting_for_io) {
@@ -255,6 +260,7 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t sector_num,
     op->s = s;
     op->sector_num = sector_num;
     op->nb_sectors = nb_sectors;
+    op->buf = NULL;
 
     /* Now make a QEMUIOVector taking enough granularity-sized chunks
      * from s->buf_free.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (6 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  4:18   ` Eric Blake
  2016-06-15  9:48   ` Stefan Hajnoczi
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 9/9] mirror: replace bdrv_dirty_bitmap with plain hbitmap Denis V. Lunev
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

Block commit of the active image to the backing store on a slow disk
could never end. For example with the guest with the following loop
inside
    while true; do
        dd bs=1k count=1 if=/dev/zero of=x
    done
running above slow storage could not complete the operation with a
resonable amount of time:
    virsh blockcommit rhel7 sda --active --shallow
    virsh qemu-monitor-event
    virsh qemu-monitor-command rhel7 \
        '{"execute":"block-job-complete",\
          "arguments":{"device":"drive-scsi0-0-0-0"} }'
    virsh qemu-monitor-event
Completion event is never received.

This problem could not be fixed easily with the current architecture. We
should either prohibit guest writes (making dirty bitmap dirty) or switch
to the sycnchronous scheme.

This patch implements the latter. It adds mirror_before_write_notify
callback. In this case all data written from the guest is synchnonously
written to the mirror target. Though the problem is solved partially.
We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
done in the next patch.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/block/mirror.c b/block/mirror.c
index 7471211..086256c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -58,6 +58,9 @@ typedef struct MirrorBlockJob {
     QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
     int buf_free_count;
 
+    NotifierWithReturn before_write;
+    CoQueue dependent_writes;
+
     unsigned long *in_flight_bitmap;
     int in_flight;
     int sectors_in_flight;
@@ -125,6 +128,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     g_free(op->buf);
     g_free(op);
 
+    qemu_co_queue_restart_all(&s->dependent_writes);
     if (s->waiting_for_io) {
         qemu_coroutine_enter(s->common.co, NULL);
     }
@@ -511,6 +515,74 @@ static void mirror_exit(BlockJob *job, void *opaque)
     bdrv_unref(src);
 }
 
+static int coroutine_fn mirror_before_write_notify(
+        NotifierWithReturn *notifier, void *opaque)
+{
+    MirrorBlockJob *s = container_of(notifier, MirrorBlockJob, before_write);
+    BdrvTrackedRequest *req = opaque;
+    MirrorOp *op;
+    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
+    int64_t end_sector = sector_num + nb_sectors;
+    int64_t aligned_start, aligned_end;
+
+    if (req->type != BDRV_TRACKED_DISCARD && req->type != BDRV_TRACKED_WRITE) {
+        /* this is not discard and write, we do not care */
+        return 0;
+    }
+
+    while (1) {
+        bool waited = false;
+        int64_t sn;
+
+        for (sn = sector_num; sn < end_sector; sn += sectors_per_chunk) {
+            int64_t chunk = sn / sectors_per_chunk;
+            if (test_bit(chunk, s->in_flight_bitmap)) {
+                trace_mirror_yield_in_flight(s, chunk, s->in_flight);
+                qemu_co_queue_wait(&s->dependent_writes);
+                waited = true;
+            }
+        }
+
+        if (!waited) {
+            break;
+        }
+    }
+
+    aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
+    aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
+    if (aligned_end > aligned_start) {
+        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
+                                aligned_end - aligned_start);
+    }
+
+    if (req->type == BDRV_TRACKED_DISCARD) {
+        mirror_do_zero_or_discard(s, sector_num, nb_sectors, true);
+        return 0;
+    }
+
+    s->in_flight++;
+    s->sectors_in_flight += nb_sectors;
+
+    /* Allocate a MirrorOp that is used as an AIO callback.  */
+    op = g_new(MirrorOp, 1);
+    op->s = s;
+    op->sector_num = sector_num;
+    op->nb_sectors = nb_sectors;
+    op->buf = qemu_try_blockalign(blk_bs(s->target), req->qiov->size);
+    if (op->buf == NULL) {
+        g_free(op);
+        return -ENOMEM;
+    }
+    qemu_iovec_init(&op->qiov, req->qiov->niov);
+    qemu_iovec_clone(&op->qiov, req->qiov, op->buf);
+
+    blk_aio_pwritev(s->target, req->offset, &op->qiov, 0,
+                    mirror_write_complete, op);
+    return 0;
+}
+
 static int mirror_dirty_init(MirrorBlockJob *s)
 {
     int64_t sector_num, end;
@@ -764,6 +836,8 @@ immediate_exit:
         mirror_drain(s);
     }
 
+    notifier_with_return_remove(&s->before_write);
+
     assert(s->in_flight == 0);
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
@@ -905,6 +979,10 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
         return;
     }
 
+    qemu_co_queue_init(&s->dependent_writes);
+    s->before_write.notify = mirror_before_write_notify;
+    bdrv_add_before_write_notifier(bs, &s->before_write);
+
     bdrv_op_block_all(target, s->common.blocker);
 
     s->common.co = qemu_coroutine_create(mirror_run);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH 9/9] mirror: replace bdrv_dirty_bitmap with plain hbitmap
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (7 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
@ 2016-06-14 15:25 ` Denis V. Lunev
  2016-06-15  9:06 ` [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Kevin Wolf
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-14 15:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: vsementsov, Denis V. Lunev, Stefan Hajnoczi, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody, Eric Blake

We have replaced the mechanics of syncing new writes in the previous patch
and thus do not need to track dirty changes anymore.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 58 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 086256c..926fd13 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -52,7 +52,7 @@ typedef struct MirrorBlockJob {
     size_t buf_size;
     int64_t bdev_length;
     unsigned long *cow_bitmap;
-    BdrvDirtyBitmap *dirty_bitmap;
+    HBitmap *copy_bitmap;
     HBitmapIter hbi;
     uint8_t *buf;
     QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
@@ -141,7 +141,7 @@ static void mirror_write_complete(void *opaque, int ret)
     if (ret < 0) {
         BlockErrorAction action;
 
-        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->sector_num, op->nb_sectors);
+        hbitmap_set(s->copy_bitmap, op->sector_num, op->nb_sectors);
         action = mirror_error_action(s, false, -ret);
         if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
             s->ret = ret;
@@ -157,7 +157,7 @@ static void mirror_read_complete(void *opaque, int ret)
     if (ret < 0) {
         BlockErrorAction action;
 
-        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->sector_num, op->nb_sectors);
+        hbitmap_set(s->copy_bitmap, op->sector_num, op->nb_sectors);
         action = mirror_error_action(s, true, -ret);
         if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
             s->ret = ret;
@@ -328,9 +328,9 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
     sector_num = hbitmap_iter_next(&s->hbi);
     if (sector_num < 0) {
-        bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
+        hbitmap_iter_init(&s->hbi, s->copy_bitmap, 0);
         sector_num = hbitmap_iter_next(&s->hbi);
-        trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
+        trace_mirror_restart_iter(s, hbitmap_count(s->copy_bitmap));
         assert(sector_num >= 0);
     }
 
@@ -347,7 +347,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         int64_t next_sector = sector_num + nb_chunks * sectors_per_chunk;
         int64_t next_chunk = next_sector / sectors_per_chunk;
         if (next_sector >= end ||
-            !bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) {
+            !hbitmap_get(s->copy_bitmap, next_sector)) {
             break;
         }
         if (test_bit(next_chunk, s->in_flight_bitmap)) {
@@ -357,7 +357,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         hbitmap_next = hbitmap_iter_next(&s->hbi);
         if (hbitmap_next > next_sector || hbitmap_next < 0) {
             /* The bitmap iterator's cache is stale, refresh it */
-            bdrv_set_dirty_iter(&s->hbi, next_sector);
+            hbitmap_iter_init(&s->hbi, s->copy_bitmap, next_sector);
             hbitmap_next = hbitmap_iter_next(&s->hbi);
         }
         assert(hbitmap_next == next_sector);
@@ -368,8 +368,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
      * calling bdrv_get_block_status_above could yield - if some blocks are
      * marked dirty in this window, we need to know.
      */
-    bdrv_reset_dirty_bitmap(s->dirty_bitmap, sector_num,
-                            nb_chunks * sectors_per_chunk);
+    hbitmap_reset(s->copy_bitmap, sector_num, nb_chunks * sectors_per_chunk);
     bitmap_set(s->in_flight_bitmap, sector_num / sectors_per_chunk, nb_chunks);
     while (nb_chunks > 0 && sector_num < end) {
         int ret;
@@ -553,8 +552,8 @@ static int coroutine_fn mirror_before_write_notify(
     aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
     aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
     if (aligned_end > aligned_start) {
-        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
-                                aligned_end - aligned_start);
+        hbitmap_reset(s->copy_bitmap, aligned_start,
+                      aligned_end - aligned_start);
     }
 
     if (req->type == BDRV_TRACKED_DISCARD) {
@@ -596,7 +595,7 @@ static int mirror_dirty_init(MirrorBlockJob *s)
 
     if (base == NULL && !bdrv_has_zero_init(target_bs) &&
             target_bs->drv->bdrv_co_write_zeroes == NULL) {
-        bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
+        hbitmap_set(s->copy_bitmap, 0, end);
         return 0;
     }
 
@@ -625,7 +624,7 @@ static int mirror_dirty_init(MirrorBlockJob *s)
 
         assert(n > 0);
         if (ret == 1) {
-            bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
+            hbitmap_set(s->copy_bitmap, sector_num, n);
         }
         sector_num += n;
     }
@@ -729,7 +728,7 @@ static void coroutine_fn mirror_run(void *opaque)
     }
 
     last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-    bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
+    hbitmap_iter_init(&s->hbi, s->copy_bitmap, 0);
     for (;;) {
         uint64_t delay_ns = 0;
         int64_t cnt;
@@ -740,7 +739,7 @@ static void coroutine_fn mirror_run(void *opaque)
             goto immediate_exit;
         }
 
-        cnt = bdrv_get_dirty_count(s->dirty_bitmap);
+        cnt = hbitmap_count(s->copy_bitmap);
         /* s->common.offset contains the number of bytes already processed so
          * far, cnt is the number of dirty sectors remaining and
          * s->sectors_in_flight is the number of sectors currently being
@@ -787,7 +786,7 @@ static void coroutine_fn mirror_run(void *opaque)
 
                 should_complete = s->should_complete ||
                     block_job_is_cancelled(&s->common);
-                cnt = bdrv_get_dirty_count(s->dirty_bitmap);
+                cnt = hbitmap_count(s->copy_bitmap);
             }
         }
 
@@ -802,7 +801,7 @@ static void coroutine_fn mirror_run(void *opaque)
              */
             trace_mirror_before_drain(s, cnt);
             bdrv_co_drain(bs);
-            cnt = bdrv_get_dirty_count(s->dirty_bitmap);
+            cnt = hbitmap_count(s->copy_bitmap);
         }
 
         ret = 0;
@@ -842,7 +841,7 @@ immediate_exit:
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
     g_free(s->in_flight_bitmap);
-    bdrv_release_dirty_bitmap(bs, s->dirty_bitmap);
+    hbitmap_free(s->copy_bitmap);
 
     data = g_malloc(sizeof(*data));
     data->ret = ret;
@@ -925,6 +924,25 @@ static const BlockJobDriver commit_active_job_driver = {
     .complete      = mirror_complete,
 };
 
+static HBitmap *mirror_create_bitmap(BlockDriverState *bs,
+                                     uint32_t granularity, Error **errp)
+{
+    int64_t bitmap_size;
+    uint32_t sector_granularity;
+
+    assert((granularity & (granularity - 1)) == 0);
+
+    sector_granularity = granularity >> BDRV_SECTOR_BITS;
+    assert(sector_granularity);
+    bitmap_size = bdrv_nb_sectors(bs);
+    if (bitmap_size < 0) {
+        error_setg_errno(errp, -bitmap_size, "could not get length of device");
+        errno = -bitmap_size;
+        return NULL;
+    }
+    return hbitmap_alloc(bitmap_size, ctz32(sector_granularity));
+}
+
 static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
                              const char *replaces,
                              int64_t speed, uint32_t granularity,
@@ -971,8 +989,8 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
     s->buf_size = ROUND_UP(buf_size, granularity);
     s->unmap = unmap;
 
-    s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
-    if (!s->dirty_bitmap) {
+    s->copy_bitmap = mirror_create_bitmap(bs, granularity, errp);
+    if (s->copy_bitmap == NULL) {
         g_free(s->replaces);
         blk_unref(s->target);
         block_job_unref(&s->common);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
@ 2016-06-14 22:48   ` Eric Blake
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-14 22:48 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> 4th argument is flags rather than size. Fortunately flags occupies
> 5 less significant bits and they are always zero due to alignment.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)

Duplicate of this patch, already on Kevin's block tree:
https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg03377.html

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run Denis V. Lunev
@ 2016-06-15  2:29   ` Eric Blake
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-15  2:29 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> The code inside the helper will be extended in the next patch. mirror_run
> itself is overbloated at the moment.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 83 ++++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 49 insertions(+), 34 deletions(-)
> 

Looks like a nice split.
Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit Denis V. Lunev
@ 2016-06-15  2:36   ` Eric Blake
  2016-06-15  8:41     ` Denis V. Lunev
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Blake @ 2016-06-15  2:36 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> There is no need to scan allocation tables if we have mark_all_dirty flag
> set. Just mark it all dirty.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 797659d..c7b3639 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -507,12 +507,16 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>      BlockDriverState *base = s->base;
>      BlockDriverState *bs = blk_bs(s->common.blk);
>      BlockDriverState *target_bs = blk_bs(s->target);
> -    bool mark_all_dirty = base == NULL && !bdrv_has_zero_init(target_bs);
>      uint64_t last_pause_ns;
>      int ret, n;
>  
>      end = s->bdev_length / BDRV_SECTOR_SIZE;
>  
> +    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
> +        bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);

Won't work as written.  'end' is 64 bits, but bdrv_set_dirty_bitmap() is
limited to a 32-bit sector count.  Might be first worth updating
bdrv_set_dirty_bitmap() and friends to be byte-based rather than
sector-based (but still tracking a sector per bit, obviously), as well
as expand it to operate on 64-bit sizes rather than 32-bit.

I'm also worried slightly that the existing code repeated things in a
loop, and therefore had pause points every iteration and could thus
remain responsive to an early cancel.  But doing the entire operation in
one chunk (assuming you fix bitmap code to handle a 64-bit size) may end
up running for so long without interruption that you lose the benefits
of an early interruption that you have by virtue of a 32-bit limit.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Denis V. Lunev
@ 2016-06-15  3:00   ` Eric Blake
  2016-06-15  8:46     ` Denis V. Lunev
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Blake @ 2016-06-15  3:00 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 4086 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
> into the wire. Thus the target could be very efficiently zeroed out. This
> is should be done with the largest chunk possible.
> 
> This improves the performance of the live migration of the empty disk by
> 150 times if NBD supports write_zeroes.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index c7b3639..c2f8773 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -21,6 +21,7 @@
>  #include "qemu/ratelimit.h"
>  #include "qemu/bitmap.h"
>  
> +#define MIRROR_ZERO_CHUNK   (3u << (29 - BDRV_SECTOR_BITS))      /* 1.5 Gb */

Probably nicer to track this in bytes.  And do you really want a
hard-coded arbitrary limit, or is it better to live with
MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?

> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>  
>      end = s->bdev_length / BDRV_SECTOR_SIZE;
>  
> -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {

Indentation is off, although if checkpatch.pl doesn't complain I guess
it doesn't matter that much.

Why should you care whether the target_bs->drv implements a callback?
Can't you just rely on the normal bdrv_*() functions to do the dirty
work of picking the most efficient implementation without you having to
bypass the block layer?  In fact, isn't that the whole goal of
bdrv_make_zero() - why not call that instead of reimplementing it?

Patch needs rebasing - we've redone this into bdrv_co_pwrite_zeroes and
a byte interface, since upstream commit c1499a5e.

>          bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
>          return 0;
>      }
> @@ -546,6 +548,34 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>          }
>          sector_num += n;
>      }
> +
> +    if (base != NULL || bdrv_has_zero_init(target_bs)) {

You're now repeating the conditional that used to be 'bool
mark_all_dirty' (well, this is !mark_all_dirty); is it worth keeping the
simpler bool around?

> +        /* no need to zero out entire disk */
> +        return 0;
> +    }
> +
> +    for (sector_num = 0; sector_num < end; ) {
> +        int nb_sectors = MIN(MIRROR_ZERO_CHUNK, end - sector_num);

Why limit yourself to 1.5G? It's either too small for what you can
really do, or too large for what the device permits.  See my above
comment about MIN_NON_ZERO.

> +        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +
> +        if (now - last_pause_ns > SLICE_TIME) {
> +            last_pause_ns = now;
> +            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
> +        }
> +
> +        if (block_job_is_cancelled(&s->common)) {
> +            return -EINTR;
> +        }
> +
> +        if (s->in_flight == MAX_IN_FLIGHT) {
> +            trace_mirror_yield(s, s->in_flight, s->buf_free_count, -1);
> +            mirror_wait_for_io(s);
> +            continue;
> +        }

Hmm - I guess your mirror yield points are why you couldn't just
directly use bdrv_make_zero(); but is that something where some code
refactoring can share more code rather than duplicating it?

> +
> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
> +        sector_num += nb_sectors;
> +    }
>      return 0;
>  }
>  
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk Denis V. Lunev
@ 2016-06-15  3:20   ` Eric Blake
  2016-06-15  9:19     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Blake @ 2016-06-15  3:20 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 915 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> We should not take into account zero blocks for delay calculations.
> They are not read and thus IO throttling is not required. In the
> other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
> days.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Seems reasonable, but I'll let others more familiar with throttling give
the final say.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
@ 2016-06-15  4:07   ` Eric Blake
  2016-06-15  9:21   ` Stefan Hajnoczi
  2016-06-15  9:22   ` Stefan Hajnoczi
  2 siblings, 0 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-15  4:07 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 2476 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> Signed-off-by: Denis V. Lunev <den@openvz.org>

The commit message says what, but not why.  It's useful to give
reviewers a hint as to why a patch makes sense (such as a future patch
being able to use the write notifier to make mirroring more efficient if
it knows what is being mirrored).

> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c                | 12 +++++++-----
>  include/block/block_int.h |  1 +
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 

> @@ -2228,7 +2230,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>          return 0;
>      }
>  
> -    tracked_request_begin(&req, bs, sector_num, nb_sectors,
> +    tracked_request_begin(&req, bs, NULL, sector_num, nb_sectors,
>                            BDRV_TRACKED_DISCARD);
>      bdrv_set_dirty(bs, sector_num, nb_sectors);
>  
> @@ -2331,7 +2333,7 @@ static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
>      };
>      BlockAIOCB *acb;
>  
> -    tracked_request_begin(&tracked_req, bs, 0, 0, BDRV_TRACKED_IOCTL);
> +    tracked_request_begin(&tracked_req, bs, NULL, 0, 0, BDRV_TRACKED_IOCTL);
>      if (!drv || !drv->bdrv_aio_ioctl) {
>          co.ret = -ENOTSUP;
>          goto out;
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 30a9717..72f463a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -69,6 +69,7 @@ enum BdrvTrackedRequestType {
>  
>  typedef struct BdrvTrackedRequest {
>      BlockDriverState *bs;
> +    QEMUIOVector *qiov;
>      int64_t offset;
>      unsigned int bytes;

I guess bytes and qiov->size are not always redundant, because you pass
NULL for qiov for a zero or discard operation (an alternative would be
to always pass a qiov, even for zero/discard, so that we only need a
single size).  But I've been pointing out our inconsistent use of qiov
for zeroes in multiple places, so I don't think it's worth changing in
this series, but in one of its own if we want to do it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp Denis V. Lunev
@ 2016-06-15  4:11   ` Eric Blake
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-15  4:11 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 3672 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:

In the subject: 'allow to save buffer' is not idiomatic English; better
would be 'allow saving the buffer' or simply 'save the buffer'

> Properly cook MirrorOp initialization/deinitialization. The field is not
> yet used actually.

Another "what" but not "why" - expanding the commit message to mention
"why" makes it easier to review.

> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index d8be80a..7471211 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -73,6 +73,7 @@ typedef struct MirrorOp {
>      QEMUIOVector qiov;
>      int64_t sector_num;
>      int nb_sectors;
> +    void *buf;
>  } MirrorOp;
>  
>  static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
> @@ -100,24 +101,28 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      s->in_flight--;
>      s->sectors_in_flight -= op->nb_sectors;
>      iov = op->qiov.iov;
> -    for (i = 0; i < op->qiov.niov; i++) {
> -        MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
> -        QSIMPLEQ_INSERT_TAIL(&s->buf_free, buf, next);
> -        s->buf_free_count++;
> -    }
>  
> -    sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
> -    chunk_num = op->sector_num / sectors_per_chunk;
> -    nb_chunks = DIV_ROUND_UP(op->nb_sectors, sectors_per_chunk);
> -    bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
> -    if (ret >= 0) {
> -        if (s->cow_bitmap) {
> -            bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
> +    if (op->buf == NULL) {
> +        for (i = 0; i < op->qiov.niov; i++) {
> +            MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
> +            QSIMPLEQ_INSERT_TAIL(&s->buf_free, buf, next);
> +            s->buf_free_count++;
> +        }
> +
> +        sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
> +        chunk_num = op->sector_num / sectors_per_chunk;
> +        nb_chunks = DIV_ROUND_UP(op->nb_sectors, sectors_per_chunk);
> +        bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);

I still think it might be smarter to fix bitmap operations to work on
byte inputs (still sectors, or rather granularity chunks, under the
hood, but no need to make users scale things just to have it scaled again).

> +        if (ret >= 0) {
> +            if (s->cow_bitmap) {
> +                bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
> +            }
> +            s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
>          }
> -        s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
>      }
>  
>      qemu_iovec_destroy(&op->qiov);
> +    g_free(op->buf);
>      g_free(op);
>  
>      if (s->waiting_for_io) {
> @@ -255,6 +260,7 @@ static int mirror_do_read(MirrorBlockJob *s, int64_t sector_num,
>      op->s = s;
>      op->sector_num = sector_num;
>      op->nb_sectors = nb_sectors;
> +    op->buf = NULL;
>  
>      /* Now make a QEMUIOVector taking enough granularity-sized chunks
>       * from s->buf_free.
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
@ 2016-06-15  4:18   ` Eric Blake
  2016-06-15  8:52     ` Denis V. Lunev
  2016-06-15  9:48   ` Stefan Hajnoczi
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Blake @ 2016-06-15  4:18 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a

s/with/within/

> resonable amount of time:

s/resonable/reasonable/

>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.

s/sycnchronous/synchronous/

> 
> This patch implements the latter. It adds mirror_before_write_notify
> callback. In this case all data written from the guest is synchnonously

s/synchnonously/synchronously/

> written to the mirror target. Though the problem is solved partially.
> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
> done in the next patch.
> 

In other words, the mere act of mirroring a guest will now be
guest-visible in that the guest is auto-throttled while waiting for the
mirroring to be written out.  It seems like you would want to be able to
opt in or out of this scheme.  Is it something that can be toggled
mid-operation (try asynchronous, and switch to synchronous if a timeout
elapses)?

> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 

I'll leave the actual idea to others to review, because there may be
some ramifications that I'm not thinking of.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit
  2016-06-15  2:36   ` Eric Blake
@ 2016-06-15  8:41     ` Denis V. Lunev
  2016-06-15 12:25       ` Eric Blake
  0 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15  8:41 UTC (permalink / raw)
  To: Eric Blake, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

On 06/15/2016 05:36 AM, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>> There is no need to scan allocation tables if we have mark_all_dirty flag
>> set. Just mark it all dirty.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 797659d..c7b3639 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -507,12 +507,16 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>       BlockDriverState *base = s->base;
>>       BlockDriverState *bs = blk_bs(s->common.blk);
>>       BlockDriverState *target_bs = blk_bs(s->target);
>> -    bool mark_all_dirty = base == NULL && !bdrv_has_zero_init(target_bs);
>>       uint64_t last_pause_ns;
>>       int ret, n;
>>   
>>       end = s->bdev_length / BDRV_SECTOR_SIZE;
>>   
>> +    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>> +        bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
> Won't work as written.  'end' is 64 bits, but bdrv_set_dirty_bitmap() is
> limited to a 32-bit sector count.  Might be first worth updating
> bdrv_set_dirty_bitmap() and friends to be byte-based rather than
> sector-based (but still tracking a sector per bit, obviously), as well
> as expand it to operate on 64-bit sizes rather than 32-bit.
very nice catch! thank you

> I'm also worried slightly that the existing code repeated things in a
> loop, and therefore had pause points every iteration and could thus
> remain responsive to an early cancel.  But doing the entire operation in
> one chunk (assuming you fix bitmap code to handle a 64-bit size) may end
> up running for so long without interruption that you lose the benefits
> of an early interruption that you have by virtue of a 32-bit limit.
>
I do not think that this should be worried actually. We just perform memset
inside for not that big area (1 Tb disk will have 2 Mb dirty area 
bitmap) under
default parameters.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-15  3:00   ` Eric Blake
@ 2016-06-15  8:46     ` Denis V. Lunev
  2016-06-15 12:34       ` Eric Blake
  0 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15  8:46 UTC (permalink / raw)
  To: Eric Blake, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

On 06/15/2016 06:00 AM, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
>> into the wire. Thus the target could be very efficiently zeroed out. This
>> is should be done with the largest chunk possible.
>>
>> This improves the performance of the live migration of the empty disk by
>> 150 times if NBD supports write_zeroes.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 32 +++++++++++++++++++++++++++++++-
>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index c7b3639..c2f8773 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -21,6 +21,7 @@
>>   #include "qemu/ratelimit.h"
>>   #include "qemu/bitmap.h"
>>   
>> +#define MIRROR_ZERO_CHUNK   (3u << (29 - BDRV_SECTOR_BITS))      /* 1.5 Gb */
> Probably nicer to track this in bytes.  And do you really want a
> hard-coded arbitrary limit, or is it better to live with
> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
unfortunately we should. INT_MAX is not aligned as required.
May be we should align INT_MAX properly to fullfill
write_zeroes alignment.

Hmm, may be we can align INT_MAX properly down. OK,
I'll try to do that gracefully.

>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>   
>>       end = s->bdev_length / BDRV_SECTOR_SIZE;
>>   
>> -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
> Indentation is off, although if checkpatch.pl doesn't complain I guess
> it doesn't matter that much.
>
> Why should you care whether the target_bs->drv implements a callback?
> Can't you just rely on the normal bdrv_*() functions to do the dirty
> work of picking the most efficient implementation without you having to
> bypass the block layer?  In fact, isn't that the whole goal of
> bdrv_make_zero() - why not call that instead of reimplementing it?
this is the idea of the patch actually. If the callback is not 
implemented, we
will have zeroes actually written or send to the wire. In this case there is
not much sense to do that, the amount of data actually written will be
significantly increased (some areas will be written twice - with zeroes and
with the actual data).

On the other hand, if callback is implemented, we will have very small 
amount
of data in the wire and written actually and thus will have a benefit. I am
trying to avoid very small chunks of data. Here (during the migration 
process)
the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes 
of data
on the transport layer.

> Patch needs rebasing - we've redone this into bdrv_co_pwrite_zeroes and
> a byte interface, since upstream commit c1499a5e.
sure!

>>           bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
>>           return 0;
>>       }
>> @@ -546,6 +548,34 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>           }
>>           sector_num += n;
>>       }
>> +
>> +    if (base != NULL || bdrv_has_zero_init(target_bs)) {
> You're now repeating the conditional that used to be 'bool
> mark_all_dirty' (well, this is !mark_all_dirty); is it worth keeping the
> simpler bool around?
not quite. The difference is in the presence of the callback,
but sure I can cache it. no prob.

>> +        /* no need to zero out entire disk */
>> +        return 0;
>> +    }
>> +
>> +    for (sector_num = 0; sector_num < end; ) {
>> +        int nb_sectors = MIN(MIRROR_ZERO_CHUNK, end - sector_num);
> Why limit yourself to 1.5G? It's either too small for what you can
> really do, or too large for what the device permits.  See my above
> comment about MIN_NON_ZERO.
alignment, covered above

>> +        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>> +
>> +        if (now - last_pause_ns > SLICE_TIME) {
>> +            last_pause_ns = now;
>> +            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
>> +        }
>> +
>> +        if (block_job_is_cancelled(&s->common)) {
>> +            return -EINTR;
>> +        }
>> +
>> +        if (s->in_flight == MAX_IN_FLIGHT) {
>> +            trace_mirror_yield(s, s->in_flight, s->buf_free_count, -1);
>> +            mirror_wait_for_io(s);
>> +            continue;
>> +        }
> Hmm - I guess your mirror yield points are why you couldn't just
> directly use bdrv_make_zero(); but is that something where some code
> refactoring can share more code rather than duplicating it?
the purpose is to put several requests into the wire in parallel.
Original mirror code do this nicely and thus is reused.

>> +
>> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
>> +        sector_num += nb_sectors;
>> +    }
>>       return 0;
>>   }
>>   
>>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror
  2016-06-15  4:18   ` Eric Blake
@ 2016-06-15  8:52     ` Denis V. Lunev
  0 siblings, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15  8:52 UTC (permalink / raw)
  To: Eric Blake, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

On 06/15/2016 07:18 AM, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>> Block commit of the active image to the backing store on a slow disk
>> could never end. For example with the guest with the following loop
>> inside
>>      while true; do
>>          dd bs=1k count=1 if=/dev/zero of=x
>>      done
>> running above slow storage could not complete the operation with a
> s/with/within/
>
>> resonable amount of time:
> s/resonable/reasonable/
>
>>      virsh blockcommit rhel7 sda --active --shallow
>>      virsh qemu-monitor-event
>>      virsh qemu-monitor-command rhel7 \
>>          '{"execute":"block-job-complete",\
>>            "arguments":{"device":"drive-scsi0-0-0-0"} }'
>>      virsh qemu-monitor-event
>> Completion event is never received.
>>
>> This problem could not be fixed easily with the current architecture. We
>> should either prohibit guest writes (making dirty bitmap dirty) or switch
>> to the sycnchronous scheme.
> s/sycnchronous/synchronous/
>
>> This patch implements the latter. It adds mirror_before_write_notify
>> callback. In this case all data written from the guest is synchnonously
> s/synchnonously/synchronously/
>
>> written to the mirror target. Though the problem is solved partially.
>> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
>> done in the next patch.
>>
> In other words, the mere act of mirroring a guest will now be
> guest-visible in that the guest is auto-throttled while waiting for the
> mirroring to be written out.  It seems like you would want to be able to
> opt in or out of this scheme.  Is it something that can be toggled
> mid-operation (try asynchronous, and switch to synchronous if a timeout
> elapses)?
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 78 insertions(+)
>>
> I'll leave the actual idea to others to review, because there may be
> some ramifications that I'm not thinking of.
>
I would like to start the discussion with this series.
Yes, may be we need the policy and should switch
to synch scheme after the first stage of mirroring
(when 'complete' command is sent by the management
layer.

This could be done relatively easily on the base of this
patches. Really. Though I want to obtain some general
acceptance in advance.

Den

P.S. Thank you very much for looking at this ;)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (8 preceding siblings ...)
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 9/9] mirror: replace bdrv_dirty_bitmap with plain hbitmap Denis V. Lunev
@ 2016-06-15  9:06 ` Kevin Wolf
  2016-06-15  9:34   ` Denis V. Lunev
  2016-06-15  9:50 ` Stefan Hajnoczi
  2016-06-15 11:09 ` Dr. David Alan Gilbert
  11 siblings, 1 reply; 40+ messages in thread
From: Kevin Wolf @ 2016-06-15  9:06 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Stefan Hajnoczi, Fam Zheng,
	Max Reitz, Jeff Cody, Eric Blake, pbonzini

Am 14.06.2016 um 17:25 hat Denis V. Lunev geschrieben:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This series switches driver mirror to synch scheme. Actually we can have
> something more intelligent and switch to sync mirroring just after
> the first pass over the bitmap. Though this could be done relatively
> easily during discussion. The most difficult things are here.
> 
> The set also adds some performance improvements dealing with
> known-to-be-zero areas.

I only read the cover letter and had a quick look at the patch doing the
actual switch, so this is by far not a real review, but I have a few
general comments anway:


First of all, let's make sure we're all using the same terminology. In
past discussions about mirror modes, we distinguished active/passive and
synchronous/asynchronous.

* An active mirror mirrors requests immediately when they are made by
  the guest. A passive mirror just remembers that it needs to mirror
  something and does it whenever it wants.

* A synchronous mirror completes the guest request only after the data
  has successfully been written to both the live imaeg and the target.
  An asynchronous one can complete the guest request before the mirror
  I/O has completed.

In these terms, the currently implemented mirror is a passive
asynchronous one. If I understand correctly, what you are doing in this
series is to convert it unconditionally to an active asynchronous one.


The "unconditionally" part is my first complaint: The active mirror does
potentially a lot more I/O, so it's not clear that you want to use it.
This should be the user's choice. (We always intended to add an active
mirror sooner or later, but so far nobody needed it desperately enough.)


The second big thing is that I don't want to see new users of the
notifiers in I/O functions. Let's try if we can't add a filter
BlockDriver instead. Then we'd add an option to set the filter node-name
in the mirror QMP command so that the management tool is aware of the
node and can refer to it.

If we don't do this now, we'll have to introduce it later and can't be
sure that the management tool knows about it. This would complicate
things quite a bit because we would have to make sure that the added
node stays invisible to the management tool.


I think these two things are the big architectural questions. The rest
is hopefully more or less implementation details.

Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-15  3:20   ` Eric Blake
@ 2016-06-15  9:19     ` Stefan Hajnoczi
  2016-06-15 10:37       ` Denis V. Lunev
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-15  9:19 UTC (permalink / raw)
  To: Eric Blake
  Cc: Denis V. Lunev, qemu-devel, qemu-block, vsementsov, Fam Zheng,
	Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]

On Tue, Jun 14, 2016 at 09:20:47PM -0600, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> > We should not take into account zero blocks for delay calculations.
> > They are not read and thus IO throttling is not required. In the
> > other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
> > days.
> > 
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > CC: Fam Zheng <famz@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Max Reitz <mreitz@redhat.com>
> > CC: Jeff Cody <jcody@redhat.com>
> > CC: Eric Blake <eblake@redhat.com>
> > ---
> >  block/mirror.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Seems reasonable, but I'll let others more familiar with throttling give
> the final say.

There is a bounce buffer fallback when !drv->bdrv_co_pwrite_zeroes.  In
that case we need to account for the bytes transferred.  I don't see
where the patch takes this into account.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
  2016-06-15  4:07   ` Eric Blake
@ 2016-06-15  9:21   ` Stefan Hajnoczi
  2016-06-15  9:24     ` Denis V. Lunev
  2016-06-15  9:22   ` Stefan Hajnoczi
  2 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-15  9:21 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Fam Zheng, Kevin Wolf,
	Max Reitz, Jeff Cody, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Tue, Jun 14, 2016 at 06:25:13PM +0300, Denis V. Lunev wrote:
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c                | 12 +++++++-----
>  include/block/block_int.h |  1 +
>  2 files changed, 8 insertions(+), 5 deletions(-)

This patch is missing a commit description.  Why is this necessary?

If you add .qiov then can we remove the .bytes field?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
  2016-06-15  4:07   ` Eric Blake
  2016-06-15  9:21   ` Stefan Hajnoczi
@ 2016-06-15  9:22   ` Stefan Hajnoczi
  2 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-15  9:22 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Fam Zheng, Kevin Wolf,
	Max Reitz, Jeff Cody, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Tue, Jun 14, 2016 at 06:25:13PM +0300, Denis V. Lunev wrote:
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c                | 12 +++++++-----
>  include/block/block_int.h |  1 +
>  2 files changed, 8 insertions(+), 5 deletions(-)

I read Eric's reply and realized the duplication with my earlier reply.
Feel free to ignore mine and just respond to Eric.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier
  2016-06-15  9:21   ` Stefan Hajnoczi
@ 2016-06-15  9:24     ` Denis V. Lunev
  0 siblings, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15  9:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, vsementsov, Fam Zheng, Kevin Wolf,
	Max Reitz, Jeff Cody, Eric Blake

On 06/15/2016 12:21 PM, Stefan Hajnoczi wrote:
> On Tue, Jun 14, 2016 at 06:25:13PM +0300, Denis V. Lunev wrote:
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/io.c                | 12 +++++++-----
>>   include/block/block_int.h |  1 +
>>   2 files changed, 8 insertions(+), 5 deletions(-)
> This patch is missing a commit description.  Why is this necessary?
>
> If you add .qiov then can we remove the .bytes field?
this would not be convenient unfortunately.
here I have pointer to qiov and this pointer is
not always available. Actually it is available
only to the write operation.

The purpose is to make data being written
available for the notifier.

Den

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-15  9:06 ` [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Kevin Wolf
@ 2016-06-15  9:34   ` Denis V. Lunev
  2016-06-15 10:25     ` Kevin Wolf
  0 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15  9:34 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, vsementsov, Stefan Hajnoczi, Fam Zheng,
	Max Reitz, Jeff Cody, Eric Blake, pbonzini

On 06/15/2016 12:06 PM, Kevin Wolf wrote:
> Am 14.06.2016 um 17:25 hat Denis V. Lunev geschrieben:
>> Block commit of the active image to the backing store on a slow disk
>> could never end. For example with the guest with the following loop
>> inside
>>      while true; do
>>          dd bs=1k count=1 if=/dev/zero of=x
>>      done
>> running above slow storage could not complete the operation with a
>> resonable amount of time:
>>      virsh blockcommit rhel7 sda --active --shallow
>>      virsh qemu-monitor-event
>>      virsh qemu-monitor-command rhel7 \
>>          '{"execute":"block-job-complete",\
>>            "arguments":{"device":"drive-scsi0-0-0-0"} }'
>>      virsh qemu-monitor-event
>> Completion event is never received.
>>
>> This problem could not be fixed easily with the current architecture. We
>> should either prohibit guest writes (making dirty bitmap dirty) or switch
>> to the sycnchronous scheme.
>>
>> This series switches driver mirror to synch scheme. Actually we can have
>> something more intelligent and switch to sync mirroring just after
>> the first pass over the bitmap. Though this could be done relatively
>> easily during discussion. The most difficult things are here.
>>
>> The set also adds some performance improvements dealing with
>> known-to-be-zero areas.
> I only read the cover letter and had a quick look at the patch doing the
> actual switch, so this is by far not a real review, but I have a few
> general comments anway:
>
>
> First of all, let's make sure we're all using the same terminology. In
> past discussions about mirror modes, we distinguished active/passive and
> synchronous/asynchronous.
>
> * An active mirror mirrors requests immediately when they are made by
>    the guest. A passive mirror just remembers that it needs to mirror
>    something and does it whenever it wants.
>
> * A synchronous mirror completes the guest request only after the data
>    has successfully been written to both the live imaeg and the target.
>    An asynchronous one can complete the guest request before the mirror
>    I/O has completed.
>
> In these terms, the currently implemented mirror is a passive
> asynchronous one. If I understand correctly, what you are doing in this
> series is to convert it unconditionally to an active asynchronous one.
>
>
> The "unconditionally" part is my first complaint: The active mirror does
> potentially a lot more I/O, so it's not clear that you want to use it.
> This should be the user's choice. (We always intended to add an active
> mirror sooner or later, but so far nobody needed it desperately enough.)
this sounds reasonable


>
> The second big thing is that I don't want to see new users of the
> notifiers in I/O functions. Let's try if we can't add a filter
> BlockDriver instead. Then we'd add an option to set the filter node-name
> in the mirror QMP command so that the management tool is aware of the
> node and can refer to it.
this will be much more difficult to implement at my experience. Can you
share more details why filters are bad?

> If we don't do this now, we'll have to introduce it later and can't be
> sure that the management tool knows about it. This would complicate
> things quite a bit because we would have to make sure that the added
> node stays invisible to the management tool.
>
>
> I think these two things are the big architectural questions. The rest
> is hopefully more or less implementation details.
I completely agree with you.

We have the following choices:
1. implement parameter to use 'active'/'passive' mode from the very 
beginning
2. switch to 'active' mode upon receiving "block-job-complete" command 
unconditionally
3. switch to 'active' mode upon receiving "block-job-complete" command 
with proper parameter
4. switch to 'active' mode after timeout (I personally do not like this 
option)

I think that choices 1 and 3 do not contradict each other and
could be implemented to gather.

Den

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror
  2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
  2016-06-15  4:18   ` Eric Blake
@ 2016-06-15  9:48   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-15  9:48 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Fam Zheng, Kevin Wolf,
	Max Reitz, Jeff Cody, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 6084 bytes --]

On Tue, Jun 14, 2016 at 06:25:15PM +0300, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This patch implements the latter. It adds mirror_before_write_notify
> callback. In this case all data written from the guest is synchnonously
> written to the mirror target. Though the problem is solved partially.
> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
> done in the next patch.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 7471211..086256c 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -58,6 +58,9 @@ typedef struct MirrorBlockJob {
>      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
>      int buf_free_count;
>  
> +    NotifierWithReturn before_write;
> +    CoQueue dependent_writes;
> +
>      unsigned long *in_flight_bitmap;
>      int in_flight;
>      int sectors_in_flight;
> @@ -125,6 +128,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      g_free(op->buf);

qemu_vfree() must be used for qemu_blockalign() memory.

>      g_free(op);
>  
> +    qemu_co_queue_restart_all(&s->dependent_writes);
>      if (s->waiting_for_io) {
>          qemu_coroutine_enter(s->common.co, NULL);
>      }
> @@ -511,6 +515,74 @@ static void mirror_exit(BlockJob *job, void *opaque)
>      bdrv_unref(src);
>  }
>  
> +static int coroutine_fn mirror_before_write_notify(
> +        NotifierWithReturn *notifier, void *opaque)
> +{
> +    MirrorBlockJob *s = container_of(notifier, MirrorBlockJob, before_write);
> +    BdrvTrackedRequest *req = opaque;
> +    MirrorOp *op;
> +    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> +    int64_t end_sector = sector_num + nb_sectors;
> +    int64_t aligned_start, aligned_end;
> +
> +    if (req->type != BDRV_TRACKED_DISCARD && req->type != BDRV_TRACKED_WRITE) {
> +        /* this is not discard and write, we do not care */
> +        return 0;
> +    }
> +
> +    while (1) {
> +        bool waited = false;
> +        int64_t sn;
> +
> +        for (sn = sector_num; sn < end_sector; sn += sectors_per_chunk) {
> +            int64_t chunk = sn / sectors_per_chunk;
> +            if (test_bit(chunk, s->in_flight_bitmap)) {
> +                trace_mirror_yield_in_flight(s, chunk, s->in_flight);
> +                qemu_co_queue_wait(&s->dependent_writes);
> +                waited = true;
> +            }
> +        }
> +
> +        if (!waited) {
> +            break;
> +        }
> +    }
> +
> +    aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
> +    aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
> +    if (aligned_end > aligned_start) {
> +        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
> +                                aligned_end - aligned_start);
> +    }
> +
> +    if (req->type == BDRV_TRACKED_DISCARD) {
> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, true);
> +        return 0;
> +    }
> +
> +    s->in_flight++;
> +    s->sectors_in_flight += nb_sectors;
> +
> +    /* Allocate a MirrorOp that is used as an AIO callback.  */
> +    op = g_new(MirrorOp, 1);
> +    op->s = s;
> +    op->sector_num = sector_num;
> +    op->nb_sectors = nb_sectors;
> +    op->buf = qemu_try_blockalign(blk_bs(s->target), req->qiov->size);
> +    if (op->buf == NULL) {
> +        g_free(op);
> +        return -ENOMEM;
> +    }
> +    qemu_iovec_init(&op->qiov, req->qiov->niov);
> +    qemu_iovec_clone(&op->qiov, req->qiov, op->buf);

Now op->qiov's iovec[] array is equivalent to req->qiov but points to
op->buf.  But you never copied the data from req->qiov to op->buf so
junk gets written to the target!

> +    blk_aio_pwritev(s->target, req->offset, &op->qiov, 0,
> +                    mirror_write_complete, op);
> +    return 0;
> +}

The commit message and description claims this is synchronous but it is
not.  Async requests are being generated by guest I/O.  There is no rate
limiting if s->target is slower than bs.  In that case the queued AIO
requests keep getting longer (including the bounce buffers).  The guest
will exhaust host memory or aio functions will fail (i.e. Linux AIO max
requests is reached).

If you want this to be synchronous you have to yield the coroutine until
the request completes.  Synchronous writes increase latency so this
cannot be the new default.

A different solution is to detect when the dirty bitmap reaches a
minimum threshold and then employ I/O throttling on bs.  That way the
guest experiences no vcpu/network downtime and the I/O performance only
drops during the convergence phase.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (9 preceding siblings ...)
  2016-06-15  9:06 ` [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Kevin Wolf
@ 2016-06-15  9:50 ` Stefan Hajnoczi
  2016-06-15 11:09 ` Dr. David Alan Gilbert
  11 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-15  9:50 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Fam Zheng, Kevin Wolf,
	Max Reitz, Jeff Cody, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 1489 bytes --]

On Tue, Jun 14, 2016 at 06:25:07PM +0300, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This series switches driver mirror to synch scheme. Actually we can have
> something more intelligent and switch to sync mirroring just after
> the first pass over the bitmap. Though this could be done relatively
> easily during discussion. The most difficult things are here.
> 
> The set also adds some performance improvements dealing with
> known-to-be-zero areas.

Patches 1-5 look pretty good to me.

I think the synchronous I/O mirroring approach will need more discussion
and it could be split into a separate series so the earlier patches can
be merged already.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-15  9:34   ` Denis V. Lunev
@ 2016-06-15 10:25     ` Kevin Wolf
  2016-06-15 10:44       ` Denis V. Lunev
  0 siblings, 1 reply; 40+ messages in thread
From: Kevin Wolf @ 2016-06-15 10:25 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, vsementsov, Stefan Hajnoczi, Fam Zheng,
	Max Reitz, Jeff Cody, Eric Blake, pbonzini

Am 15.06.2016 um 11:34 hat Denis V. Lunev geschrieben:
> On 06/15/2016 12:06 PM, Kevin Wolf wrote:
> >The second big thing is that I don't want to see new users of the
> >notifiers in I/O functions. Let's try if we can't add a filter
> >BlockDriver instead. Then we'd add an option to set the filter node-name
> >in the mirror QMP command so that the management tool is aware of the
> >node and can refer to it.
> this will be much more difficult to implement at my experience.

I agree that it will be more difficult, though I'm not sure whether it's
much more.

> Can you share more details why filters are bad?

Basically, notifiers (and other code for supporting other features in
the common I/O path) just don't fit the block layer design very well.
They are more hacked on instead of designed in.

This shows in different ways in different places, so I can't fully tell
you the problems that using notifiers in the mirror code would cause
(which is a problem in itself), but for example the copy-on-read code
that has been added to the core block layer instead of a separate filter
driver had trouble with ignoring its own internal requests that it made
to the BDS, which was hacked around with a new request flag, i.e. making
the core block layer even more complicated.


An example that I can think of with respect to mirroring is taking a
snapshot during the operation. Op blockers prevent this from happening
at the moment, but it seems to be a very reasonable thing for a user to
want. Let me use the filter node approach to visualise this:

    (raw-posix)    (qcow2)
    source_file <- source <- mirror <- virtio-blk
                               |
    target_file <- target <----+

There are two ways to create a snapshot: Either you put it logically
between the mirror and virtio-blk, so that only source (now a backing
file), but no new writes will be mirrored. This is easy in both
approaches, but maybe the less commonly wanted thing.

The other option is putting the new overlay between source and mirror:

    (raw-posix)    (qcow2)
    source_file <- source <- snap <- mirror <- virtio-blk
                                       |
    target_file <- target <------------+

With the mirror intercept being its own BDS node, making this change is
very easy and doesn't involve any code that is specific to mirroring.

If it doesn't have a filter driver and uses notifiers, however, the
mirror is directly tied to the source BDS, and now it doesn't just work,
but you need extra code that explicitly moves the notifier from the
source BDS to the snap BDS. And you need such code not only for
mirroring, but for everything that potentially hooks into the I/O
functions.


Maybe these two examples give you an idea why I want to use the concepts
that are designed into the block layer with much thought rather than
hacked on notifiers. Notifiers may be simpler to implement in the first
place, but they lead to a less consistent and harder to maintain block
layer in the long run.

> >If we don't do this now, we'll have to introduce it later and can't be
> >sure that the management tool knows about it. This would complicate
> >things quite a bit because we would have to make sure that the added
> >node stays invisible to the management tool.
> >
> >
> >I think these two things are the big architectural questions. The rest
> >is hopefully more or less implementation details.
> I completely agree with you.
> 
> We have the following choices:
> 1. implement parameter to use 'active'/'passive' mode from the very
> beginning
> 2. switch to 'active' mode upon receiving "block-job-complete"
> command unconditionally
> 3. switch to 'active' mode upon receiving "block-job-complete"
> command with proper parameter
> 4. switch to 'active' mode after timeout (I personally do not like
> this option)
> 
> I think that choices 1 and 3 do not contradict each other and
> could be implemented to gather.

I think we definitely want 1. and some way to switch after the fact.
That you suggest three different conditions for doing the switch
suggests that it is policy and doesn't belong in qemu, but in the
management tools. So how about complementing 1. with 5.?

5. Provide a QMP command to switch between active and passive mode

Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-15  9:19     ` Stefan Hajnoczi
@ 2016-06-15 10:37       ` Denis V. Lunev
  2016-06-16 10:10         ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15 10:37 UTC (permalink / raw)
  To: Stefan Hajnoczi, Eric Blake
  Cc: Kevin Wolf, vsementsov, Fam Zheng, qemu-block, Jeff Cody,
	qemu-devel, Max Reitz, Denis V. Lunev

On 06/15/2016 12:19 PM, Stefan Hajnoczi wrote:
> On Tue, Jun 14, 2016 at 09:20:47PM -0600, Eric Blake wrote:
>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>> We should not take into account zero blocks for delay calculations.
>>> They are not read and thus IO throttling is not required. In the
>>> other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
>>> days.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Fam Zheng <famz@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Max Reitz <mreitz@redhat.com>
>>> CC: Jeff Cody <jcody@redhat.com>
>>> CC: Eric Blake <eblake@redhat.com>
>>> ---
>>>   block/mirror.c | 7 +++++--
>>>   1 file changed, 5 insertions(+), 2 deletions(-)
>> Seems reasonable, but I'll let others more familiar with throttling give
>> the final say.
> There is a bounce buffer fallback when !drv->bdrv_co_pwrite_zeroes.  In
> that case we need to account for the bytes transferred.  I don't see
> where the patch takes this into account.
Interesting point.

I think we are charging for IO performed. Thus with the
absence of the callback we should charge for io_sectors/2
The write will be full scale, there is no reading.

Correct?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-15 10:25     ` Kevin Wolf
@ 2016-06-15 10:44       ` Denis V. Lunev
  0 siblings, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15 10:44 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, vsementsov, Stefan Hajnoczi, Fam Zheng,
	Max Reitz, Jeff Cody, Eric Blake, pbonzini

On 06/15/2016 01:25 PM, Kevin Wolf wrote:
> Am 15.06.2016 um 11:34 hat Denis V. Lunev geschrieben:
>> On 06/15/2016 12:06 PM, Kevin Wolf wrote:
>>> The second big thing is that I don't want to see new users of the
>>> notifiers in I/O functions. Let's try if we can't add a filter
>>> BlockDriver instead. Then we'd add an option to set the filter node-name
>>> in the mirror QMP command so that the management tool is aware of the
>>> node and can refer to it.
>> this will be much more difficult to implement at my experience.
> I agree that it will be more difficult, though I'm not sure whether it's
> much more.
>
>> Can you share more details why filters are bad?
> Basically, notifiers (and other code for supporting other features in
> the common I/O path) just don't fit the block layer design very well.
> They are more hacked on instead of designed in.
>
> This shows in different ways in different places, so I can't fully tell
> you the problems that using notifiers in the mirror code would cause
> (which is a problem in itself), but for example the copy-on-read code
> that has been added to the core block layer instead of a separate filter
> driver had trouble with ignoring its own internal requests that it made
> to the BDS, which was hacked around with a new request flag, i.e. making
> the core block layer even more complicated.
>
>
> An example that I can think of with respect to mirroring is taking a
> snapshot during the operation. Op blockers prevent this from happening
> at the moment, but it seems to be a very reasonable thing for a user to
> want. Let me use the filter node approach to visualise this:
>
>      (raw-posix)    (qcow2)
>      source_file <- source <- mirror <- virtio-blk
>                                 |
>      target_file <- target <----+
>
> There are two ways to create a snapshot: Either you put it logically
> between the mirror and virtio-blk, so that only source (now a backing
> file), but no new writes will be mirrored. This is easy in both
> approaches, but maybe the less commonly wanted thing.
>
> The other option is putting the new overlay between source and mirror:
>
>      (raw-posix)    (qcow2)
>      source_file <- source <- snap <- mirror <- virtio-blk
>                                         |
>      target_file <- target <------------+
>
> With the mirror intercept being its own BDS node, making this change is
> very easy and doesn't involve any code that is specific to mirroring.
>
> If it doesn't have a filter driver and uses notifiers, however, the
> mirror is directly tied to the source BDS, and now it doesn't just work,
> but you need extra code that explicitly moves the notifier from the
> source BDS to the snap BDS. And you need such code not only for
> mirroring, but for everything that potentially hooks into the I/O
> functions.
>
>
> Maybe these two examples give you an idea why I want to use the concepts
> that are designed into the block layer with much thought rather than
> hacked on notifiers. Notifiers may be simpler to implement in the first
> place, but they lead to a less consistent and harder to maintain block
> layer in the long run.
ok. great explanation, thank you


>>> If we don't do this now, we'll have to introduce it later and can't be
>>> sure that the management tool knows about it. This would complicate
>>> things quite a bit because we would have to make sure that the added
>>> node stays invisible to the management tool.
>>>
>>>
>>> I think these two things are the big architectural questions. The rest
>>> is hopefully more or less implementation details.
>> I completely agree with you.
>>
>> We have the following choices:
>> 1. implement parameter to use 'active'/'passive' mode from the very
>> beginning
>> 2. switch to 'active' mode upon receiving "block-job-complete"
>> command unconditionally
>> 3. switch to 'active' mode upon receiving "block-job-complete"
>> command with proper parameter
>> 4. switch to 'active' mode after timeout (I personally do not like
>> this option)
>>
>> I think that choices 1 and 3 do not contradict each other and
>> could be implemented to gather.
> I think we definitely want 1. and some way to switch after the fact.
> That you suggest three different conditions for doing the switch
> suggests that it is policy and doesn't belong in qemu, but in the
> management tools. So how about complementing 1. with 5.?
>
> 5. Provide a QMP command to switch between active and passive mode
reasonable. looks OK to me.

Den

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 0/9] major rework of drive-mirror
  2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
                   ` (10 preceding siblings ...)
  2016-06-15  9:50 ` Stefan Hajnoczi
@ 2016-06-15 11:09 ` Dr. David Alan Gilbert
  11 siblings, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-15 11:09 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, Kevin Wolf, vsementsov, Fam Zheng,
	Jeff Cody, Max Reitz, Stefan Hajnoczi

* Denis V. Lunev (den@openvz.org) wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This series switches driver mirror to synch scheme. Actually we can have
> something more intelligent and switch to sync mirroring just after
> the first pass over the bitmap. Though this could be done relatively
> easily during discussion. The most difficult things are here.

Some random thoughts:
   a) People have been asking for post-copy for block storage, and this
solves the same problem a different way.
   b) I guess the other way is to turn on write-throttling - that would
also guarantee the rate at which writes happen.
   c) Don't you end up with something verymuch like what the colo block replication
guys have ended up with?

Dave

> The set also adds some performance improvements dealing with
> known-to-be-zero areas.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> 
> Denis V. Lunev (9):
>   mirror: fix calling of blk_aio_pwritev/blk_aio_preadv
>   mirror: create mirror_dirty_init helper for mirror_run
>   mirror: optimize dirty bitmap filling in mirror_run a bit
>   mirror: efficiently zero out target
>   mirror: improve performance of mirroring of empty disk
>   block: pass qiov into before_write notifier
>   mirror: allow to save buffer for QEMUIOVector in MirrorOp
>   mirror: use synch scheme for drive mirror
>   mirror: replace bdrv_dirty_bitmap with plain hbitmap
> 
>  block/io.c                |  12 +-
>  block/mirror.c            | 290 +++++++++++++++++++++++++++++++++++-----------
>  include/block/block_int.h |   1 +
>  3 files changed, 229 insertions(+), 74 deletions(-)
> 
> -- 
> 2.5.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit
  2016-06-15  8:41     ` Denis V. Lunev
@ 2016-06-15 12:25       ` Eric Blake
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-15 12:25 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

On 06/15/2016 02:41 AM, Denis V. Lunev wrote:
> On 06/15/2016 05:36 AM, Eric Blake wrote:
>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>> There is no need to scan allocation tables if we have mark_all_dirty
>>> flag
>>> set. Just mark it all dirty.
>>>

>>>       int ret, n;
>>>         end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>   +    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>> +        bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
>> Won't work as written.  'end' is 64 bits, but bdrv_set_dirty_bitmap() is
>> limited to a 32-bit sector count.  Might be first worth updating
>> bdrv_set_dirty_bitmap() and friends to be byte-based rather than
>> sector-based (but still tracking a sector per bit, obviously), as well
>> as expand it to operate on 64-bit sizes rather than 32-bit.
> very nice catch! thank you
> 

[meta-comment - your reply style is a bit hard to read. It's best to
include a blank line both before and after any text you write when
replying in context, as the extra spacing calls visual attention to your
reply rather than being part of a wall of text]

>> I'm also worried slightly that the existing code repeated things in a
>> loop, and therefore had pause points every iteration and could thus
>> remain responsive to an early cancel.  But doing the entire operation in
>> one chunk (assuming you fix bitmap code to handle a 64-bit size) may end
>> up running for so long without interruption that you lose the benefits
>> of an early interruption that you have by virtue of a 32-bit limit.
>>
> I do not think that this should be worried actually. We just perform memset
> inside for not that big area (1 Tb disk will have 2 Mb dirty area
> bitmap) under
> default parameters.
> 

Okay, so the real slowdown in the loop was the rest of the code
(checking backing status in bdrv_is_allocated_above() and not in
actually writing to the bitmap (bdrv_set_dirty_bitmap()).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-15  8:46     ` Denis V. Lunev
@ 2016-06-15 12:34       ` Eric Blake
  2016-06-15 13:18         ` Denis V. Lunev
  2016-07-06 14:33         ` Denis V. Lunev
  0 siblings, 2 replies; 40+ messages in thread
From: Eric Blake @ 2016-06-15 12:34 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 3746 bytes --]

On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
> On 06/15/2016 06:00 AM, Eric Blake wrote:
>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>> placed
>>> into the wire. Thus the target could be very efficiently zeroed out.
>>> This
>>> is should be done with the largest chunk possible.
>>>

>> Probably nicer to track this in bytes.  And do you really want a
>> hard-coded arbitrary limit, or is it better to live with
>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
> unfortunately we should. INT_MAX is not aligned as required.
> May be we should align INT_MAX properly to fullfill
> write_zeroes alignment.
> 
> Hmm, may be we can align INT_MAX properly down. OK,
> I'll try to do that gracefully.

It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
aligned value; we already have code in io.c that does that in
bdrv_co_do_pwrite_zeroes().

> 
>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>         end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>   -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>> it doesn't matter that much.
>>
>> Why should you care whether the target_bs->drv implements a callback?
>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>> work of picking the most efficient implementation without you having to
>> bypass the block layer?  In fact, isn't that the whole goal of
>> bdrv_make_zero() - why not call that instead of reimplementing it?
> this is the idea of the patch actually. If the callback is not
> implemented, we
> will have zeroes actually written or send to the wire. In this case
> there is
> not much sense to do that, the amount of data actually written will be
> significantly increased (some areas will be written twice - with zeroes and
> with the actual data).
>

But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
can use the public interface to learn whether bdrv_make_zero() will be
efficient or not, without having to probe what the backend supports.

> On the other hand, if callback is implemented, we will have very small
> amount
> of data in the wire and written actually and thus will have a benefit. I am
> trying to avoid very small chunks of data. Here (during the migration
> process)
> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
> of data
> on the transport layer.

I agree that we don't want to pre-initialize the device to zero unless
write zeroes is an efficient operation, but I don't think that the
existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
that out.

I also think that we need to push harder on the NBD list that under the
new block limits proposal, we WANT to be able to advertise when the new
NBD_CMD_WRITE_ZEROES command will accept a larger size than
NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
states that if a server advertises a max transaction size to the client,
then the client must honor that size for all commands including
NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
2G - 4k request] is invalid and would have to be a bunch of 32M requests).
https://sourceforge.net/p/nbd/mailman/message/35081223/

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-15 12:34       ` Eric Blake
@ 2016-06-15 13:18         ` Denis V. Lunev
  2016-07-06 14:33         ` Denis V. Lunev
  1 sibling, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-06-15 13:18 UTC (permalink / raw)
  To: Eric Blake, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

On 06/15/2016 03:34 PM, Eric Blake wrote:
> On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
>> On 06/15/2016 06:00 AM, Eric Blake wrote:
>>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>>> placed
>>>> into the wire. Thus the target could be very efficiently zeroed out.
>>>> This
>>>> is should be done with the largest chunk possible.
>>>>
>>> Probably nicer to track this in bytes.  And do you really want a
>>> hard-coded arbitrary limit, or is it better to live with
>>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
>> unfortunately we should. INT_MAX is not aligned as required.
>> May be we should align INT_MAX properly to fullfill
>> write_zeroes alignment.
>>
>> Hmm, may be we can align INT_MAX properly down. OK,
>> I'll try to do that gracefully.
> It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
> aligned value; we already have code in io.c that does that in
> bdrv_co_do_pwrite_zeroes().

ok



>>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>>          end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>>    -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>>> it doesn't matter that much.
>>>
>>> Why should you care whether the target_bs->drv implements a callback?
>>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>>> work of picking the most efficient implementation without you having to
>>> bypass the block layer?  In fact, isn't that the whole goal of
>>> bdrv_make_zero() - why not call that instead of reimplementing it?
>> this is the idea of the patch actually. If the callback is not
>> implemented, we
>> will have zeroes actually written or send to the wire. In this case
>> there is
>> not much sense to do that, the amount of data actually written will be
>> significantly increased (some areas will be written twice - with zeroes and
>> with the actual data).
>>
> But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
> can use the public interface to learn whether bdrv_make_zero() will be
> efficient or not, without having to probe what the backend supports.

bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs)
{
     BlockDriverInfo bdi;

     if (bs->backing || !(bs->open_flags & BDRV_O_UNMAP)) {
         return false;
     }

     if (bdrv_get_info(bs, &bdi) == 0) {
         return bdi.can_write_zeroes_with_unmap;
     }

     return false;
}


This function looks rotten. We CAN efficiently zero out
QCOW2 images even with backing store available. Though
the availability of the bdrv_co_write_zeroes does not
guarantee that it is working (NFS, CIFS etc for raw_posix.c).




>> On the other hand, if callback is implemented, we will have very small
>> amount
>> of data in the wire and written actually and thus will have a benefit. I am
>> trying to avoid very small chunks of data. Here (during the migration
>> process)
>> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
>> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
>> of data
>> on the transport layer.
> I agree that we don't want to pre-initialize the device to zero unless
> write zeroes is an efficient operation, but I don't think that the
> existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
> that out.
>
> I also think that we need to push harder on the NBD list that under the
> new block limits proposal, we WANT to be able to advertise when the new
> NBD_CMD_WRITE_ZEROES command will accept a larger size than
> NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
> states that if a server advertises a max transaction size to the client,
> then the client must honor that size for all commands including
> NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
> 2G - 4k request] is invalid and would have to be a bunch of 32M requests).
> https://sourceforge.net/p/nbd/mailman/message/35081223/
>


I see...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-15 10:37       ` Denis V. Lunev
@ 2016-06-16 10:10         ` Stefan Hajnoczi
  2016-06-17  2:53           ` Eric Blake
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-16 10:10 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Stefan Hajnoczi, Eric Blake, Kevin Wolf, vsementsov, Fam Zheng,
	qemu-block, Jeff Cody, qemu-devel, Max Reitz, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1732 bytes --]

On Wed, Jun 15, 2016 at 01:37:18PM +0300, Denis V. Lunev wrote:
> On 06/15/2016 12:19 PM, Stefan Hajnoczi wrote:
> > On Tue, Jun 14, 2016 at 09:20:47PM -0600, Eric Blake wrote:
> > > On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> > > > We should not take into account zero blocks for delay calculations.
> > > > They are not read and thus IO throttling is not required. In the
> > > > other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
> > > > days.
> > > > 
> > > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > > > CC: Fam Zheng <famz@redhat.com>
> > > > CC: Kevin Wolf <kwolf@redhat.com>
> > > > CC: Max Reitz <mreitz@redhat.com>
> > > > CC: Jeff Cody <jcody@redhat.com>
> > > > CC: Eric Blake <eblake@redhat.com>
> > > > ---
> > > >   block/mirror.c | 7 +++++--
> > > >   1 file changed, 5 insertions(+), 2 deletions(-)
> > > Seems reasonable, but I'll let others more familiar with throttling give
> > > the final say.
> > There is a bounce buffer fallback when !drv->bdrv_co_pwrite_zeroes.  In
> > that case we need to account for the bytes transferred.  I don't see
> > where the patch takes this into account.
> Interesting point.
> 
> I think we are charging for IO performed. Thus with the
> absence of the callback we should charge for io_sectors/2
> The write will be full scale, there is no reading.
> 
> Correct?

io_sectors currently only accounts for bytes written, not bytes read.

Therefore, I think we need:

/* Don't charge for efficient zero writes */
if (drv->bdrv_co_pwrite_zeroes) {
    io_sectors = 0;
}

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-16 10:10         ` Stefan Hajnoczi
@ 2016-06-17  2:53           ` Eric Blake
  2016-06-17 13:56             ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Blake @ 2016-06-17  2:53 UTC (permalink / raw)
  To: Stefan Hajnoczi, Denis V. Lunev
  Cc: Stefan Hajnoczi, Kevin Wolf, vsementsov, Fam Zheng, qemu-block,
	Jeff Cody, qemu-devel, Max Reitz, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 913 bytes --]

On 06/16/2016 04:10 AM, Stefan Hajnoczi wrote:

> 
> io_sectors currently only accounts for bytes written, not bytes read.
> 
> Therefore, I think we need:
> 
> /* Don't charge for efficient zero writes */
> if (drv->bdrv_co_pwrite_zeroes) {
>     io_sectors = 0;
> }

That's not sufficient.  NBD will have conditional support for write
zeroes, depending on whether the server supports it (that is, once my
patches for NBD_CMD_WRITE_ZEROES get out of a holding pattern on all the
other patches in the queue being flushed...).  So NBD will have the
bdrv_co_pwrite_zeroes callback, but that doesn't mean it will always
work - if the server doesn't support it, calling bdrv_co_pwrite_zeroes
will fail with -ENOTSUP and fall back to less-efficient writes that need
to be accounted for.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk
  2016-06-17  2:53           ` Eric Blake
@ 2016-06-17 13:56             ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2016-06-17 13:56 UTC (permalink / raw)
  To: Eric Blake
  Cc: Stefan Hajnoczi, Denis V. Lunev, Kevin Wolf, vsementsov,
	Fam Zheng, qemu-block, Jeff Cody, qemu-devel, Max Reitz,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

On Thu, Jun 16, 2016 at 08:53:12PM -0600, Eric Blake wrote:
> On 06/16/2016 04:10 AM, Stefan Hajnoczi wrote:
> 
> > 
> > io_sectors currently only accounts for bytes written, not bytes read.
> > 
> > Therefore, I think we need:
> > 
> > /* Don't charge for efficient zero writes */
> > if (drv->bdrv_co_pwrite_zeroes) {
> >     io_sectors = 0;
> > }
> 
> That's not sufficient.  NBD will have conditional support for write
> zeroes, depending on whether the server supports it (that is, once my
> patches for NBD_CMD_WRITE_ZEROES get out of a holding pattern on all the
> other patches in the queue being flushed...).  So NBD will have the
> bdrv_co_pwrite_zeroes callback, but that doesn't mean it will always
> work - if the server doesn't support it, calling bdrv_co_pwrite_zeroes
> will fail with -ENOTSUP and fall back to less-efficient writes that need
> to be accounted for.

Good point!  Optimizations are tricky :-).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
  2016-06-15 12:34       ` Eric Blake
  2016-06-15 13:18         ` Denis V. Lunev
@ 2016-07-06 14:33         ` Denis V. Lunev
  1 sibling, 0 replies; 40+ messages in thread
From: Denis V. Lunev @ 2016-07-06 14:33 UTC (permalink / raw)
  To: Eric Blake, qemu-devel, qemu-block
  Cc: vsementsov, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, Jeff Cody

On 06/15/2016 03:34 PM, Eric Blake wrote:
> On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
>> On 06/15/2016 06:00 AM, Eric Blake wrote:
>>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>>> placed
>>>> into the wire. Thus the target could be very efficiently zeroed out.
>>>> This
>>>> is should be done with the largest chunk possible.
>>>>
>>> Probably nicer to track this in bytes.  And do you really want a
>>> hard-coded arbitrary limit, or is it better to live with
>>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
>> unfortunately we should. INT_MAX is not aligned as required.
>> May be we should align INT_MAX properly to fullfill
>> write_zeroes alignment.
>>
>> Hmm, may be we can align INT_MAX properly down. OK,
>> I'll try to do that gracefully.
> It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
> aligned value; we already have code in io.c that does that in
> bdrv_co_do_pwrite_zeroes().
>
>>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>>          end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>>    -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>>> it doesn't matter that much.
>>>
>>> Why should you care whether the target_bs->drv implements a callback?
>>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>>> work of picking the most efficient implementation without you having to
>>> bypass the block layer?  In fact, isn't that the whole goal of
>>> bdrv_make_zero() - why not call that instead of reimplementing it?
>> this is the idea of the patch actually. If the callback is not
>> implemented, we
>> will have zeroes actually written or send to the wire. In this case
>> there is
>> not much sense to do that, the amount of data actually written will be
>> significantly increased (some areas will be written twice - with zeroes and
>> with the actual data).
>>
> But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
> can use the public interface to learn whether bdrv_make_zero() will be
> efficient or not, without having to probe what the backend supports.
>
>> On the other hand, if callback is implemented, we will have very small
>> amount
>> of data in the wire and written actually and thus will have a benefit. I am
>> trying to avoid very small chunks of data. Here (during the migration
>> process)
>> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
>> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
>> of data
>> on the transport layer.
> I agree that we don't want to pre-initialize the device to zero unless
> write zeroes is an efficient operation, but I don't think that the
> existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
> that out.
>
> I also think that we need to push harder on the NBD list that under the
> new block limits proposal, we WANT to be able to advertise when the new
> NBD_CMD_WRITE_ZEROES command will accept a larger size than
> NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
> states that if a server advertises a max transaction size to the client,
> then the client must honor that size for all commands including
> NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
> 2G - 4k request] is invalid and would have to be a bunch of 32M requests).
> https://sourceforge.net/p/nbd/mailman/message/35081223/
>

Eric, do you know how to make an answer to this letter in the list.
May be you can write a reply and add me in CC: there.
This change in NBD is really necessary.

Den

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-07-06 14:34 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
2016-06-14 22:48   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run Denis V. Lunev
2016-06-15  2:29   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit Denis V. Lunev
2016-06-15  2:36   ` Eric Blake
2016-06-15  8:41     ` Denis V. Lunev
2016-06-15 12:25       ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Denis V. Lunev
2016-06-15  3:00   ` Eric Blake
2016-06-15  8:46     ` Denis V. Lunev
2016-06-15 12:34       ` Eric Blake
2016-06-15 13:18         ` Denis V. Lunev
2016-07-06 14:33         ` Denis V. Lunev
2016-06-14 15:25 ` [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk Denis V. Lunev
2016-06-15  3:20   ` Eric Blake
2016-06-15  9:19     ` Stefan Hajnoczi
2016-06-15 10:37       ` Denis V. Lunev
2016-06-16 10:10         ` Stefan Hajnoczi
2016-06-17  2:53           ` Eric Blake
2016-06-17 13:56             ` Stefan Hajnoczi
2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
2016-06-15  4:07   ` Eric Blake
2016-06-15  9:21   ` Stefan Hajnoczi
2016-06-15  9:24     ` Denis V. Lunev
2016-06-15  9:22   ` Stefan Hajnoczi
2016-06-14 15:25 ` [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp Denis V. Lunev
2016-06-15  4:11   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
2016-06-15  4:18   ` Eric Blake
2016-06-15  8:52     ` Denis V. Lunev
2016-06-15  9:48   ` Stefan Hajnoczi
2016-06-14 15:25 ` [Qemu-devel] [PATCH 9/9] mirror: replace bdrv_dirty_bitmap with plain hbitmap Denis V. Lunev
2016-06-15  9:06 ` [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Kevin Wolf
2016-06-15  9:34   ` Denis V. Lunev
2016-06-15 10:25     ` Kevin Wolf
2016-06-15 10:44       ` Denis V. Lunev
2016-06-15  9:50 ` Stefan Hajnoczi
2016-06-15 11:09 ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.