All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/21] new backup architecture
@ 2016-12-23 14:28 Vladimir Sementsov-Ogievskiy
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
                   ` (21 more replies)
  0 siblings, 22 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Hi all.

This is a new architecture for backup. It solves some current problems:
1. intersecting requests: for now at request start we wait for all intersecting requests, which means that
    a. we may wait even for unrelated to our request clusters
    b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not to start 1 and 3 in parallel with 2 and 4?

2. notifier request is internally synchronous: if notifier starts copying clusters 1,2,3,4, they will be copied one by one in synchronous loop.

3. notifier wait full copying of corresponding clusters (when actually it may wait only for _read_ operations to be finished)

In short, what is done:
1. full async scheme
4. no intersecting requests
3. notifiers can wait only for read, not for write
4. notifiers wait only for corresponding clusters
5. time limit for notifiers
5. skip unallocated clusters for full mode
6. use HBitmap as main backup bitmap and just init it from dirty bitmap for incremental case
7. retrying: do not reread on write fail

# Intro

Instead of sync-copying + async-notifiers as in old backup, or aio requests like in mirror, this scheme just start 24 workers - separate coroutines, each of them copying clusters synchronously. Copying is only done by one cluster, there are no large requests.
The only difference for clusters, awaited by write notifiers, is larger priority. So, notifiers do not start io requests, they just mark some clusters as awaited and yield. Then, when some worker completes read of last cluster, awaited by this notifier it will enter it.

# Some data structures

Instead of done_bitmap - copy_bitmap, like in mirror.
HBitmap copy_bitmap
    Exactly, what should be copied:
    0 - may mean one of three things:
        - this cluster should not be copied at all
        - this cluster is in flight
        - this cluster is already copied
    1 - means that cluster should be copied, but not touched yet (no async io exists for it)

New bitmap: notif_wait_bitmap - not HBitmap, just Bitmap.
    Exactly, in flight clusters, waiting for read operation:
    0 - may mean one of three things:
        - this cluster should not be copied at all
        - this cluster is in flight and it is _already_ read to memory
        - this cluster is already copied
    1 - means that cluster is in flight, but read operation have not finished
        yet
    The only exception is none-mode: in this case 1 means in flight: in io read or write. This is needed for image fleecing.

Cluster states (copy_bitmap, notif_wait_bitmap)

0, 0 - Ignored (should not be copied at all) or In flight (read done) or Copied
0, 1 - In flight, read operation not finished (or write op. - for none-mode)
1, 0 - Should be copied, but not touched yet
1, 1 - Impossible state

NotifierRequest - request from notifier, it changes sequence of cluster copying by workers.
NotifierRequest {
    int64_t start;
    int64_t end;
    int nb_wait; // nb clusters (in specified range) that should be copied but not already read, i.e. clusters awaited by this notifier
    Coroutine *notif; // corresponding notifier coroutine
}

notifier_reqs - list of notifier requests

# More info

At backup start copy_bitmap is inited to sync_bitmap for incremental backup. For top/full backup it is inited to all ones, but in parallel with workers main coroutine skips not allocated clusters.

Worker coroutines are copying clusters, preferable awaited by notifiers (for which NotifierRequest exists in the list). Function get_work helps them.
Workers will copy clusters, awaited by notifiers even if block-job is paused - it is the same behaviour  as in old architecture.

Old backup fails guest-write if notifier fails to backup corresponding clusters. In the new scheme there is a little difference: notifier just wait for 5s and if backup can't copy all corresponding clusters in this time - guest-write fails.
Error scenarios was considered on list, the final solution was to provide user a possibility to chose what should be failed: backup or guest-write. I'll add this later.

Worker can exit (no more clusters to copy or fatal error) or pause (error or user pause or throttling). When last worker goes to pause it rings up main block-job coroutine, which will handle user pause or errors. We need to handle errors in main coroutine because of nature of block_job_error_action, which may yield.

There also is a bonus: new io-retrying scheme: if there is an error on read or write, worker just yield in the retrying loop and if it will be resumed (with job->error_exit = false) it will continue from the same place, so if we have failed write after successful read we will not reread.

Vladimir Sementsov-Ogievskiy (21):
  backup: move from done_bitmap to copy_bitmap
  backup: init copy_bitmap from sync_bitmap for incremental
  backup: improve non-dirty bits progress processing
  backup: use copy_bitmap in incremental backup
  hbitmap: improve dirty iter
  backup: rewrite top mode cluster skipping
  backup: refactor: merge top/full/incremental backup code
  backup: skip unallocated clusters for full mode
  backup: separate copy function
  backup: refactor backup_copy_cluster()
  backup: move r/w error handling code to r/w functions
  iotests: add supported_cache_modes to main function
  coroutine: add qemu_coroutine_add_next
  block: add trace point on bdrv_close_all
  bitmap: add bitmap_count_between() function
  hbitmap: add hbitmap_count_between() function
  backup: make all reads not serializing
  backup: new async architecture
  backup: refactor backup_do_cow
  backup: move bitmap handling from backup_do_cow to get_work
  backup: refactor: remove backup_do_cow()

 block.c                       |   1 +
 block/backup.c                | 871 +++++++++++++++++++++++++++++++-----------
 block/trace-events            |  34 +-
 blockjob.c                    |  29 +-
 include/block/blockjob.h      |  15 +-
 include/qemu/bitmap.h         |   4 +
 include/qemu/coroutine.h      |   2 +
 include/qemu/hbitmap.h        |  26 +-
 tests/qemu-iotests/055        |   4 +-
 tests/qemu-iotests/129        |   6 +-
 tests/qemu-iotests/iotests.py |   7 +-
 util/bitmap.c                 |  27 ++
 util/hbitmap.c                |  32 +-
 util/qemu-coroutine.c         |   7 +
 14 files changed, 805 insertions(+), 260 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-23  5:34   ` Jeff Cody
                     ` (2 more replies)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental Vladimir Sementsov-Ogievskiy
                   ` (20 subsequent siblings)
  21 siblings, 3 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Use HBitmap copy_bitmap instead of done_bitmap. This is needed to unify
backup loop for full/incremental modes in future patches.

We reset bit of the copy_bitmap immediately after checking it in
backup_do_cow(). It is safe, because all other intersecting requests
will wait for our request finish anyway.

The other difference is that in case of error we will have zeroed bit in
copy_bitmap, when in done_bitmap we have not set bit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ea38733..6b27e55 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -39,11 +39,12 @@ typedef struct BackupBlockJob {
     BlockdevOnError on_target_error;
     CoRwlock flush_rwlock;
     uint64_t sectors_read;
-    unsigned long *done_bitmap;
     int64_t cluster_size;
     bool compress;
     NotifierWithReturn before_write;
     QLIST_HEAD(, CowRequest) inflight_reqs;
+
+    HBitmap *copy_bitmap;
 } BackupBlockJob;
 
 /* Size of a cluster in sectors, instead of bytes. */
@@ -115,10 +116,11 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
     cow_request_begin(&cow_request, job, start, end);
 
     for (; start < end; start++) {
-        if (test_bit(start, job->done_bitmap)) {
+        if (!hbitmap_get(job->copy_bitmap, start)) {
             trace_backup_do_cow_skip(job, start);
             continue; /* already copied */
         }
+        hbitmap_reset(job->copy_bitmap, start, 1);
 
         trace_backup_do_cow_process(job, start);
 
@@ -141,6 +143,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
             if (error_is_read) {
                 *error_is_read = true;
             }
+            hbitmap_set(job->copy_bitmap, start, 1);
             goto out;
         }
 
@@ -157,11 +160,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
             if (error_is_read) {
                 *error_is_read = false;
             }
+            hbitmap_set(job->copy_bitmap, start, 1);
             goto out;
         }
 
-        set_bit(start, job->done_bitmap);
-
         /* Publish progress, guest I/O counts as progress too.  Note that the
          * offset field is an opaque progress value, it is not a disk offset.
          */
@@ -271,7 +273,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
     }
 
     len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
-    bitmap_zero(backup_job->done_bitmap, len);
+    hbitmap_set(backup_job->copy_bitmap, 0, len);
 }
 
 void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
@@ -450,7 +452,8 @@ static void coroutine_fn backup_run(void *opaque)
     start = 0;
     end = DIV_ROUND_UP(job->common.len, job->cluster_size);
 
-    job->done_bitmap = bitmap_new(end);
+    job->copy_bitmap = hbitmap_alloc(end, 0);
+    hbitmap_set(job->copy_bitmap, 0, end);
 
     job->before_write.notify = backup_before_write_notify;
     bdrv_add_before_write_notifier(bs, &job->before_write);
@@ -524,7 +527,7 @@ static void coroutine_fn backup_run(void *opaque)
     /* wait until pending backup_do_cow() calls have completed */
     qemu_co_rwlock_wrlock(&job->flush_rwlock);
     qemu_co_rwlock_unlock(&job->flush_rwlock);
-    g_free(job->done_bitmap);
+    hbitmap_free(job->copy_bitmap);
 
     data = g_malloc(sizeof(*data));
     data->ret = ret;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-24  7:09   ` Fam Zheng
  2017-01-31 10:36   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing Vladimir Sementsov-Ogievskiy
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

We should not copy non-dirty clusters in write notifiers. So,
initialize copy_bitmap from sync_bitmap.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/backup.c b/block/backup.c
index 6b27e55..621b1c0 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -437,6 +437,34 @@ out:
     return ret;
 }
 
+/* init copy_bitmap from sync_bitmap */
+static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
+{
+    int64_t sector;
+    BdrvDirtyBitmapIter *dbi;
+    uint32_t sect_gran =
+        bdrv_dirty_bitmap_granularity(job->sync_bitmap) >> BDRV_SECTOR_BITS;
+    int64_t sz = bdrv_dirty_bitmap_size(job->sync_bitmap);
+    int64_t sectors_per_cluster = cluster_size_sectors(job);
+    uint32_t cl_gran = MAX(1, sect_gran / sectors_per_cluster);
+
+    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
+    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
+        int64_t cluster = sector / sectors_per_cluster;
+        int64_t next_sector = (cluster + cl_gran) * sectors_per_cluster;
+
+        hbitmap_set(job->copy_bitmap, cluster, cl_gran);
+
+        if (next_sector >= sz) {
+            break;
+        }
+
+        bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
+    }
+
+    bdrv_dirty_iter_free(dbi);
+}
+
 static void coroutine_fn backup_run(void *opaque)
 {
     BackupBlockJob *job = opaque;
@@ -453,20 +481,22 @@ static void coroutine_fn backup_run(void *opaque)
     end = DIV_ROUND_UP(job->common.len, job->cluster_size);
 
     job->copy_bitmap = hbitmap_alloc(end, 0);
-    hbitmap_set(job->copy_bitmap, 0, end);
 
     job->before_write.notify = backup_before_write_notify;
     bdrv_add_before_write_notifier(bs, &job->before_write);
 
     if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
+        hbitmap_set(job->copy_bitmap, 0, end);
         while (!block_job_is_cancelled(&job->common)) {
             /* Yield until the job is cancelled.  We just let our before_write
              * notify callback service CoW requests. */
             block_job_yield(&job->common);
         }
     } else if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
+        backup_incremental_init_copy_bitmap(job);
         ret = backup_run_incremental(job);
     } else {
+        hbitmap_set(job->copy_bitmap, 0, end);
         /* Both FULL and TOP SYNC_MODE's require copying.. */
         for (; start < end; start++) {
             bool error_is_read;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-24  7:17   ` Fam Zheng
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup Vladimir Sementsov-Ogievskiy
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Set fake progress for non-dirty clusters in copy_bitmap initialization,
to:
1. set progress in the same place where corresponding clusters are
consumed from copy_bitmap (or not initialized, as here)
2. earlier progress information for user
3. simplify the code

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 621b1c0..f1f87f6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -383,7 +383,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
     int64_t sector;
     int64_t cluster;
     int64_t end;
-    int64_t last_cluster = -1;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     BdrvDirtyBitmapIter *dbi;
 
@@ -395,12 +394,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
     while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
         cluster = sector / sectors_per_cluster;
 
-        /* Fake progress updates for any clusters we skipped */
-        if (cluster != last_cluster + 1) {
-            job->common.offset += ((cluster - last_cluster - 1) *
-                                   job->cluster_size);
-        }
-
         for (end = cluster + clusters_per_iter; cluster < end; cluster++) {
             do {
                 if (yield_and_check(job)) {
@@ -422,14 +415,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
         if (granularity < job->cluster_size) {
             bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
         }
-
-        last_cluster = cluster - 1;
-    }
-
-    /* Play some final catchup with the progress meter */
-    end = DIV_ROUND_UP(job->common.len, job->cluster_size);
-    if (last_cluster + 1 < end) {
-        job->common.offset += ((end - last_cluster - 1) * job->cluster_size);
     }
 
 out:
@@ -462,6 +447,9 @@ static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
         bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
     }
 
+    job->common.offset = job->common.len -
+                         hbitmap_count(job->copy_bitmap) * job->cluster_size;
+
     bdrv_dirty_iter_free(dbi);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 11:01   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter Vladimir Sementsov-Ogievskiy
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

We can use copy_bitmap instead of sync_bitmap. copy_bitmap is
initialized from sync_bitmap and it is more informative: we will not try
to process data, that is already in progress (by write notifier).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 58 +++++++++++++++++++---------------------------------------
 1 file changed, 19 insertions(+), 39 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index f1f87f6..938b7df 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -376,50 +376,30 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
 
 static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
 {
+    int ret;
     bool error_is_read;
-    int ret = 0;
-    int clusters_per_iter;
-    uint32_t granularity;
-    int64_t sector;
-    int64_t cluster;
-    int64_t end;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
-    BdrvDirtyBitmapIter *dbi;
-
-    granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap);
-    clusters_per_iter = MAX((granularity / job->cluster_size), 1);
-    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
-
-    /* Find the next dirty sector(s) */
-    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
-        cluster = sector / sectors_per_cluster;
-
-        for (end = cluster + clusters_per_iter; cluster < end; cluster++) {
-            do {
-                if (yield_and_check(job)) {
-                    goto out;
-                }
-                ret = backup_do_cow(job, cluster * sectors_per_cluster,
-                                    sectors_per_cluster, &error_is_read,
-                                    false);
-                if ((ret < 0) &&
-                    backup_error_action(job, error_is_read, -ret) ==
-                    BLOCK_ERROR_ACTION_REPORT) {
-                    goto out;
-                }
-            } while (ret < 0);
-        }
+    int64_t cluster;
+    HBitmapIter hbi;
 
-        /* If the bitmap granularity is smaller than the backup granularity,
-         * we need to advance the iterator pointer to the next cluster. */
-        if (granularity < job->cluster_size) {
-            bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
-        }
+    hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
+    while ((cluster = hbitmap_iter_next(&hbi)) != -1) {
+        do {
+            if (yield_and_check(job)) {
+                return 0;
+            }
+            ret = backup_do_cow(job, cluster * sectors_per_cluster,
+                                sectors_per_cluster, &error_is_read,
+                                false);
+            if ((ret < 0) &&
+                backup_error_action(job, error_is_read, -ret) ==
+                BLOCK_ERROR_ACTION_REPORT) {
+                return ret;
+            }
+        } while (ret < 0);
     }
 
-out:
-    bdrv_dirty_iter_free(dbi);
-    return ret;
+    return 0;
 }
 
 /* init copy_bitmap from sync_bitmap */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 11:20   ` Stefan Hajnoczi
  2017-01-31 11:29   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping Vladimir Sementsov-Ogievskiy
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Make dirty iter resistant to resetting bits in corresponding HBitmap.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/qemu/hbitmap.h | 24 ++----------------------
 util/hbitmap.c         | 23 ++++++++++++++++++++++-
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index eb46475..2873a46 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -243,10 +243,7 @@ void hbitmap_free(HBitmap *hb);
  * the lowest-numbered bit that is set in @hb, starting at @first.
  *
  * Concurrent setting of bits is acceptable, and will at worst cause the
- * iteration to miss some of those bits.  Resetting bits before the current
- * position of the iterator is also okay.  However, concurrent resetting of
- * bits can lead to unexpected behavior if the iterator has not yet reached
- * those bits.
+ * iteration to miss some of those bits.  Resetting bits is also okay.
  */
 void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first);
 
@@ -285,24 +282,7 @@ void hbitmap_free_meta(HBitmap *hb);
  * Return the next bit that is set in @hbi's associated HBitmap,
  * or -1 if all remaining bits are zero.
  */
-static inline int64_t hbitmap_iter_next(HBitmapIter *hbi)
-{
-    unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1];
-    int64_t item;
-
-    if (cur == 0) {
-        cur = hbitmap_iter_skip_words(hbi);
-        if (cur == 0) {
-            return -1;
-        }
-    }
-
-    /* The next call will resume work from the next bit.  */
-    hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
-    item = ((uint64_t)hbi->pos << BITS_PER_LEVEL) + ctzl(cur);
-
-    return item << hbi->granularity;
-}
+int64_t hbitmap_iter_next(HBitmapIter *hbi);
 
 /**
  * hbitmap_iter_next_word:
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 5d1a21c..48d8b2d 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -106,8 +106,9 @@ unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
 
     unsigned long cur;
     do {
-        cur = hbi->cur[--i];
+        i--;
         pos >>= BITS_PER_LEVEL;
+        cur = hbi->cur[i] & hb->levels[i][pos];
     } while (cur == 0);
 
     /* Check for end of iteration.  We always use fewer than BITS_PER_LONG
@@ -139,6 +140,26 @@ unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
     return cur;
 }
 
+int64_t hbitmap_iter_next(HBitmapIter *hbi)
+{
+    unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1] &
+            hbi->hb->levels[HBITMAP_LEVELS - 1][hbi->pos];
+    int64_t item;
+
+    if (cur == 0) {
+        cur = hbitmap_iter_skip_words(hbi);
+        if (cur == 0) {
+            return -1;
+        }
+    }
+
+    /* The next call will resume work from the next bit.  */
+    hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
+    item = ((uint64_t)hbi->pos << BITS_PER_LEVEL) + ctzl(cur);
+
+    return item << hbi->granularity;
+}
+
 void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first)
 {
     unsigned i, bit;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 13:31   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code Vladimir Sementsov-Ogievskiy
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

TOP backup mode skips not allocated clusters. This patch mark skipped
clusters in copy_bitmap to prevent their writing in write notifier
(however, they may be written before skipping, but that is not
critical).

Also, update job->common.offset appropriately, to come eventually to
job->common.len.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 135 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 104 insertions(+), 31 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 938b7df..e2b944a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -53,6 +53,11 @@ static inline int64_t cluster_size_sectors(BackupBlockJob *job)
   return job->cluster_size / BDRV_SECTOR_SIZE;
 }
 
+static inline int64_t max_query_sectors(BackupBlockJob *job)
+{
+    return (INT_MAX & ~(job->cluster_size - 1)) >> BDRV_SECTOR_BITS;
+}
+
 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
                                                        int64_t start,
@@ -374,6 +379,101 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
     return false;
 }
 
+static void backup_skip_clusters(BackupBlockJob *job,
+                                 int64_t start, int64_t end)
+{
+    CowRequest cow_request;
+
+    wait_for_overlapping_requests(job, start, end);
+    cow_request_begin(&cow_request, job, start, end);
+
+    if (end * job->cluster_size > job->common.len) {
+        int64_t n;
+        end--;
+        n = job->common.len - end * job->cluster_size;
+        assert(n > 0);
+
+        if (hbitmap_get(job->copy_bitmap, end)) {
+            hbitmap_reset(job->copy_bitmap, end, 1);
+            job->common.offset += n;
+        }
+    }
+
+    for ( ; start < end; start++) {
+        if (!hbitmap_get(job->copy_bitmap, start)) {
+            continue;
+        }
+
+        hbitmap_reset(job->copy_bitmap, start, 1);
+        job->common.offset += job->cluster_size;
+    }
+
+    cow_request_end(&cow_request);
+}
+
+static int backup_skip_unallocated_clusters(BackupBlockJob *job,
+                                            BlockDriverState *base,
+                                            int64_t start, int *n)
+{
+    int ret;
+    int64_t sectors_per_cluster = cluster_size_sectors(job);
+    BlockDriverState *bs = blk_bs(job->common.blk);
+    int64_t sector_end = job->common.len >> BDRV_SECTOR_BITS;
+    int64_t sector = start * sectors_per_cluster;
+    int max_sectors = MIN(max_query_sectors(job), sector_end - sector);
+    int n_sectors = 0;
+
+    ret = bdrv_is_allocated_above(bs, base, sector, max_sectors, &n_sectors);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (sector + n_sectors == sector_end || ret == 1) {
+        *n = DIV_ROUND_UP(n_sectors, sectors_per_cluster);
+    } else if (n_sectors < sectors_per_cluster) {
+        *n = 1;
+        ret = 1;
+    } else {
+        *n = n_sectors / sectors_per_cluster;
+    }
+
+    if (ret == 0) {
+        backup_skip_clusters(job, start, start + *n);
+    }
+
+    return 0;
+}
+
+static void backup_skip_loop(BackupBlockJob *job, BlockDriverState *base)
+{
+    HBitmapIter hbi;
+    int64_t cluster;
+    int64_t end = DIV_ROUND_UP(job->common.len, job->cluster_size);
+
+    hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
+    while ((cluster = hbitmap_iter_next(&hbi)) != -1) {
+        int n, ret;
+
+        if (yield_and_check(job)) {
+            return;
+        }
+
+        ret = backup_skip_unallocated_clusters(job, base, cluster, &n);
+        if (ret < 0) {
+            n = 1;
+        }
+
+        cluster += n;
+        if (cluster >= end) {
+            return;
+        }
+
+        if (n > 1) {
+            hbitmap_iter_init(&hbi, job->copy_bitmap, cluster);
+        }
+    }
+}
+
 static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
 {
     int ret;
@@ -465,6 +565,10 @@ static void coroutine_fn backup_run(void *opaque)
         ret = backup_run_incremental(job);
     } else {
         hbitmap_set(job->copy_bitmap, 0, end);
+        if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+            backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
+        }
+
         /* Both FULL and TOP SYNC_MODE's require copying.. */
         for (; start < end; start++) {
             bool error_is_read;
@@ -472,37 +576,6 @@ static void coroutine_fn backup_run(void *opaque)
                 break;
             }
 
-            if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
-                int i, n;
-                int alloced = 0;
-
-                /* Check to see if these blocks are already in the
-                 * backing file. */
-
-                for (i = 0; i < sectors_per_cluster;) {
-                    /* bdrv_is_allocated() only returns true/false based
-                     * on the first set of sectors it comes across that
-                     * are are all in the same state.
-                     * For that reason we must verify each sector in the
-                     * backup cluster length.  We end up copying more than
-                     * needed but at some point that is always the case. */
-                    alloced =
-                        bdrv_is_allocated(bs,
-                                start * sectors_per_cluster + i,
-                                sectors_per_cluster - i, &n);
-                    i += n;
-
-                    if (alloced == 1 || n == 0) {
-                        break;
-                    }
-                }
-
-                /* If the above loop never found any sectors that are in
-                 * the topmost image, skip this backup. */
-                if (alloced == 0) {
-                    continue;
-                }
-            }
             /* FULL sync mode we copy the whole drive. */
             ret = backup_do_cow(job, start * sectors_per_cluster,
                                 sectors_per_cluster, &error_is_read, false);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:26   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Merge top/full/incremental modes backup into one backup_loop.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 41 +++++++++--------------------------------
 1 file changed, 9 insertions(+), 32 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index e2b944a..2afd1b6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -474,7 +474,7 @@ static void backup_skip_loop(BackupBlockJob *job, BlockDriverState *base)
     }
 }
 
-static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
+static int coroutine_fn backup_loop(BackupBlockJob *job)
 {
     int ret;
     bool error_is_read;
@@ -538,14 +538,12 @@ static void coroutine_fn backup_run(void *opaque)
     BackupBlockJob *job = opaque;
     BackupCompleteData *data;
     BlockDriverState *bs = blk_bs(job->common.blk);
-    int64_t start, end;
-    int64_t sectors_per_cluster = cluster_size_sectors(job);
+    int64_t end;
     int ret = 0;
 
     QLIST_INIT(&job->inflight_reqs);
     qemu_co_rwlock_init(&job->flush_rwlock);
 
-    start = 0;
     end = DIV_ROUND_UP(job->common.len, job->cluster_size);
 
     job->copy_bitmap = hbitmap_alloc(end, 0);
@@ -560,37 +558,16 @@ static void coroutine_fn backup_run(void *opaque)
              * notify callback service CoW requests. */
             block_job_yield(&job->common);
         }
-    } else if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
-        backup_incremental_init_copy_bitmap(job);
-        ret = backup_run_incremental(job);
     } else {
-        hbitmap_set(job->copy_bitmap, 0, end);
-        if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
-            backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
-        }
-
-        /* Both FULL and TOP SYNC_MODE's require copying.. */
-        for (; start < end; start++) {
-            bool error_is_read;
-            if (yield_and_check(job)) {
-                break;
-            }
-
-            /* FULL sync mode we copy the whole drive. */
-            ret = backup_do_cow(job, start * sectors_per_cluster,
-                                sectors_per_cluster, &error_is_read, false);
-            if (ret < 0) {
-                /* Depending on error action, fail now or retry cluster */
-                BlockErrorAction action =
-                    backup_error_action(job, error_is_read, -ret);
-                if (action == BLOCK_ERROR_ACTION_REPORT) {
-                    break;
-                } else {
-                    start--;
-                    continue;
-                }
+        if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
+            backup_incremental_init_copy_bitmap(job);
+        } else {
+            hbitmap_set(job->copy_bitmap, 0, end);
+            if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+                backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
             }
         }
+        ret = backup_loop(job);
     }
 
     notifier_with_return_remove(&job->before_write);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-24  7:59   ` Fam Zheng
                     ` (2 more replies)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 09/21] backup: separate copy function Vladimir Sementsov-Ogievskiy
                   ` (13 subsequent siblings)
  21 siblings, 3 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

In case of full backup we can skip unallocated clusters if the target
disk is already zero-initialized.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c         | 8 ++++++--
 tests/qemu-iotests/055 | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 2afd1b6..4ef8daf 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -562,9 +562,13 @@ static void coroutine_fn backup_run(void *opaque)
         if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
             backup_incremental_init_copy_bitmap(job);
         } else {
+            /* top or full mode */
+            bool is_top = job->sync_mode == MIRROR_SYNC_MODE_TOP;
+            BlockDriverState *base =
+                    is_top ? backing_bs(blk_bs(job->common.blk)) : NULL;
             hbitmap_set(job->copy_bitmap, 0, end);
-            if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
-                backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
+            if (is_top || bdrv_has_zero_init(blk_bs(job->target))) {
+                backup_skip_loop(job, base);
             }
         }
         ret = backup_loop(job);
diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
index 1d3fd04..388b7b2 100755
--- a/tests/qemu-iotests/055
+++ b/tests/qemu-iotests/055
@@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
 blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
 
 image_len = 64 * 1024 * 1024 # MB
+pause_write = '3M'
 
 def setUpModule():
     qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
@@ -39,6 +40,7 @@ def setUpModule():
     qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
     qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
     qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
+    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)
 
 def tearDownModule():
     os.remove(test_img)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 09/21] backup: separate copy function
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:40   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster() Vladimir Sementsov-Ogievskiy
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 103 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 61 insertions(+), 42 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 4ef8daf..2c8b7ba 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -95,6 +95,65 @@ static void cow_request_end(CowRequest *req)
     qemu_co_queue_restart_all(&req->wait_queue);
 }
 
+static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
+                                            int64_t cluster,
+                                            bool *error_is_read,
+                                            bool is_write_notifier,
+                                            void *bounce_buffer)
+{
+    BlockBackend *blk = job->common.blk;
+    int n;
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    int ret = 0;
+    int64_t sectors_per_cluster = cluster_size_sectors(job);
+
+    trace_backup_do_cow_process(job, cluster);
+
+    n = MIN(sectors_per_cluster,
+            job->common.len / BDRV_SECTOR_SIZE -
+            cluster * sectors_per_cluster);
+
+    iov.iov_base = bounce_buffer;
+    iov.iov_len = n * BDRV_SECTOR_SIZE;
+    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+
+    ret = blk_co_preadv(blk, cluster * job->cluster_size,
+                        bounce_qiov.size, &bounce_qiov,
+                        is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+    if (ret < 0) {
+        trace_backup_do_cow_read_fail(job, cluster, ret);
+        if (error_is_read) {
+            *error_is_read = true;
+        }
+        return ret;
+    }
+
+    if (buffer_is_zero(iov.iov_base, iov.iov_len)) {
+        ret = blk_co_pwrite_zeroes(job->target, cluster * job->cluster_size,
+                                   bounce_qiov.size, BDRV_REQ_MAY_UNMAP);
+    } else {
+        ret = blk_co_pwritev(job->target, cluster * job->cluster_size,
+                             bounce_qiov.size, &bounce_qiov,
+                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+    }
+    if (ret < 0) {
+        trace_backup_do_cow_write_fail(job, cluster, ret);
+        if (error_is_read) {
+            *error_is_read = false;
+        }
+        return ret;
+    }
+
+    /* Publish progress, guest I/O counts as progress too.  Note that the
+     * offset field is an opaque progress value, it is not a disk offset.
+     */
+    job->sectors_read += n;
+    job->common.offset += n * BDRV_SECTOR_SIZE;
+
+    return 0;
+}
+
 static int coroutine_fn backup_do_cow(BackupBlockJob *job,
                                       int64_t sector_num, int nb_sectors,
                                       bool *error_is_read,
@@ -102,13 +161,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 {
     BlockBackend *blk = job->common.blk;
     CowRequest cow_request;
-    struct iovec iov;
-    QEMUIOVector bounce_qiov;
     void *bounce_buffer = NULL;
     int ret = 0;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     int64_t start, end;
-    int n;
 
     qemu_co_rwlock_rdlock(&job->flush_rwlock);
 
@@ -127,53 +183,16 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
         }
         hbitmap_reset(job->copy_bitmap, start, 1);
 
-        trace_backup_do_cow_process(job, start);
-
-        n = MIN(sectors_per_cluster,
-                job->common.len / BDRV_SECTOR_SIZE -
-                start * sectors_per_cluster);
-
         if (!bounce_buffer) {
             bounce_buffer = blk_blockalign(blk, job->cluster_size);
         }
-        iov.iov_base = bounce_buffer;
-        iov.iov_len = n * BDRV_SECTOR_SIZE;
-        qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-        ret = blk_co_preadv(blk, start * job->cluster_size,
-                            bounce_qiov.size, &bounce_qiov,
-                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+        ret = backup_copy_cluster(job, start, error_is_read, is_write_notifier,
+                                  bounce_buffer);
         if (ret < 0) {
-            trace_backup_do_cow_read_fail(job, start, ret);
-            if (error_is_read) {
-                *error_is_read = true;
-            }
-            hbitmap_set(job->copy_bitmap, start, 1);
-            goto out;
-        }
-
-        if (buffer_is_zero(iov.iov_base, iov.iov_len)) {
-            ret = blk_co_pwrite_zeroes(job->target, start * job->cluster_size,
-                                       bounce_qiov.size, BDRV_REQ_MAY_UNMAP);
-        } else {
-            ret = blk_co_pwritev(job->target, start * job->cluster_size,
-                                 bounce_qiov.size, &bounce_qiov,
-                                 job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
-        }
-        if (ret < 0) {
-            trace_backup_do_cow_write_fail(job, start, ret);
-            if (error_is_read) {
-                *error_is_read = false;
-            }
             hbitmap_set(job->copy_bitmap, start, 1);
             goto out;
         }
-
-        /* Publish progress, guest I/O counts as progress too.  Note that the
-         * offset field is an opaque progress value, it is not a disk offset.
-         */
-        job->sectors_read += n;
-        job->common.offset += n * BDRV_SECTOR_SIZE;
     }
 
 out:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster()
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 09/21] backup: separate copy function Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:57   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions Vladimir Sementsov-Ogievskiy
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

    Split out read and write functions.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c     | 53 +++++++++++++++++++++++++++++++++++++++--------------
 block/trace-events |  4 ++--
 2 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 2c8b7ba..5c31607 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -95,18 +95,53 @@ static void cow_request_end(CowRequest *req)
     qemu_co_queue_restart_all(&req->wait_queue);
 }
 
+static int coroutine_fn backup_do_read(BackupBlockJob *job,
+                                       int64_t offset, unsigned int bytes,
+                                       QEMUIOVector *qiov,
+                                       bool is_write_notifier)
+{
+    int ret = blk_co_preadv(job->common.blk, offset, bytes, qiov,
+                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+    if (ret < 0) {
+        trace_backup_do_read_fail(job, offset, bytes, ret);
+    }
+
+    return ret;
+}
+
+static int coroutine_fn backup_do_write(BackupBlockJob *job,
+                                        int64_t offset, unsigned int bytes,
+                                        QEMUIOVector *qiov)
+{
+    int ret;
+    assert(qiov->niov == 1);
+
+    if (buffer_is_zero(qiov->iov->iov_base, qiov->iov->iov_len)) {
+        ret = blk_co_pwrite_zeroes(job->target, offset, bytes,
+                                   BDRV_REQ_MAY_UNMAP);
+    } else {
+        ret = blk_co_pwritev(job->target, offset, bytes, qiov,
+                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+    }
+    if (ret < 0) {
+        trace_backup_do_write_fail(job, offset, bytes, ret);
+    }
+
+    return ret;
+}
+
 static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
                                             int64_t cluster,
                                             bool *error_is_read,
                                             bool is_write_notifier,
                                             void *bounce_buffer)
 {
-    BlockBackend *blk = job->common.blk;
     int n;
     struct iovec iov;
     QEMUIOVector bounce_qiov;
     int ret = 0;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
+    int64_t offset = cluster * job->cluster_size;
 
     trace_backup_do_cow_process(job, cluster);
 
@@ -118,27 +153,17 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
     iov.iov_len = n * BDRV_SECTOR_SIZE;
     qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-    ret = blk_co_preadv(blk, cluster * job->cluster_size,
-                        bounce_qiov.size, &bounce_qiov,
-                        is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+    ret = backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov,
+                         is_write_notifier);
     if (ret < 0) {
-        trace_backup_do_cow_read_fail(job, cluster, ret);
         if (error_is_read) {
             *error_is_read = true;
         }
         return ret;
     }
 
-    if (buffer_is_zero(iov.iov_base, iov.iov_len)) {
-        ret = blk_co_pwrite_zeroes(job->target, cluster * job->cluster_size,
-                                   bounce_qiov.size, BDRV_REQ_MAY_UNMAP);
-    } else {
-        ret = blk_co_pwritev(job->target, cluster * job->cluster_size,
-                             bounce_qiov.size, &bounce_qiov,
-                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
-    }
+    ret = backup_do_write(job, offset, bounce_qiov.size, &bounce_qiov);
     if (ret < 0) {
-        trace_backup_do_cow_write_fail(job, cluster, ret);
         if (error_is_read) {
             *error_is_read = false;
         }
diff --git a/block/trace-events b/block/trace-events
index cfc05f2..832e8ed 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -43,8 +43,8 @@ backup_do_cow_enter(void *job, int64_t start, int64_t sector_num, int nb_sectors
 backup_do_cow_return(void *job, int64_t sector_num, int nb_sectors, int ret) "job %p sector_num %"PRId64" nb_sectors %d ret %d"
 backup_do_cow_skip(void *job, int64_t start) "job %p start %"PRId64
 backup_do_cow_process(void *job, int64_t start) "job %p start %"PRId64
-backup_do_cow_read_fail(void *job, int64_t start, int ret) "job %p start %"PRId64" ret %d"
-backup_do_cow_write_fail(void *job, int64_t start, int ret) "job %p start %"PRId64" ret %d"
+backup_do_read_fail(void *job, int64_t offset, unsigned bytes, int ret) "job %p offset %"PRId64" bytes %u ret %d"
+backup_do_write_fail(void *job, int64_t offset, unsigned bytes, int ret) "job %p offset %"PRId64" bytes %u ret %d"
 
 # blockdev.c
 qmp_block_job_cancel(void *job) "job %p"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (9 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster() Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:57   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function Vladimir Sementsov-Ogievskiy
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

It simplifies code: we do not need error_is_read parameter and
retrying in backup_loop. Also, retrying for read and write are separate,
so we will not reread if write failed after successfull read.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 141 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 74 insertions(+), 67 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 5c31607..442e6da 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -95,18 +95,65 @@ static void cow_request_end(CowRequest *req)
     qemu_co_queue_restart_all(&req->wait_queue);
 }
 
+static BlockErrorAction backup_error_action(BackupBlockJob *job,
+                                            bool read, int error)
+{
+    if (read) {
+        return block_job_error_action(&job->common, job->on_source_error,
+                                      true, error);
+    } else {
+        return block_job_error_action(&job->common, job->on_target_error,
+                                      false, error);
+    }
+}
+
+static bool coroutine_fn yield_and_check(BackupBlockJob *job)
+{
+    if (block_job_is_cancelled(&job->common)) {
+        return true;
+    }
+
+    /* we need to yield so that bdrv_drain_all() returns.
+     * (without, VM does not reboot)
+     */
+    if (job->common.speed) {
+        uint64_t delay_ns = ratelimit_calculate_delay(&job->limit,
+                                                      job->sectors_read);
+        job->sectors_read = 0;
+        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, delay_ns);
+    } else {
+        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, 0);
+    }
+
+    if (block_job_is_cancelled(&job->common)) {
+        return true;
+    }
+
+    return false;
+}
+
 static int coroutine_fn backup_do_read(BackupBlockJob *job,
                                        int64_t offset, unsigned int bytes,
                                        QEMUIOVector *qiov,
                                        bool is_write_notifier)
 {
-    int ret = blk_co_preadv(job->common.blk, offset, bytes, qiov,
-                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+    int ret;
+    BdrvRequestFlags flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
+
+retry:
+    ret = blk_co_preadv(job->common.blk, offset, bytes, qiov, flags);
     if (ret < 0) {
         trace_backup_do_read_fail(job, offset, bytes, ret);
+
+        BlockErrorAction action = backup_error_action(job, true, -ret);
+        if (action != BLOCK_ERROR_ACTION_REPORT && !yield_and_check(job)) {
+            goto retry;
+        }
+
+        return ret;
     }
 
-    return ret;
+    return 0;
 }
 
 static int coroutine_fn backup_do_write(BackupBlockJob *job,
@@ -114,9 +161,13 @@ static int coroutine_fn backup_do_write(BackupBlockJob *job,
                                         QEMUIOVector *qiov)
 {
     int ret;
+    bool zeroes;
+
     assert(qiov->niov == 1);
+    zeroes = buffer_is_zero(qiov->iov->iov_base, qiov->iov->iov_len);
 
-    if (buffer_is_zero(qiov->iov->iov_base, qiov->iov->iov_len)) {
+retry:
+    if (zeroes) {
         ret = blk_co_pwrite_zeroes(job->target, offset, bytes,
                                    BDRV_REQ_MAY_UNMAP);
     } else {
@@ -125,14 +176,20 @@ static int coroutine_fn backup_do_write(BackupBlockJob *job,
     }
     if (ret < 0) {
         trace_backup_do_write_fail(job, offset, bytes, ret);
+
+        BlockErrorAction action = backup_error_action(job, false, -ret);
+        if (action != BLOCK_ERROR_ACTION_REPORT && !yield_and_check(job)) {
+            goto retry;
+        }
+
+        return ret;
     }
 
-    return ret;
+    return 0;
 }
 
 static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
                                             int64_t cluster,
-                                            bool *error_is_read,
                                             bool is_write_notifier,
                                             void *bounce_buffer)
 {
@@ -156,17 +213,11 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
     ret = backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov,
                          is_write_notifier);
     if (ret < 0) {
-        if (error_is_read) {
-            *error_is_read = true;
-        }
         return ret;
     }
 
     ret = backup_do_write(job, offset, bounce_qiov.size, &bounce_qiov);
     if (ret < 0) {
-        if (error_is_read) {
-            *error_is_read = false;
-        }
         return ret;
     }
 
@@ -181,7 +232,6 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
 
 static int coroutine_fn backup_do_cow(BackupBlockJob *job,
                                       int64_t sector_num, int nb_sectors,
-                                      bool *error_is_read,
                                       bool is_write_notifier)
 {
     BlockBackend *blk = job->common.blk;
@@ -212,8 +262,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
             bounce_buffer = blk_blockalign(blk, job->cluster_size);
         }
 
-        ret = backup_copy_cluster(job, start, error_is_read, is_write_notifier,
-                                  bounce_buffer);
+        ret = backup_copy_cluster(job, start, is_write_notifier, bounce_buffer);
         if (ret < 0) {
             hbitmap_set(job->copy_bitmap, start, 1);
             goto out;
@@ -247,7 +296,7 @@ static int coroutine_fn backup_before_write_notify(
     assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
-    return backup_do_cow(job, sector_num, nb_sectors, NULL, true);
+    return backup_do_cow(job, sector_num, nb_sectors, true);
 }
 
 static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
@@ -374,18 +423,6 @@ static void backup_drain(BlockJob *job)
     }
 }
 
-static BlockErrorAction backup_error_action(BackupBlockJob *job,
-                                            bool read, int error)
-{
-    if (read) {
-        return block_job_error_action(&job->common, job->on_source_error,
-                                      true, error);
-    } else {
-        return block_job_error_action(&job->common, job->on_target_error,
-                                      false, error);
-    }
-}
-
 typedef struct {
     int ret;
 } BackupCompleteData;
@@ -398,31 +435,6 @@ static void backup_complete(BlockJob *job, void *opaque)
     g_free(data);
 }
 
-static bool coroutine_fn yield_and_check(BackupBlockJob *job)
-{
-    if (block_job_is_cancelled(&job->common)) {
-        return true;
-    }
-
-    /* we need to yield so that bdrv_drain_all() returns.
-     * (without, VM does not reboot)
-     */
-    if (job->common.speed) {
-        uint64_t delay_ns = ratelimit_calculate_delay(&job->limit,
-                                                      job->sectors_read);
-        job->sectors_read = 0;
-        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, delay_ns);
-    } else {
-        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, 0);
-    }
-
-    if (block_job_is_cancelled(&job->common)) {
-        return true;
-    }
-
-    return false;
-}
-
 static void backup_skip_clusters(BackupBlockJob *job,
                                  int64_t start, int64_t end)
 {
@@ -521,26 +533,21 @@ static void backup_skip_loop(BackupBlockJob *job, BlockDriverState *base)
 static int coroutine_fn backup_loop(BackupBlockJob *job)
 {
     int ret;
-    bool error_is_read;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     int64_t cluster;
     HBitmapIter hbi;
 
     hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
     while ((cluster = hbitmap_iter_next(&hbi)) != -1) {
-        do {
-            if (yield_and_check(job)) {
-                return 0;
-            }
-            ret = backup_do_cow(job, cluster * sectors_per_cluster,
-                                sectors_per_cluster, &error_is_read,
-                                false);
-            if ((ret < 0) &&
-                backup_error_action(job, error_is_read, -ret) ==
-                BLOCK_ERROR_ACTION_REPORT) {
-                return ret;
-            }
-        } while (ret < 0);
+        if (yield_and_check(job)) {
+            return 0;
+        }
+
+        ret = backup_do_cow(job, cluster * sectors_per_cluster,
+                            sectors_per_cluster, false);
+        if (ret < 0) {
+            return ret;
+        }
     }
 
     return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (10 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:58   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next Vladimir Sementsov-Ogievskiy
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/iotests.py | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index bec8eb4..a5ef9c3 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -348,6 +348,10 @@ def verify_platform(supported_oses=['linux']):
     if True not in [sys.platform.startswith(x) for x in supported_oses]:
         notrun('not suitable for this OS: %s' % sys.platform)
 
+def verify_cache_mode(supported_cache_modes=[]):
+    if supported_cache_modes and (cachemode not in supported_cache_modes):
+        notrun('not suitable for this cache mode: %s' % cachemode)
+
 def supports_quorum():
     return 'quorum' in qemu_img_pipe('--help')
 
@@ -356,7 +360,7 @@ def verify_quorum():
     if not supports_quorum():
         notrun('quorum support missing')
 
-def main(supported_fmts=[], supported_oses=['linux']):
+def main(supported_fmts=[], supported_oses=['linux'], supported_cache_modes=[]):
     '''Run tests'''
 
     global debug
@@ -373,6 +377,7 @@ def main(supported_fmts=[], supported_oses=['linux']):
     verbosity = 1
     verify_image_format(supported_fmts)
     verify_platform(supported_oses)
+    verify_cache_mode(supported_cache_modes)
 
     # We need to filter out the time taken from the output so that qemu-iotest
     # can reliably diff the results against master output.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (11 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 15:03   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Simple add coroutine to self->co_queue_wakeup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/qemu/coroutine.h | 2 ++
 util/qemu-coroutine.c    | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index e6a60d5..6e87c87 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -226,4 +226,6 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  */
 void coroutine_fn yield_until_fd_readable(int fd);
 
+void qemu_coroutine_add_next(Coroutine *next);
+
 #endif /* QEMU_COROUTINE_H */
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 737bffa..300b96d 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -151,3 +151,10 @@ bool qemu_coroutine_entered(Coroutine *co)
 {
     return co->caller;
 }
+
+void qemu_coroutine_add_next(Coroutine *next)
+{
+    Coroutine *self = qemu_coroutine_self();
+    QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
+    trace_qemu_co_queue_next(next);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (12 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 15:03   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c            | 1 +
 block/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 39ddea3..3aa433b 100644
--- a/block.c
+++ b/block.c
@@ -2372,6 +2372,7 @@ static void bdrv_close(BlockDriverState *bs)
 
 void bdrv_close_all(void)
 {
+    trace_bdrv_close_all();
     block_job_cancel_sync_all();
     nbd_export_close_all();
 
diff --git a/block/trace-events b/block/trace-events
index 832e8ed..e7a7372 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -3,6 +3,7 @@
 # block.c
 bdrv_open_common(void *bs, const char *filename, int flags, const char *format_name) "bs %p filename \"%s\" flags %#x format_name \"%s\""
 bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
+bdrv_close_all(void) ""
 
 # block/block-backend.c
 blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int flags) "blk %p bs %p offset %"PRId64" bytes %u flags %x"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (13 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 15:15   ` Stefan Hajnoczi
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 include/qemu/bitmap.h |  4 ++++
 util/bitmap.c         | 27 +++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 63ea2d0..3bfc33e 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -216,6 +216,10 @@ static inline int bitmap_intersects(const unsigned long *src1,
     }
 }
 
+unsigned long bitmap_count_between(const unsigned long *src,
+                                   unsigned long start,
+                                   unsigned long end);
+
 void bitmap_set(unsigned long *map, long i, long len);
 void bitmap_set_atomic(unsigned long *map, long i, long len);
 void bitmap_clear(unsigned long *map, long start, long nr);
diff --git a/util/bitmap.c b/util/bitmap.c
index 43ed011..e5aaa1c 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -336,3 +336,30 @@ int slow_bitmap_intersects(const unsigned long *bitmap1,
     }
     return 0;
 }
+
+unsigned long bitmap_count_between(const unsigned long *src,
+                                   unsigned long start,
+                                   unsigned long end)
+{
+    unsigned long first = start / BITS_PER_LONG;
+    unsigned long lim = end / BITS_PER_LONG;
+    unsigned long sum, i;
+
+    assert(start < end);
+
+    sum = ctpopl(src[first] & BITMAP_FIRST_WORD_MASK(start));
+
+    for (i = first + 1; i < lim; ++i) {
+        sum += ctpopl(src[i]);
+    }
+
+    if (end % BITS_PER_LONG) {
+        if (first == lim) {
+            sum -= ctpopl(src[first] & BITMAP_FIRST_WORD_MASK(end));
+        } else {
+            sum += ctpopl(src[i] & BITMAP_LAST_WORD_MASK(end));
+        }
+    }
+
+    return sum;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (14 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:28 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 15:56   ` Stefan Hajnoczi
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:28 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Add this function only for HBitmap's with greanularity = 0.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 include/qemu/hbitmap.h | 2 ++
 util/hbitmap.c         | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 2873a46..1967e13 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -314,4 +314,6 @@ static inline size_t hbitmap_iter_next_word(HBitmapIter *hbi, unsigned long *p_c
 }
 
 
+uint64_t hbitmap_count_between(HBitmap *hb, uint64_t start, uint64_t end);
+
 #endif
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 48d8b2d..d06bfa3 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -232,6 +232,15 @@ static uint64_t hb_count_between(HBitmap *hb, uint64_t start, uint64_t last)
     return count;
 }
 
+/* hbitmap_count_between() is only for HBitmap's with granularity = 0 */
+uint64_t hbitmap_count_between(HBitmap *hb, uint64_t start, uint64_t end)
+{
+    assert(hb->granularity == 0);
+    assert(start < end);
+
+    return hb_count_between(hb, start, end - 1);
+}
+
 /* Setting starts at the last layer and propagates up if an element
  * changes.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (15 preceding siblings ...)
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:29 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 16:30   ` Stefan Hajnoczi
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 18/21] backup: new async architecture Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:29 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

To simplify things make all reads not serializing, not only from
notifiers. This is needed because after the following patch, there would
not be strong division between reads from notifiers or not - they all
would be called from one place.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 442e6da..c2f7665 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -134,14 +134,13 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
 
 static int coroutine_fn backup_do_read(BackupBlockJob *job,
                                        int64_t offset, unsigned int bytes,
-                                       QEMUIOVector *qiov,
-                                       bool is_write_notifier)
+                                       QEMUIOVector *qiov)
 {
     int ret;
-    BdrvRequestFlags flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
 
 retry:
-    ret = blk_co_preadv(job->common.blk, offset, bytes, qiov, flags);
+    ret = blk_co_preadv(job->common.blk, offset, bytes, qiov,
+                        BDRV_REQ_NO_SERIALISING);
     if (ret < 0) {
         trace_backup_do_read_fail(job, offset, bytes, ret);
 
@@ -190,7 +189,6 @@ retry:
 
 static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
                                             int64_t cluster,
-                                            bool is_write_notifier,
                                             void *bounce_buffer)
 {
     int n;
@@ -210,8 +208,7 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
     iov.iov_len = n * BDRV_SECTOR_SIZE;
     qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-    ret = backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov,
-                         is_write_notifier);
+    ret = backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov);
     if (ret < 0) {
         return ret;
     }
@@ -231,8 +228,7 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
 }
 
 static int coroutine_fn backup_do_cow(BackupBlockJob *job,
-                                      int64_t sector_num, int nb_sectors,
-                                      bool is_write_notifier)
+                                      int64_t sector_num, int nb_sectors)
 {
     BlockBackend *blk = job->common.blk;
     CowRequest cow_request;
@@ -262,7 +258,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
             bounce_buffer = blk_blockalign(blk, job->cluster_size);
         }
 
-        ret = backup_copy_cluster(job, start, is_write_notifier, bounce_buffer);
+        ret = backup_copy_cluster(job, start, bounce_buffer);
         if (ret < 0) {
             hbitmap_set(job->copy_bitmap, start, 1);
             goto out;
@@ -296,7 +292,7 @@ static int coroutine_fn backup_before_write_notify(
     assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
-    return backup_do_cow(job, sector_num, nb_sectors, true);
+    return backup_do_cow(job, sector_num, nb_sectors);
 }
 
 static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
@@ -544,7 +540,7 @@ static int coroutine_fn backup_loop(BackupBlockJob *job)
         }
 
         ret = backup_do_cow(job, cluster * sectors_per_cluster,
-                            sectors_per_cluster, false);
+                            sectors_per_cluster);
         if (ret < 0) {
             return ret;
         }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 18/21] backup: new async architecture
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (16 preceding siblings ...)
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:29 ` Vladimir Sementsov-Ogievskiy
  2017-01-31 16:46   ` Stefan Hajnoczi
  2017-02-01 16:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 20/21] backup: move bitmap handling from backup_do_cow to get_work Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  21 siblings, 2 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:29 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

New async scheme: all copying is done by worker coroutines. Main
block-job coroutine serves initial skipping of unallocated clusters, and
also all pauses, error handling, throttling.

Notifiers just mark clusters as awaited (by adding NotifierRequest to
the list) and wait for some time (5 sec) for these clusters.

Because of the imporvements (async, fast, skipping unallocated) some
tests failed, so they are fixed here too.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c           | 641 ++++++++++++++++++++++++++++++++++++-----------
 block/trace-events       |  31 ++-
 blockjob.c               |  29 ++-
 include/block/blockjob.h |  15 +-
 tests/qemu-iotests/055   |   2 +-
 tests/qemu-iotests/129   |   6 +-
 6 files changed, 569 insertions(+), 155 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index c2f7665..558b871 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -27,6 +27,18 @@
 
 #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 #define SLICE_TIME 100000000ULL /* ns */
+#define BACKUP_LOOP_DELAY 100000000ULL /* ns */
+#define WRITE_NOTIFY_TIMEOUT_NS 5000000000LL
+
+#define NB_WORKERS 24
+
+typedef struct NotifierRequest {
+    int64_t start; /* in clusters */
+    int64_t end; /* in clusters */
+    int nb_wait; /* awaited clusters */
+    Coroutine *notif;
+    QSIMPLEQ_ENTRY(NotifierRequest) list;
+} NotifierRequest;
 
 typedef struct BackupBlockJob {
     BlockJob common;
@@ -37,7 +49,6 @@ typedef struct BackupBlockJob {
     RateLimit limit;
     BlockdevOnError on_source_error;
     BlockdevOnError on_target_error;
-    CoRwlock flush_rwlock;
     uint64_t sectors_read;
     int64_t cluster_size;
     bool compress;
@@ -45,19 +56,26 @@ typedef struct BackupBlockJob {
     QLIST_HEAD(, CowRequest) inflight_reqs;
 
     HBitmap *copy_bitmap;
+    HBitmapIter linear_hbi;
+
+    CoQueue paused_workers;
+    int running_workers;
+    int nb_busy_workers;
+
+    bool delayed;
+    bool waiting_for_workers;
+    bool error_exit;
+    bool has_errors;
+    BlockErrorAction error_action;
+    int main_error;
+    bool main_error_is_read;
+
+    unsigned long *notif_wait_bitmap;
+    QSIMPLEQ_HEAD(, NotifierRequest) notifier_reqs;
+    NotifierRequest *current_notif;
+    HBitmapIter current_notif_hbi;
 } BackupBlockJob;
 
-/* Size of a cluster in sectors, instead of bytes. */
-static inline int64_t cluster_size_sectors(BackupBlockJob *job)
-{
-  return job->cluster_size / BDRV_SECTOR_SIZE;
-}
-
-static inline int64_t max_query_sectors(BackupBlockJob *job)
-{
-    return (INT_MAX & ~(job->cluster_size - 1)) >> BDRV_SECTOR_BITS;
-}
-
 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
                                                        int64_t start,
@@ -95,6 +113,129 @@ static void cow_request_end(CowRequest *req)
     qemu_co_queue_restart_all(&req->wait_queue);
 }
 
+static inline void set_notif_cur(BackupBlockJob *job, NotifierRequest *nr)
+{
+    if (nr != NULL) {
+        hbitmap_iter_init(&job->current_notif_hbi, job->copy_bitmap, nr->start);
+    }
+    job->current_notif = nr;
+}
+
+/* NULL result means that request is already done */
+static NotifierRequest *add_notif_req(BackupBlockJob *job,
+                                      int64_t start, int64_t end,
+                                      Coroutine *notif)
+{
+    NotifierRequest *nr;
+    int nb_wait = bitmap_count_between(job->notif_wait_bitmap, start, end) +
+                  hbitmap_count_between(job->copy_bitmap, start, end);
+
+    if (nb_wait == 0) {
+        return NULL;
+    }
+
+    nr = g_new0(NotifierRequest, 1);
+
+    nr->start = start;
+    nr->end = end;
+    nr->notif = notif;
+    nr->nb_wait = nb_wait;
+
+    QSIMPLEQ_INSERT_TAIL(&job->notifier_reqs, nr, list);
+    if (job->current_notif == NULL) {
+        set_notif_cur(job, nr);
+    }
+
+    return nr;
+}
+
+static void detach_notif_req(BackupBlockJob *job, NotifierRequest *req)
+{
+    assert(req);
+
+    if (job->current_notif == req) {
+        set_notif_cur(job, QSIMPLEQ_NEXT(req, list));
+    }
+
+    QSIMPLEQ_REMOVE(&job->notifier_reqs, req, NotifierRequest, list);
+}
+
+static int64_t next_notif_cluster(BackupBlockJob *job)
+{
+    while (job->current_notif != NULL) {
+        int64_t cluster = hbitmap_iter_next(&job->current_notif_hbi);
+        if (cluster != -1 && cluster < job->current_notif->end) {
+            return cluster;
+        }
+
+        /* nothing to do for current notifier */
+        set_notif_cur(job, QSIMPLEQ_NEXT(job->current_notif, list));
+    }
+
+    /* nothing to do for notifiers */
+    return -1;
+}
+
+static void finish_cluster_for_notif(BackupBlockJob *job, int64_t cluster)
+{
+    NotifierRequest *req, *next;
+    QSIMPLEQ_HEAD(, NotifierRequest) finished_reqs;
+    QSIMPLEQ_INIT(&finished_reqs);
+
+    trace_backup_finish_cluster_for_notif(job, cluster);
+
+    clear_bit(cluster, job->notif_wait_bitmap);
+
+    /* handle notifiers */
+    QSIMPLEQ_FOREACH_SAFE(req, &job->notifier_reqs, list, next) {
+        if (cluster >= req->start && cluster < req->end) {
+            assert(req->nb_wait > 0);
+            req->nb_wait--;
+            if (req->nb_wait == 0) {
+                detach_notif_req(job, req);
+                QSIMPLEQ_INSERT_TAIL(&finished_reqs, req, list);
+            }
+        }
+    }
+
+    while ((req = QSIMPLEQ_FIRST(&finished_reqs))) {
+        QSIMPLEQ_REMOVE(&finished_reqs, req, NotifierRequest, list);
+        qemu_coroutine_enter(req->notif);
+    }
+}
+
+static void release_pending_notifiers(BackupBlockJob *job)
+{
+    NotifierRequest *req, *next;
+    QSIMPLEQ_HEAD(, NotifierRequest) finished_reqs;
+    QSIMPLEQ_INIT(&finished_reqs);
+
+    QSIMPLEQ_FOREACH_SAFE(req, &job->notifier_reqs, list, next) {
+        req->nb_wait = 0;
+        detach_notif_req(job, req);
+        QSIMPLEQ_INSERT_TAIL(&finished_reqs, req, list);
+    }
+
+    while ((req = QSIMPLEQ_FIRST(&finished_reqs))) {
+        QSIMPLEQ_REMOVE(&finished_reqs, req, NotifierRequest, list);
+        qemu_coroutine_enter(req->notif);
+    }
+}
+
+static void coroutine_fn backup_do_cow(BackupBlockJob *job,
+                                       int64_t sector_num, int nb_sectors);
+
+/* Size of a cluster in sectors, instead of bytes. */
+static inline int64_t cluster_size_sectors(BackupBlockJob *job)
+{
+    return job->cluster_size / BDRV_SECTOR_SIZE;
+}
+
+static inline int64_t max_query_sectors(BackupBlockJob *job)
+{
+    return (INT_MAX & ~(job->cluster_size - 1)) >> BDRV_SECTOR_BITS;
+}
+
 static BlockErrorAction backup_error_action(BackupBlockJob *job,
                                             bool read, int error)
 {
@@ -107,12 +248,8 @@ static BlockErrorAction backup_error_action(BackupBlockJob *job,
     }
 }
 
-static bool coroutine_fn yield_and_check(BackupBlockJob *job)
+static void backup_job_sleep(BackupBlockJob *job)
 {
-    if (block_job_is_cancelled(&job->common)) {
-        return true;
-    }
-
     /* we need to yield so that bdrv_drain_all() returns.
      * (without, VM does not reboot)
      */
@@ -120,11 +257,42 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
         uint64_t delay_ns = ratelimit_calculate_delay(&job->limit,
                                                       job->sectors_read);
         job->sectors_read = 0;
+        job->delayed = true;
+        trace_backup_sleep_delay(block_job_is_cancelled(&job->common),
+                                 block_job_should_pause(&job->common));
         block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, delay_ns);
+        job->delayed = false;
     } else {
-        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, 0);
+        trace_backup_sleep_zero(block_job_is_cancelled(&job->common),
+                                block_job_should_pause(&job->common));
+        block_job_pause_point(&job->common);
+    }
+
+    trace_backup_sleep_finish();
+}
+
+/* backup_busy_sleep
+ * Just yield, without setting busy=false
+ */
+static void backup_busy_sleep(BackupBlockJob *job)
+{
+    trace_backup_loop_busy_sleep_start();
+
+    job->waiting_for_workers = true;
+    qemu_coroutine_yield();
+    job->waiting_for_workers = false;
+
+    trace_backup_loop_busy_sleep_finish();
+}
+
+static bool coroutine_fn yield_and_check(BackupBlockJob *job)
+{
+    if (block_job_is_cancelled(&job->common)) {
+        return true;
     }
 
+    backup_job_sleep(job);
+
     if (block_job_is_cancelled(&job->common)) {
         return true;
     }
@@ -132,69 +300,246 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
     return false;
 }
 
-static int coroutine_fn backup_do_read(BackupBlockJob *job,
-                                       int64_t offset, unsigned int bytes,
-                                       QEMUIOVector *qiov)
+static void backup_job_wait_workers(BackupBlockJob *job)
 {
-    int ret;
+    if (job->nb_busy_workers == 0) {
+        return;
+    }
 
-retry:
-    ret = blk_co_preadv(job->common.blk, offset, bytes, qiov,
-                        BDRV_REQ_NO_SERIALISING);
-    if (ret < 0) {
-        trace_backup_do_read_fail(job, offset, bytes, ret);
+    job->waiting_for_workers = true;
+
+    trace_backup_job_wait_workers_start();
+    qemu_coroutine_yield();
+    trace_backup_job_wait_workers_end();
+
+    assert(job->nb_busy_workers == 0);
+    job->waiting_for_workers = false;
+}
 
-        BlockErrorAction action = backup_error_action(job, true, -ret);
-        if (action != BLOCK_ERROR_ACTION_REPORT && !yield_and_check(job)) {
-            goto retry;
+static void backup_worker_pause(BackupBlockJob *job)
+{
+    job->nb_busy_workers--;
+
+    trace_backup_worker_pause(qemu_coroutine_self(), job->nb_busy_workers,
+                              job->waiting_for_workers);
+
+    if (job->nb_busy_workers == 0 && job->waiting_for_workers) {
+        qemu_coroutine_add_next(job->common.co);
+    }
+
+    qemu_co_queue_wait(&job->paused_workers);
+
+    trace_backup_worker_unpause(qemu_coroutine_self());
+
+    job->nb_busy_workers++;
+}
+
+static void backup_worker_error(BackupBlockJob *job, int error, bool is_read)
+{
+    BlockErrorAction action =
+        block_job_get_error_action(&job->common, job->on_target_error, error);
+
+    if (!job->has_errors || job->error_action == BLOCK_ERROR_ACTION_IGNORE ||
+            action == BLOCK_ERROR_ACTION_REPORT) {
+        job->has_errors = true;
+        job->error_action = action;
+        job->main_error = error;
+        job->main_error_is_read = is_read;
+    }
+
+    backup_worker_pause(job);
+}
+
+#define BACKUP_WORKER_STOP -1
+#define BACKUP_WORKER_PAUSE -2
+
+static inline bool check_delay(BackupBlockJob *job)
+{
+    uint64_t delay_ns;
+
+    if (!job->common.speed) {
+        return false;
+    }
+
+    delay_ns = ratelimit_calculate_delay(&job->limit, job->sectors_read);
+    job->sectors_read = 0;
+
+    if (delay_ns == 0) {
+        if (job->delayed) {
+            job->delayed = false;
+            qemu_co_queue_restart_all(&job->paused_workers);
         }
+        return false;
+    }
 
-        return ret;
+    return job->delayed = true;
+}
+
+static inline int64_t backup_get_work(BackupBlockJob *job)
+{
+    int64_t cluster;
+
+    if (block_job_is_cancelled(&job->common) || job->error_exit) {
+        return BACKUP_WORKER_STOP;
     }
 
-    return 0;
+    cluster = next_notif_cluster(job);
+    if (cluster != -1) {
+        return cluster;
+    }
+
+    if (block_job_should_pause(&job->common) ||
+        job->sync_mode == MIRROR_SYNC_MODE_NONE || check_delay(job))
+    {
+        return BACKUP_WORKER_PAUSE;
+    }
+
+    cluster = hbitmap_iter_next(&job->linear_hbi);
+
+    return cluster == -1 ? BACKUP_WORKER_STOP : cluster;
+}
+
+static void coroutine_fn backup_worker_co(void *opaque)
+{
+    BackupBlockJob *job = opaque;
+
+    job->running_workers++;
+    job->nb_busy_workers++;
+
+    while (true) {
+        int64_t cluster = backup_get_work(job);
+        trace_backup_worker_got_work(job, qemu_coroutine_self(), cluster);
+
+        switch (cluster) {
+        case BACKUP_WORKER_STOP:
+            job->nb_busy_workers--;
+            job->running_workers--;
+            if (job->nb_busy_workers == 0 && job->waiting_for_workers) {
+                qemu_coroutine_add_next(job->common.co);
+            }
+            trace_backup_worker_stop(qemu_coroutine_self(),
+                                     job->nb_busy_workers, job->running_workers,
+                                     job->waiting_for_workers);
+            return;
+        case BACKUP_WORKER_PAUSE:
+            backup_worker_pause(job);
+            break;
+        default:
+            backup_do_cow(job, cluster * cluster_size_sectors(job),
+                          cluster_size_sectors(job));
+        }
+    }
+}
+
+static bool coroutine_fn handle_errors(BackupBlockJob *job)
+{
+    BlockErrorAction action;
+
+    if (!job->has_errors) {
+        return false;
+    }
+
+    backup_job_wait_workers(job);
+
+    action = backup_error_action(job, job->main_error_is_read,
+                                 -job->main_error);
+
+    if (action == BLOCK_ERROR_ACTION_REPORT) {
+        return true;
+    }
+
+    job->has_errors = false;
+    return false;
+}
+
+
+
+/* this loop runs in block job coroutine (job->common.co) and handles
+ * block-job specific pauses, delays and yields. It doesn't do real data copy.
+ */
+static void coroutine_fn backup_loop(BackupBlockJob *job)
+{
+    bool err_ret;
+
+    trace_backup_loop_enter();
+
+    while (job->running_workers > 0) {
+        if (handle_errors(job)) {
+            trace_backup_loop_error_exit();
+            return;
+        }
+
+        backup_job_sleep(job);
+
+        if (job->running_workers == 0) {
+            break;
+        }
+
+        if (handle_errors(job)) {
+            trace_backup_loop_error_exit();
+            return;
+        }
+
+        /* we may be resumed after pause or delay, so we should resume paused
+         * workers */
+        qemu_co_queue_restart_all(&job->paused_workers);
+
+        backup_busy_sleep(job);
+    }
+
+    err_ret = handle_errors(job);
+    trace_backup_loop_finish(err_ret);
 }
 
-static int coroutine_fn backup_do_write(BackupBlockJob *job,
+static void coroutine_fn backup_do_read(BackupBlockJob *job,
                                         int64_t offset, unsigned int bytes,
                                         QEMUIOVector *qiov)
 {
     int ret;
-    bool zeroes;
-
-    assert(qiov->niov == 1);
-    zeroes = buffer_is_zero(qiov->iov->iov_base, qiov->iov->iov_len);
 
-retry:
-    if (zeroes) {
-        ret = blk_co_pwrite_zeroes(job->target, offset, bytes,
-                                   BDRV_REQ_MAY_UNMAP);
-    } else {
-        ret = blk_co_pwritev(job->target, offset, bytes, qiov,
-                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+    while (!job->error_exit &&
+           (ret = blk_co_preadv(job->common.blk, offset, bytes, qiov,
+                                BDRV_REQ_NO_SERIALISING)) < 0)
+    {
+        trace_backup_do_read_fail(job, qemu_coroutine_self(), offset, bytes,
+                                  ret);
+        backup_worker_error(job, ret, true);
     }
-    if (ret < 0) {
-        trace_backup_do_write_fail(job, offset, bytes, ret);
+}
 
-        BlockErrorAction action = backup_error_action(job, false, -ret);
-        if (action != BLOCK_ERROR_ACTION_REPORT && !yield_and_check(job)) {
-            goto retry;
-        }
+static void coroutine_fn backup_do_write(BackupBlockJob *job,
+                                        int64_t offset, unsigned int bytes,
+                                        QEMUIOVector *qiov)
+{
+    int ret;
+    BdrvRequestFlags flags = 0;
 
-        return ret;
+    assert(qiov->niov == 1);
+
+    if (buffer_is_zero(qiov->iov->iov_base, qiov->iov->iov_len)) {
+        flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_ZERO_WRITE;
+        qiov = NULL;
+    } else if (job->compress) {
+        flags = BDRV_REQ_WRITE_COMPRESSED;
     }
 
-    return 0;
+    while (!job->error_exit &&
+           (ret = blk_co_pwritev(job->target, offset,
+                                 bytes, qiov, flags)) < 0)
+    {
+        trace_backup_do_write_fail(job, qemu_coroutine_self(), offset, bytes,
+                                   ret);
+        backup_worker_error(job, ret, false);
+    }
 }
 
-static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
+static void coroutine_fn backup_copy_cluster(BackupBlockJob *job,
                                             int64_t cluster,
                                             void *bounce_buffer)
 {
     int n;
     struct iovec iov;
     QEMUIOVector bounce_qiov;
-    int ret = 0;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     int64_t offset = cluster * job->cluster_size;
 
@@ -208,14 +553,22 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
     iov.iov_len = n * BDRV_SECTOR_SIZE;
     qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-    ret = backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov);
-    if (ret < 0) {
-        return ret;
+    backup_do_read(job, offset, bounce_qiov.size, &bounce_qiov);
+    if (job->error_exit) {
+        return;
     }
 
-    ret = backup_do_write(job, offset, bounce_qiov.size, &bounce_qiov);
-    if (ret < 0) {
-        return ret;
+    if (job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+        finish_cluster_for_notif(job, cluster);
+    }
+
+    backup_do_write(job, offset, bounce_qiov.size, &bounce_qiov);
+    if (job->error_exit) {
+        return;
+    }
+
+    if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
+        finish_cluster_for_notif(job, cluster);
     }
 
     /* Publish progress, guest I/O counts as progress too.  Note that the
@@ -223,29 +576,21 @@ static int coroutine_fn backup_copy_cluster(BackupBlockJob *job,
      */
     job->sectors_read += n;
     job->common.offset += n * BDRV_SECTOR_SIZE;
-
-    return 0;
 }
 
-static int coroutine_fn backup_do_cow(BackupBlockJob *job,
-                                      int64_t sector_num, int nb_sectors)
+static void coroutine_fn backup_do_cow(BackupBlockJob *job,
+                                       int64_t sector_num, int nb_sectors)
 {
     BlockBackend *blk = job->common.blk;
-    CowRequest cow_request;
     void *bounce_buffer = NULL;
-    int ret = 0;
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     int64_t start, end;
 
-    qemu_co_rwlock_rdlock(&job->flush_rwlock);
-
     start = sector_num / sectors_per_cluster;
     end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
 
-    trace_backup_do_cow_enter(job, start, sector_num, nb_sectors);
-
-    wait_for_overlapping_requests(job, start, end);
-    cow_request_begin(&cow_request, job, start, end);
+    trace_backup_do_cow_enter(job, qemu_coroutine_self(),  start, sector_num,
+                              nb_sectors);
 
     for (; start < end; start++) {
         if (!hbitmap_get(job->copy_bitmap, start)) {
@@ -253,46 +598,64 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
             continue; /* already copied */
         }
         hbitmap_reset(job->copy_bitmap, start, 1);
+        set_bit(start, job->notif_wait_bitmap);
 
         if (!bounce_buffer) {
             bounce_buffer = blk_blockalign(blk, job->cluster_size);
         }
 
-        ret = backup_copy_cluster(job, start, bounce_buffer);
-        if (ret < 0) {
-            hbitmap_set(job->copy_bitmap, start, 1);
-            goto out;
-        }
+        backup_copy_cluster(job, start, bounce_buffer);
     }
 
-out:
     if (bounce_buffer) {
         qemu_vfree(bounce_buffer);
     }
 
-    cow_request_end(&cow_request);
-
-    trace_backup_do_cow_return(job, sector_num, nb_sectors, ret);
-
-    qemu_co_rwlock_unlock(&job->flush_rwlock);
-
-    return ret;
+    trace_backup_do_cow_return(job, qemu_coroutine_self(), sector_num,
+                               nb_sectors);
 }
 
 static int coroutine_fn backup_before_write_notify(
         NotifierWithReturn *notifier,
         void *opaque)
 {
-    BackupBlockJob *job = container_of(notifier, BackupBlockJob, before_write);
-    BdrvTrackedRequest *req = opaque;
-    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
-    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
+    BdrvTrackedRequest *tr = opaque;
+    NotifierRequest *nr;
+    BackupBlockJob *job = (BackupBlockJob *)tr->bs->job;
+    int64_t start = tr->offset / job->cluster_size;
+    int64_t end = DIV_ROUND_UP(tr->offset + tr->bytes, job->cluster_size);
+    int ret = 0;
+
+    assert((tr->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((tr->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    nr = add_notif_req(job, start, end, qemu_coroutine_self());
 
-    assert(req->bs == blk_bs(job->common.blk));
-    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+    if (nr == NULL) {
+        trace_backup_before_write_notify_skip(job, tr->offset, tr->bytes);
+    } else {
+        trace_backup_before_write_notify_start(job, tr->offset, tr->bytes, nr,
+                                               nr->start, nr->end, nr->nb_wait);
+
+        if (!job->has_errors) {
+            qemu_co_queue_restart_all(&job->paused_workers);
+        }
+        co_aio_sleep_ns(blk_get_aio_context(job->common.blk),
+                        QEMU_CLOCK_REALTIME, WRITE_NOTIFY_TIMEOUT_NS);
+        if (nr->nb_wait > 0) {
+            /* timer expired and read request not finished */
+            ret = -EINVAL;
+            detach_notif_req(job, nr);
+            trace_backup_before_write_notify_timeout(job, nr, nr->start,
+                                                     nr->end, nr->nb_wait);
+        } else {
+            trace_backup_before_write_notify_success(job, nr, nr->start,
+                                                     nr->end, nr->nb_wait);
+        }
+        g_free(nr);
+    }
 
-    return backup_do_cow(job, sector_num, nb_sectors);
+    return ret;
 }
 
 static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
@@ -434,11 +797,6 @@ static void backup_complete(BlockJob *job, void *opaque)
 static void backup_skip_clusters(BackupBlockJob *job,
                                  int64_t start, int64_t end)
 {
-    CowRequest cow_request;
-
-    wait_for_overlapping_requests(job, start, end);
-    cow_request_begin(&cow_request, job, start, end);
-
     if (end * job->cluster_size > job->common.len) {
         int64_t n;
         end--;
@@ -446,8 +804,10 @@ static void backup_skip_clusters(BackupBlockJob *job,
         assert(n > 0);
 
         if (hbitmap_get(job->copy_bitmap, end)) {
+            trace_backup_skip_cluster(job, end);
             hbitmap_reset(job->copy_bitmap, end, 1);
             job->common.offset += n;
+            finish_cluster_for_notif(job, end);
         }
     }
 
@@ -456,11 +816,11 @@ static void backup_skip_clusters(BackupBlockJob *job,
             continue;
         }
 
+        trace_backup_skip_cluster(job, start);
         hbitmap_reset(job->copy_bitmap, start, 1);
         job->common.offset += job->cluster_size;
+        finish_cluster_for_notif(job, start);
     }
-
-    cow_request_end(&cow_request);
 }
 
 static int backup_skip_unallocated_clusters(BackupBlockJob *job,
@@ -526,29 +886,6 @@ static void backup_skip_loop(BackupBlockJob *job, BlockDriverState *base)
     }
 }
 
-static int coroutine_fn backup_loop(BackupBlockJob *job)
-{
-    int ret;
-    int64_t sectors_per_cluster = cluster_size_sectors(job);
-    int64_t cluster;
-    HBitmapIter hbi;
-
-    hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
-    while ((cluster = hbitmap_iter_next(&hbi)) != -1) {
-        if (yield_and_check(job)) {
-            return 0;
-        }
-
-        ret = backup_do_cow(job, cluster * sectors_per_cluster,
-                            sectors_per_cluster);
-        if (ret < 0) {
-            return ret;
-        }
-    }
-
-    return 0;
-}
-
 /* init copy_bitmap from sync_bitmap */
 static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
 {
@@ -586,50 +923,62 @@ static void coroutine_fn backup_run(void *opaque)
     BackupCompleteData *data;
     BlockDriverState *bs = blk_bs(job->common.blk);
     int64_t end;
-    int ret = 0;
+    int i;
+    bool is_top = job->sync_mode == MIRROR_SYNC_MODE_TOP;
+    bool is_full = job->sync_mode == MIRROR_SYNC_MODE_FULL;
+
+    trace_backup_run();
 
+    qemu_co_queue_init(&job->paused_workers);
     QLIST_INIT(&job->inflight_reqs);
-    qemu_co_rwlock_init(&job->flush_rwlock);
+    QSIMPLEQ_INIT(&job->notifier_reqs);
 
     end = DIV_ROUND_UP(job->common.len, job->cluster_size);
 
     job->copy_bitmap = hbitmap_alloc(end, 0);
+    job->notif_wait_bitmap = bitmap_new(end);
 
     job->before_write.notify = backup_before_write_notify;
     bdrv_add_before_write_notifier(bs, &job->before_write);
 
-    if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
+    /* init copy_bitmap */
+    if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
+        backup_incremental_init_copy_bitmap(job);
+    } else {
         hbitmap_set(job->copy_bitmap, 0, end);
+    }
+    hbitmap_iter_init(&job->linear_hbi, job->copy_bitmap, 0);
+
+    for (i = 0; i < NB_WORKERS; ++i) {
+        Coroutine *co = qemu_coroutine_create(backup_worker_co, job);
+        qemu_coroutine_enter(co);
+    }
+
+    if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
         while (!block_job_is_cancelled(&job->common)) {
-            /* Yield until the job is cancelled.  We just let our before_write
-             * notify callback service CoW requests. */
+            /* Yield until the job is cancelled.  We just let backup workers
+             * service CoW requests. */
             block_job_yield(&job->common);
         }
-    } else {
-        if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
-            backup_incremental_init_copy_bitmap(job);
-        } else {
-            /* top or full mode */
-            bool is_top = job->sync_mode == MIRROR_SYNC_MODE_TOP;
-            BlockDriverState *base =
-                    is_top ? backing_bs(blk_bs(job->common.blk)) : NULL;
-            hbitmap_set(job->copy_bitmap, 0, end);
-            if (is_top || bdrv_has_zero_init(blk_bs(job->target))) {
-                backup_skip_loop(job, base);
-            }
-        }
-        ret = backup_loop(job);
+    } else if (is_top || (is_full && bdrv_has_zero_init(blk_bs(job->target)))) {
+        BlockDriverState *base =
+                is_top ? backing_bs(blk_bs(job->common.blk)) : NULL;
+        backup_skip_loop(job, base);
     }
 
+    /* wait for all workers to exit */
+    backup_loop(job);
+
     notifier_with_return_remove(&job->before_write);
 
-    /* wait until pending backup_do_cow() calls have completed */
-    qemu_co_rwlock_wrlock(&job->flush_rwlock);
-    qemu_co_rwlock_unlock(&job->flush_rwlock);
+    /* in case of error or cancel there may be pending notifiers */
+    release_pending_notifiers(job);
+
     hbitmap_free(job->copy_bitmap);
+    g_free(job->notif_wait_bitmap);
 
     data = g_malloc(sizeof(*data));
-    data->ret = ret;
+    data->ret = job->main_error;
     block_job_defer_to_main_loop(&job->common, backup_complete, data);
 }
 
diff --git a/block/trace-events b/block/trace-events
index e7a7372..a55297d 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -40,12 +40,35 @@ mirror_yield_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested chu
 mirror_break_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested chunks %d in_flight %d"
 
 # block/backup.c
-backup_do_cow_enter(void *job, int64_t start, int64_t sector_num, int nb_sectors) "job %p start %"PRId64" sector_num %"PRId64" nb_sectors %d"
-backup_do_cow_return(void *job, int64_t sector_num, int nb_sectors, int ret) "job %p sector_num %"PRId64" nb_sectors %d ret %d"
+backup_do_cow_enter(void *job, void *self, int64_t start, int64_t sector_num, int nb_sectors) "job %p self %p start %"PRId64" sector_num %"PRId64" nb_sectors %d"
+backup_do_cow_return(void *job, void *self, int64_t sector_num, int nb_sectors) "job %p self %p sector_num %"PRId64" nb_sectors %d"
 backup_do_cow_skip(void *job, int64_t start) "job %p start %"PRId64
 backup_do_cow_process(void *job, int64_t start) "job %p start %"PRId64
-backup_do_read_fail(void *job, int64_t offset, unsigned bytes, int ret) "job %p offset %"PRId64" bytes %u ret %d"
-backup_do_write_fail(void *job, int64_t offset, unsigned bytes, int ret) "job %p offset %"PRId64" bytes %u ret %d"
+backup_do_read_fail(void *job, void *self, int64_t offset, unsigned bytes, int ret) "job %p self %p offset %"PRId64" bytes %u ret %d"
+backup_do_write_fail(void *job, void *self, int64_t offset, unsigned bytes, int ret) "job %p self %p offset %"PRId64" bytes %u ret %d"
+backup_job_wait_workers_start(void) ""
+backup_job_wait_workers_end(void) ""
+backup_loop_enter(void) ""
+backup_loop_error_exit(void) ""
+backup_loop_sleep_start(void) ""
+backup_loop_sleep_finish(void) ""
+backup_loop_busy_sleep_start(void) ""
+backup_loop_busy_sleep_finish(void) ""
+backup_loop_finish(bool with_err) "with_err: %d"
+backup_run(void) ""
+backup_sleep_delay(bool cancelled, bool pause) "cancelled %d pause %d"
+backup_sleep_zero(bool cancelled, bool pause) "cancelled %d pause %d"
+backup_sleep_finish(void) ""
+backup_worker_pause(void *p, int nb_busy_workers, bool waiting) "co %p nb_busy_workers %d waiting %d"
+backup_worker_unpause(void *p) "co %p"
+backup_worker_stop(void *p, int nb_busy_workers, int running_workers, bool waiting) "co %p nb_busy_workers %d running_workers %d waiting %d"
+backup_worker_got_work(void *job, void *worker, int64_t cluster) "job %p worker %p cluster %ld"
+backup_before_write_notify_skip(void *job, int64_t offset, int64_t bytes) "job %p offset %"PRId64" bytes %"PRId64
+backup_before_write_notify_start(void *job, int64_t offset, int64_t bytes, void *nr, int64_t start, int64_t end, int nb_wait) "job %p offset %"PRId64" bytes %"PRId64" notif %p start %"PRId64" end %"PRId64" nb_wait %d"
+backup_before_write_notify_timeout(void *job, void *nr, int64_t start, int64_t end, int nb_wait) "job %p notif %p start %"PRId64" end %"PRId64" nb_wait %d"
+backup_before_write_notify_success(void *job, void *nr, int64_t start, int64_t end, int nb_wait) "job %p notif %p start %"PRId64" end %"PRId64" nb_wait %d"
+backup_skip_cluster(void *job, int64_t cluster) "job %p cluster %"PRId64
+backup_finish_cluster_for_notif(void *job, int64_t cluster) "job %p cluster %"PRId64
 
 # blockdev.c
 qmp_block_job_cancel(void *job) "job %p"
diff --git a/blockjob.c b/blockjob.c
index 513620c..1f384c6 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -407,7 +407,7 @@ void block_job_user_pause(BlockJob *job)
     block_job_pause(job);
 }
 
-static bool block_job_should_pause(BlockJob *job)
+bool block_job_should_pause(BlockJob *job)
 {
     return job->pause_count > 0;
 }
@@ -666,6 +666,33 @@ void block_job_event_ready(BlockJob *job)
                                     job->speed, &error_abort);
 }
 
+BlockErrorAction block_job_get_error_action(BlockJob *job,
+                                            BlockdevOnError on_err, int error)
+{
+    BlockErrorAction action;
+
+    switch (on_err) {
+    case BLOCKDEV_ON_ERROR_ENOSPC:
+    case BLOCKDEV_ON_ERROR_AUTO:
+        action = (error == ENOSPC) ?
+                 BLOCK_ERROR_ACTION_STOP : BLOCK_ERROR_ACTION_REPORT;
+        break;
+    case BLOCKDEV_ON_ERROR_STOP:
+        action = BLOCK_ERROR_ACTION_STOP;
+        break;
+    case BLOCKDEV_ON_ERROR_REPORT:
+        action = BLOCK_ERROR_ACTION_REPORT;
+        break;
+    case BLOCKDEV_ON_ERROR_IGNORE:
+        action = BLOCK_ERROR_ACTION_IGNORE;
+        break;
+    default:
+        abort();
+    }
+
+    return action;
+}
+
 BlockErrorAction block_job_error_action(BlockJob *job, BlockdevOnError on_err,
                                         int is_read, int error)
 {
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 1acb256..7d24cf6 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -248,6 +248,15 @@ void block_job_user_pause(BlockJob *job);
 bool block_job_user_paused(BlockJob *job);
 
 /**
+ * block_job_should_pause:
+ * @job: The job being queried.
+ *
+ * Returns whether the job is currently paused, or will pause
+ * as soon as it reaches a sleeping point.
+ */
+bool block_job_should_pause(BlockJob *job);
+
+/**
  * block_job_resume:
  * @job: The job to be resumed.
  *
@@ -309,7 +318,11 @@ int block_job_complete_sync(BlockJob *job, Error **errp);
  */
 void block_job_iostatus_reset(BlockJob *job);
 
-/**
+
+BlockErrorAction block_job_get_error_action(BlockJob *job,
+                                            BlockdevOnError on_err, int error);
+
+/*
  * block_job_txn_new:
  *
  * Allocate and return a new block job transaction.  Jobs can be added to the
diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
index 388b7b2..e15905f 100755
--- a/tests/qemu-iotests/055
+++ b/tests/qemu-iotests/055
@@ -553,4 +553,4 @@ class TestDriveCompression(iotests.QMPTestCase):
             self.do_test_compress_pause('blockdev-backup', format, target='drive1')
 
 if __name__ == '__main__':
-    iotests.main(supported_fmts=['raw', 'qcow2'])
+    iotests.main(supported_fmts=['raw', 'qcow2'], supported_cache_modes=['none'])
diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129
index 9e87e1c..3d5e137 100644
--- a/tests/qemu-iotests/129
+++ b/tests/qemu-iotests/129
@@ -66,8 +66,10 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
         result = self.vm.qmp("stop")
         self.assert_qmp(result, 'return', {})
         result = self.vm.qmp("query-block-jobs")
-        self.assert_qmp(result, 'return[0]/busy', True)
-        self.assert_qmp(result, 'return[0]/ready', False)
+        if result['return']:
+            # make additional check if block job is not released yet
+            self.assert_qmp(result, 'return[0]/busy', True)
+            self.assert_qmp(result, 'return[0]/ready', False)
 
     def test_drive_mirror(self):
         self.do_test_stop("drive-mirror", device="drive0",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 20/21] backup: move bitmap handling from backup_do_cow to get_work
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (17 preceding siblings ...)
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 18/21] backup: new async architecture Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:29 ` Vladimir Sementsov-Ogievskiy
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 21/21] backup: refactor: remove backup_do_cow() Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:29 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 900bbd3..b79a481 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -383,19 +383,24 @@ static inline int64_t backup_get_work(BackupBlockJob *job)
     }
 
     cluster = next_notif_cluster(job);
-    if (cluster != -1) {
-        return cluster;
-    }
+    if (cluster == -1) {
+        if (block_job_should_pause(&job->common) ||
+            job->sync_mode == MIRROR_SYNC_MODE_NONE || check_delay(job))
+        {
+            return BACKUP_WORKER_PAUSE;
+        }
 
-    if (block_job_should_pause(&job->common) ||
-        job->sync_mode == MIRROR_SYNC_MODE_NONE || check_delay(job))
-    {
-        return BACKUP_WORKER_PAUSE;
+        cluster = hbitmap_iter_next(&job->linear_hbi);
+        if (cluster == -1) {
+            return BACKUP_WORKER_STOP;
+        }
     }
 
-    cluster = hbitmap_iter_next(&job->linear_hbi);
+    assert(hbitmap_get(job->copy_bitmap, cluster));
+    hbitmap_reset(job->copy_bitmap, cluster, 1);
+    set_bit(cluster, job->notif_wait_bitmap);
 
-    return cluster == -1 ? BACKUP_WORKER_STOP : cluster;
+    return cluster;
 }
 
 static void coroutine_fn backup_worker_co(void *opaque)
@@ -583,13 +588,6 @@ static void coroutine_fn backup_do_cow(BackupBlockJob *job, int64_t cluster)
 
     trace_backup_do_cow_enter(job, qemu_coroutine_self(), cluster);
 
-    if (!hbitmap_get(job->copy_bitmap, cluster)) {
-        trace_backup_do_cow_skip(job, cluster);
-        return; /* already copied */
-    }
-    hbitmap_reset(job->copy_bitmap, cluster, 1);
-    set_bit(cluster, job->notif_wait_bitmap);
-
     bounce_buffer = blk_blockalign(blk, job->cluster_size);
     backup_copy_cluster(job, cluster, bounce_buffer);
     qemu_vfree(bounce_buffer);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 21/21] backup: refactor: remove backup_do_cow()
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (18 preceding siblings ...)
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 20/21] backup: move bitmap handling from backup_do_cow to get_work Vladimir Sementsov-Ogievskiy
@ 2016-12-23 14:29 ` Vladimir Sementsov-Ogievskiy
  2017-01-09 11:04 ` [Qemu-devel] [PATCH 00/21] new backup architecture Stefan Hajnoczi
  2017-01-31 10:20 ` Stefan Hajnoczi
  21 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-23 14:29 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, vsementsov, pbonzini, jcody

Call backup_copy_cluster directly from backup_worker_co. Move io buffer
allocation to backup_worker_co (and do not reallocate it for each
cluster).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c     | 28 +++++++++-------------------
 block/trace-events |  6 ++----
 2 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index b79a481..4a3a874 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -222,8 +222,6 @@ static void release_pending_notifiers(BackupBlockJob *job)
     }
 }
 
-static void coroutine_fn backup_do_cow(BackupBlockJob *job, int64_t cluster);
-
 /* Size of a cluster in sectors, instead of bytes. */
 static inline int64_t cluster_size_sectors(BackupBlockJob *job)
 {
@@ -403,9 +401,14 @@ static inline int64_t backup_get_work(BackupBlockJob *job)
     return cluster;
 }
 
+static void coroutine_fn backup_copy_cluster(BackupBlockJob *job,
+                                            int64_t cluster,
+                                            void *bounce_buffer);
+
 static void coroutine_fn backup_worker_co(void *opaque)
 {
     BackupBlockJob *job = opaque;
+    void *io_buf = blk_blockalign(job->common.blk, job->cluster_size);
 
     job->running_workers++;
     job->nb_busy_workers++;
@@ -424,12 +427,13 @@ static void coroutine_fn backup_worker_co(void *opaque)
             trace_backup_worker_stop(qemu_coroutine_self(),
                                      job->nb_busy_workers, job->running_workers,
                                      job->waiting_for_workers);
+            qemu_vfree(io_buf);
             return;
         case BACKUP_WORKER_PAUSE:
             backup_worker_pause(job);
             break;
         default:
-            backup_do_cow(job, cluster);
+            backup_copy_cluster(job, cluster, io_buf);
         }
     }
 }
@@ -546,8 +550,7 @@ static void coroutine_fn backup_copy_cluster(BackupBlockJob *job,
     int64_t sectors_per_cluster = cluster_size_sectors(job);
     int64_t offset = cluster * job->cluster_size;
 
-    trace_backup_do_cow_process(job, cluster);
-
+    trace_backup_copy_cluster_enter(job, qemu_coroutine_self(), cluster);
     n = MIN(sectors_per_cluster,
             job->common.len / BDRV_SECTOR_SIZE -
             cluster * sectors_per_cluster);
@@ -579,20 +582,7 @@ static void coroutine_fn backup_copy_cluster(BackupBlockJob *job,
      */
     job->sectors_read += n;
     job->common.offset += n * BDRV_SECTOR_SIZE;
-}
-
-static void coroutine_fn backup_do_cow(BackupBlockJob *job, int64_t cluster)
-{
-    BlockBackend *blk = job->common.blk;
-    void *bounce_buffer;
-
-    trace_backup_do_cow_enter(job, qemu_coroutine_self(), cluster);
-
-    bounce_buffer = blk_blockalign(blk, job->cluster_size);
-    backup_copy_cluster(job, cluster, bounce_buffer);
-    qemu_vfree(bounce_buffer);
-
-    trace_backup_do_cow_return(job, qemu_coroutine_self(), cluster);
+    trace_backup_copy_cluster_success(job, qemu_coroutine_self(), cluster);
 }
 
 static int coroutine_fn backup_before_write_notify(
diff --git a/block/trace-events b/block/trace-events
index 54cde28..770b407 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -40,10 +40,8 @@ mirror_yield_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested chu
 mirror_break_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested chunks %d in_flight %d"
 
 # block/backup.c
-backup_do_cow_enter(void *job, void *self, int64_t cluster) "job %p self %p cluster %"PRId64
-backup_do_cow_return(void *job, void *self, int64_t cluster) "job %p self %p cluster %"PRId64
-backup_do_cow_skip(void *job, int64_t cluster) "job %p cluster %"PRId64
-backup_do_cow_process(void *job, int64_t cluster) "job %p cluster %"PRId64
+backup_copy_cluster_enter(void *job, void *self, int64_t cluster) "job %p self %p cluster %"PRId64
+backup_copy_cluster_success(void *job, void *self, int64_t cluster) "job %p self %p cluster %"PRId64
 backup_do_read_fail(void *job, void *self, int64_t offset, unsigned bytes, int ret) "job %p self %p offset %"PRId64" bytes %u ret %d"
 backup_do_write_fail(void *job, void *self, int64_t offset, unsigned bytes, int ret) "job %p self %p offset %"PRId64" bytes %u ret %d"
 backup_job_wait_workers_start(void) ""
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] new backup architecture
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (19 preceding siblings ...)
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 21/21] backup: refactor: remove backup_do_cow() Vladimir Sementsov-Ogievskiy
@ 2017-01-09 11:04 ` Stefan Hajnoczi
  2017-01-10  6:05   ` Jeff Cody
  2017-01-31 10:20 ` Stefan Hajnoczi
  21 siblings, 1 reply; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-09 11:04 UTC (permalink / raw)
  To: jsnow, jcody
  Cc: qemu-block, qemu-devel, kwolf, mreitz, famz, den, pbonzini,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 6875 bytes --]

On Fri, Dec 23, 2016 at 05:28:43PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Jeff or John: are you reviewing this?

> This is a new architecture for backup. It solves some current problems:
> 1. intersecting requests: for now at request start we wait for all intersecting requests, which means that
>     a. we may wait even for unrelated to our request clusters
>     b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not to start 1 and 3 in parallel with 2 and 4?
> 
> 2. notifier request is internally synchronous: if notifier starts copying clusters 1,2,3,4, they will be copied one by one in synchronous loop.
> 
> 3. notifier wait full copying of corresponding clusters (when actually it may wait only for _read_ operations to be finished)
> 
> In short, what is done:
> 1. full async scheme
> 4. no intersecting requests
> 3. notifiers can wait only for read, not for write
> 4. notifiers wait only for corresponding clusters
> 5. time limit for notifiers
> 5. skip unallocated clusters for full mode
> 6. use HBitmap as main backup bitmap and just init it from dirty bitmap for incremental case
> 7. retrying: do not reread on write fail
> 
> # Intro
> 
> Instead of sync-copying + async-notifiers as in old backup, or aio requests like in mirror, this scheme just start 24 workers - separate coroutines, each of them copying clusters synchronously. Copying is only done by one cluster, there are no large requests.
> The only difference for clusters, awaited by write notifiers, is larger priority. So, notifiers do not start io requests, they just mark some clusters as awaited and yield. Then, when some worker completes read of last cluster, awaited by this notifier it will enter it.
> 
> # Some data structures
> 
> Instead of done_bitmap - copy_bitmap, like in mirror.
> HBitmap copy_bitmap
>     Exactly, what should be copied:
>     0 - may mean one of three things:
>         - this cluster should not be copied at all
>         - this cluster is in flight
>         - this cluster is already copied
>     1 - means that cluster should be copied, but not touched yet (no async io exists for it)
> 
> New bitmap: notif_wait_bitmap - not HBitmap, just Bitmap.
>     Exactly, in flight clusters, waiting for read operation:
>     0 - may mean one of three things:
>         - this cluster should not be copied at all
>         - this cluster is in flight and it is _already_ read to memory
>         - this cluster is already copied
>     1 - means that cluster is in flight, but read operation have not finished
>         yet
>     The only exception is none-mode: in this case 1 means in flight: in io read or write. This is needed for image fleecing.
> 
> Cluster states (copy_bitmap, notif_wait_bitmap)
> 
> 0, 0 - Ignored (should not be copied at all) or In flight (read done) or Copied
> 0, 1 - In flight, read operation not finished (or write op. - for none-mode)
> 1, 0 - Should be copied, but not touched yet
> 1, 1 - Impossible state
> 
> NotifierRequest - request from notifier, it changes sequence of cluster copying by workers.
> NotifierRequest {
>     int64_t start;
>     int64_t end;
>     int nb_wait; // nb clusters (in specified range) that should be copied but not already read, i.e. clusters awaited by this notifier
>     Coroutine *notif; // corresponding notifier coroutine
> }
> 
> notifier_reqs - list of notifier requests
> 
> # More info
> 
> At backup start copy_bitmap is inited to sync_bitmap for incremental backup. For top/full backup it is inited to all ones, but in parallel with workers main coroutine skips not allocated clusters.
> 
> Worker coroutines are copying clusters, preferable awaited by notifiers (for which NotifierRequest exists in the list). Function get_work helps them.
> Workers will copy clusters, awaited by notifiers even if block-job is paused - it is the same behaviour  as in old architecture.
> 
> Old backup fails guest-write if notifier fails to backup corresponding clusters. In the new scheme there is a little difference: notifier just wait for 5s and if backup can't copy all corresponding clusters in this time - guest-write fails.
> Error scenarios was considered on list, the final solution was to provide user a possibility to chose what should be failed: backup or guest-write. I'll add this later.
> 
> Worker can exit (no more clusters to copy or fatal error) or pause (error or user pause or throttling). When last worker goes to pause it rings up main block-job coroutine, which will handle user pause or errors. We need to handle errors in main coroutine because of nature of block_job_error_action, which may yield.
> 
> There also is a bonus: new io-retrying scheme: if there is an error on read or write, worker just yield in the retrying loop and if it will be resumed (with job->error_exit = false) it will continue from the same place, so if we have failed write after successful read we will not reread.
> 
> Vladimir Sementsov-Ogievskiy (21):
>   backup: move from done_bitmap to copy_bitmap
>   backup: init copy_bitmap from sync_bitmap for incremental
>   backup: improve non-dirty bits progress processing
>   backup: use copy_bitmap in incremental backup
>   hbitmap: improve dirty iter
>   backup: rewrite top mode cluster skipping
>   backup: refactor: merge top/full/incremental backup code
>   backup: skip unallocated clusters for full mode
>   backup: separate copy function
>   backup: refactor backup_copy_cluster()
>   backup: move r/w error handling code to r/w functions
>   iotests: add supported_cache_modes to main function
>   coroutine: add qemu_coroutine_add_next
>   block: add trace point on bdrv_close_all
>   bitmap: add bitmap_count_between() function
>   hbitmap: add hbitmap_count_between() function
>   backup: make all reads not serializing
>   backup: new async architecture
>   backup: refactor backup_do_cow
>   backup: move bitmap handling from backup_do_cow to get_work
>   backup: refactor: remove backup_do_cow()
> 
>  block.c                       |   1 +
>  block/backup.c                | 871 +++++++++++++++++++++++++++++++-----------
>  block/trace-events            |  34 +-
>  blockjob.c                    |  29 +-
>  include/block/blockjob.h      |  15 +-
>  include/qemu/bitmap.h         |   4 +
>  include/qemu/coroutine.h      |   2 +
>  include/qemu/hbitmap.h        |  26 +-
>  tests/qemu-iotests/055        |   4 +-
>  tests/qemu-iotests/129        |   6 +-
>  tests/qemu-iotests/iotests.py |   7 +-
>  util/bitmap.c                 |  27 ++
>  util/hbitmap.c                |  32 +-
>  util/qemu-coroutine.c         |   7 +
>  14 files changed, 805 insertions(+), 260 deletions(-)
> 
> -- 
> 1.8.3.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] new backup architecture
  2017-01-09 11:04 ` [Qemu-devel] [PATCH 00/21] new backup architecture Stefan Hajnoczi
@ 2017-01-10  6:05   ` Jeff Cody
  2017-01-10 18:48     ` John Snow
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Cody @ 2017-01-10  6:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: jsnow, qemu-block, qemu-devel, kwolf, mreitz, famz, den,
	pbonzini, Vladimir Sementsov-Ogievskiy

On Mon, Jan 09, 2017 at 11:04:27AM +0000, Stefan Hajnoczi wrote:
> On Fri, Dec 23, 2016 at 05:28:43PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 
> Jeff or John: are you reviewing this?

It's in my review queue, but it would probably be a good one for John to
review as well if he has time.

> 
> > This is a new architecture for backup. It solves some current problems:
> > 1. intersecting requests: for now at request start we wait for all intersecting requests, which means that
> >     a. we may wait even for unrelated to our request clusters
> >     b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not to start 1 and 3 in parallel with 2 and 4?
> > 
> > 2. notifier request is internally synchronous: if notifier starts copying clusters 1,2,3,4, they will be copied one by one in synchronous loop.
> > 
> > 3. notifier wait full copying of corresponding clusters (when actually it may wait only for _read_ operations to be finished)
> > 
> > In short, what is done:
> > 1. full async scheme
> > 4. no intersecting requests
> > 3. notifiers can wait only for read, not for write
> > 4. notifiers wait only for corresponding clusters
> > 5. time limit for notifiers
> > 5. skip unallocated clusters for full mode
> > 6. use HBitmap as main backup bitmap and just init it from dirty bitmap for incremental case
> > 7. retrying: do not reread on write fail
> > 
> > # Intro
> > 
> > Instead of sync-copying + async-notifiers as in old backup, or aio requests like in mirror, this scheme just start 24 workers - separate coroutines, each of them copying clusters synchronously. Copying is only done by one cluster, there are no large requests.
> > The only difference for clusters, awaited by write notifiers, is larger priority. So, notifiers do not start io requests, they just mark some clusters as awaited and yield. Then, when some worker completes read of last cluster, awaited by this notifier it will enter it.
> > 
> > # Some data structures
> > 
> > Instead of done_bitmap - copy_bitmap, like in mirror.
> > HBitmap copy_bitmap
> >     Exactly, what should be copied:
> >     0 - may mean one of three things:
> >         - this cluster should not be copied at all
> >         - this cluster is in flight
> >         - this cluster is already copied
> >     1 - means that cluster should be copied, but not touched yet (no async io exists for it)
> > 
> > New bitmap: notif_wait_bitmap - not HBitmap, just Bitmap.
> >     Exactly, in flight clusters, waiting for read operation:
> >     0 - may mean one of three things:
> >         - this cluster should not be copied at all
> >         - this cluster is in flight and it is _already_ read to memory
> >         - this cluster is already copied
> >     1 - means that cluster is in flight, but read operation have not finished
> >         yet
> >     The only exception is none-mode: in this case 1 means in flight: in io read or write. This is needed for image fleecing.
> > 
> > Cluster states (copy_bitmap, notif_wait_bitmap)
> > 
> > 0, 0 - Ignored (should not be copied at all) or In flight (read done) or Copied
> > 0, 1 - In flight, read operation not finished (or write op. - for none-mode)
> > 1, 0 - Should be copied, but not touched yet
> > 1, 1 - Impossible state
> > 
> > NotifierRequest - request from notifier, it changes sequence of cluster copying by workers.
> > NotifierRequest {
> >     int64_t start;
> >     int64_t end;
> >     int nb_wait; // nb clusters (in specified range) that should be copied but not already read, i.e. clusters awaited by this notifier
> >     Coroutine *notif; // corresponding notifier coroutine
> > }
> > 
> > notifier_reqs - list of notifier requests
> > 
> > # More info
> > 
> > At backup start copy_bitmap is inited to sync_bitmap for incremental backup. For top/full backup it is inited to all ones, but in parallel with workers main coroutine skips not allocated clusters.
> > 
> > Worker coroutines are copying clusters, preferable awaited by notifiers (for which NotifierRequest exists in the list). Function get_work helps them.
> > Workers will copy clusters, awaited by notifiers even if block-job is paused - it is the same behaviour  as in old architecture.
> > 
> > Old backup fails guest-write if notifier fails to backup corresponding clusters. In the new scheme there is a little difference: notifier just wait for 5s and if backup can't copy all corresponding clusters in this time - guest-write fails.
> > Error scenarios was considered on list, the final solution was to provide user a possibility to chose what should be failed: backup or guest-write. I'll add this later.
> > 
> > Worker can exit (no more clusters to copy or fatal error) or pause (error or user pause or throttling). When last worker goes to pause it rings up main block-job coroutine, which will handle user pause or errors. We need to handle errors in main coroutine because of nature of block_job_error_action, which may yield.
> > 
> > There also is a bonus: new io-retrying scheme: if there is an error on read or write, worker just yield in the retrying loop and if it will be resumed (with job->error_exit = false) it will continue from the same place, so if we have failed write after successful read we will not reread.
> > 
> > Vladimir Sementsov-Ogievskiy (21):
> >   backup: move from done_bitmap to copy_bitmap
> >   backup: init copy_bitmap from sync_bitmap for incremental
> >   backup: improve non-dirty bits progress processing
> >   backup: use copy_bitmap in incremental backup
> >   hbitmap: improve dirty iter
> >   backup: rewrite top mode cluster skipping
> >   backup: refactor: merge top/full/incremental backup code
> >   backup: skip unallocated clusters for full mode
> >   backup: separate copy function
> >   backup: refactor backup_copy_cluster()
> >   backup: move r/w error handling code to r/w functions
> >   iotests: add supported_cache_modes to main function
> >   coroutine: add qemu_coroutine_add_next
> >   block: add trace point on bdrv_close_all
> >   bitmap: add bitmap_count_between() function
> >   hbitmap: add hbitmap_count_between() function
> >   backup: make all reads not serializing
> >   backup: new async architecture
> >   backup: refactor backup_do_cow
> >   backup: move bitmap handling from backup_do_cow to get_work
> >   backup: refactor: remove backup_do_cow()
> > 
> >  block.c                       |   1 +
> >  block/backup.c                | 871 +++++++++++++++++++++++++++++++-----------
> >  block/trace-events            |  34 +-
> >  blockjob.c                    |  29 +-
> >  include/block/blockjob.h      |  15 +-
> >  include/qemu/bitmap.h         |   4 +
> >  include/qemu/coroutine.h      |   2 +
> >  include/qemu/hbitmap.h        |  26 +-
> >  tests/qemu-iotests/055        |   4 +-
> >  tests/qemu-iotests/129        |   6 +-
> >  tests/qemu-iotests/iotests.py |   7 +-
> >  util/bitmap.c                 |  27 ++
> >  util/hbitmap.c                |  32 +-
> >  util/qemu-coroutine.c         |   7 +
> >  14 files changed, 805 insertions(+), 260 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] new backup architecture
  2017-01-10  6:05   ` Jeff Cody
@ 2017-01-10 18:48     ` John Snow
  0 siblings, 0 replies; 59+ messages in thread
From: John Snow @ 2017-01-10 18:48 UTC (permalink / raw)
  To: Jeff Cody, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, kwolf, mreitz, famz, den, pbonzini,
	Vladimir Sementsov-Ogievskiy



On 01/10/2017 01:05 AM, Jeff Cody wrote:
> On Mon, Jan 09, 2017 at 11:04:27AM +0000, Stefan Hajnoczi wrote:
>> On Fri, Dec 23, 2016 at 05:28:43PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>
>> Jeff or John: are you reviewing this?
> 
> It's in my review queue, but it would probably be a good one for John to
> review as well if he has time.
> 

Sorry, just back from vacation. It may take me a bit to get to it.

>>
>>> This is a new architecture for backup. It solves some current problems:
>>> 1. intersecting requests: for now at request start we wait for all intersecting requests, which means that
>>>     a. we may wait even for unrelated to our request clusters
>>>     b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not to start 1 and 3 in parallel with 2 and 4?
>>>
>>> 2. notifier request is internally synchronous: if notifier starts copying clusters 1,2,3,4, they will be copied one by one in synchronous loop.
>>>
>>> 3. notifier wait full copying of corresponding clusters (when actually it may wait only for _read_ operations to be finished)
>>>
>>> In short, what is done:
>>> 1. full async scheme
>>> 4. no intersecting requests
>>> 3. notifiers can wait only for read, not for write
>>> 4. notifiers wait only for corresponding clusters
>>> 5. time limit for notifiers
>>> 5. skip unallocated clusters for full mode
>>> 6. use HBitmap as main backup bitmap and just init it from dirty bitmap for incremental case
>>> 7. retrying: do not reread on write fail
>>>
>>> # Intro
>>>
>>> Instead of sync-copying + async-notifiers as in old backup, or aio requests like in mirror, this scheme just start 24 workers - separate coroutines, each of them copying clusters synchronously. Copying is only done by one cluster, there are no large requests.
>>> The only difference for clusters, awaited by write notifiers, is larger priority. So, notifiers do not start io requests, they just mark some clusters as awaited and yield. Then, when some worker completes read of last cluster, awaited by this notifier it will enter it.
>>>
>>> # Some data structures
>>>
>>> Instead of done_bitmap - copy_bitmap, like in mirror.
>>> HBitmap copy_bitmap
>>>     Exactly, what should be copied:
>>>     0 - may mean one of three things:
>>>         - this cluster should not be copied at all
>>>         - this cluster is in flight
>>>         - this cluster is already copied
>>>     1 - means that cluster should be copied, but not touched yet (no async io exists for it)
>>>
>>> New bitmap: notif_wait_bitmap - not HBitmap, just Bitmap.
>>>     Exactly, in flight clusters, waiting for read operation:
>>>     0 - may mean one of three things:
>>>         - this cluster should not be copied at all
>>>         - this cluster is in flight and it is _already_ read to memory
>>>         - this cluster is already copied
>>>     1 - means that cluster is in flight, but read operation have not finished
>>>         yet
>>>     The only exception is none-mode: in this case 1 means in flight: in io read or write. This is needed for image fleecing.
>>>
>>> Cluster states (copy_bitmap, notif_wait_bitmap)
>>>
>>> 0, 0 - Ignored (should not be copied at all) or In flight (read done) or Copied
>>> 0, 1 - In flight, read operation not finished (or write op. - for none-mode)
>>> 1, 0 - Should be copied, but not touched yet
>>> 1, 1 - Impossible state
>>>
>>> NotifierRequest - request from notifier, it changes sequence of cluster copying by workers.
>>> NotifierRequest {
>>>     int64_t start;
>>>     int64_t end;
>>>     int nb_wait; // nb clusters (in specified range) that should be copied but not already read, i.e. clusters awaited by this notifier
>>>     Coroutine *notif; // corresponding notifier coroutine
>>> }
>>>
>>> notifier_reqs - list of notifier requests
>>>
>>> # More info
>>>
>>> At backup start copy_bitmap is inited to sync_bitmap for incremental backup. For top/full backup it is inited to all ones, but in parallel with workers main coroutine skips not allocated clusters.
>>>
>>> Worker coroutines are copying clusters, preferable awaited by notifiers (for which NotifierRequest exists in the list). Function get_work helps them.
>>> Workers will copy clusters, awaited by notifiers even if block-job is paused - it is the same behaviour  as in old architecture.
>>>
>>> Old backup fails guest-write if notifier fails to backup corresponding clusters. In the new scheme there is a little difference: notifier just wait for 5s and if backup can't copy all corresponding clusters in this time - guest-write fails.
>>> Error scenarios was considered on list, the final solution was to provide user a possibility to chose what should be failed: backup or guest-write. I'll add this later.
>>>
>>> Worker can exit (no more clusters to copy or fatal error) or pause (error or user pause or throttling). When last worker goes to pause it rings up main block-job coroutine, which will handle user pause or errors. We need to handle errors in main coroutine because of nature of block_job_error_action, which may yield.
>>>
>>> There also is a bonus: new io-retrying scheme: if there is an error on read or write, worker just yield in the retrying loop and if it will be resumed (with job->error_exit = false) it will continue from the same place, so if we have failed write after successful read we will not reread.
>>>
>>> Vladimir Sementsov-Ogievskiy (21):
>>>   backup: move from done_bitmap to copy_bitmap
>>>   backup: init copy_bitmap from sync_bitmap for incremental
>>>   backup: improve non-dirty bits progress processing
>>>   backup: use copy_bitmap in incremental backup
>>>   hbitmap: improve dirty iter
>>>   backup: rewrite top mode cluster skipping
>>>   backup: refactor: merge top/full/incremental backup code
>>>   backup: skip unallocated clusters for full mode
>>>   backup: separate copy function
>>>   backup: refactor backup_copy_cluster()
>>>   backup: move r/w error handling code to r/w functions
>>>   iotests: add supported_cache_modes to main function
>>>   coroutine: add qemu_coroutine_add_next
>>>   block: add trace point on bdrv_close_all
>>>   bitmap: add bitmap_count_between() function
>>>   hbitmap: add hbitmap_count_between() function
>>>   backup: make all reads not serializing
>>>   backup: new async architecture
>>>   backup: refactor backup_do_cow
>>>   backup: move bitmap handling from backup_do_cow to get_work
>>>   backup: refactor: remove backup_do_cow()
>>>
>>>  block.c                       |   1 +
>>>  block/backup.c                | 871 +++++++++++++++++++++++++++++++-----------
>>>  block/trace-events            |  34 +-
>>>  blockjob.c                    |  29 +-
>>>  include/block/blockjob.h      |  15 +-
>>>  include/qemu/bitmap.h         |   4 +
>>>  include/qemu/coroutine.h      |   2 +
>>>  include/qemu/hbitmap.h        |  26 +-
>>>  tests/qemu-iotests/055        |   4 +-
>>>  tests/qemu-iotests/129        |   6 +-
>>>  tests/qemu-iotests/iotests.py |   7 +-
>>>  util/bitmap.c                 |  27 ++
>>>  util/hbitmap.c                |  32 +-
>>>  util/qemu-coroutine.c         |   7 +
>>>  14 files changed, 805 insertions(+), 260 deletions(-)
>>>
>>> -- 
>>> 1.8.3.1
>>>
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
@ 2017-01-23  5:34   ` Jeff Cody
  2017-01-23 12:20   ` Vladimir Sementsov-Ogievskiy
  2017-01-31 10:25   ` Stefan Hajnoczi
  2 siblings, 0 replies; 59+ messages in thread
From: Jeff Cody @ 2017-01-23  5:34 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den,
	stefanha, pbonzini

On Fri, Dec 23, 2016 at 05:28:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Use HBitmap copy_bitmap instead of done_bitmap. This is needed to unify
> backup loop for full/incremental modes in future patches.
> 
> We reset bit of the copy_bitmap immediately after checking it in
> backup_do_cow(). It is safe, because all other intersecting requests
> will wait for our request finish anyway.
> 
> The other difference is that in case of error we will have zeroed bit in
> copy_bitmap, when in done_bitmap we have not set bit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index ea38733..6b27e55 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -39,11 +39,12 @@ typedef struct BackupBlockJob {
>      BlockdevOnError on_target_error;
>      CoRwlock flush_rwlock;
>      uint64_t sectors_read;
> -    unsigned long *done_bitmap;
>      int64_t cluster_size;
>      bool compress;
>      NotifierWithReturn before_write;
>      QLIST_HEAD(, CowRequest) inflight_reqs;
> +
> +    HBitmap *copy_bitmap;
>  } BackupBlockJob;
>  
>  /* Size of a cluster in sectors, instead of bytes. */
> @@ -115,10 +116,11 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>      cow_request_begin(&cow_request, job, start, end);
>  
>      for (; start < end; start++) {
> -        if (test_bit(start, job->done_bitmap)) {
> +        if (!hbitmap_get(job->copy_bitmap, start)) {
>              trace_backup_do_cow_skip(job, start);
>              continue; /* already copied */
>          }
> +        hbitmap_reset(job->copy_bitmap, start, 1);
>  
>          trace_backup_do_cow_process(job, start);
>  
> @@ -141,6 +143,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>              if (error_is_read) {
>                  *error_is_read = true;
>              }
> +            hbitmap_set(job->copy_bitmap, start, 1);
>              goto out;
>          }
>  
> @@ -157,11 +160,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>              if (error_is_read) {
>                  *error_is_read = false;
>              }
> +            hbitmap_set(job->copy_bitmap, start, 1);
>              goto out;
>          }
>  
> -        set_bit(start, job->done_bitmap);
> -
>          /* Publish progress, guest I/O counts as progress too.  Note that the
>           * offset field is an opaque progress value, it is not a disk offset.
>           */
> @@ -271,7 +273,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
>      }
>  
>      len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
> -    bitmap_zero(backup_job->done_bitmap, len);
> +    hbitmap_set(backup_job->copy_bitmap, 0, len);
>  }
>  
>  void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
> @@ -450,7 +452,8 @@ static void coroutine_fn backup_run(void *opaque)
>      start = 0;
>      end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>  
> -    job->done_bitmap = bitmap_new(end);
> +    job->copy_bitmap = hbitmap_alloc(end, 0);
> +    hbitmap_set(job->copy_bitmap, 0, end);
>  
>      job->before_write.notify = backup_before_write_notify;
>      bdrv_add_before_write_notifier(bs, &job->before_write);
> @@ -524,7 +527,7 @@ static void coroutine_fn backup_run(void *opaque)
>      /* wait until pending backup_do_cow() calls have completed */
>      qemu_co_rwlock_wrlock(&job->flush_rwlock);
>      qemu_co_rwlock_unlock(&job->flush_rwlock);
> -    g_free(job->done_bitmap);
> +    hbitmap_free(job->copy_bitmap);
>  
>      data = g_malloc(sizeof(*data));
>      data->ret = ret;
> -- 
> 1.8.3.1
>

Reviewed-by: Jeff Cody <jcody@redhat.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
  2017-01-23  5:34   ` Jeff Cody
@ 2017-01-23 12:20   ` Vladimir Sementsov-Ogievskiy
  2017-01-31 10:25   ` Stefan Hajnoczi
  2 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-23 12:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: kwolf, mreitz, jsnow, famz, den, stefanha, pbonzini, jcody

23.12.2016 17:28, Vladimir Sementsov-Ogievskiy wrote:
> Use HBitmap copy_bitmap instead of done_bitmap. This is needed to unify
> backup loop for full/incremental modes in future patches.
>
> We reset bit of the copy_bitmap immediately after checking it in
> backup_do_cow(). It is safe, because all other intersecting requests
> will wait for our request finish anyway.
>
> The other difference is that in case of error we will have zeroed bit in
> copy_bitmap, when in done_bitmap we have not set bit.

This sentence is unrelated. It actually relates to old my mistake (not 
published versions), which is fixed here. Really, in case of error I set 
bit in copy_bitmap back, to correspond to old behavior with done_bitmap.

>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/backup.c | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/block/backup.c b/block/backup.c
> index ea38733..6b27e55 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -39,11 +39,12 @@ typedef struct BackupBlockJob {
>       BlockdevOnError on_target_error;
>       CoRwlock flush_rwlock;
>       uint64_t sectors_read;
> -    unsigned long *done_bitmap;
>       int64_t cluster_size;
>       bool compress;
>       NotifierWithReturn before_write;
>       QLIST_HEAD(, CowRequest) inflight_reqs;
> +
> +    HBitmap *copy_bitmap;
>   } BackupBlockJob;
>   
>   /* Size of a cluster in sectors, instead of bytes. */
> @@ -115,10 +116,11 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>       cow_request_begin(&cow_request, job, start, end);
>   
>       for (; start < end; start++) {
> -        if (test_bit(start, job->done_bitmap)) {
> +        if (!hbitmap_get(job->copy_bitmap, start)) {
>               trace_backup_do_cow_skip(job, start);
>               continue; /* already copied */
>           }
> +        hbitmap_reset(job->copy_bitmap, start, 1);
>   
>           trace_backup_do_cow_process(job, start);
>   
> @@ -141,6 +143,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>               if (error_is_read) {
>                   *error_is_read = true;
>               }
> +            hbitmap_set(job->copy_bitmap, start, 1);
>               goto out;
>           }
>   
> @@ -157,11 +160,10 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
>               if (error_is_read) {
>                   *error_is_read = false;
>               }
> +            hbitmap_set(job->copy_bitmap, start, 1);
>               goto out;
>           }
>   
> -        set_bit(start, job->done_bitmap);
> -
>           /* Publish progress, guest I/O counts as progress too.  Note that the
>            * offset field is an opaque progress value, it is not a disk offset.
>            */
> @@ -271,7 +273,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
>       }
>   
>       len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
> -    bitmap_zero(backup_job->done_bitmap, len);
> +    hbitmap_set(backup_job->copy_bitmap, 0, len);
>   }
>   
>   void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
> @@ -450,7 +452,8 @@ static void coroutine_fn backup_run(void *opaque)
>       start = 0;
>       end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>   
> -    job->done_bitmap = bitmap_new(end);
> +    job->copy_bitmap = hbitmap_alloc(end, 0);
> +    hbitmap_set(job->copy_bitmap, 0, end);
>   
>       job->before_write.notify = backup_before_write_notify;
>       bdrv_add_before_write_notifier(bs, &job->before_write);
> @@ -524,7 +527,7 @@ static void coroutine_fn backup_run(void *opaque)
>       /* wait until pending backup_do_cow() calls have completed */
>       qemu_co_rwlock_wrlock(&job->flush_rwlock);
>       qemu_co_rwlock_unlock(&job->flush_rwlock);
> -    g_free(job->done_bitmap);
> +    hbitmap_free(job->copy_bitmap);
>   
>       data = g_malloc(sizeof(*data));
>       data->ret = ret;


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental Vladimir Sementsov-Ogievskiy
@ 2017-01-24  7:09   ` Fam Zheng
  2017-01-24  9:00     ` Vladimir Sementsov-Ogievskiy
  2017-01-31 10:36   ` Stefan Hajnoczi
  1 sibling, 1 reply; 59+ messages in thread
From: Fam Zheng @ 2017-01-24  7:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> We should not copy non-dirty clusters in write notifiers. So,
> initialize copy_bitmap from sync_bitmap.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index 6b27e55..621b1c0 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -437,6 +437,34 @@ out:
>      return ret;
>  }
>  
> +/* init copy_bitmap from sync_bitmap */
> +static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
> +{
> +    int64_t sector;
> +    BdrvDirtyBitmapIter *dbi;
> +    uint32_t sect_gran =
> +        bdrv_dirty_bitmap_granularity(job->sync_bitmap) >> BDRV_SECTOR_BITS;
> +    int64_t sz = bdrv_dirty_bitmap_size(job->sync_bitmap);
> +    int64_t sectors_per_cluster = cluster_size_sectors(job);
> +    uint32_t cl_gran = MAX(1, sect_gran / sectors_per_cluster);
> +
> +    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
> +    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
> +        int64_t cluster = sector / sectors_per_cluster;
> +        int64_t next_sector = (cluster + cl_gran) * sectors_per_cluster;
> +
> +        hbitmap_set(job->copy_bitmap, cluster, cl_gran);
> +
> +        if (next_sector >= sz) {
> +            break;
> +        }
> +
> +        bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
> +    }
> +
> +    bdrv_dirty_iter_free(dbi);
> +}
> +
>  static void coroutine_fn backup_run(void *opaque)
>  {
>      BackupBlockJob *job = opaque;
> @@ -453,20 +481,22 @@ static void coroutine_fn backup_run(void *opaque)
>      end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>  
>      job->copy_bitmap = hbitmap_alloc(end, 0);
> -    hbitmap_set(job->copy_bitmap, 0, end);
>  
>      job->before_write.notify = backup_before_write_notify;
>      bdrv_add_before_write_notifier(bs, &job->before_write);
>  
>      if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
> +        hbitmap_set(job->copy_bitmap, 0, end);

This is confusing. It seems job->copy_bitmap is actually a superset of clusters
to copy instead of the exact one?  Because "none" mode doesn't need blanket copy
- only overwritten clusters are copied...

>          while (!block_job_is_cancelled(&job->common)) {
>              /* Yield until the job is cancelled.  We just let our before_write
>               * notify callback service CoW requests. */
>              block_job_yield(&job->common);
>          }
>      } else if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
> +        backup_incremental_init_copy_bitmap(job);

... But here you are only marking bits in sync_bitmap.

Fam

>          ret = backup_run_incremental(job);
>      } else {
> +        hbitmap_set(job->copy_bitmap, 0, end);
>          /* Both FULL and TOP SYNC_MODE's require copying.. */
>          for (; start < end; start++) {
>              bool error_is_read;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing Vladimir Sementsov-Ogievskiy
@ 2017-01-24  7:17   ` Fam Zheng
  2017-01-24  9:12     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 59+ messages in thread
From: Fam Zheng @ 2017-01-24  7:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, jcody, mreitz, stefanha, pbonzini,
	den, jsnow

On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> Set fake progress for non-dirty clusters in copy_bitmap initialization,
> to:
> 1. set progress in the same place where corresponding clusters are
> consumed from copy_bitmap (or not initialized, as here)
> 2. earlier progress information for user
> 3. simplify the code
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 18 +++---------------
>  1 file changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index 621b1c0..f1f87f6 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -383,7 +383,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>      int64_t sector;
>      int64_t cluster;
>      int64_t end;
> -    int64_t last_cluster = -1;
>      int64_t sectors_per_cluster = cluster_size_sectors(job);
>      BdrvDirtyBitmapIter *dbi;
>  
> @@ -395,12 +394,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>      while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
>          cluster = sector / sectors_per_cluster;
>  
> -        /* Fake progress updates for any clusters we skipped */
> -        if (cluster != last_cluster + 1) {
> -            job->common.offset += ((cluster - last_cluster - 1) *
> -                                   job->cluster_size);
> -        }
> -
>          for (end = cluster + clusters_per_iter; cluster < end; cluster++) {
>              do {
>                  if (yield_and_check(job)) {
> @@ -422,14 +415,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>          if (granularity < job->cluster_size) {
>              bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
>          }
> -
> -        last_cluster = cluster - 1;
> -    }
> -
> -    /* Play some final catchup with the progress meter */
> -    end = DIV_ROUND_UP(job->common.len, job->cluster_size);
> -    if (last_cluster + 1 < end) {
> -        job->common.offset += ((end - last_cluster - 1) * job->cluster_size);
>      }
>  
>  out:
> @@ -462,6 +447,9 @@ static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
>          bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
>      }
>  
> +    job->common.offset = job->common.len -
> +                         hbitmap_count(job->copy_bitmap) * job->cluster_size;
> +
>      bdrv_dirty_iter_free(dbi);
>  }

Is this effectively moving the progress reporting from cluster copying to dirty
bitmap initialization? If so the progress will stay "100%" once
backup_incremental_init_copy_bitmap returns, but the backup has merely started.
I don't think this is a good idea.

Fam

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
@ 2017-01-24  7:59   ` Fam Zheng
  2017-01-24  9:18     ` Vladimir Sementsov-Ogievskiy
  2017-01-31 14:33   ` Stefan Hajnoczi
  2017-01-31 14:38   ` Stefan Hajnoczi
  2 siblings, 1 reply; 59+ messages in thread
From: Fam Zheng @ 2017-01-24  7:59 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
> index 1d3fd04..388b7b2 100755
> --- a/tests/qemu-iotests/055
> +++ b/tests/qemu-iotests/055
> @@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
>  blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
>  
>  image_len = 64 * 1024 * 1024 # MB
> +pause_write = '3M'
>  
>  def setUpModule():
>      qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
> @@ -39,6 +40,7 @@ def setUpModule():
>      qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
>      qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
>      qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
> +    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)

What does this iotest change do?

Fam

>  
>  def tearDownModule():
>      os.remove(test_img)
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2017-01-24  7:09   ` Fam Zheng
@ 2017-01-24  9:00     ` Vladimir Sementsov-Ogievskiy
  2017-01-24  9:46       ` Fam Zheng
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-24  9:00 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

24.01.2017 10:09, Fam Zheng wrote:
> On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
>> We should not copy non-dirty clusters in write notifiers. So,
>> initialize copy_bitmap from sync_bitmap.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/backup.c | 32 +++++++++++++++++++++++++++++++-
>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 6b27e55..621b1c0 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -437,6 +437,34 @@ out:
>>       return ret;
>>   }
>>   
>> +/* init copy_bitmap from sync_bitmap */
>> +static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
>> +{
>> +    int64_t sector;
>> +    BdrvDirtyBitmapIter *dbi;
>> +    uint32_t sect_gran =
>> +        bdrv_dirty_bitmap_granularity(job->sync_bitmap) >> BDRV_SECTOR_BITS;
>> +    int64_t sz = bdrv_dirty_bitmap_size(job->sync_bitmap);
>> +    int64_t sectors_per_cluster = cluster_size_sectors(job);
>> +    uint32_t cl_gran = MAX(1, sect_gran / sectors_per_cluster);
>> +
>> +    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
>> +    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
>> +        int64_t cluster = sector / sectors_per_cluster;
>> +        int64_t next_sector = (cluster + cl_gran) * sectors_per_cluster;
>> +
>> +        hbitmap_set(job->copy_bitmap, cluster, cl_gran);
>> +
>> +        if (next_sector >= sz) {
>> +            break;
>> +        }
>> +
>> +        bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
>> +    }
>> +
>> +    bdrv_dirty_iter_free(dbi);
>> +}
>> +
>>   static void coroutine_fn backup_run(void *opaque)
>>   {
>>       BackupBlockJob *job = opaque;
>> @@ -453,20 +481,22 @@ static void coroutine_fn backup_run(void *opaque)
>>       end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>>   
>>       job->copy_bitmap = hbitmap_alloc(end, 0);
>> -    hbitmap_set(job->copy_bitmap, 0, end);
>>   
>>       job->before_write.notify = backup_before_write_notify;
>>       bdrv_add_before_write_notifier(bs, &job->before_write);
>>   
>>       if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
>> +        hbitmap_set(job->copy_bitmap, 0, end);
> This is confusing. It seems job->copy_bitmap is actually a superset of clusters
> to copy instead of the exact one?  Because "none" mode doesn't need blanket copy
> - only overwritten clusters are copied...

We can't guess, what clusters should be copied finally in none mode. 
None mode is done by allowing only notifier handling and no linear 
copying. But notifier handling will copy only clusters marked in 
copy_bitmap, so I set it from 0 to end.

>
>>           while (!block_job_is_cancelled(&job->common)) {
>>               /* Yield until the job is cancelled.  We just let our before_write
>>                * notify callback service CoW requests. */
>>               block_job_yield(&job->common);
>>           }
>>       } else if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>> +        backup_incremental_init_copy_bitmap(job);
> ... But here you are only marking bits in sync_bitmap.
>
> Fam
>
>>           ret = backup_run_incremental(job);
>>       } else {
>> +        hbitmap_set(job->copy_bitmap, 0, end);
>>           /* Both FULL and TOP SYNC_MODE's require copying.. */
>>           for (; start < end; start++) {
>>               bool error_is_read;
>> -- 
>> 1.8.3.1
>>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing
  2017-01-24  7:17   ` Fam Zheng
@ 2017-01-24  9:12     ` Vladimir Sementsov-Ogievskiy
  2017-01-31 10:56       ` Stefan Hajnoczi
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-24  9:12 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-block, qemu-devel, kwolf, jcody, mreitz, stefanha, pbonzini,
	den, jsnow

24.01.2017 10:17, Fam Zheng wrote:
> On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
>> Set fake progress for non-dirty clusters in copy_bitmap initialization,
>> to:
>> 1. set progress in the same place where corresponding clusters are
>> consumed from copy_bitmap (or not initialized, as here)
>> 2. earlier progress information for user
>> 3. simplify the code
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/backup.c | 18 +++---------------
>>   1 file changed, 3 insertions(+), 15 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 621b1c0..f1f87f6 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -383,7 +383,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>>       int64_t sector;
>>       int64_t cluster;
>>       int64_t end;
>> -    int64_t last_cluster = -1;
>>       int64_t sectors_per_cluster = cluster_size_sectors(job);
>>       BdrvDirtyBitmapIter *dbi;
>>   
>> @@ -395,12 +394,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>>       while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
>>           cluster = sector / sectors_per_cluster;
>>   
>> -        /* Fake progress updates for any clusters we skipped */
>> -        if (cluster != last_cluster + 1) {
>> -            job->common.offset += ((cluster - last_cluster - 1) *
>> -                                   job->cluster_size);
>> -        }
>> -
>>           for (end = cluster + clusters_per_iter; cluster < end; cluster++) {
>>               do {
>>                   if (yield_and_check(job)) {
>> @@ -422,14 +415,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>>           if (granularity < job->cluster_size) {
>>               bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
>>           }
>> -
>> -        last_cluster = cluster - 1;
>> -    }
>> -
>> -    /* Play some final catchup with the progress meter */
>> -    end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>> -    if (last_cluster + 1 < end) {
>> -        job->common.offset += ((end - last_cluster - 1) * job->cluster_size);
>>       }
>>   
>>   out:
>> @@ -462,6 +447,9 @@ static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
>>           bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
>>       }
>>   
>> +    job->common.offset = job->common.len -
>> +                         hbitmap_count(job->copy_bitmap) * job->cluster_size;
>> +
>>       bdrv_dirty_iter_free(dbi);
>>   }
> Is this effectively moving the progress reporting from cluster copying to dirty
> bitmap initialization? If so the progress will stay "100%" once
> backup_incremental_init_copy_bitmap returns, but the backup has merely started.
> I don't think this is a good idea.
>
> Fam

Currently progress tracking for incremental backup is bad too. Holes are 
bad for progress in any way. To make good progress we should calculate 
it like (cluters _physically_ copied) / (full amount of clusters to be 
_physically_ copied). And with current qapi scheme it is not possible. 
This formula may be approximated by (offset - skipped_clusters) / (len - 
skipped_clusters), where offset and len are old good len, and 
skipped_clusters should be added to query_block_jobs. And with such 
approximation it is obvious that it will be more presize if we skip all 
clusters that should be skipped earlier. The best way of course is to 
skip them in initialization. It is not possible (if I understand things 
right) for full mode, as it skips clusters using get_block_status which 
may be long operation, so we should start notifier handling before 
skipping operation is finished...

Any way, it is work of management tool to show beautiful progress, qemu 
only provides information for it, and with current scheme, the only way 
to provide information about cluster skipping in copying progress is by 
offset field. And it is not bad to provide this information earlier. And 
again, to produce really beautiful progress we at least need one more 
field - skipped_clusters.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2017-01-24  7:59   ` Fam Zheng
@ 2017-01-24  9:18     ` Vladimir Sementsov-Ogievskiy
  2017-01-24  9:36       ` Fam Zheng
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-24  9:18 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

24.01.2017 10:59, Fam Zheng wrote:
> On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
>> diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
>> index 1d3fd04..388b7b2 100755
>> --- a/tests/qemu-iotests/055
>> +++ b/tests/qemu-iotests/055
>> @@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
>>   blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
>>   
>>   image_len = 64 * 1024 * 1024 # MB
>> +pause_write = '3M'
>>   
>>   def setUpModule():
>>       qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
>> @@ -39,6 +40,7 @@ def setUpModule():
>>       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
>>       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
>>       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
>> +    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)
> What does this iotest change do?

Without this backup block-job finishes before the next query to it from 
test and test fails. This is a wide problem I suffer of on the way of 
backup improvement and it should have better solution then adjusting 
amount of data being copied...

>
> Fam
>
>>   
>>   def tearDownModule():
>>       os.remove(test_img)
>> -- 
>> 1.8.3.1
>>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2017-01-24  9:18     ` Vladimir Sementsov-Ogievskiy
@ 2017-01-24  9:36       ` Fam Zheng
  2017-01-24 10:13         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 59+ messages in thread
From: Fam Zheng @ 2017-01-24  9:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

On Tue, 01/24 12:18, Vladimir Sementsov-Ogievskiy wrote:
> 24.01.2017 10:59, Fam Zheng wrote:
> > On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> > > diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
> > > index 1d3fd04..388b7b2 100755
> > > --- a/tests/qemu-iotests/055
> > > +++ b/tests/qemu-iotests/055
> > > @@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
> > >   blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
> > >   image_len = 64 * 1024 * 1024 # MB
> > > +pause_write = '3M'
> > >   def setUpModule():
> > >       qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
> > > @@ -39,6 +40,7 @@ def setUpModule():
> > >       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
> > >       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
> > >       qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
> > > +    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)
> > What does this iotest change do?
> 
> Without this backup block-job finishes before the next query to it from test
> and test fails. This is a wide problem I suffer of on the way of backup
> improvement and it should have better solution then adjusting amount of data
> being copied...

You can use blkdebug to pause I/O before querying the block job, or simply
throttle it down.

Fam

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2017-01-24  9:00     ` Vladimir Sementsov-Ogievskiy
@ 2017-01-24  9:46       ` Fam Zheng
  2017-01-24 10:16         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 59+ messages in thread
From: Fam Zheng @ 2017-01-24  9:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

On Tue, 01/24 12:00, Vladimir Sementsov-Ogievskiy wrote:
> > >   static void coroutine_fn backup_run(void *opaque)
> > >   {
> > >       BackupBlockJob *job = opaque;
> > > @@ -453,20 +481,22 @@ static void coroutine_fn backup_run(void *opaque)
> > >       end = DIV_ROUND_UP(job->common.len, job->cluster_size);
> > >       job->copy_bitmap = hbitmap_alloc(end, 0);
> > > -    hbitmap_set(job->copy_bitmap, 0, end);
> > >       job->before_write.notify = backup_before_write_notify;
> > >       bdrv_add_before_write_notifier(bs, &job->before_write);
> > >       if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
> > > +        hbitmap_set(job->copy_bitmap, 0, end);
> > This is confusing. It seems job->copy_bitmap is actually a superset of clusters
> > to copy instead of the exact one?  Because "none" mode doesn't need blanket copy
> > - only overwritten clusters are copied...
> 
> We can't guess, what clusters should be copied finally in none mode. None
> mode is done by allowing only notifier handling and no linear copying. But
> notifier handling will copy only clusters marked in copy_bitmap, so I set it
> from 0 to end.

Yes, that's how I understand it too, what I dislike is this bit of inconsistency
with it: "allowed to copy bitmap" here versus "planned to copy" in other modes.
I don't know how to improve that, but I think at least the specialty of none
mode could be mentioned in the comment of copy_bitmap.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2017-01-24  9:36       ` Fam Zheng
@ 2017-01-24 10:13         ` Vladimir Sementsov-Ogievskiy
  2017-01-24 11:12           ` Fam Zheng
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-24 10:13 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

24.01.2017 12:36, Fam Zheng wrote:
> On Tue, 01/24 12:18, Vladimir Sementsov-Ogievskiy wrote:
>> 24.01.2017 10:59, Fam Zheng wrote:
>>> On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
>>>> diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
>>>> index 1d3fd04..388b7b2 100755
>>>> --- a/tests/qemu-iotests/055
>>>> +++ b/tests/qemu-iotests/055
>>>> @@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
>>>>    blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
>>>>    image_len = 64 * 1024 * 1024 # MB
>>>> +pause_write = '3M'
>>>>    def setUpModule():
>>>>        qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
>>>> @@ -39,6 +40,7 @@ def setUpModule():
>>>>        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
>>>>        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
>>>>        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
>>>> +    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)
>>> What does this iotest change do?
>> Without this backup block-job finishes before the next query to it from test
>> and test fails. This is a wide problem I suffer of on the way of backup
>> improvement and it should have better solution then adjusting amount of data
>> being copied...
> You can use blkdebug to pause I/O before querying the block job, or simply
> throttle it down.
>
> Fam

throttling down is not better I thing then just copying additional 3M, 
it is environment-dependent hack too. And we can't use throttling in all 
tests - throttling should be itself tested separately.

In the other thread there was an idea to not delete block jobs on their 
finish, but maintain them in some 'finished' state. This will be great 
for the case and will be generic.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2017-01-24  9:46       ` Fam Zheng
@ 2017-01-24 10:16         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-01-24 10:16 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

24.01.2017 12:46, Fam Zheng wrote:
> On Tue, 01/24 12:00, Vladimir Sementsov-Ogievskiy wrote:
>>>>    static void coroutine_fn backup_run(void *opaque)
>>>>    {
>>>>        BackupBlockJob *job = opaque;
>>>> @@ -453,20 +481,22 @@ static void coroutine_fn backup_run(void *opaque)
>>>>        end = DIV_ROUND_UP(job->common.len, job->cluster_size);
>>>>        job->copy_bitmap = hbitmap_alloc(end, 0);
>>>> -    hbitmap_set(job->copy_bitmap, 0, end);
>>>>        job->before_write.notify = backup_before_write_notify;
>>>>        bdrv_add_before_write_notifier(bs, &job->before_write);
>>>>        if (job->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>>> +        hbitmap_set(job->copy_bitmap, 0, end);
>>> This is confusing. It seems job->copy_bitmap is actually a superset of clusters
>>> to copy instead of the exact one?  Because "none" mode doesn't need blanket copy
>>> - only overwritten clusters are copied...
>> We can't guess, what clusters should be copied finally in none mode. None
>> mode is done by allowing only notifier handling and no linear copying. But
>> notifier handling will copy only clusters marked in copy_bitmap, so I set it
>> from 0 to end.
> Yes, that's how I understand it too, what I dislike is this bit of inconsistency
> with it: "allowed to copy bitmap" here versus "planned to copy" in other modes.
> I don't know how to improve that, but I think at least the specialty of none
> mode could be mentioned in the comment of copy_bitmap.

Ok, comment will be good. I'll add it.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2017-01-24 10:13         ` Vladimir Sementsov-Ogievskiy
@ 2017-01-24 11:12           ` Fam Zheng
  0 siblings, 0 replies; 59+ messages in thread
From: Fam Zheng @ 2017-01-24 11:12 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, den, stefanha,
	pbonzini, jcody

On Tue, 01/24 13:13, Vladimir Sementsov-Ogievskiy wrote:
> 24.01.2017 12:36, Fam Zheng wrote:
> > On Tue, 01/24 12:18, Vladimir Sementsov-Ogievskiy wrote:
> > > 24.01.2017 10:59, Fam Zheng wrote:
> > > > On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> > > > > diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
> > > > > index 1d3fd04..388b7b2 100755
> > > > > --- a/tests/qemu-iotests/055
> > > > > +++ b/tests/qemu-iotests/055
> > > > > @@ -30,6 +30,7 @@ target_img = os.path.join(iotests.test_dir, 'target.img')
> > > > >    blockdev_target_img = os.path.join(iotests.test_dir, 'blockdev-target.img')
> > > > >    image_len = 64 * 1024 * 1024 # MB
> > > > > +pause_write = '3M'
> > > > >    def setUpModule():
> > > > >        qemu_img('create', '-f', iotests.imgfmt, test_img, str(image_len))
> > > > > @@ -39,6 +40,7 @@ def setUpModule():
> > > > >        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xd5 1M 32k', test_img)
> > > > >        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 32M 124k', test_img)
> > > > >        qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x33 67043328 64k', test_img)
> > > > > +    qemu_io('-f', iotests.imgfmt, '-c', 'write -P0xdc 40M ' + pause_write, test_img)
> > > > What does this iotest change do?
> > > Without this backup block-job finishes before the next query to it from test
> > > and test fails. This is a wide problem I suffer of on the way of backup
> > > improvement and it should have better solution then adjusting amount of data
> > > being copied...
> > You can use blkdebug to pause I/O before querying the block job, or simply
> > throttle it down.
> > 
> > Fam
> 
> throttling down is not better I thing then just copying additional 3M, it is
> environment-dependent hack too. And we can't use throttling in all tests -
> throttling should be itself tested separately.

Then maybe opt to blkdebug?

I didn't mean the speed parameter of block jobs, but the block_set_io_throttle.
It is not perfect, but still better than simply enlarging the data size, and
is slightly simpler than blkdebug pause/resume.

Fam

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] new backup architecture
  2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
                   ` (20 preceding siblings ...)
  2017-01-09 11:04 ` [Qemu-devel] [PATCH 00/21] new backup architecture Stefan Hajnoczi
@ 2017-01-31 10:20 ` Stefan Hajnoczi
  21 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 10:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1393 bytes --]

On Fri, Dec 23, 2016 at 05:28:43PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> This is a new architecture for backup. It solves some current problems:
> 1. intersecting requests: for now at request start we wait for all intersecting requests, which means that
>     a. we may wait even for unrelated to our request clusters
>     b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not to start 1 and 3 in parallel with 2 and 4?
> 
> 2. notifier request is internally synchronous: if notifier starts copying clusters 1,2,3,4, they will be copied one by one in synchronous loop.
> 
> 3. notifier wait full copying of corresponding clusters (when actually it may wait only for _read_ operations to be finished)

Please include benchmark results since this is a performance
optimization.  I think this new level of complexity is worth it because
it should be possible to achieve significantly higher throughput, but
data is still necessary.

The cover letter mentions spawning 24 coroutines.  Did you compare the
memory footprint against the old backup architecture?  Sometimes users
complain when they notice QEMU using significantly more memory than in
previous versions.  If there's a good justification or a way to minimize
the impact then it's fine, but please check.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
  2017-01-23  5:34   ` Jeff Cody
  2017-01-23 12:20   ` Vladimir Sementsov-Ogievskiy
@ 2017-01-31 10:25   ` Stefan Hajnoczi
  2 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 10:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 774 bytes --]

On Fri, Dec 23, 2016 at 05:28:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Use HBitmap copy_bitmap instead of done_bitmap. This is needed to unify
> backup loop for full/incremental modes in future patches.
> 
> We reset bit of the copy_bitmap immediately after checking it in
> backup_do_cow(). It is safe, because all other intersecting requests
> will wait for our request finish anyway.
> 
> The other difference is that in case of error we will have zeroed bit in
> copy_bitmap, when in done_bitmap we have not set bit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental Vladimir Sementsov-Ogievskiy
  2017-01-24  7:09   ` Fam Zheng
@ 2017-01-31 10:36   ` Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 10:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

On Fri, Dec 23, 2016 at 05:28:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> +/* init copy_bitmap from sync_bitmap */
> +static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
> +{
> +    int64_t sector;
> +    BdrvDirtyBitmapIter *dbi;
> +    uint32_t sect_gran =
> +        bdrv_dirty_bitmap_granularity(job->sync_bitmap) >> BDRV_SECTOR_BITS;
> +    int64_t sz = bdrv_dirty_bitmap_size(job->sync_bitmap);
> +    int64_t sectors_per_cluster = cluster_size_sectors(job);
> +    uint32_t cl_gran = MAX(1, sect_gran / sectors_per_cluster);
> +
> +    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
> +    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
> +        int64_t cluster = sector / sectors_per_cluster;
> +        int64_t next_sector = (cluster + cl_gran) * sectors_per_cluster;
> +
> +        hbitmap_set(job->copy_bitmap, cluster, cl_gran);
> +
> +        if (next_sector >= sz) {
> +            break;
> +        }
> +
> +        bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);

Simpler and clearer:
bdrv_set_dirty_iter(dbi, next_sector);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing
  2017-01-24  9:12     ` Vladimir Sementsov-Ogievskiy
@ 2017-01-31 10:56       ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 10:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Fam Zheng, qemu-block, qemu-devel, kwolf, jcody, mreitz,
	pbonzini, den, jsnow

[-- Attachment #1: Type: text/plain, Size: 4678 bytes --]

On Tue, Jan 24, 2017 at 12:12:47PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 24.01.2017 10:17, Fam Zheng wrote:
> > On Fri, 12/23 17:28, Vladimir Sementsov-Ogievskiy wrote:
> > > Set fake progress for non-dirty clusters in copy_bitmap initialization,
> > > to:
> > > 1. set progress in the same place where corresponding clusters are
> > > consumed from copy_bitmap (or not initialized, as here)
> > > 2. earlier progress information for user
> > > 3. simplify the code
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > ---
> > >   block/backup.c | 18 +++---------------
> > >   1 file changed, 3 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/block/backup.c b/block/backup.c
> > > index 621b1c0..f1f87f6 100644
> > > --- a/block/backup.c
> > > +++ b/block/backup.c
> > > @@ -383,7 +383,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
> > >       int64_t sector;
> > >       int64_t cluster;
> > >       int64_t end;
> > > -    int64_t last_cluster = -1;
> > >       int64_t sectors_per_cluster = cluster_size_sectors(job);
> > >       BdrvDirtyBitmapIter *dbi;
> > > @@ -395,12 +394,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
> > >       while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
> > >           cluster = sector / sectors_per_cluster;
> > > -        /* Fake progress updates for any clusters we skipped */
> > > -        if (cluster != last_cluster + 1) {
> > > -            job->common.offset += ((cluster - last_cluster - 1) *
> > > -                                   job->cluster_size);
> > > -        }
> > > -
> > >           for (end = cluster + clusters_per_iter; cluster < end; cluster++) {
> > >               do {
> > >                   if (yield_and_check(job)) {
> > > @@ -422,14 +415,6 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
> > >           if (granularity < job->cluster_size) {
> > >               bdrv_set_dirty_iter(dbi, cluster * sectors_per_cluster);
> > >           }
> > > -
> > > -        last_cluster = cluster - 1;
> > > -    }
> > > -
> > > -    /* Play some final catchup with the progress meter */
> > > -    end = DIV_ROUND_UP(job->common.len, job->cluster_size);
> > > -    if (last_cluster + 1 < end) {
> > > -        job->common.offset += ((end - last_cluster - 1) * job->cluster_size);
> > >       }
> > >   out:
> > > @@ -462,6 +447,9 @@ static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
> > >           bdrv_set_dirty_iter(dbi, (cluster + cl_gran) * sectors_per_cluster);
> > >       }
> > > +    job->common.offset = job->common.len -
> > > +                         hbitmap_count(job->copy_bitmap) * job->cluster_size;
> > > +
> > >       bdrv_dirty_iter_free(dbi);
> > >   }
> > Is this effectively moving the progress reporting from cluster copying to dirty
> > bitmap initialization? If so the progress will stay "100%" once
> > backup_incremental_init_copy_bitmap returns, but the backup has merely started.
> > I don't think this is a good idea.
> > 
> > Fam
> 
> Currently progress tracking for incremental backup is bad too. Holes are bad
> for progress in any way. To make good progress we should calculate it like
> (cluters _physically_ copied) / (full amount of clusters to be _physically_
> copied). And with current qapi scheme it is not possible. This formula may
> be approximated by (offset - skipped_clusters) / (len - skipped_clusters),
> where offset and len are old good len, and skipped_clusters should be added
> to query_block_jobs. And with such approximation it is obvious that it will
> be more presize if we skip all clusters that should be skipped earlier. The
> best way of course is to skip them in initialization. It is not possible (if
> I understand things right) for full mode, as it skips clusters using
> get_block_status which may be long operation, so we should start notifier
> handling before skipping operation is finished...
> 
> Any way, it is work of management tool to show beautiful progress, qemu only
> provides information for it, and with current scheme, the only way to
> provide information about cluster skipping in copying progress is by offset
> field. And it is not bad to provide this information earlier. And again, to
> produce really beautiful progress we at least need one more field -
> skipped_clusters.

If you want to change the semantics of the progress indicator, please
send it as a separate series.

I'm not convinced by this patch but it shouldn't block the rest of
the series from being merged.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup Vladimir Sementsov-Ogievskiy
@ 2017-01-31 11:01   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 11:01 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Fri, Dec 23, 2016 at 05:28:47PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We can use copy_bitmap instead of sync_bitmap. copy_bitmap is
> initialized from sync_bitmap and it is more informative: we will not try
> to process data, that is already in progress (by write notifier).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 58 +++++++++++++++++++---------------------------------------
>  1 file changed, 19 insertions(+), 39 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter Vladimir Sementsov-Ogievskiy
@ 2017-01-31 11:20   ` Stefan Hajnoczi
  2017-01-31 11:29   ` Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 11:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]

On Fri, Dec 23, 2016 at 05:28:48PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Make dirty iter resistant to resetting bits in corresponding HBitmap.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/qemu/hbitmap.h | 24 ++----------------------
>  util/hbitmap.c         | 23 ++++++++++++++++++++++-
>  2 files changed, 24 insertions(+), 23 deletions(-)

Please add a test case to tests/test-hbitmap.c.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter Vladimir Sementsov-Ogievskiy
  2017-01-31 11:20   ` Stefan Hajnoczi
@ 2017-01-31 11:29   ` Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 11:29 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Fri, Dec 23, 2016 at 05:28:48PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Make dirty iter resistant to resetting bits in corresponding HBitmap.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/qemu/hbitmap.h | 24 ++----------------------
>  util/hbitmap.c         | 23 ++++++++++++++++++++++-
>  2 files changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index eb46475..2873a46 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -243,10 +243,7 @@ void hbitmap_free(HBitmap *hb);
>   * the lowest-numbered bit that is set in @hb, starting at @first.
>   *
>   * Concurrent setting of bits is acceptable, and will at worst cause the
> - * iteration to miss some of those bits.  Resetting bits before the current
> - * position of the iterator is also okay.  However, concurrent resetting of
> - * bits can lead to unexpected behavior if the iterator has not yet reached
> - * those bits.
> + * iteration to miss some of those bits.  Resetting bits is also okay.
>   */
>  void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first);

Do you need to modify hbitmap_iter_next_word() as well?  Otherwise it
uses the hbi->cur[] cached information and does not handle clearing bits
ahead of the iterator.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping Vladimir Sementsov-Ogievskiy
@ 2017-01-31 13:31   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 13:31 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

On Fri, Dec 23, 2016 at 05:28:49PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> @@ -374,6 +379,101 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
>      return false;
>  }
>  
> +static void backup_skip_clusters(BackupBlockJob *job,
> +                                 int64_t start, int64_t end)

Missing coroutine_fn to document that this function must be called from
coroutine context.

> +{
> +    CowRequest cow_request;
> +
> +    wait_for_overlapping_requests(job, start, end);
> +    cow_request_begin(&cow_request, job, start, end);
> +
> +    if (end * job->cluster_size > job->common.len) {
> +        int64_t n;
> +        end--;
> +        n = job->common.len - end * job->cluster_size;
> +        assert(n > 0);
> +
> +        if (hbitmap_get(job->copy_bitmap, end)) {
> +            hbitmap_reset(job->copy_bitmap, end, 1);
> +            job->common.offset += n;
> +        }
> +    }
> +
> +    for ( ; start < end; start++) {
> +        if (!hbitmap_get(job->copy_bitmap, start)) {
> +            continue;
> +        }
> +
> +        hbitmap_reset(job->copy_bitmap, start, 1);
> +        job->common.offset += job->cluster_size;
> +    }
> +
> +    cow_request_end(&cow_request);
> +}
> +
> +static int backup_skip_unallocated_clusters(BackupBlockJob *job,
> +                                            BlockDriverState *base,
> +                                            int64_t start, int *n)

Missing coroutine_fn.

> +{
> +    int ret;
> +    int64_t sectors_per_cluster = cluster_size_sectors(job);
> +    BlockDriverState *bs = blk_bs(job->common.blk);
> +    int64_t sector_end = job->common.len >> BDRV_SECTOR_BITS;
> +    int64_t sector = start * sectors_per_cluster;
> +    int max_sectors = MIN(max_query_sectors(job), sector_end - sector);
> +    int n_sectors = 0;
> +
> +    ret = bdrv_is_allocated_above(bs, base, sector, max_sectors, &n_sectors);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (sector + n_sectors == sector_end || ret == 1) {
> +        *n = DIV_ROUND_UP(n_sectors, sectors_per_cluster);
> +    } else if (n_sectors < sectors_per_cluster) {
> +        *n = 1;
> +        ret = 1;
> +    } else {
> +        *n = n_sectors / sectors_per_cluster;
> +    }
> +
> +    if (ret == 0) {
> +        backup_skip_clusters(job, start, start + *n);
> +    }
> +
> +    return 0;
> +}
> +
> +static void backup_skip_loop(BackupBlockJob *job, BlockDriverState *base)

Missing coroutine_fn.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code Vladimir Sementsov-Ogievskiy
@ 2017-01-31 14:26   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:26 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

On Fri, Dec 23, 2016 at 05:28:50PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Merge top/full/incremental modes backup into one backup_loop.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 41 +++++++++--------------------------------
>  1 file changed, 9 insertions(+), 32 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
  2017-01-24  7:59   ` Fam Zheng
@ 2017-01-31 14:33   ` Stefan Hajnoczi
  2017-01-31 14:38   ` Stefan Hajnoczi
  2 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:33 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1609 bytes --]

On Fri, Dec 23, 2016 at 05:28:51PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> In case of full backup we can skip unallocated clusters if the target
> disk is already zero-initialized.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c         | 8 ++++++--
>  tests/qemu-iotests/055 | 2 ++
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index 2afd1b6..4ef8daf 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -562,9 +562,13 @@ static void coroutine_fn backup_run(void *opaque)
>          if (job->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>              backup_incremental_init_copy_bitmap(job);
>          } else {
> +            /* top or full mode */
> +            bool is_top = job->sync_mode == MIRROR_SYNC_MODE_TOP;
> +            BlockDriverState *base =
> +                    is_top ? backing_bs(blk_bs(job->common.blk)) : NULL;
>              hbitmap_set(job->copy_bitmap, 0, end);
> -            if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
> -                backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
> +            if (is_top || bdrv_has_zero_init(blk_bs(job->target))) {
> +                backup_skip_loop(job, base);
>              }

An alternative that I find easier to read:

hbitmap_set(job->copy_bitmap, 0, end);
if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
    backup_skip_loop(job, backing_bs(blk_bs(job->common.blk)));
} else if (bdrv_has_zero_init(blk_bs(job->target))) {
    backup_skip_loop(job, NULL);
}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
  2017-01-24  7:59   ` Fam Zheng
  2017-01-31 14:33   ` Stefan Hajnoczi
@ 2017-01-31 14:38   ` Stefan Hajnoczi
  2 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:38 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 537 bytes --]

On Fri, Dec 23, 2016 at 05:28:51PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> +            if (is_top || bdrv_has_zero_init(blk_bs(job->target))) {

  /*
   * Returns 1 if newly created images are guaranteed to contain
   * only zeros, 0 otherwise.
   */
  int (*bdrv_has_zero_init)(BlockDriverState *bs);

This function does not cover arbitrary images opened with bdrv_open(),
only those created with bdrv_img_create() immediately before opening.

This means that the check is only valid with backup->mode !=
NEW_IMAGE_MODE_EXISTING.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 09/21] backup: separate copy function
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 09/21] backup: separate copy function Vladimir Sementsov-Ogievskiy
@ 2017-01-31 14:40   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:40 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 344 bytes --]

On Fri, Dec 23, 2016 at 05:28:52PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 103 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 61 insertions(+), 42 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster()
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster() Vladimir Sementsov-Ogievskiy
@ 2017-01-31 14:57   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 431 bytes --]

On Fri, Dec 23, 2016 at 05:28:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>     Split out read and write functions.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c     | 53 +++++++++++++++++++++++++++++++++++++++--------------
>  block/trace-events |  4 ++--
>  2 files changed, 41 insertions(+), 16 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions Vladimir Sementsov-Ogievskiy
@ 2017-01-31 14:57   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 562 bytes --]

On Fri, Dec 23, 2016 at 05:28:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> It simplifies code: we do not need error_is_read parameter and
> retrying in backup_loop. Also, retrying for read and write are separate,
> so we will not reread if write failed after successfull read.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 141 ++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 74 insertions(+), 67 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function Vladimir Sementsov-Ogievskiy
@ 2017-01-31 14:58   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 14:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 304 bytes --]

On Fri, Dec 23, 2016 at 05:28:55PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  tests/qemu-iotests/iotests.py | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next Vladimir Sementsov-Ogievskiy
@ 2017-01-31 15:03   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 15:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

On Fri, Dec 23, 2016 at 05:28:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Simple add coroutine to self->co_queue_wakeup.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/qemu/coroutine.h | 2 ++
>  util/qemu-coroutine.c    | 7 +++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index e6a60d5..6e87c87 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -226,4 +226,6 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
>   */
>  void coroutine_fn yield_until_fd_readable(int fd);
>  
> +void qemu_coroutine_add_next(Coroutine *next);

Missing coroutine_fn.

Please add a doc comment and a testcase in tests/test-coroutine.c.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all Vladimir Sementsov-Ogievskiy
@ 2017-01-31 15:03   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 15:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 301 bytes --]

On Fri, Dec 23, 2016 at 05:28:57PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block.c            | 1 +
>  block/trace-events | 1 +
>  2 files changed, 2 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function Vladimir Sementsov-Ogievskiy
@ 2017-01-31 15:15   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 15:15 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1607 bytes --]

On Fri, Dec 23, 2016 at 05:28:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  include/qemu/bitmap.h |  4 ++++
>  util/bitmap.c         | 27 +++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
> index 63ea2d0..3bfc33e 100644
> --- a/include/qemu/bitmap.h
> +++ b/include/qemu/bitmap.h
> @@ -216,6 +216,10 @@ static inline int bitmap_intersects(const unsigned long *src1,
>      }
>  }
>  
> +unsigned long bitmap_count_between(const unsigned long *src,
> +                                   unsigned long start,
> +                                   unsigned long end);

Please add this to the table of functions documented at the top of the
header file.

> +
>  void bitmap_set(unsigned long *map, long i, long len);
>  void bitmap_set_atomic(unsigned long *map, long i, long len);
>  void bitmap_clear(unsigned long *map, long start, long nr);
> diff --git a/util/bitmap.c b/util/bitmap.c
> index 43ed011..e5aaa1c 100644
> --- a/util/bitmap.c
> +++ b/util/bitmap.c
> @@ -336,3 +336,30 @@ int slow_bitmap_intersects(const unsigned long *bitmap1,
>      }
>      return 0;
>  }
> +
> +unsigned long bitmap_count_between(const unsigned long *src,
> +                                   unsigned long start,
> +                                   unsigned long end)

Most of the other functions use long start and long nr.  Please follow
that convention.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function
  2016-12-23 14:28 ` [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function Vladimir Sementsov-Ogievskiy
@ 2017-01-31 15:56   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 15:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Fri, Dec 23, 2016 at 05:28:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add this function only for HBitmap's with greanularity = 0.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  include/qemu/hbitmap.h | 2 ++
>  util/hbitmap.c         | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index 2873a46..1967e13 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -314,4 +314,6 @@ static inline size_t hbitmap_iter_next_word(HBitmapIter *hbi, unsigned long *p_c
>  }
>  
>  
> +uint64_t hbitmap_count_between(HBitmap *hb, uint64_t start, uint64_t end);

Missing doc comments.

> +
>  #endif
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index 48d8b2d..d06bfa3 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -232,6 +232,15 @@ static uint64_t hb_count_between(HBitmap *hb, uint64_t start, uint64_t last)
>      return count;
>  }
>  
> +/* hbitmap_count_between() is only for HBitmap's with granularity = 0 */
> +uint64_t hbitmap_count_between(HBitmap *hb, uint64_t start, uint64_t end)
> +{
> +    assert(hb->granularity == 0);
> +    assert(start < end);
> +
> +    return hb_count_between(hb, start, end - 1);
> +}

Please rename hb_count_between() and make it externally visible instead
of adding a similarly-named wrapper.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing Vladimir Sementsov-Ogievskiy
@ 2017-01-31 16:30   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 16:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On Fri, Dec 23, 2016 at 05:29:00PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> To simplify things make all reads not serializing, not only from
> notifiers. This is needed because after the following patch, there would
> not be strong division between reads from notifiers or not - they all
> would be called from one place.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 18/21] backup: new async architecture
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 18/21] backup: new async architecture Vladimir Sementsov-Ogievskiy
@ 2017-01-31 16:46   ` Stefan Hajnoczi
  2017-02-01 16:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-01-31 16:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, mreitz, jsnow, famz, den, pbonzini, jcody

[-- Attachment #1: Type: text/plain, Size: 2838 bytes --]

On Fri, Dec 23, 2016 at 05:29:01PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Just a few comments.  I need to resume review of this series tomorrow.

> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index 1acb256..7d24cf6 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -248,6 +248,15 @@ void block_job_user_pause(BlockJob *job);
>  bool block_job_user_paused(BlockJob *job);
>  
>  /**
> + * block_job_should_pause:
> + * @job: The job being queried.
> + *
> + * Returns whether the job is currently paused, or will pause
> + * as soon as it reaches a sleeping point.

s/sleeping point/pause point/

> + */
> +bool block_job_should_pause(BlockJob *job);
> +
> +/**
>   * block_job_resume:
>   * @job: The job to be resumed.
>   *
> @@ -309,7 +318,11 @@ int block_job_complete_sync(BlockJob *job, Error **errp);
>   */
>  void block_job_iostatus_reset(BlockJob *job);
>  
> -/**
> +
> +BlockErrorAction block_job_get_error_action(BlockJob *job,
> +                                            BlockdevOnError on_err, int error);

Missing doc comment.

> +
> +/*
>   * block_job_txn_new:
>   *
>   * Allocate and return a new block job transaction.  Jobs can be added to the
> diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
> index 388b7b2..e15905f 100755
> --- a/tests/qemu-iotests/055
> +++ b/tests/qemu-iotests/055
> @@ -553,4 +553,4 @@ class TestDriveCompression(iotests.QMPTestCase):
>              self.do_test_compress_pause('blockdev-backup', format, target='drive1')
>  
>  if __name__ == '__main__':
> -    iotests.main(supported_fmts=['raw', 'qcow2'])
> +    iotests.main(supported_fmts=['raw', 'qcow2'], supported_cache_modes=['none'])

The test has no real dependency on cache=none.  Did you just add this to
influence performance in the hope that the test runs more
deterministically?

> diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129
> index 9e87e1c..3d5e137 100644
> --- a/tests/qemu-iotests/129
> +++ b/tests/qemu-iotests/129
> @@ -66,8 +66,10 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
>          result = self.vm.qmp("stop")
>          self.assert_qmp(result, 'return', {})
>          result = self.vm.qmp("query-block-jobs")
> -        self.assert_qmp(result, 'return[0]/busy', True)
> -        self.assert_qmp(result, 'return[0]/ready', False)
> +        if result['return']:
> +            # make additional check if block job is not released yet
> +            self.assert_qmp(result, 'return[0]/busy', True)
> +            self.assert_qmp(result, 'return[0]/ready', False)

This is silencing a failure rather than improving the test case.  Why
does the test case fail for you?  The rate limit is 1 KB/s, so the job
should have no chance of completing.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 18/21] backup: new async architecture
  2016-12-23 14:29 ` [Qemu-devel] [PATCH 18/21] backup: new async architecture Vladimir Sementsov-Ogievskiy
  2017-01-31 16:46   ` Stefan Hajnoczi
@ 2017-02-01 16:13   ` Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2017-02-01 16:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, qemu-devel, kwolf, famz, mreitz, stefanha, pbonzini, den

[-- Attachment #1: Type: text/plain, Size: 7390 bytes --]

On Fri, Dec 23, 2016 at 05:29:01PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> @@ -120,11 +257,42 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
>          uint64_t delay_ns = ratelimit_calculate_delay(&job->limit,
>                                                        job->sectors_read);
>          job->sectors_read = 0;
> +        job->delayed = true;
> +        trace_backup_sleep_delay(block_job_is_cancelled(&job->common),
> +                                 block_job_should_pause(&job->common));
>          block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, delay_ns);
> +        job->delayed = false;
>      } else {
> -        block_job_sleep_ns(&job->common, QEMU_CLOCK_REALTIME, 0);
> +        trace_backup_sleep_zero(block_job_is_cancelled(&job->common),
> +                                block_job_should_pause(&job->common));
> +        block_job_pause_point(&job->common);
> +    }
> +
> +    trace_backup_sleep_finish();
> +}
> +
> +/* backup_busy_sleep
> + * Just yield, without setting busy=false

How does this interact with block_job_detach_aio_context() and
block_job_drain()?  Jobs must reach pause points regularly so that
cancellation and detaching AioContexts works.

> @@ -132,69 +300,246 @@ static bool coroutine_fn yield_and_check(BackupBlockJob *job)
>      return false;
>  }
>  
> -static int coroutine_fn backup_do_read(BackupBlockJob *job,
> -                                       int64_t offset, unsigned int bytes,
> -                                       QEMUIOVector *qiov)
> +static void backup_job_wait_workers(BackupBlockJob *job)

Missing coroutine_fn.  This function yields and will crash unless called
from coroutine context.  Please use coroutine_fn throughout the code so
it's clear when a function is only allowed to be called from coroutine
context.

> +static void backup_worker_pause(BackupBlockJob *job)
> +{
> +    job->nb_busy_workers--;
> +
> +    trace_backup_worker_pause(qemu_coroutine_self(), job->nb_busy_workers,
> +                              job->waiting_for_workers);
> +
> +    if (job->nb_busy_workers == 0 && job->waiting_for_workers) {
> +        qemu_coroutine_add_next(job->common.co);
> +    }
> +
> +    qemu_co_queue_wait(&job->paused_workers);
> +
> +    trace_backup_worker_unpause(qemu_coroutine_self());
> +
> +    job->nb_busy_workers++;

This is similar to rwlock.  The main coroutine would use wrlock and the
workers would use rdlock.

rwlock avoids situations where the main blockjob re-enters a worker and
vice versa.  qemu_coroutine_add_next() is a lower-level primitive and
does not prevent this type of bug.

Please use rwlock.

> +static inline bool check_delay(BackupBlockJob *job)
> +{
> +    uint64_t delay_ns;
> +
> +    if (!job->common.speed) {
> +        return false;
> +    }
> +
> +    delay_ns = ratelimit_calculate_delay(&job->limit, job->sectors_read);
> +    job->sectors_read = 0;
> +
> +    if (delay_ns == 0) {
> +        if (job->delayed) {
> +            job->delayed = false;
> +            qemu_co_queue_restart_all(&job->paused_workers);
>          }
> +        return false;
> +    }
>  
> -        return ret;
> +    return job->delayed = true;

This looks like a "== vs =" bug.  Please reformat it so readers don't
have to puzzle out what you meant:

job->delayed = true;
return true;

> +static void coroutine_fn backup_worker_co(void *opaque)
> +{
> +    BackupBlockJob *job = opaque;
> +
> +    job->running_workers++;
> +    job->nb_busy_workers++;
> +
> +    while (true) {
> +        int64_t cluster = backup_get_work(job);
> +        trace_backup_worker_got_work(job, qemu_coroutine_self(), cluster);
> +
> +        switch (cluster) {
> +        case BACKUP_WORKER_STOP:
> +            job->nb_busy_workers--;
> +            job->running_workers--;
> +            if (job->nb_busy_workers == 0 && job->waiting_for_workers) {
> +                qemu_coroutine_add_next(job->common.co);

Is there a reason for using qemu_coroutine_add_next() instead of
qemu_coroutine_enter()?

I think neither function prevents a crash if backup_get_work() returns
BACKUP_WORKER_STOP and this coroutine has never yielded yet.  We would
try to re-enter the main blockjob coroutine.

>  static int coroutine_fn backup_before_write_notify(
>          NotifierWithReturn *notifier,
>          void *opaque)
>  {
> -    BackupBlockJob *job = container_of(notifier, BackupBlockJob, before_write);
> -    BdrvTrackedRequest *req = opaque;
> -    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> -    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> +    BdrvTrackedRequest *tr = opaque;
> +    NotifierRequest *nr;
> +    BackupBlockJob *job = (BackupBlockJob *)tr->bs->job;
> +    int64_t start = tr->offset / job->cluster_size;
> +    int64_t end = DIV_ROUND_UP(tr->offset + tr->bytes, job->cluster_size);
> +    int ret = 0;
> +
> +    assert((tr->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((tr->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);

Are these assertions still necessary?  req->bytes >> BDRV_SECTOR_BITS
rounds down so we needed a multiple of BDRV_SECTOR_SIZE.  Now
DIV_ROUND_UP() is used so it doesn't matter.

> +
> +    nr = add_notif_req(job, start, end, qemu_coroutine_self());
>  
> -    assert(req->bs == blk_bs(job->common.blk));
> -    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> -    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    if (nr == NULL) {
> +        trace_backup_before_write_notify_skip(job, tr->offset, tr->bytes);
> +    } else {
> +        trace_backup_before_write_notify_start(job, tr->offset, tr->bytes, nr,
> +                                               nr->start, nr->end, nr->nb_wait);
> +
> +        if (!job->has_errors) {
> +            qemu_co_queue_restart_all(&job->paused_workers);
> +        }
> +        co_aio_sleep_ns(blk_get_aio_context(job->common.blk),
> +                        QEMU_CLOCK_REALTIME, WRITE_NOTIFY_TIMEOUT_NS);
> +        if (nr->nb_wait > 0) {
> +            /* timer expired and read request not finished */
> +            ret = -EINVAL;

#define	EINVAL		22	/* Invalid argument */

Why did you choose this errno?

EIO is the errno that kernel uses when I/O fails (e.g. hardware timeout).

> @@ -586,50 +923,62 @@ static void coroutine_fn backup_run(void *opaque)
>      BackupCompleteData *data;
>      BlockDriverState *bs = blk_bs(job->common.blk);
>      int64_t end;
> -    int ret = 0;
> +    int i;
> +    bool is_top = job->sync_mode == MIRROR_SYNC_MODE_TOP;
> +    bool is_full = job->sync_mode == MIRROR_SYNC_MODE_FULL;
> +
> +    trace_backup_run();

This trace event isn't useful if the guest has multiple drives.  Please
include arguments that correlate the event with specific objects like
the BlockBackend/BlockDriverState/BlockJob instances, I/O request sector
and length, etc.

>  BlockErrorAction block_job_error_action(BlockJob *job, BlockdevOnError on_err,
>                                          int is_read, int error)
>  {

block_job_get_error_action() was copy-pasted from
block_job_error_action().  Now there are two functions with similar
names that duplicate code.

Why can't you use the existing block_job_error_action() function?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2017-02-01 16:13 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-23 14:28 [Qemu-devel] [PATCH 00/21] new backup architecture Vladimir Sementsov-Ogievskiy
2016-12-23 14:28 ` [Qemu-devel] [PATCH 01/21] backup: move from done_bitmap to copy_bitmap Vladimir Sementsov-Ogievskiy
2017-01-23  5:34   ` Jeff Cody
2017-01-23 12:20   ` Vladimir Sementsov-Ogievskiy
2017-01-31 10:25   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 02/21] backup: init copy_bitmap from sync_bitmap for incremental Vladimir Sementsov-Ogievskiy
2017-01-24  7:09   ` Fam Zheng
2017-01-24  9:00     ` Vladimir Sementsov-Ogievskiy
2017-01-24  9:46       ` Fam Zheng
2017-01-24 10:16         ` Vladimir Sementsov-Ogievskiy
2017-01-31 10:36   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 03/21] backup: improve non-dirty bits progress processing Vladimir Sementsov-Ogievskiy
2017-01-24  7:17   ` Fam Zheng
2017-01-24  9:12     ` Vladimir Sementsov-Ogievskiy
2017-01-31 10:56       ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 04/21] backup: use copy_bitmap in incremental backup Vladimir Sementsov-Ogievskiy
2017-01-31 11:01   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 05/21] hbitmap: improve dirty iter Vladimir Sementsov-Ogievskiy
2017-01-31 11:20   ` Stefan Hajnoczi
2017-01-31 11:29   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 06/21] backup: rewrite top mode cluster skipping Vladimir Sementsov-Ogievskiy
2017-01-31 13:31   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 07/21] backup: refactor: merge top/full/incremental backup code Vladimir Sementsov-Ogievskiy
2017-01-31 14:26   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 08/21] backup: skip unallocated clusters for full mode Vladimir Sementsov-Ogievskiy
2017-01-24  7:59   ` Fam Zheng
2017-01-24  9:18     ` Vladimir Sementsov-Ogievskiy
2017-01-24  9:36       ` Fam Zheng
2017-01-24 10:13         ` Vladimir Sementsov-Ogievskiy
2017-01-24 11:12           ` Fam Zheng
2017-01-31 14:33   ` Stefan Hajnoczi
2017-01-31 14:38   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 09/21] backup: separate copy function Vladimir Sementsov-Ogievskiy
2017-01-31 14:40   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 10/21] backup: refactor backup_copy_cluster() Vladimir Sementsov-Ogievskiy
2017-01-31 14:57   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 11/21] backup: move r/w error handling code to r/w functions Vladimir Sementsov-Ogievskiy
2017-01-31 14:57   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 12/21] iotests: add supported_cache_modes to main function Vladimir Sementsov-Ogievskiy
2017-01-31 14:58   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 13/21] coroutine: add qemu_coroutine_add_next Vladimir Sementsov-Ogievskiy
2017-01-31 15:03   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 14/21] block: add trace point on bdrv_close_all Vladimir Sementsov-Ogievskiy
2017-01-31 15:03   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 15/21] bitmap: add bitmap_count_between() function Vladimir Sementsov-Ogievskiy
2017-01-31 15:15   ` Stefan Hajnoczi
2016-12-23 14:28 ` [Qemu-devel] [PATCH 16/21] hbitmap: add hbitmap_count_between() function Vladimir Sementsov-Ogievskiy
2017-01-31 15:56   ` Stefan Hajnoczi
2016-12-23 14:29 ` [Qemu-devel] [PATCH 17/21] backup: make all reads not serializing Vladimir Sementsov-Ogievskiy
2017-01-31 16:30   ` Stefan Hajnoczi
2016-12-23 14:29 ` [Qemu-devel] [PATCH 18/21] backup: new async architecture Vladimir Sementsov-Ogievskiy
2017-01-31 16:46   ` Stefan Hajnoczi
2017-02-01 16:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-12-23 14:29 ` [Qemu-devel] [PATCH 20/21] backup: move bitmap handling from backup_do_cow to get_work Vladimir Sementsov-Ogievskiy
2016-12-23 14:29 ` [Qemu-devel] [PATCH 21/21] backup: refactor: remove backup_do_cow() Vladimir Sementsov-Ogievskiy
2017-01-09 11:04 ` [Qemu-devel] [PATCH 00/21] new backup architecture Stefan Hajnoczi
2017-01-10  6:05   ` Jeff Cody
2017-01-10 18:48     ` John Snow
2017-01-31 10:20 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.