All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 0/4] fix image fleecing
@ 2018-07-09 16:37 Vladimir Sementsov-Ogievskiy
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range Vladimir Sementsov-Ogievskiy
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-07-09 16:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: pl, pbonzini, ronniesahlberg, famz, stefanha, mreitz, kwolf,
	jcody, jsnow, den, vsementsov

Hi all.

This fixes image fleecing scheme for 3.0, details are in 04 patch.
01 is a significant fix too.

v5: - keep one _internal recursive function like in original code [Fam]
    - grammar and rewording [Eric] (a bit changed,
                                    be free to fix/reword in flight)

PS: 222 now fails for me, no matter with or without these series...

v4:
01: fix copy_range architecture here too
02: rebase on 01
03: rebase

v3:
02: fix typo in assert, to fix build

v2:
01,02: new patches

03: - improve comment
    - fix assert in bdrv_aligned_pwritev
    - add asserts to not use the flag on read requests
    - support copy_range

04: - expand "detected cases" range 
    - s/fleecing/serialize_target_writes
    - support backup_cow_with_offload
    - drop restriction on compressed writes


v1 cover:

It's a continuation of discussion under
"[PATCH v2 2/3] block/fleecing-filter: new filter driver for fleecing" [1].

Here is my try to implement Kevin's idea, that all backup writes (for
fleecing case) should be serialized. (However, I've skipped for now
fixing related permissions).

Looks like these patches may replace patch [1], to make fleecing scheme
safe. But I'm not sure, a look by Kevin is necessary.

A test is still needed, to prove that this patch is necessary and that it
works..


PS: 222 now fails for me, no matter with or without these series:
    222         [failed, exit status 1] - output mismatch (see 222.out.bad)
    --- /work/src/qemu/up-new-fleecing/tests/qemu-iotests/222.out   2018-07-03 20:18:24.247252451 +0300
    +++ /work/src/qemu/up-new-fleecing/tests/qemu-iotests/222.out.bad       2018-07-09 19:36:39.586747425 +0300
    @@ -40,28 +40,7 @@
     --- Verifying Data ---
     
     read -P0x5d 0 64k
    -read -P0xd5 1M 64k
    -read -P0xdc 32M 64k
    -read -P0xcd 0x3ff0000 64k
    -read -P0 0x00f8000 32k
    -read -P0 0x2010000 32k
    -read -P0 0x3fe0000 64k
    -
    ---- Cleanup ---
    -
    -{u'return': {}}
    -{u'timestamp': {u'seconds': 'SECS', u'microseconds': 'USECS'}, u'data': {u'device': u'drive0', u'type': u'backup', u'speed': 0, u'len': 67108864, u'offset': 393216}, u'event': u'BLOCK_JOB_CANCELLED'}
    -{u'return': {}}
    -{u'return': {}}
    -
    ---- Confirming writes ---
    -
    -read -P0xab 0 64k
    -read -P0xad 0x00f8000 64k
    -read -P0x1d 0x2008000 64k
    -read -P0xea 0x3fe0000 64k
    -read -P0xd5 0x108000 32k
    -read -P0xdc 32M 32k
    -read -P0xcd 0x3ff0000 64k
    -
    -Done
    +Traceback (most recent call last):
    +  File "222", line 132, in <module>
    +    assert qemu_io_silent('-r', '-f', 'raw', '-c', cmd, nbd_uri) == 0
    +AssertionError
    Failures: 222
    Failed 1 of 1 tests



Vladimir Sementsov-Ogievskiy (4):
  block/io: fix copy_range
  block: split flags in copy_range
  block: add BDRV_REQ_SERIALISING flag
  block/backup: fix fleecing scheme: use serialized writes

 include/block/block.h          |  27 ++++++++++-
 include/block/block_int.h      |  14 ++++--
 include/sysemu/block-backend.h |   3 +-
 block/backup.c                 |  20 +++++---
 block/block-backend.c          |   5 +-
 block/file-posix.c             |  21 ++++----
 block/io.c                     | 106 ++++++++++++++++++++++++++++-------------
 block/iscsi.c                  |   9 ++--
 block/qcow2.c                  |  20 ++++----
 block/raw-format.c             |  24 ++++++----
 qemu-img.c                     |   2 +-
 11 files changed, 174 insertions(+), 77 deletions(-)

-- 
2.11.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range
  2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
@ 2018-07-09 16:37 ` Vladimir Sementsov-Ogievskiy
  2018-07-10  1:49   ` Fam Zheng
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-07-09 16:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: pl, pbonzini, ronniesahlberg, famz, stefanha, mreitz, kwolf,
	jcody, jsnow, den, vsementsov

Here two things are fixed:

1. Architecture

On each recursion step, we go to the child of src or dst, only for one
of them. So, it's wrong to create tracked requests for both on each
step. It leads to tracked requests duplication.

2. Wait for serializing requests on write path independently of
   BDRV_REQ_NO_SERIALISING

Before commit 9ded4a01149 "backup: Use copy offloading",
BDRV_REQ_NO_SERIALISING was used for only one case: read in
copy-on-write operation during backup. Also, the flag was handled only
on read path (in bdrv_co_preadv and bdrv_aligned_preadv).

After 9ded4a01149, flag is used for not waiting serializing operations
on backup target (in same case of copy-on-write operation). This
behavior change is unsubstantiated and potentially dangerous, let's
drop it and add additional asserts and documentation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h | 11 +++++++++++
 block/io.c            | 42 +++++++++++++++++++++++++++---------------
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index bc76b1e59f..5320f82ac9 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -50,6 +50,17 @@ typedef enum {
      * opened with BDRV_O_UNMAP.
      */
     BDRV_REQ_MAY_UNMAP          = 0x4,
+
+    /* The BDRV_REQ_NO_SERIALISING flag is only valid for reads and means that
+     * we don't want wait_serialising_requests() during read operation.
+     *
+     * This flag is used for backup copy on write operation, when we need to
+     * read old data before write (write notifier triggered). It is okay since
+     * we already waited for other serializing requests in the initiating write
+     * (see bdrv_aligned_pwritev), and it is necessary if the initiating write
+     * is already serializing (without the flag, the read would deadlock
+     * waiting for the serialising write to complete).
+     */
     BDRV_REQ_NO_SERIALISING     = 0x8,
     BDRV_REQ_FUA                = 0x10,
     BDRV_REQ_WRITE_COMPRESSED   = 0x20,
diff --git a/block/io.c b/block/io.c
index 1a2272fad3..94b6ec94e2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1572,6 +1572,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
     max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
                                    align);
 
+    /* BDRV_REQ_NO_SERIALISING is only for read operation */
+    assert(!(flags & BDRV_REQ_NO_SERIALISING));
     waited = wait_serialising_requests(req);
     assert(!waited || !req->serialising);
     assert(req->overlap_offset <= offset);
@@ -2896,7 +2898,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
                                                     BdrvRequestFlags flags,
                                                     bool recurse_src)
 {
-    BdrvTrackedRequest src_req, dst_req;
+    BdrvTrackedRequest req;
     int ret;
 
     if (!dst || !dst->bs) {
@@ -2923,32 +2925,42 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
         || src->bs->encrypted || dst->bs->encrypted) {
         return -ENOTSUP;
     }
-    bdrv_inc_in_flight(src->bs);
-    bdrv_inc_in_flight(dst->bs);
-    tracked_request_begin(&src_req, src->bs, src_offset,
-                          bytes, BDRV_TRACKED_READ);
-    tracked_request_begin(&dst_req, dst->bs, dst_offset,
-                          bytes, BDRV_TRACKED_WRITE);
 
-    if (!(flags & BDRV_REQ_NO_SERIALISING)) {
-        wait_serialising_requests(&src_req);
-        wait_serialising_requests(&dst_req);
-    }
     if (recurse_src) {
+        bdrv_inc_in_flight(src->bs);
+        tracked_request_begin(&req, src->bs, src_offset, bytes,
+                              BDRV_TRACKED_READ);
+
+        if (!(flags & BDRV_REQ_NO_SERIALISING)) {
+            wait_serialising_requests(&req);
+        }
+
         ret = src->bs->drv->bdrv_co_copy_range_from(src->bs,
                                                     src, src_offset,
                                                     dst, dst_offset,
                                                     bytes, flags);
+
+        tracked_request_end(&req);
+        bdrv_dec_in_flight(src->bs);
     } else {
+        bdrv_inc_in_flight(dst->bs);
+        tracked_request_begin(&req, dst->bs, dst_offset, bytes,
+                              BDRV_TRACKED_WRITE);
+
+        /* BDRV_REQ_NO_SERIALISING is only for read operation,
+         * so we ignore it in flags.
+         */
+        wait_serialising_requests(&req);
+
         ret = dst->bs->drv->bdrv_co_copy_range_to(dst->bs,
                                                   src, src_offset,
                                                   dst, dst_offset,
                                                   bytes, flags);
+
+        tracked_request_end(&req);
+        bdrv_dec_in_flight(dst->bs);
     }
-    tracked_request_end(&src_req);
-    tracked_request_end(&dst_req);
-    bdrv_dec_in_flight(src->bs);
-    bdrv_dec_in_flight(dst->bs);
+
     return ret;
 }
 
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range
  2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range Vladimir Sementsov-Ogievskiy
@ 2018-07-09 16:37 ` Vladimir Sementsov-Ogievskiy
  2018-07-10  1:52   ` Fam Zheng
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-07-09 16:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: pl, pbonzini, ronniesahlberg, famz, stefanha, mreitz, kwolf,
	jcody, jsnow, den, vsementsov

Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h          |  3 ++-
 include/block/block_int.h      | 14 +++++++++----
 include/sysemu/block-backend.h |  3 ++-
 block/backup.c                 |  2 +-
 block/block-backend.c          |  5 +++--
 block/file-posix.c             | 21 +++++++++++--------
 block/io.c                     | 46 +++++++++++++++++++++++-------------------
 block/iscsi.c                  |  9 ++++++---
 block/qcow2.c                  | 20 +++++++++---------
 block/raw-format.c             | 24 ++++++++++++++--------
 qemu-img.c                     |  2 +-
 11 files changed, 90 insertions(+), 59 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 5320f82ac9..57a96d4988 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -690,5 +690,6 @@ void bdrv_unregister_buf(BlockDriverState *bs, void *host);
  **/
 int coroutine_fn bdrv_co_copy_range(BdrvChild *src, uint64_t src_offset,
                                     BdrvChild *dst, uint64_t dst_offset,
-                                    uint64_t bytes, BdrvRequestFlags flags);
+                                    uint64_t bytes, BdrvRequestFlags read_flags,
+                                    BdrvRequestFlags write_flags);
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index af71b414be..119b6e4ea5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -218,7 +218,8 @@ struct BlockDriver {
                                                 BdrvChild *dst,
                                                 uint64_t dst_offset,
                                                 uint64_t bytes,
-                                                BdrvRequestFlags flags);
+                                                BdrvRequestFlags read_flags,
+                                                BdrvRequestFlags write_flags);
 
     /* Map [offset, offset + nbytes) range onto a child of bs to copy data to,
      * and invoke bdrv_co_copy_range_to(child, src, ...), or perform the copy
@@ -234,7 +235,8 @@ struct BlockDriver {
                                               BdrvChild *dst,
                                               uint64_t dst_offset,
                                               uint64_t bytes,
-                                              BdrvRequestFlags flags);
+                                              BdrvRequestFlags read_flags,
+                                              BdrvRequestFlags write_flags);
 
     /*
      * Building block for bdrv_block_status[_above] and
@@ -1153,10 +1155,14 @@ void blockdev_close_all_bdrv_states(void);
 
 int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, uint64_t src_offset,
                                          BdrvChild *dst, uint64_t dst_offset,
-                                         uint64_t bytes, BdrvRequestFlags flags);
+                                         uint64_t bytes,
+                                         BdrvRequestFlags read_flags,
+                                         BdrvRequestFlags write_flags);
 int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, uint64_t src_offset,
                                        BdrvChild *dst, uint64_t dst_offset,
-                                       uint64_t bytes, BdrvRequestFlags flags);
+                                       uint64_t bytes,
+                                       BdrvRequestFlags read_flags,
+                                       BdrvRequestFlags write_flags);
 
 int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8d03d493c2..830d873f24 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -234,6 +234,7 @@ void blk_unregister_buf(BlockBackend *blk, void *host);
 
 int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
                                    BlockBackend *blk_out, int64_t off_out,
-                                   int bytes, BdrvRequestFlags flags);
+                                   int bytes, BdrvRequestFlags read_flags,
+                                   BdrvRequestFlags write_flags);
 
 #endif
diff --git a/block/backup.c b/block/backup.c
index 81895ddbe2..f3e4e814b6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -163,7 +163,7 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
     hbitmap_reset(job->copy_bitmap, start / job->cluster_size,
                   nr_clusters);
     ret = blk_co_copy_range(blk, start, job->target, start, nbytes,
-                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0, 0);
     if (ret < 0) {
         trace_backup_do_cow_copy_range_fail(job, start, ret);
         hbitmap_set(job->copy_bitmap, start / job->cluster_size,
diff --git a/block/block-backend.c b/block/block-backend.c
index 6b75bca317..ac8c3e0b1c 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2218,7 +2218,8 @@ void blk_unregister_buf(BlockBackend *blk, void *host)
 
 int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
                                    BlockBackend *blk_out, int64_t off_out,
-                                   int bytes, BdrvRequestFlags flags)
+                                   int bytes, BdrvRequestFlags read_flags,
+                                   BdrvRequestFlags write_flags)
 {
     int r;
     r = blk_check_byte_request(blk_in, off_in, bytes);
@@ -2231,5 +2232,5 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
     }
     return bdrv_co_copy_range(blk_in->root, off_in,
                               blk_out->root, off_out,
-                              bytes, flags);
+                              bytes, read_flags, write_flags);
 }
diff --git a/block/file-posix.c b/block/file-posix.c
index 98987b80f1..4fec8cb53c 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2589,18 +2589,23 @@ static void raw_abort_perm_update(BlockDriverState *bs)
     raw_handle_perm_lock(bs, RAW_PL_ABORT, 0, 0, NULL);
 }
 
-static int coroutine_fn raw_co_copy_range_from(BlockDriverState *bs,
-                                               BdrvChild *src, uint64_t src_offset,
-                                               BdrvChild *dst, uint64_t dst_offset,
-                                               uint64_t bytes, BdrvRequestFlags flags)
+static int coroutine_fn raw_co_copy_range_from(
+        BlockDriverState *bs, BdrvChild *src, uint64_t src_offset,
+        BdrvChild *dst, uint64_t dst_offset, uint64_t bytes,
+        BdrvRequestFlags read_flags, BdrvRequestFlags write_flags)
 {
-    return bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes, flags);
+    return bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
+                                 read_flags, write_flags);
 }
 
 static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
-                                             BdrvChild *src, uint64_t src_offset,
-                                             BdrvChild *dst, uint64_t dst_offset,
-                                             uint64_t bytes, BdrvRequestFlags flags)
+                                             BdrvChild *src,
+                                             uint64_t src_offset,
+                                             BdrvChild *dst,
+                                             uint64_t dst_offset,
+                                             uint64_t bytes,
+                                             BdrvRequestFlags read_flags,
+                                             BdrvRequestFlags write_flags)
 {
     BDRVRawState *s = bs->opaque;
     BDRVRawState *src_s;
diff --git a/block/io.c b/block/io.c
index 94b6ec94e2..5d5651ad84 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2890,13 +2890,11 @@ void bdrv_unregister_buf(BlockDriverState *bs, void *host)
     }
 }
 
-static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
-                                                    uint64_t src_offset,
-                                                    BdrvChild *dst,
-                                                    uint64_t dst_offset,
-                                                    uint64_t bytes,
-                                                    BdrvRequestFlags flags,
-                                                    bool recurse_src)
+static int coroutine_fn bdrv_co_copy_range_internal(
+        BdrvChild *src, uint64_t src_offset, BdrvChild *dst,
+        uint64_t dst_offset, uint64_t bytes,
+        BdrvRequestFlags read_flags, BdrvRequestFlags write_flags,
+        bool recurse_src)
 {
     BdrvTrackedRequest req;
     int ret;
@@ -2908,8 +2906,8 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
     if (ret) {
         return ret;
     }
-    if (flags & BDRV_REQ_ZERO_WRITE) {
-        return bdrv_co_pwrite_zeroes(dst, dst_offset, bytes, flags);
+    if (write_flags & BDRV_REQ_ZERO_WRITE) {
+        return bdrv_co_pwrite_zeroes(dst, dst_offset, bytes, write_flags);
     }
 
     if (!src || !src->bs) {
@@ -2931,14 +2929,15 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
         tracked_request_begin(&req, src->bs, src_offset, bytes,
                               BDRV_TRACKED_READ);
 
-        if (!(flags & BDRV_REQ_NO_SERIALISING)) {
+        if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
             wait_serialising_requests(&req);
         }
 
         ret = src->bs->drv->bdrv_co_copy_range_from(src->bs,
                                                     src, src_offset,
                                                     dst, dst_offset,
-                                                    bytes, flags);
+                                                    bytes,
+                                                    read_flags, write_flags);
 
         tracked_request_end(&req);
         bdrv_dec_in_flight(src->bs);
@@ -2947,15 +2946,15 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
         tracked_request_begin(&req, dst->bs, dst_offset, bytes,
                               BDRV_TRACKED_WRITE);
 
-        /* BDRV_REQ_NO_SERIALISING is only for read operation,
-         * so we ignore it in flags.
-         */
+        /* BDRV_REQ_NO_SERIALISING is only for read operation */
+        assert(!(write_flags & BDRV_REQ_NO_SERIALISING));
         wait_serialising_requests(&req);
 
         ret = dst->bs->drv->bdrv_co_copy_range_to(dst->bs,
                                                   src, src_offset,
                                                   dst, dst_offset,
-                                                  bytes, flags);
+                                                  bytes,
+                                                  read_flags, write_flags);
 
         tracked_request_end(&req);
         bdrv_dec_in_flight(dst->bs);
@@ -2970,10 +2969,12 @@ static int coroutine_fn bdrv_co_copy_range_internal(BdrvChild *src,
  * semantics. */
 int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, uint64_t src_offset,
                                          BdrvChild *dst, uint64_t dst_offset,
-                                         uint64_t bytes, BdrvRequestFlags flags)
+                                         uint64_t bytes,
+                                         BdrvRequestFlags read_flags,
+                                         BdrvRequestFlags write_flags)
 {
     return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
-                                       bytes, flags, true);
+                                       bytes, read_flags, write_flags, true);
 }
 
 /* Copy range from @src to @dst.
@@ -2982,19 +2983,22 @@ int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, uint64_t src_offset,
  * semantics. */
 int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, uint64_t src_offset,
                                        BdrvChild *dst, uint64_t dst_offset,
-                                       uint64_t bytes, BdrvRequestFlags flags)
+                                       uint64_t bytes,
+                                       BdrvRequestFlags read_flags,
+                                       BdrvRequestFlags write_flags)
 {
     return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
-                                       bytes, flags, false);
+                                       bytes, read_flags, write_flags, false);
 }
 
 int coroutine_fn bdrv_co_copy_range(BdrvChild *src, uint64_t src_offset,
                                     BdrvChild *dst, uint64_t dst_offset,
-                                    uint64_t bytes, BdrvRequestFlags flags)
+                                    uint64_t bytes, BdrvRequestFlags read_flags,
+                                    BdrvRequestFlags write_flags)
 {
     return bdrv_co_copy_range_from(src, src_offset,
                                    dst, dst_offset,
-                                   bytes, flags);
+                                   bytes, read_flags, write_flags);
 }
 
 static void bdrv_parent_cb_resize(BlockDriverState *bs)
diff --git a/block/iscsi.c b/block/iscsi.c
index ead2bd5aa7..38b917a1e5 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2193,9 +2193,11 @@ static int coroutine_fn iscsi_co_copy_range_from(BlockDriverState *bs,
                                                  BdrvChild *dst,
                                                  uint64_t dst_offset,
                                                  uint64_t bytes,
-                                                 BdrvRequestFlags flags)
+                                                 BdrvRequestFlags read_flags,
+                                                 BdrvRequestFlags write_flags)
 {
-    return bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes, flags);
+    return bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
+                                 read_flags, write_flags);
 }
 
 static struct scsi_task *iscsi_xcopy_task(int param_len)
@@ -2332,7 +2334,8 @@ static int coroutine_fn iscsi_co_copy_range_to(BlockDriverState *bs,
                                                BdrvChild *dst,
                                                uint64_t dst_offset,
                                                uint64_t bytes,
-                                               BdrvRequestFlags flags)
+                                               BdrvRequestFlags read_flags,
+                                               BdrvRequestFlags write_flags)
 {
     IscsiLun *dst_lun = dst->bs->opaque;
     IscsiLun *src_lun;
diff --git a/block/qcow2.c b/block/qcow2.c
index 33b61b7480..867ce02d50 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3253,13 +3253,14 @@ static int coroutine_fn
 qcow2_co_copy_range_from(BlockDriverState *bs,
                          BdrvChild *src, uint64_t src_offset,
                          BdrvChild *dst, uint64_t dst_offset,
-                         uint64_t bytes, BdrvRequestFlags flags)
+                         uint64_t bytes, BdrvRequestFlags read_flags,
+                         BdrvRequestFlags write_flags)
 {
     BDRVQcow2State *s = bs->opaque;
     int ret;
     unsigned int cur_bytes; /* number of bytes in current iteration */
     BdrvChild *child = NULL;
-    BdrvRequestFlags cur_flags;
+    BdrvRequestFlags cur_write_flags;
 
     assert(!bs->encrypted);
     qemu_co_mutex_lock(&s->lock);
@@ -3268,7 +3269,7 @@ qcow2_co_copy_range_from(BlockDriverState *bs,
         uint64_t copy_offset = 0;
         /* prepare next request */
         cur_bytes = MIN(bytes, INT_MAX);
-        cur_flags = flags;
+        cur_write_flags = write_flags;
 
         ret = qcow2_get_cluster_offset(bs, src_offset, &cur_bytes, &copy_offset);
         if (ret < 0) {
@@ -3280,20 +3281,20 @@ qcow2_co_copy_range_from(BlockDriverState *bs,
             if (bs->backing && bs->backing->bs) {
                 int64_t backing_length = bdrv_getlength(bs->backing->bs);
                 if (src_offset >= backing_length) {
-                    cur_flags |= BDRV_REQ_ZERO_WRITE;
+                    cur_write_flags |= BDRV_REQ_ZERO_WRITE;
                 } else {
                     child = bs->backing;
                     cur_bytes = MIN(cur_bytes, backing_length - src_offset);
                     copy_offset = src_offset;
                 }
             } else {
-                cur_flags |= BDRV_REQ_ZERO_WRITE;
+                cur_write_flags |= BDRV_REQ_ZERO_WRITE;
             }
             break;
 
         case QCOW2_CLUSTER_ZERO_PLAIN:
         case QCOW2_CLUSTER_ZERO_ALLOC:
-            cur_flags |= BDRV_REQ_ZERO_WRITE;
+            cur_write_flags |= BDRV_REQ_ZERO_WRITE;
             break;
 
         case QCOW2_CLUSTER_COMPRESSED:
@@ -3317,7 +3318,7 @@ qcow2_co_copy_range_from(BlockDriverState *bs,
         ret = bdrv_co_copy_range_from(child,
                                       copy_offset,
                                       dst, dst_offset,
-                                      cur_bytes, cur_flags);
+                                      cur_bytes, read_flags, cur_write_flags);
         qemu_co_mutex_lock(&s->lock);
         if (ret < 0) {
             goto out;
@@ -3338,7 +3339,8 @@ static int coroutine_fn
 qcow2_co_copy_range_to(BlockDriverState *bs,
                        BdrvChild *src, uint64_t src_offset,
                        BdrvChild *dst, uint64_t dst_offset,
-                       uint64_t bytes, BdrvRequestFlags flags)
+                       uint64_t bytes, BdrvRequestFlags read_flags,
+                       BdrvRequestFlags write_flags)
 {
     BDRVQcow2State *s = bs->opaque;
     int offset_in_cluster;
@@ -3382,7 +3384,7 @@ qcow2_co_copy_range_to(BlockDriverState *bs,
         ret = bdrv_co_copy_range_to(src, src_offset,
                                     bs->file,
                                     cluster_offset + offset_in_cluster,
-                                    cur_bytes, flags);
+                                    cur_bytes, read_flags, write_flags);
         qemu_co_mutex_lock(&s->lock);
         if (ret < 0) {
             goto fail;
diff --git a/block/raw-format.c b/block/raw-format.c
index b78da564d4..a3591985f6 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -498,9 +498,13 @@ static int raw_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
 }
 
 static int coroutine_fn raw_co_copy_range_from(BlockDriverState *bs,
-                                               BdrvChild *src, uint64_t src_offset,
-                                               BdrvChild *dst, uint64_t dst_offset,
-                                               uint64_t bytes, BdrvRequestFlags flags)
+                                               BdrvChild *src,
+                                               uint64_t src_offset,
+                                               BdrvChild *dst,
+                                               uint64_t dst_offset,
+                                               uint64_t bytes,
+                                               BdrvRequestFlags read_flags,
+                                               BdrvRequestFlags write_flags)
 {
     int ret;
 
@@ -509,13 +513,17 @@ static int coroutine_fn raw_co_copy_range_from(BlockDriverState *bs,
         return ret;
     }
     return bdrv_co_copy_range_from(bs->file, src_offset, dst, dst_offset,
-                                   bytes, flags);
+                                   bytes, read_flags, write_flags);
 }
 
 static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
-                                             BdrvChild *src, uint64_t src_offset,
-                                             BdrvChild *dst, uint64_t dst_offset,
-                                             uint64_t bytes, BdrvRequestFlags flags)
+                                             BdrvChild *src,
+                                             uint64_t src_offset,
+                                             BdrvChild *dst,
+                                             uint64_t dst_offset,
+                                             uint64_t bytes,
+                                             BdrvRequestFlags read_flags,
+                                             BdrvRequestFlags write_flags)
 {
     int ret;
 
@@ -524,7 +532,7 @@ static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
         return ret;
     }
     return bdrv_co_copy_range_to(src, src_offset, bs->file, dst_offset, bytes,
-                                 flags);
+                                 read_flags, write_flags);
 }
 
 BlockDriver bdrv_raw = {
diff --git a/qemu-img.c b/qemu-img.c
index 7651d8172c..f4074ebf75 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1783,7 +1783,7 @@ static int coroutine_fn convert_co_copy_range(ImgConvertState *s, int64_t sector
 
         ret = blk_co_copy_range(blk, offset, s->target,
                                 sector_num << BDRV_SECTOR_BITS,
-                                n << BDRV_SECTOR_BITS, 0);
+                                n << BDRV_SECTOR_BITS, 0, 0);
         if (ret < 0) {
             return ret;
         }
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag
  2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range Vladimir Sementsov-Ogievskiy
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range Vladimir Sementsov-Ogievskiy
@ 2018-07-09 16:37 ` Vladimir Sementsov-Ogievskiy
  2018-07-10  2:51   ` Fam Zheng
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes Vladimir Sementsov-Ogievskiy
  2018-07-10 11:13 ` [Qemu-devel] [PATCH v5 0/4] fix image fleecing Kevin Wolf
  4 siblings, 1 reply; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-07-09 16:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: pl, pbonzini, ronniesahlberg, famz, stefanha, mreitz, kwolf,
	jcody, jsnow, den, vsementsov

Serialized writes should be used in copy-on-write of backup(sync=none)
for image fleecing scheme.

We need to change an assert in bdrv_aligned_pwritev, added in
28de2dcd88de. The assert may fail now, because call to
wait_serialising_requests here may become first call to it for this
request with serializing flag set. It occurs if the request is aligned
(otherwise, we should already set serializing flag before calling
bdrv_aligned_pwritev and correspondingly waited for all intersecting
requests). However, for aligned requests, we should not care about
outdating of previously read data, as there no such data. Therefore,
let's just update an assert to not care about aligned requests.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h | 13 ++++++++++++-
 block/io.c            | 26 +++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 57a96d4988..6623d15432 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -69,8 +69,19 @@ typedef enum {
      * content. */
     BDRV_REQ_WRITE_UNCHANGED    = 0x40,
 
+    /* BDRV_REQ_SERIALISING forces request serialisation for writes.
+     * It is used to ensure that writes to the backing file of a backup process
+     * target cannot race with a read of the backup target that defers to the
+     * backing file.
+     *
+     * Note, that BDRV_REQ_SERIALISING is _not_ opposite in meaning to
+     * BDRV_REQ_NO_SERIALISING. A more descriptive name for the latter might be
+     * _DO_NOT_WAIT_FOR_SERIALISING, except that is too long.
+     */
+    BDRV_REQ_SERIALISING        = 0x80,
+
     /* Mask of valid flags */
-    BDRV_REQ_MASK               = 0x7f,
+    BDRV_REQ_MASK               = 0xff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/block/io.c b/block/io.c
index 5d5651ad84..102cb0e1ba 100644
--- a/block/io.c
+++ b/block/io.c
@@ -623,6 +623,16 @@ static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
     req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
 }
 
+static bool is_request_serialising_and_aligned(BdrvTrackedRequest *req)
+{
+    /* if request is serialising, overlap_offset and overlap_bytes are set, so
+     * we can check is request aligned. Otherwise don't care and return false
+     */
+
+    return req->serialising && (req->offset == req->overlap_offset) &&
+           (req->bytes == req->overlap_bytes);
+}
+
 /**
  * Round a region to cluster boundaries
  */
@@ -1291,6 +1301,9 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
         mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
+    /* BDRV_REQ_SERIALISING is only for write operation */
+    assert(!(flags & BDRV_REQ_SERIALISING));
+
     if (!(flags & BDRV_REQ_NO_SERIALISING)) {
         wait_serialising_requests(req);
     }
@@ -1574,8 +1587,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
 
     /* BDRV_REQ_NO_SERIALISING is only for read operation */
     assert(!(flags & BDRV_REQ_NO_SERIALISING));
+
+    if (flags & BDRV_REQ_SERIALISING) {
+        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+    }
+
     waited = wait_serialising_requests(req);
-    assert(!waited || !req->serialising);
+    assert(!waited || !req->serialising ||
+           is_request_serialising_and_aligned(req));
     assert(req->overlap_offset <= offset);
     assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
     if (flags & BDRV_REQ_WRITE_UNCHANGED) {
@@ -2929,6 +2948,8 @@ static int coroutine_fn bdrv_co_copy_range_internal(
         tracked_request_begin(&req, src->bs, src_offset, bytes,
                               BDRV_TRACKED_READ);
 
+        /* BDRV_REQ_SERIALISING is only for write operation */
+        assert(!(read_flags & BDRV_REQ_SERIALISING));
         if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
             wait_serialising_requests(&req);
         }
@@ -2948,6 +2969,9 @@ static int coroutine_fn bdrv_co_copy_range_internal(
 
         /* BDRV_REQ_NO_SERIALISING is only for read operation */
         assert(!(write_flags & BDRV_REQ_NO_SERIALISING));
+        if (write_flags & BDRV_REQ_SERIALISING) {
+            mark_request_serialising(&req, bdrv_get_cluster_size(dst->bs));
+        }
         wait_serialising_requests(&req);
 
         ret = dst->bs->drv->bdrv_co_copy_range_to(dst->bs,
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes
  2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag Vladimir Sementsov-Ogievskiy
@ 2018-07-09 16:37 ` Vladimir Sementsov-Ogievskiy
  2018-07-10  2:57   ` Fam Zheng
  2018-07-10 11:13 ` [Qemu-devel] [PATCH v5 0/4] fix image fleecing Kevin Wolf
  4 siblings, 1 reply; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-07-09 16:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: pl, pbonzini, ronniesahlberg, famz, stefanha, mreitz, kwolf,
	jcody, jsnow, den, vsementsov

Fleecing scheme works as follows: we want a kind of temporary snapshot
of active drive A. We create temporary image B, with B->backing = A.
Then we start backup(sync=none) from A to B. From this point, B reads
as point-in-time snapshot of A (A continues to be active drive,
accepting guest IO).

This scheme needs some additional synchronization between reads from B
and backup COW operations, otherwise, the following situation is
theoretically possible:

(assume B is qcow2, client is NBD client, reading from B)

1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
   goes up to l2 table loading (assume cache miss)

2) guest write => backup COW => qcow2 write =>
   try to take qcow2 mutex => waiting

3. l2 table loaded, we see that cluster is UNALLOCATED, go to
   "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
   bdrv_co_preadv(bs->backing, ...)

4) aha, mutex unlocked, backup COW continues, and we finally finish
   guest write and change cluster in our active disk A

5. actually, do bdrv_co_preadv(bs->backing, ...) and read
   _new updated_ data.

To avoid this, let's make backup writes serializing, to not intersect
with reads from B.

Note: we expand range of handled cases from (sync=none and
B->backing = A) to just (A in backing chain of B), to finally allow
safe reading from B during backup for all cases when A in backing chain
of B, i.e. B formally looks like point-in-time snapshot of A.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index f3e4e814b6..319fc922e8 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -47,6 +47,8 @@ typedef struct BackupBlockJob {
     HBitmap *copy_bitmap;
     bool use_copy_range;
     int64_t copy_range_size;
+
+    bool serialize_target_writes;
 } BackupBlockJob;
 
 static const BlockJobDriver backup_job_driver;
@@ -102,6 +104,8 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
     QEMUIOVector qiov;
     BlockBackend *blk = job->common.blk;
     int nbytes;
+    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
+    int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
 
     hbitmap_reset(job->copy_bitmap, start / job->cluster_size, 1);
     nbytes = MIN(job->cluster_size, job->len - start);
@@ -112,8 +116,7 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
     iov.iov_len = nbytes;
     qemu_iovec_init_external(&qiov, &iov, 1);
 
-    ret = blk_co_preadv(blk, start, qiov.size, &qiov,
-                        is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
+    ret = blk_co_preadv(blk, start, qiov.size, &qiov, read_flags);
     if (ret < 0) {
         trace_backup_do_cow_read_fail(job, start, ret);
         if (error_is_read) {
@@ -124,11 +127,11 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
 
     if (qemu_iovec_is_zero(&qiov)) {
         ret = blk_co_pwrite_zeroes(job->target, start,
-                                   qiov.size, BDRV_REQ_MAY_UNMAP);
+                                   qiov.size, write_flags | BDRV_REQ_MAY_UNMAP);
     } else {
         ret = blk_co_pwritev(job->target, start,
-                             qiov.size, &qiov,
-                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+                             qiov.size, &qiov, write_flags |
+                             (job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0));
     }
     if (ret < 0) {
         trace_backup_do_cow_write_fail(job, start, ret);
@@ -156,6 +159,8 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
     int nr_clusters;
     BlockBackend *blk = job->common.blk;
     int nbytes;
+    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
+    int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
 
     assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size));
     nbytes = MIN(job->copy_range_size, end - start);
@@ -163,7 +168,7 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
     hbitmap_reset(job->copy_bitmap, start / job->cluster_size,
                   nr_clusters);
     ret = blk_co_copy_range(blk, start, job->target, start, nbytes,
-                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0, 0);
+                            read_flags, write_flags);
     if (ret < 0) {
         trace_backup_do_cow_copy_range_fail(job, start, ret);
         hbitmap_set(job->copy_bitmap, start / job->cluster_size,
@@ -701,6 +706,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
                        sync_bitmap : NULL;
     job->compress = compress;
 
+    /* Detect image-fleecing (and similar) schemes */
+    job->serialize_target_writes = bdrv_chain_contains(target, bs);
+
     /* If there is no backing file on the target, we cannot rely on COW if our
      * backup cluster size is smaller than the target cluster size. Even for
      * targets with a backing file, try to avoid COW if possible. */
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range Vladimir Sementsov-Ogievskiy
@ 2018-07-10  1:49   ` Fam Zheng
  0 siblings, 0 replies; 10+ messages in thread
From: Fam Zheng @ 2018-07-10  1:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-devel, qemu-block, pl, pbonzini, ronniesahlberg, stefanha,
	mreitz, kwolf, jcody, jsnow, den

On Mon, 07/09 19:37, Vladimir Sementsov-Ogievskiy wrote:
> Here two things are fixed:
> 
> 1. Architecture
> 
> On each recursion step, we go to the child of src or dst, only for one
> of them. So, it's wrong to create tracked requests for both on each
> step. It leads to tracked requests duplication.
> 
> 2. Wait for serializing requests on write path independently of
>    BDRV_REQ_NO_SERIALISING
> 
> Before commit 9ded4a01149 "backup: Use copy offloading",
> BDRV_REQ_NO_SERIALISING was used for only one case: read in
> copy-on-write operation during backup. Also, the flag was handled only
> on read path (in bdrv_co_preadv and bdrv_aligned_preadv).
> 
> After 9ded4a01149, flag is used for not waiting serializing operations
> on backup target (in same case of copy-on-write operation). This
> behavior change is unsubstantiated and potentially dangerous, let's
> drop it and add additional asserts and documentation.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range Vladimir Sementsov-Ogievskiy
@ 2018-07-10  1:52   ` Fam Zheng
  0 siblings, 0 replies; 10+ messages in thread
From: Fam Zheng @ 2018-07-10  1:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-devel, qemu-block, pl, pbonzini, ronniesahlberg, stefanha,
	mreitz, kwolf, jcody, jsnow, den

On Mon, 07/09 19:37, Vladimir Sementsov-Ogievskiy wrote:
> Pass read flags and write flags separately. This is needed to handle
> coming BDRV_REQ_NO_SERIALISING clearly in following patches.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag Vladimir Sementsov-Ogievskiy
@ 2018-07-10  2:51   ` Fam Zheng
  0 siblings, 0 replies; 10+ messages in thread
From: Fam Zheng @ 2018-07-10  2:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-devel, qemu-block, pl, pbonzini, ronniesahlberg, stefanha,
	mreitz, kwolf, jcody, jsnow, den

On Mon, 07/09 19:37, Vladimir Sementsov-Ogievskiy wrote:
> Serialized writes should be used in copy-on-write of backup(sync=none)
> for image fleecing scheme.
> 
> We need to change an assert in bdrv_aligned_pwritev, added in
> 28de2dcd88de. The assert may fail now, because call to
> wait_serialising_requests here may become first call to it for this
> request with serializing flag set.
> It occurs if the request is aligned
> (otherwise, we should already set serializing flag before calling
> bdrv_aligned_pwritev and correspondingly waited for all intersecting
> requests). However, for aligned requests, we should not care about
> outdating of previously read data, as there no such data. Therefore,
> let's just update an assert to not care about aligned requests.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block.h | 13 ++++++++++++-
>  block/io.c            | 26 +++++++++++++++++++++++++-
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 57a96d4988..6623d15432 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -69,8 +69,19 @@ typedef enum {
>       * content. */
>      BDRV_REQ_WRITE_UNCHANGED    = 0x40,
>  
> +    /* BDRV_REQ_SERIALISING forces request serialisation for writes.
> +     * It is used to ensure that writes to the backing file of a backup process
> +     * target cannot race with a read of the backup target that defers to the
> +     * backing file.
> +     *
> +     * Note, that BDRV_REQ_SERIALISING is _not_ opposite in meaning to
> +     * BDRV_REQ_NO_SERIALISING. A more descriptive name for the latter might be
> +     * _DO_NOT_WAIT_FOR_SERIALISING, except that is too long.
> +     */
> +    BDRV_REQ_SERIALISING        = 0x80,
> +
>      /* Mask of valid flags */
> -    BDRV_REQ_MASK               = 0x7f,
> +    BDRV_REQ_MASK               = 0xff,
>  } BdrvRequestFlags;
>  
>  typedef struct BlockSizes {
> diff --git a/block/io.c b/block/io.c
> index 5d5651ad84..102cb0e1ba 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -623,6 +623,16 @@ static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
>      req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
>  }
>  
> +static bool is_request_serialising_and_aligned(BdrvTrackedRequest *req)
> +{
> +    /* if request is serialising, overlap_offset and overlap_bytes are set, so
> +     * we can check is request aligned. Otherwise don't care and return false

"if the request is aligned".

> +     */
> +
> +    return req->serialising && (req->offset == req->overlap_offset) &&
> +           (req->bytes == req->overlap_bytes);
> +}
> +
>  /**
>   * Round a region to cluster boundaries
>   */
> @@ -1291,6 +1301,9 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
>          mark_request_serialising(req, bdrv_get_cluster_size(bs));
>      }
>  
> +    /* BDRV_REQ_SERIALISING is only for write operation */
> +    assert(!(flags & BDRV_REQ_SERIALISING));
> +
>      if (!(flags & BDRV_REQ_NO_SERIALISING)) {
>          wait_serialising_requests(req);
>      }
> @@ -1574,8 +1587,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
>  
>      /* BDRV_REQ_NO_SERIALISING is only for read operation */
>      assert(!(flags & BDRV_REQ_NO_SERIALISING));
> +
> +    if (flags & BDRV_REQ_SERIALISING) {
> +        mark_request_serialising(req, bdrv_get_cluster_size(bs));
> +    }
> +
>      waited = wait_serialising_requests(req);
> -    assert(!waited || !req->serialising);
> +    assert(!waited || !req->serialising ||
> +           is_request_serialising_and_aligned(req));

Note to myself: what can happen is when we have a CoR request in flight which
was marked serialising, this aligned write overlaps with it. In this case, we
will wait (waited == true), and req->serialising == true as set by
BDRV_REQ_SERIALISING.  (Previously we must have called
wait_serialising_requests() in bdrv_co_pwritev() or bdrv_co_do_zero_pwritev()
before reaching here, if req->serialising, in which case no wait should happen.)

>      assert(req->overlap_offset <= offset);
>      assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
>      if (flags & BDRV_REQ_WRITE_UNCHANGED) {
> @@ -2929,6 +2948,8 @@ static int coroutine_fn bdrv_co_copy_range_internal(
>          tracked_request_begin(&req, src->bs, src_offset, bytes,
>                                BDRV_TRACKED_READ);
>  
> +        /* BDRV_REQ_SERIALISING is only for write operation */
> +        assert(!(read_flags & BDRV_REQ_SERIALISING));
>          if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
>              wait_serialising_requests(&req);
>          }
> @@ -2948,6 +2969,9 @@ static int coroutine_fn bdrv_co_copy_range_internal(
>  
>          /* BDRV_REQ_NO_SERIALISING is only for read operation */
>          assert(!(write_flags & BDRV_REQ_NO_SERIALISING));
> +        if (write_flags & BDRV_REQ_SERIALISING) {
> +            mark_request_serialising(&req, bdrv_get_cluster_size(dst->bs));
> +        }
>          wait_serialising_requests(&req);
>  
>          ret = dst->bs->drv->bdrv_co_copy_range_to(dst->bs,
> -- 
> 2.11.1
> 

Reviewed-by: Fam Zheng <famz@redhat.com>

Fam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes Vladimir Sementsov-Ogievskiy
@ 2018-07-10  2:57   ` Fam Zheng
  0 siblings, 0 replies; 10+ messages in thread
From: Fam Zheng @ 2018-07-10  2:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-devel, qemu-block, pl, pbonzini, ronniesahlberg, stefanha,
	mreitz, kwolf, jcody, jsnow, den

On Mon, 07/09 19:37, Vladimir Sementsov-Ogievskiy wrote:
> Fleecing scheme works as follows: we want a kind of temporary snapshot
> of active drive A. We create temporary image B, with B->backing = A.
> Then we start backup(sync=none) from A to B. From this point, B reads
> as point-in-time snapshot of A (A continues to be active drive,
> accepting guest IO).
> 
> This scheme needs some additional synchronization between reads from B
> and backup COW operations, otherwise, the following situation is
> theoretically possible:
> 
> (assume B is qcow2, client is NBD client, reading from B)
> 
> 1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
>    goes up to l2 table loading (assume cache miss)
> 
> 2) guest write => backup COW => qcow2 write =>
>    try to take qcow2 mutex => waiting
> 
> 3. l2 table loaded, we see that cluster is UNALLOCATED, go to
>    "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
>    bdrv_co_preadv(bs->backing, ...)
> 
> 4) aha, mutex unlocked, backup COW continues, and we finally finish
>    guest write and change cluster in our active disk A
> 
> 5. actually, do bdrv_co_preadv(bs->backing, ...) and read
>    _new updated_ data.
> 
> To avoid this, let's make backup writes serializing, to not intersect
> with reads from B.
> 
> Note: we expand range of handled cases from (sync=none and
> B->backing = A) to just (A in backing chain of B), to finally allow
> safe reading from B during backup for all cases when A in backing chain
> of B, i.e. B formally looks like point-in-time snapshot of A.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index f3e4e814b6..319fc922e8 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -47,6 +47,8 @@ typedef struct BackupBlockJob {
>      HBitmap *copy_bitmap;
>      bool use_copy_range;
>      int64_t copy_range_size;
> +
> +    bool serialize_target_writes;
>  } BackupBlockJob;
>  
>  static const BlockJobDriver backup_job_driver;
> @@ -102,6 +104,8 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
>      QEMUIOVector qiov;
>      BlockBackend *blk = job->common.blk;
>      int nbytes;
> +    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
> +    int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>  
>      hbitmap_reset(job->copy_bitmap, start / job->cluster_size, 1);
>      nbytes = MIN(job->cluster_size, job->len - start);
> @@ -112,8 +116,7 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
>      iov.iov_len = nbytes;
>      qemu_iovec_init_external(&qiov, &iov, 1);
>  
> -    ret = blk_co_preadv(blk, start, qiov.size, &qiov,
> -                        is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0);
> +    ret = blk_co_preadv(blk, start, qiov.size, &qiov, read_flags);
>      if (ret < 0) {
>          trace_backup_do_cow_read_fail(job, start, ret);
>          if (error_is_read) {
> @@ -124,11 +127,11 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
>  
>      if (qemu_iovec_is_zero(&qiov)) {
>          ret = blk_co_pwrite_zeroes(job->target, start,
> -                                   qiov.size, BDRV_REQ_MAY_UNMAP);
> +                                   qiov.size, write_flags | BDRV_REQ_MAY_UNMAP);
>      } else {
>          ret = blk_co_pwritev(job->target, start,
> -                             qiov.size, &qiov,
> -                             job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
> +                             qiov.size, &qiov, write_flags |
> +                             (job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0));
>      }
>      if (ret < 0) {
>          trace_backup_do_cow_write_fail(job, start, ret);
> @@ -156,6 +159,8 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
>      int nr_clusters;
>      BlockBackend *blk = job->common.blk;
>      int nbytes;
> +    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
> +    int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>  
>      assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size));
>      nbytes = MIN(job->copy_range_size, end - start);
> @@ -163,7 +168,7 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
>      hbitmap_reset(job->copy_bitmap, start / job->cluster_size,
>                    nr_clusters);
>      ret = blk_co_copy_range(blk, start, job->target, start, nbytes,
> -                            is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0, 0);
> +                            read_flags, write_flags);
>      if (ret < 0) {
>          trace_backup_do_cow_copy_range_fail(job, start, ret);
>          hbitmap_set(job->copy_bitmap, start / job->cluster_size,
> @@ -701,6 +706,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>                         sync_bitmap : NULL;
>      job->compress = compress;
>  
> +    /* Detect image-fleecing (and similar) schemes */
> +    job->serialize_target_writes = bdrv_chain_contains(target, bs);
> +
>      /* If there is no backing file on the target, we cannot rely on COW if our
>       * backup cluster size is smaller than the target cluster size. Even for
>       * targets with a backing file, try to avoid COW if possible. */

I always thought mirror is by far the most complicated job, but now backup is
catching up quickly!

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/4] fix image fleecing
  2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes Vladimir Sementsov-Ogievskiy
@ 2018-07-10 11:13 ` Kevin Wolf
  4 siblings, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2018-07-10 11:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-devel, qemu-block, pl, pbonzini, ronniesahlberg, famz,
	stefanha, mreitz, jcody, jsnow, den

Am 09.07.2018 um 18:37 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all.
> 
> This fixes image fleecing scheme for 3.0, details are in 04 patch.
> 01 is a significant fix too.
> 
> v5: - keep one _internal recursive function like in original code [Fam]
>     - grammar and rewording [Eric] (a bit changed,
>                                     be free to fix/reword in flight)
> 
> PS: 222 now fails for me, no matter with or without these series...

Thanks, applied to the block branch.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-07-10 11:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-09 16:37 [Qemu-devel] [PATCH v5 0/4] fix image fleecing Vladimir Sementsov-Ogievskiy
2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 1/4] block/io: fix copy_range Vladimir Sementsov-Ogievskiy
2018-07-10  1:49   ` Fam Zheng
2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 2/4] block: split flags in copy_range Vladimir Sementsov-Ogievskiy
2018-07-10  1:52   ` Fam Zheng
2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 3/4] block: add BDRV_REQ_SERIALISING flag Vladimir Sementsov-Ogievskiy
2018-07-10  2:51   ` Fam Zheng
2018-07-09 16:37 ` [Qemu-devel] [PATCH v5 4/4] block/backup: fix fleecing scheme: use serialized writes Vladimir Sementsov-Ogievskiy
2018-07-10  2:57   ` Fam Zheng
2018-07-10 11:13 ` [Qemu-devel] [PATCH v5 0/4] fix image fleecing Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.