All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation
@ 2018-12-03 10:14 Anton Nefedov
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

new in v10:
    - patches 1-3,6,7: rebase after REQ_WRITE_UNCHANGED
    - patch 3: drop supported_zero_flags. My bad, no write_zeroes in quorum.
    - patch 4: almost trivial rebase. RB-tags not stripped.
               Choose another constant for BDRV_REQ_ALLOCATE
    - patch 5: rebase. Instead of marking REQ_ALLOCATE serialising, accompany
               it with REQ_SERIALISING.
    - patch 7: add symmetric copy-on-read change
    - patch 8: trivial rebase. RB-tags not stripped.

v9: http://lists.nongnu.org/archive/html/qemu-devel/2018-05/msg01667.html
    - fixed commentary wording in patches 4, 8
    - rebased (no conflicts)

v8: http://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg03291.html

----

This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
     pad data with zeroes if there is no backing data.
     Resulting increase in ops number and potential cluster fragmentation
     (on the host file) is already solved by:
       ee22a9d qcow2: Merge the writing of the COW regions with the guest data
     However, in case of zero COW regions, that can be avoided at all
     but the whole clusters are preallocated and zeroed in a single
     efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
     space megabytes (*configurable number) ahead which gives noticeable
     improvement on some storage types (e.g. distributed storage)
     where the space allocation operation might be expensive)
     (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
     cluster after the space has been allocated & zeroed but before
     the first data is written and the cluster is linked to L2.
     (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
In this patchset, file-posix driver is marked as supporting this operation
if it supports (/configured to support) fallocate() operation.

Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver.
A new flag (BDRV_REQ_ALLOCATE) is introduced to avoid that but return ENOTSUP.
Such allocate requests are also implemented to possibly overlap with the
other requests. No wait is performed but an error returned in such case as well.
So the operation should be considered advisory and a fallback scenario still
handled by the caller (in this case, qcow2 driver).

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

    +-----------+-------+------+-------+------+------+
    |   file    |    before    |     after    | gain |
    +-----------+-------+------+-------+------+------+
    |    ssd    |      61.153  |      36.313  |  41% |
    |    hdd    |     112.676  |     122.056  |  -8% |
    +-----------+--------------+--------------+------+

Anton Nefedov (9):
  mirror: inherit supported write/zero flags
  blkverify: set supported write/zero flags
  quorum: set supported write flags
  block: introduce BDRV_REQ_ALLOCATE flag
  block: treat BDRV_REQ_ALLOCATE as serialising
  file-posix: support BDRV_REQ_ALLOCATE
  block: support BDRV_REQ_ALLOCATE in passthrough drivers
  qcow2: skip writing zero buffers to empty COW areas
  iotest 134: test cluster-misaligned encrypted write

 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 +++
 include/block/block.h      |  9 ++++-
 include/block/block_int.h  |  2 +-
 block/blkdebug.c           |  2 +-
 block/blkverify.c          | 10 ++++-
 block/copy-on-read.c       |  4 +-
 block/file-posix.c         |  8 +++-
 block/io.c                 | 49 ++++++++++++++++++-----
 block/mirror.c             |  8 +++-
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 80 +++++++++++++++++++++++++++++++++++++-
 block/quorum.c             | 19 ++++++++-
 block/raw-format.c         |  2 +-
 block/trace-events         |  1 +
 tests/qemu-iotests/060     | 26 ++++++++-----
 tests/qemu-iotests/060.out |  5 ++-
 tests/qemu-iotests/134     |  9 +++++
 tests/qemu-iotests/134.out | 10 +++++
 19 files changed, 220 insertions(+), 36 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 2/9] blkverify: set " Anton Nefedov
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/mirror.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 56d9ef7474..56908c9b19 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1528,8 +1528,12 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         mirror_top_bs->implicit = true;
     }
     mirror_top_bs->total_sectors = bs->total_sectors;
-    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+        (BDRV_REQ_FUA & bs->supported_write_flags);
+    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+         & bs->supported_zero_flags);
+
     bs_opaque = g_new0(MirrorBDSOpaque, 1);
     mirror_top_bs->opaque = bs_opaque;
     bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 2/9] blkverify: set supported write/zero flags
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-07 14:32   ` Alberto Garcia
  2018-12-12 12:26   ` Vladimir Sementsov-Ogievskiy
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags Anton Nefedov
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/blkverify.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index 89bf4386e3..bb52596cbb 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -141,8 +141,14 @@ static int blkverify_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
-    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+        (BDRV_REQ_FUA &
+         bs->file->bs->supported_write_flags &
+         s->test_file->bs->supported_write_flags);
+    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+         bs->file->bs->supported_zero_flags &
+         s->test_file->bs->supported_zero_flags);
 
     ret = 0;
 fail:
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 2/9] blkverify: set " Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-07 14:33   ` Alberto Garcia
  2018-12-12 12:33   ` Vladimir Sementsov-Ogievskiy
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/quorum.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 16b3c8067c..d21a6a3b8e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
     },
 };
 
+static void quorum_set_supported_flags(BlockDriverState *bs)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i;
+
+    bs->supported_write_flags = BDRV_REQ_FUA;
+    for (i = 0; i < s->num_children; i++) {
+        bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
+    }
+
+    bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
+}
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
                        Error **errp)
 {
@@ -950,7 +963,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     }
     s->next_child_index = s->num_children;
 
-    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
+    quorum_set_supported_flags(bs);
 
     g_free(opened);
     goto exit;
@@ -1025,6 +1038,8 @@ static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
     s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
     s->children[s->num_children++] = child;
 
+    quorum_set_supported_flags(bs);
+
 out:
     bdrv_drained_end(bs);
 }
@@ -1063,6 +1078,8 @@ static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
     bdrv_unref_child(bs, child);
 
     bdrv_drained_end(bs);
+
+    quorum_set_supported_flags(bs);
 }
 
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (2 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-05 12:59   ` Vladimir Sementsov-Ogievskiy
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising Anton Nefedov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

The flag is supposed to indicate that the region of the disk image has
to be sufficiently allocated so it reads as zeroes.

The call with the flag set must return -ENOTSUP if allocation cannot
be done efficiently.
This has to be made sure of by both
  - the drivers that support the flag
  - and the common block layer (so it will not fall back to any slowpath
    (like writing zero buffers) in case the driver does not support
    the flag).

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
---
 include/block/block.h     |  9 ++++++++-
 include/block/block_int.h |  2 +-
 block/io.c                | 18 ++++++++++++++++--
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 7f5453b45b..f571082415 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -83,8 +83,15 @@ typedef enum {
      */
     BDRV_REQ_SERIALISING        = 0x80,
 
+    /* The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
+     * efficiently allocate the space so it reads as zeroes, or return an error.
+     * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
+     * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+     */
+    BDRV_REQ_ALLOCATE           = 0x100,
+
     /* Mask of valid flags */
-    BDRV_REQ_MASK               = 0xff,
+    BDRV_REQ_MASK               = 0x1ff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f605622216..ff84c5d8aa 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -724,7 +724,7 @@ struct BlockDriverState {
      * their children. */
     unsigned int supported_write_flags;
     /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
-     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
+     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE) */
     unsigned int supported_zero_flags;
 
     /* the following member gives a name to every node on the bs graph. */
diff --git a/block/io.c b/block/io.c
index bd9d688f8b..d9d7644858 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1534,7 +1534,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
             assert(!bs->supported_zero_flags);
         }
 
-        if (ret == -ENOTSUP) {
+        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
             BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
 
@@ -1702,7 +1702,9 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
         !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
         qemu_iovec_is_zero(qiov)) {
         flags |= BDRV_REQ_ZERO_WRITE;
-        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
+        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP &&
+            !(flags & BDRV_REQ_ALLOCATE))
+        {
             flags |= BDRV_REQ_MAY_UNMAP;
         }
     }
@@ -1773,6 +1775,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     assert(flags & BDRV_REQ_ZERO_WRITE);
     if (head_padding_bytes || tail_padding_bytes) {
+        if (flags & BDRV_REQ_ALLOCATE) {
+            return -ENOTSUP;
+        }
         buf = qemu_blockalign(bs, align);
         iov = (struct iovec) {
             .iov_base   = buf,
@@ -1858,6 +1863,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
     bool use_local_qiov = false;
     int ret;
 
+    assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
+    assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
+
     trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
     if (!bs->drv) {
@@ -1980,6 +1988,12 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
 {
     trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 
+    if ((flags & BDRV_REQ_ALLOCATE) &&
+        !(child->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
+    {
+        return -ENOTSUP;
+    }
+
     if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
         flags &= ~BDRV_REQ_MAY_UNMAP;
     }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (3 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-05 13:14   ` Vladimir Sementsov-Ogievskiy
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

The idea is that ALLOCATE requests may overlap with other requests.
Reuse the existing block layer infrastructure for serialising requests.
Use the following approach:
  - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
  - ALLOCATE request itself must never wait if another request is in flight
    already. Return EAGAIN, let the caller reconsider.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/io.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/io.c b/block/io.c
index d9d7644858..6ff946f63d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+static bool coroutine_fn find_or_wait_serialising_requests(
+    BdrvTrackedRequest *self, bool wait)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     bool retry;
-    bool waited = false;
+    bool found = false;
 
     if (!atomic_read(&bs->serialising_in_flight)) {
         return false;
@@ -751,11 +752,14 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  * will wait for us as soon as it wakes up, then just go on
                  * (instead of producing a deadlock in the former case). */
                 if (!req->waiting_for) {
+                    found = true;
+                    if (!wait) {
+                        break;
+                    }
                     self->waiting_for = req;
                     qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
                     self->waiting_for = NULL;
                     retry = true;
-                    waited = true;
                     break;
                 }
             }
@@ -763,7 +767,12 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
         qemu_co_mutex_unlock(&bs->reqs_lock);
     } while (retry);
 
-    return waited;
+    return found;
+}
+
+static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+{
+    return find_or_wait_serialising_requests(self, true);
 }
 
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
@@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
                           BdrvTrackedRequest *req, int flags)
 {
     BlockDriverState *bs = child->bs;
-    bool waited;
+    bool found;
     int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
     if (bs->read_only) {
@@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
         mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
-    waited = wait_serialising_requests(req);
+    found = find_or_wait_serialising_requests(req,
+                                              !(flags & BDRV_REQ_ALLOCATE));
+    if (found && (flags & BDRV_REQ_ALLOCATE)) {
+        return -EAGAIN;
+    }
 
-    assert(!waited || !req->serialising ||
+    assert(!found || !req->serialising ||
            is_request_serialising_and_aligned(req));
     assert(req->overlap_offset <= offset);
     assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
@@ -1866,6 +1879,10 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
     assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
     assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
 
+    if (flags & BDRV_REQ_ALLOCATE) {
+        flags |= BDRV_REQ_SERIALISING;
+    }
+
     trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
     if (!bs->drv) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (4 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-05 13:25   ` Vladimir Sementsov-Ogievskiy
  2018-12-07 15:09   ` Alberto Garcia
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers Anton Nefedov
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

Current write_zeroes implementation is good enough to satisfy this flag too

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/file-posix.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 07bbdab953..b0b7ab0159 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -603,6 +603,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
         } else {
             s->discard_zeroes = true;
             s->has_fallocate = true;
+            bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
         }
     } else {
         if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
@@ -646,10 +647,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
         s->is_xfs = true;
+        bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
     }
 #endif
 
-    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+    bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
     ret = 0;
 fail:
     if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1520,6 +1522,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
         }
         s->has_fallocate = false;
     }
+
+    if (!s->has_fallocate) {
+        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
+    }
 #endif
 
     return -ENOTSUP;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (6 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-03 13:59   ` Alberto Garcia
                     ` (3 more replies)
  2018-12-03 10:15 ` [Qemu-devel] [PATCH v10 9/9] iotest 134: test cluster-misaligned encrypted write Anton Nefedov
  8 siblings, 4 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing image,
efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
cluster instead of writing explicit zero buffers later in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
---
 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 +++
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 80 +++++++++++++++++++++++++++++++++++++-
 block/trace-events         |  1 +
 tests/qemu-iotests/060     | 26 ++++++++-----
 tests/qemu-iotests/060.out |  5 ++-
 7 files changed, 109 insertions(+), 15 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index d4fe710836..50598aa8fe 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3004,6 +3004,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3022,7 +3024,7 @@
             'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
             'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
             'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-            'cor_write'] }
+            'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index 8662b68575..8a64077897 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -389,6 +389,12 @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /**
+     * Indicates that COW regions are already handled and do not require
+     * any more processing.
+     */
+    bool skip_cow;
+
     /**
      * The I/O vector with the data from the actual guest write request.
      * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index d37fe08b3d..3685c5f67e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->offset + start->nb_bytes <= end->offset);
     assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
         return 0;
     }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 991d6ac91b..027188a1a3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
             continue;
         }
 
+        /* If COW regions are handled already, skip this too */
+        if (m->skip_cow) {
+            continue;
+        }
+
         /* The data (middle) region must be immediately after the
          * start region */
         if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
     return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+    int64_t nr;
+    return !bytes ||
+        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+    /* This check is designed for optimization shortcut so it must be
+     * efficient.
+     * Instead of is_zero(), use is_unallocated() as it is faster (but not
+     * as accurate and can result in false negatives). */
+    return is_unallocated(bs, m->offset + m->cow_start.offset,
+                          m->cow_start.nb_bytes) &&
+           is_unallocated(bs, m->offset + m->cow_end.offset,
+                          m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *m;
+
+    if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
+        return 0;
+    }
+
+    if (bs->encrypted) {
+        return 0;
+    }
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        int ret;
+
+        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+            continue;
+        }
+
+        if (!is_zero_cow(bs, m)) {
+            continue;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
+        /* instead of writing zero COW buffers,
+           efficiently zero out the whole clusters */
+        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
+                                    m->nb_clusters * s->cluster_size,
+                                    BDRV_REQ_ALLOCATE);
+        if (ret < 0) {
+            if (ret != -ENOTSUP && ret != -EAGAIN) {
+                return ret;
+            }
+            continue;
+        }
+
+        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
+        m->skip_cow = true;
+    }
+    return 0;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -2122,24 +2189,33 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             goto fail;
         }
 
+        qemu_co_mutex_unlock(&s->lock);
+
+        ret = handle_alloc_space(bs, l2meta);
+        if (ret < 0) {
+            qemu_co_mutex_lock(&s->lock);
+            goto fail;
+        }
+
         /* If we need to do COW, check if it's possible to merge the
          * writing of the guest data together with that of the COW regions.
          * If it's not possible (or not necessary) then write the
          * guest data now. */
         if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
-            qemu_co_mutex_unlock(&s->lock);
             BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
             trace_qcow2_writev_data(qemu_coroutine_self(),
                                     cluster_offset + offset_in_cluster);
             ret = bdrv_co_pwritev(bs->file,
                                   cluster_offset + offset_in_cluster,
                                   cur_bytes, &hd_qiov, 0);
-            qemu_co_mutex_lock(&s->lock);
             if (ret < 0) {
+                qemu_co_mutex_lock(&s->lock);
                 goto fail;
             }
         }
 
+        qemu_co_mutex_lock(&s->lock);
+
         ret = qcow2_handle_l2meta(bs, &l2meta, true);
         if (ret) {
             goto fail;
diff --git a/block/trace-events b/block/trace-events
index 3e8c47bb24..f3f6d66dcc 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -69,6 +69,7 @@ qcow2_writev_done_part(void *co, int cur_bytes) "co %p cur_bytes %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset 0x%" PRIx64
 qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
 qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
+qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset 0x%" PRIx64 " nb_clusters %d"
 
 # block/qcow2-cluster.c
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index af0588ae9a..089df94657 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -150,27 +150,33 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
 echo
 echo "=== Testing overlap while COW is in flight ==="
 echo
+BACKING_IMG=$TEST_IMG.base
+TEST_IMG=$BACKING_IMG _make_test_img 1G
+
+$QEMU_IO -c 'write 64k 64k' "$BACKING_IMG" | _filter_qemu_io
+
 # compat=0.10 is required in order to make the following discard actually
-# unallocate the sector rather than make it a zero sector - we want COW, after
-# all.
-IMGOPTS='compat=0.10' _make_test_img 1G
+# unallocate the sector rather than make it a zero sector as we would like
+# to reuse it for another guest offset
+IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
 # Write two clusters, the second one enforces creation of an L2 table after
 # the first data cluster.
 $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
-# Discard the first cluster. This cluster will soon enough be reallocated and
-# used for COW.
+# Discard the first cluster. This cluster will soon enough be reallocated
 $QEMU_IO -c 'discard 0k 64k' "$TEST_IMG" | _filter_qemu_io
 # Now, corrupt the image by marking the second L2 table cluster as free.
 poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
-# Start a write operation requiring COW on the image stopping it right before
-# doing the read; then, trigger the corruption prevention by writing anything to
-# any unallocated cluster, leading to an attempt to overwrite the second L2
+# Start a write operation requiring COW on the image;
+# this write will reuse the host offset released by a previous discard.
+# Stop it right before doing the read.
+# Then, trigger the corruption prevention by writing anything to
+# another unallocated cluster, leading to an attempt to overwrite the second L2
 # table. Finally, resume the COW write and see it fail (but not crash).
 echo "open -o file.driver=blkdebug $TEST_IMG
 break cow_read 0
-aio_write 0k 1k
+aio_write 64k 1k
 wait_break 0
-write 64k 64k
+write 128k 64k
 resume 0" | $QEMU_IO | _filter_qemu_io
 
 echo
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index d67c6234a4..ec8f15d2f0 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -97,7 +97,10 @@ read 512/512 bytes at offset 0
 
 === Testing overlap while COW is in flight ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 65536/65536 bytes at offset 536870912
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (5 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
@ 2018-12-03 10:14 ` Anton Nefedov
  2018-12-05 13:28   ` Vladimir Sementsov-Ogievskiy
  2018-12-07 15:00   ` Alberto Garcia
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2018-12-03 10:15 ` [Qemu-devel] [PATCH v10 9/9] iotest 134: test cluster-misaligned encrypted write Anton Nefedov
  8 siblings, 2 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

Support the flag if the underlying BDS supports it

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/blkdebug.c     | 2 +-
 block/blkverify.c    | 2 +-
 block/copy-on-read.c | 4 ++--
 block/mirror.c       | 2 +-
 block/raw-format.c   | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 0759452925..f0fc2ec276 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -401,7 +401,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
     bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
         (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
     bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
             bs->file->bs->supported_zero_flags);
     ret = -EINVAL;
 
diff --git a/block/blkverify.c b/block/blkverify.c
index bb52596cbb..9cb4f94b68 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -146,7 +146,7 @@ static int blkverify_open(BlockDriverState *bs, QDict *options, int flags,
          bs->file->bs->supported_write_flags &
          s->test_file->bs->supported_write_flags);
     bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
          bs->file->bs->supported_zero_flags &
          s->test_file->bs->supported_zero_flags);
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 64dcc424b5..1eb993699a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -38,8 +38,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int flags,
                                     bs->file->bs->supported_write_flags);
 
     bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-                               ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
-                                    bs->file->bs->supported_zero_flags);
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
+         bs->file->bs->supported_zero_flags);
 
     return 0;
 }
diff --git a/block/mirror.c b/block/mirror.c
index 56908c9b19..9d836a6bd3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1531,7 +1531,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
     mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
         (BDRV_REQ_FUA & bs->supported_write_flags);
     mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE)
          & bs->supported_zero_flags);
 
     bs_opaque = g_new0(MirrorBDSOpaque, 1);
diff --git a/block/raw-format.c b/block/raw-format.c
index 6f6dc99b2c..ad7453dc83 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -432,7 +432,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
     bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
         (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
     bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
             bs->file->bs->supported_zero_flags);
 
     if (bs->probed && !bdrv_is_read_only(bs)) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v10 9/9] iotest 134: test cluster-misaligned encrypted write
  2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
                   ` (7 preceding siblings ...)
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
@ 2018-12-03 10:15 ` Anton Nefedov
  8 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto,
	Vladimir Sementsov-Ogievskiy, Anton Nefedov

COW (even empty/zero) areas require encryption too

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
---
 tests/qemu-iotests/134     |  9 +++++++++
 tests/qemu-iotests/134.out | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index cacabcd28b..792c8ca12f 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -57,6 +57,15 @@ echo
 echo "== reading whole image =="
 $QEMU_IO --object $SECRET -c "read 0 $size" --image-opts $IMGSPEC | _filter_qemu_io | _filter_testdir
 
+echo
+echo "== rewriting cluster part =="
+$QEMU_IO --object $SECRET -c "write -P 0xb 512 512" --image-opts $IMGSPEC | _filter_qemu_io | _filter_testdir
+
+echo
+echo "== verify pattern =="
+$QEMU_IO --object $SECRET -c "read -P 0 0 512"  --image-opts $IMGSPEC | _filter_qemu_io | _filter_testdir
+$QEMU_IO --object $SECRET -c "read -P 0xb 512 512"  --image-opts $IMGSPEC | _filter_qemu_io | _filter_testdir
+
 echo
 echo "== rewriting whole image =="
 $QEMU_IO --object $SECRET -c "write -P 0xa 0 $size" --image-opts $IMGSPEC | _filter_qemu_io | _filter_testdir
diff --git a/tests/qemu-iotests/134.out b/tests/qemu-iotests/134.out
index 972be49d91..09d46f6b17 100644
--- a/tests/qemu-iotests/134.out
+++ b/tests/qemu-iotests/134.out
@@ -5,6 +5,16 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 encryption=on encrypt.
 read 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== rewriting cluster part ==
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify pattern ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == rewriting whole image ==
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
@ 2018-12-03 13:59   ` Alberto Garcia
  2018-12-03 14:04     ` Anton Nefedov
  2018-12-05 14:01   ` Vladimir Sementsov-Ogievskiy
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2018-12-03 13:59 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:59 AM CET, Anton Nefedov wrote:
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3004,6 +3004,8 @@
>  #
>  # @cor_write: a write due to copy-on-read (since 2.11)
>  #
> +# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)

Since 4.0 ??

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 13:59   ` Alberto Garcia
@ 2018-12-03 14:04     ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-03 14:04 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy



On 3/12/2018 4:59 PM, Alberto Garcia wrote:
> On Mon 03 Dec 2018 11:14:59 AM CET, Anton Nefedov wrote:
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3004,6 +3004,8 @@
>>   #
>>   # @cor_write: a write due to copy-on-read (since 2.11)
>>   #
>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
> 
> Since 4.0 ??
> 
> Berto
> 

I heard that 3.1 will be followed by 4.0

/Anton

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
@ 2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
  2018-12-05 13:27     ` Anton Nefedov
  2018-12-07 14:31   ` Alberto Garcia
  2018-12-12 12:15   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 12:43 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

Could you please write, what is the behavior change and why here?

Is it a bug, that FUA was not inherited before?

03.12.2018 13:14, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>   block/mirror.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 56d9ef7474..56908c9b19 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1528,8 +1528,12 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
>           mirror_top_bs->implicit = true;
>       }
>       mirror_top_bs->total_sectors = bs->total_sectors;
> -    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
> -    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
> +    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        (BDRV_REQ_FUA & bs->supported_write_flags);
> +    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
> +         & bs->supported_zero_flags);
> +
>       bs_opaque = g_new0(MirrorBDSOpaque, 1);
>       mirror_top_bs->opaque = bs_opaque;
>       bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
@ 2018-12-05 12:59   ` Vladimir Sementsov-Ogievskiy
  2018-12-05 13:38     ` Anton Nefedov
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 12:59 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> The flag is supposed to indicate that the region of the disk image has
> to be sufficiently allocated so it reads as zeroes.
> 
> The call with the flag set must return -ENOTSUP if allocation cannot
> be done efficiently.
> This has to be made sure of by both
>    - the drivers that support the flag
>    - and the common block layer (so it will not fall back to any slowpath
>      (like writing zero buffers) in case the driver does not support
>      the flag).
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> ---
>   include/block/block.h     |  9 ++++++++-
>   include/block/block_int.h |  2 +-
>   block/io.c                | 18 ++++++++++++++++--
>   3 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 7f5453b45b..f571082415 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -83,8 +83,15 @@ typedef enum {
>        */
>       BDRV_REQ_SERIALISING        = 0x80,
>   
> +    /* The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
> +     * efficiently allocate the space so it reads as zeroes, or return an error.
> +     * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
> +     * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.

and, may be, it can't be set with FUA too?

> +     */
> +    BDRV_REQ_ALLOCATE           = 0x100,
> +
>       /* Mask of valid flags */
> -    BDRV_REQ_MASK               = 0xff,
> +    BDRV_REQ_MASK               = 0x1ff,
>   } BdrvRequestFlags;
>   
>   typedef struct BlockSizes {
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index f605622216..ff84c5d8aa 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -724,7 +724,7 @@ struct BlockDriverState {
>        * their children. */
>       unsigned int supported_write_flags;
>       /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
> -     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
> +     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE) */
>       unsigned int supported_zero_flags;
>   
>       /* the following member gives a name to every node on the bs graph. */
> diff --git a/block/io.c b/block/io.c
> index bd9d688f8b..d9d7644858 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1534,7 +1534,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>               assert(!bs->supported_zero_flags);
>           }
>   
> -        if (ret == -ENOTSUP) {
> +        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
>               /* Fall back to bounce buffer if write zeroes is unsupported */
>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>   
> @@ -1702,7 +1702,9 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
>           !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
>           qemu_iovec_is_zero(qiov)) {
>           flags |= BDRV_REQ_ZERO_WRITE;
> -        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP &&
> +            !(flags & BDRV_REQ_ALLOCATE))


dead check. we are in if (!(flags & BDRV_REQ_ZERO_WRITE)), so (flags & BDRV_REQ_ALLOCATE) must be zero as well.

> +        {
>               flags |= BDRV_REQ_MAY_UNMAP;
>           }
>       }
> @@ -1773,6 +1775,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
>   
>       assert(flags & BDRV_REQ_ZERO_WRITE);
>       if (head_padding_bytes || tail_padding_bytes) {
> +        if (flags & BDRV_REQ_ALLOCATE) {
> +            return -ENOTSUP;
> +        }
>           buf = qemu_blockalign(bs, align);
>           iov = (struct iovec) {
>               .iov_base   = buf,
> @@ -1858,6 +1863,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>       bool use_local_qiov = false;
>       int ret;
>   
> +    assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
> +    assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));

what about FUA?

> +
>       trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
>   
>       if (!bs->drv) {
> @@ -1980,6 +1988,12 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
>   {
>       trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
>   
> +    if ((flags & BDRV_REQ_ALLOCATE) &&
> +        !(child->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
> +    {
> +        return -ENOTSUP;
> +    }
> +
>       if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
>           flags &= ~BDRV_REQ_MAY_UNMAP;
>       }
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising Anton Nefedov
@ 2018-12-05 13:14   ` Vladimir Sementsov-Ogievskiy
  2018-12-05 14:01     ` Anton Nefedov
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 13:14 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> The idea is that ALLOCATE requests may overlap with other requests.

please, describe why

> Reuse the existing block layer infrastructure for serialising requests.
> Use the following approach:
>    - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
>    - ALLOCATE request itself must never wait if another request is in flight
>      already. Return EAGAIN, let the caller reconsider.
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>   block/io.c | 31 ++++++++++++++++++++++++-------
>   1 file changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index d9d7644858..6ff946f63d 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
>       bdrv_wakeup(bs);
>   }
>   
> -static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> +static bool coroutine_fn find_or_wait_serialising_requests(
> +    BdrvTrackedRequest *self, bool wait)
>   {
>       BlockDriverState *bs = self->bs;
>       BdrvTrackedRequest *req;
>       bool retry;
> -    bool waited = false;
> +    bool found = false;
>   
>       if (!atomic_read(&bs->serialising_in_flight)) {
>           return false;
> @@ -751,11 +752,14 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>                    * will wait for us as soon as it wakes up, then just go on
>                    * (instead of producing a deadlock in the former case). */
>                   if (!req->waiting_for) {
> +                    found = true;
> +                    if (!wait) {
> +                        break;
> +                    }
>                       self->waiting_for = req;
>                       qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
>                       self->waiting_for = NULL;
>                       retry = true;
> -                    waited = true;
>                       break;
>                   }
>               }
> @@ -763,7 +767,12 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>           qemu_co_mutex_unlock(&bs->reqs_lock);
>       } while (retry);
>   
> -    return waited;
> +    return found;
> +}
> +
> +static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> +{
> +    return find_or_wait_serialising_requests(self, true);
>   }
>   
>   static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
> @@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>                             BdrvTrackedRequest *req, int flags)
>   {
>       BlockDriverState *bs = child->bs;
> -    bool waited;
> +    bool found;
>       int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
>   
>       if (bs->read_only) {
> @@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>           mark_request_serialising(req, bdrv_get_cluster_size(bs));
>       }
>   
> -    waited = wait_serialising_requests(req);
> +    found = find_or_wait_serialising_requests(req,
> +                                              !(flags & BDRV_REQ_ALLOCATE));
> +    if (found && (flags & BDRV_REQ_ALLOCATE)) {
> +        return -EAGAIN;
> +    }
>   
> -    assert(!waited || !req->serialising ||
> +    assert(!found || !req->serialising ||
>              is_request_serialising_and_aligned(req));
>       assert(req->overlap_offset <= offset);
>       assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
> @@ -1866,6 +1879,10 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>       assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
>       assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
>   
> +    if (flags & BDRV_REQ_ALLOCATE) {
> +        flags |= BDRV_REQ_SERIALISING;
> +    }
> +
>       trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
>   
>       if (!bs->drv) {
> 

patch looks correct technically, I just don't know the reasoning..

the only thing, is that it would be good to add a comment in BDRV flags definition section, that _ALLOCATE implies _SERIALIZE.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
@ 2018-12-05 13:25   ` Vladimir Sementsov-Ogievskiy
  2018-12-05 14:11     ` Anton Nefedov
  2018-12-07 15:09   ` Alberto Garcia
  1 sibling, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 13:25 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> Current write_zeroes implementation is good enough to satisfy this flag too
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>   block/file-posix.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 07bbdab953..b0b7ab0159 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -603,6 +603,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>           } else {
>               s->discard_zeroes = true;
>               s->has_fallocate = true;
> +            bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
>           }
>       } else {
>           if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
> @@ -646,10 +647,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>   #ifdef CONFIG_XFS
>       if (platform_test_xfs_fd(s->fd)) {
>           s->is_xfs = true;
> +        bs->supported_zero_flags = BDRV_REQ_ALLOCATE;

why we should handle xfs separately? is there a case with xfs, not handled by previous generic if ()?

>       }
>   #endif
>   
> -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> +    bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>       ret = 0;
>   fail:
>       if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> @@ -1520,6 +1522,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>           }
>           s->has_fallocate = false;
>       }
> +
> +    if (!s->has_fallocate) {
> +        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
> +    }
>   #endif
>   
>       return -ENOTSUP;
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags
  2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-05 13:27     ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-05 13:27 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

On 5/12/2018 3:43 PM, Vladimir Sementsov-Ogievskiy wrote:
> Could you please write, what is the behavior change and why here?
> 

The idea is that passthrough drivers should report the flags if there
are no obstacles to support it.
Technically, these changes are not connected to this series, but since
BDRV_REQ_ALLOCATE flag is added, so we might want to expose it where
possible.

> Is it a bug, that FUA was not inherited before?
> 

I don't think it's a bug really since there is a fallback path in
block/io.c.

> 03.12.2018 13:14, Anton Nefedov wrote:
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>    block/mirror.c | 8 ++++++--
>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 56d9ef7474..56908c9b19 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -1528,8 +1528,12 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
>>            mirror_top_bs->implicit = true;
>>        }
>>        mirror_top_bs->total_sectors = bs->total_sectors;
>> -    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
>> -    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
>> +    mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        (BDRV_REQ_FUA & bs->supported_write_flags);
>> +    mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
>> +         & bs->supported_zero_flags);
>> +
>>        bs_opaque = g_new0(MirrorBDSOpaque, 1);
>>        mirror_top_bs->opaque = bs_opaque;
>>        bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
>>
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers Anton Nefedov
@ 2018-12-05 13:28   ` Vladimir Sementsov-Ogievskiy
  2018-12-07 15:00   ` Alberto Garcia
  1 sibling, 0 replies; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 13:28 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> Support the flag if the underlying BDS supports it
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> ---
>   block/blkdebug.c     | 2 +-
>   block/blkverify.c    | 2 +-
>   block/copy-on-read.c | 4 ++--
>   block/mirror.c       | 2 +-
>   block/raw-format.c   | 2 +-
>   5 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index 0759452925..f0fc2ec276 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -401,7 +401,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>       bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>           (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
>       bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> -        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
>               bs->file->bs->supported_zero_flags);
>       ret = -EINVAL;
>   
> diff --git a/block/blkverify.c b/block/blkverify.c
> index bb52596cbb..9cb4f94b68 100644
> --- a/block/blkverify.c
> +++ b/block/blkverify.c
> @@ -146,7 +146,7 @@ static int blkverify_open(BlockDriverState *bs, QDict *options, int flags,
>            bs->file->bs->supported_write_flags &
>            s->test_file->bs->supported_write_flags);
>       bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> -        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
>            bs->file->bs->supported_zero_flags &
>            s->test_file->bs->supported_zero_flags);
>   
> diff --git a/block/copy-on-read.c b/block/copy-on-read.c
> index 64dcc424b5..1eb993699a 100644
> --- a/block/copy-on-read.c
> +++ b/block/copy-on-read.c
> @@ -38,8 +38,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int flags,
>                                       bs->file->bs->supported_write_flags);
>   
>       bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> -                               ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
> -                                    bs->file->bs->supported_zero_flags);
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
> +         bs->file->bs->supported_zero_flags);
>   
>       return 0;
>   }
> diff --git a/block/mirror.c b/block/mirror.c
> index 56908c9b19..9d836a6bd3 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1531,7 +1531,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
>       mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>           (BDRV_REQ_FUA & bs->supported_write_flags);
>       mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> -        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE)
>            & bs->supported_zero_flags);
>   
>       bs_opaque = g_new0(MirrorBDSOpaque, 1);
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 6f6dc99b2c..ad7453dc83 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -432,7 +432,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
>       bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>           (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
>       bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> -        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
>               bs->file->bs->supported_zero_flags);
>   
>       if (bs->probed && !bdrv_is_read_only(bs)) {
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag
  2018-12-05 12:59   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-05 13:38     ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-05 13:38 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto



On 5/12/2018 3:59 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> The flag is supposed to indicate that the region of the disk image has
>> to be sufficiently allocated so it reads as zeroes.
>>
>> The call with the flag set must return -ENOTSUP if allocation cannot
>> be done efficiently.
>> This has to be made sure of by both
>>     - the drivers that support the flag
>>     - and the common block layer (so it will not fall back to any slowpath
>>       (like writing zero buffers) in case the driver does not support
>>       the flag).
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> Reviewed-by: Alberto Garcia <berto@igalia.com>
>> ---
>>    include/block/block.h     |  9 ++++++++-
>>    include/block/block_int.h |  2 +-
>>    block/io.c                | 18 ++++++++++++++++--
>>    3 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 7f5453b45b..f571082415 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -83,8 +83,15 @@ typedef enum {
>>         */
>>        BDRV_REQ_SERIALISING        = 0x80,
>>    
>> +    /* The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
>> +     * efficiently allocate the space so it reads as zeroes, or return an error.
>> +     * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
>> +     * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
> 
> and, may be, it can't be set with FUA too?
> 

I don't quite see why it cannot. Even the efficient allocate call
usually implies some Unit Access, it's up to the protocol driver to
decide which exactly.

>> +     */
>> +    BDRV_REQ_ALLOCATE           = 0x100,
>> +
>>        /* Mask of valid flags */
>> -    BDRV_REQ_MASK               = 0xff,
>> +    BDRV_REQ_MASK               = 0x1ff,
>>    } BdrvRequestFlags;
>>    
>>    typedef struct BlockSizes {
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index f605622216..ff84c5d8aa 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -724,7 +724,7 @@ struct BlockDriverState {
>>         * their children. */
>>        unsigned int supported_write_flags;
>>        /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
>> -     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
>> +     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE) */
>>        unsigned int supported_zero_flags;
>>    
>>        /* the following member gives a name to every node on the bs graph. */
>> diff --git a/block/io.c b/block/io.c
>> index bd9d688f8b..d9d7644858 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1534,7 +1534,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>                assert(!bs->supported_zero_flags);
>>            }
>>    
>> -        if (ret == -ENOTSUP) {
>> +        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
>>                /* Fall back to bounce buffer if write zeroes is unsupported */
>>                BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>>    
>> @@ -1702,7 +1702,9 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
>>            !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
>>            qemu_iovec_is_zero(qiov)) {
>>            flags |= BDRV_REQ_ZERO_WRITE;
>> -        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP &&
>> +            !(flags & BDRV_REQ_ALLOCATE))
> 
> 
> dead check. we are in if (!(flags & BDRV_REQ_ZERO_WRITE)), so (flags & BDRV_REQ_ALLOCATE) must be zero as well.
> 

Agree.

>> +        {
>>                flags |= BDRV_REQ_MAY_UNMAP;
>>            }
>>        }
>> @@ -1773,6 +1775,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
>>    
>>        assert(flags & BDRV_REQ_ZERO_WRITE);
>>        if (head_padding_bytes || tail_padding_bytes) {
>> +        if (flags & BDRV_REQ_ALLOCATE) {
>> +            return -ENOTSUP;
>> +        }
>>            buf = qemu_blockalign(bs, align);
>>            iov = (struct iovec) {
>>                .iov_base   = buf,
>> @@ -1858,6 +1863,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>>        bool use_local_qiov = false;
>>        int ret;
>>    
>> +    assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
>> +    assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
> 
> what about FUA?
> 

pls see above

>> +
>>        trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
>>    
>>        if (!bs->drv) {
>> @@ -1980,6 +1988,12 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>    {
>>        trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
>>    
>> +    if ((flags & BDRV_REQ_ALLOCATE) &&
>> +        !(child->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
>> +    {
>> +        return -ENOTSUP;
>> +    }
>> +
>>        if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
>>            flags &= ~BDRV_REQ_MAY_UNMAP;
>>        }
>>
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2018-12-03 13:59   ` Alberto Garcia
@ 2018-12-05 14:01   ` Vladimir Sementsov-Ogievskiy
  2018-12-05 16:59     ` Anton Nefedov
  2018-12-13 12:02   ` Vladimir Sementsov-Ogievskiy
  2018-12-14 16:20   ` Vladimir Sementsov-Ogievskiy
  3 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 14:01 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing image,
> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
> cluster instead of writing explicit zero buffers later in perform_cow().
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> ---
>   qapi/block-core.json       |  4 +-
>   block/qcow2.h              |  6 +++
>   block/qcow2-cluster.c      |  2 +-
>   block/qcow2.c              | 80 +++++++++++++++++++++++++++++++++++++-
>   block/trace-events         |  1 +
>   tests/qemu-iotests/060     | 26 ++++++++-----
>   tests/qemu-iotests/060.out |  5 ++-
>   7 files changed, 109 insertions(+), 15 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index d4fe710836..50598aa8fe 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3004,6 +3004,8 @@
>   #
>   # @cor_write: a write due to copy-on-read (since 2.11)
>   #
> +# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
> +#
>   # Since: 2.9
>   ##
>   { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
> @@ -3022,7 +3024,7 @@
>               'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>               'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>               'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
> -            'cor_write'] }
> +            'cor_write', 'cluster_alloc_space'] }
>   
>   ##
>   # @BlkdebugInjectErrorOptions:
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 8662b68575..8a64077897 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -389,6 +389,12 @@ typedef struct QCowL2Meta
>        */
>       Qcow2COWRegion cow_end;
>   
> +    /**
> +     * Indicates that COW regions are already handled and do not require
> +     * any more processing.
> +     */
> +    bool skip_cow;
> +
>       /**
>        * The I/O vector with the data from the actual guest write request.
>        * If non-NULL, this is meant to be merged together with the data
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index d37fe08b3d..3685c5f67e 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
>       assert(start->offset + start->nb_bytes <= end->offset);
>       assert(!m->data_qiov || m->data_qiov->size == data_bytes);
>   
> -    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
> +    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
>           return 0;
>       }
>   
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 991d6ac91b..027188a1a3 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>               continue;
>           }
>   
> +        /* If COW regions are handled already, skip this too */
> +        if (m->skip_cow) {
> +            continue;
> +        }
> +
>           /* The data (middle) region must be immediately after the
>            * start region */
>           if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>       return false;
>   }
>   
> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
> +{
> +    int64_t nr;
> +    return !bytes ||
> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);

bdrv_is_allocated_above may return error < 0

> +}
> +
> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> +{
> +    /* This check is designed for optimization shortcut so it must be
> +     * efficient.
> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
> +     * as accurate and can result in false negatives). */
> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
> +                          m->cow_start.nb_bytes) &&
> +           is_unallocated(bs, m->offset + m->cow_end.offset,
> +                          m->cow_end.nb_bytes);
> +}
> +
> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    QCowL2Meta *m;
> +
> +    if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
> +        return 0;
> +    }
> +
> +    if (bs->encrypted) {
> +        return 0;
> +    }
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        int ret;
> +
> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
> +            continue;
> +        }
> +
> +        if (!is_zero_cow(bs, m)) {
> +            continue;
> +        }
> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
> +        /* instead of writing zero COW buffers,
> +           efficiently zero out the whole clusters */
> +        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
> +                                    m->nb_clusters * s->cluster_size,
> +                                    BDRV_REQ_ALLOCATE);
> +        if (ret < 0) {
> +            if (ret != -ENOTSUP && ret != -EAGAIN) {

why do you ignore ENOTSUP?

And also, hmm, should not we have problems with serializing? If already started unaligned request,
it is serializing.. And now, we are going to start nested serializing request, which will definitely
return EAGAIN..

> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
> +        m->skip_cow = true;
> +    }
> +    return 0;
> +}
> +
>   static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                            uint64_t bytes, QEMUIOVector *qiov,
>                                            int flags)
> @@ -2122,24 +2189,33 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>               goto fail;
>           }
>   
> +        qemu_co_mutex_unlock(&s->lock);
> +
> +        ret = handle_alloc_space(bs, l2meta);
> +        if (ret < 0) {
> +            qemu_co_mutex_lock(&s->lock);
> +            goto fail;
> +        }
> +
>           /* If we need to do COW, check if it's possible to merge the
>            * writing of the guest data together with that of the COW regions.
>            * If it's not possible (or not necessary) then write the
>            * guest data now. */
>           if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
> -            qemu_co_mutex_unlock(&s->lock);
>               BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
>               trace_qcow2_writev_data(qemu_coroutine_self(),
>                                       cluster_offset + offset_in_cluster);
>               ret = bdrv_co_pwritev(bs->file,
>                                     cluster_offset + offset_in_cluster,
>                                     cur_bytes, &hd_qiov, 0);
> -            qemu_co_mutex_lock(&s->lock);
>               if (ret < 0) {
> +                qemu_co_mutex_lock(&s->lock);
>                   goto fail;
>               }
>           }
>   
> +        qemu_co_mutex_lock(&s->lock);
> +
>           ret = qcow2_handle_l2meta(bs, &l2meta, true);
>           if (ret) {
>               goto fail;
> diff --git a/block/trace-events b/block/trace-events
> index 3e8c47bb24..f3f6d66dcc 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -69,6 +69,7 @@ qcow2_writev_done_part(void *co, int cur_bytes) "co %p cur_bytes %d"
>   qcow2_writev_data(void *co, uint64_t offset) "co %p offset 0x%" PRIx64
>   qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
>   qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
> +qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset 0x%" PRIx64 " nb_clusters %d"
>   
>   # block/qcow2-cluster.c
>   qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
> diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
> index af0588ae9a..089df94657 100755
> --- a/tests/qemu-iotests/060
> +++ b/tests/qemu-iotests/060
> @@ -150,27 +150,33 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
>   echo
>   echo "=== Testing overlap while COW is in flight ==="
>   echo
> +BACKING_IMG=$TEST_IMG.base
> +TEST_IMG=$BACKING_IMG _make_test_img 1G
> +
> +$QEMU_IO -c 'write 64k 64k' "$BACKING_IMG" | _filter_qemu_io
> +
>   # compat=0.10 is required in order to make the following discard actually
> -# unallocate the sector rather than make it a zero sector - we want COW, after
> -# all.
> -IMGOPTS='compat=0.10' _make_test_img 1G
> +# unallocate the sector rather than make it a zero sector as we would like
> +# to reuse it for another guest offset
> +IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
>   # Write two clusters, the second one enforces creation of an L2 table after
>   # the first data cluster.
>   $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
> -# Discard the first cluster. This cluster will soon enough be reallocated and
> -# used for COW.
> +# Discard the first cluster. This cluster will soon enough be reallocated
>   $QEMU_IO -c 'discard 0k 64k' "$TEST_IMG" | _filter_qemu_io
>   # Now, corrupt the image by marking the second L2 table cluster as free.
>   poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
> -# Start a write operation requiring COW on the image stopping it right before
> -# doing the read; then, trigger the corruption prevention by writing anything to
> -# any unallocated cluster, leading to an attempt to overwrite the second L2
> +# Start a write operation requiring COW on the image;
> +# this write will reuse the host offset released by a previous discard.
> +# Stop it right before doing the read.
> +# Then, trigger the corruption prevention by writing anything to
> +# another unallocated cluster, leading to an attempt to overwrite the second L2
>   # table. Finally, resume the COW write and see it fail (but not crash).
>   echo "open -o file.driver=blkdebug $TEST_IMG
>   break cow_read 0
> -aio_write 0k 1k
> +aio_write 64k 1k
>   wait_break 0
> -write 64k 64k
> +write 128k 64k
>   resume 0" | $QEMU_IO | _filter_qemu_io
>   
>   echo
> diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
> index d67c6234a4..ec8f15d2f0 100644
> --- a/tests/qemu-iotests/060.out
> +++ b/tests/qemu-iotests/060.out
> @@ -97,7 +97,10 @@ read 512/512 bytes at offset 0
>   
>   === Testing overlap while COW is in flight ===
>   
> -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
> +wrote 65536/65536 bytes at offset 65536
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
>   wrote 65536/65536 bytes at offset 0
>   64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 65536/65536 bytes at offset 536870912
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising
  2018-12-05 13:14   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-05 14:01     ` Anton Nefedov
  2018-12-12 12:48       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-05 14:01 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto



On 5/12/2018 4:14 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> The idea is that ALLOCATE requests may overlap with other requests.
> 
> please, describe why
> 

It is not used in this series from some point, but the idea is that the
caller might use ALLOCATE requests on a larger extent.

Described in the series header:

   2. moreover, efficient write_zeroes() operation can be used to
preallocate
      space megabytes (*configurable number) ahead which gives noticeable
      improvement on some storage types (e.g. distributed storage)
      where the space allocation operation might be expensive)
      (Not included in this patchset since v6).

So, it's possible to drop from this series and add later but I'd like
to receive general remarks on whether this is an acceptable way.

>> Reuse the existing block layer infrastructure for serialising requests.
>> Use the following approach:
>>     - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
>>     - ALLOCATE request itself must never wait if another request is in flight
>>       already. Return EAGAIN, let the caller reconsider.
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>    block/io.c | 31 ++++++++++++++++++++++++-------
>>    1 file changed, 24 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index d9d7644858..6ff946f63d 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
>>        bdrv_wakeup(bs);
>>    }
>>    
>> -static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>> +static bool coroutine_fn find_or_wait_serialising_requests(
>> +    BdrvTrackedRequest *self, bool wait)
>>    {
>>        BlockDriverState *bs = self->bs;
>>        BdrvTrackedRequest *req;
>>        bool retry;
>> -    bool waited = false;
>> +    bool found = false;
>>    
>>        if (!atomic_read(&bs->serialising_in_flight)) {
>>            return false;
>> @@ -751,11 +752,14 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>>                     * will wait for us as soon as it wakes up, then just go on
>>                     * (instead of producing a deadlock in the former case). */
>>                    if (!req->waiting_for) {
>> +                    found = true;
>> +                    if (!wait) {
>> +                        break;
>> +                    }
>>                        self->waiting_for = req;
>>                        qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
>>                        self->waiting_for = NULL;
>>                        retry = true;
>> -                    waited = true;
>>                        break;
>>                    }
>>                }
>> @@ -763,7 +767,12 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>>            qemu_co_mutex_unlock(&bs->reqs_lock);
>>        } while (retry);
>>    
>> -    return waited;
>> +    return found;
>> +}
>> +
>> +static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>> +{
>> +    return find_or_wait_serialising_requests(self, true);
>>    }
>>    
>>    static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>> @@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>>                              BdrvTrackedRequest *req, int flags)
>>    {
>>        BlockDriverState *bs = child->bs;
>> -    bool waited;
>> +    bool found;
>>        int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
>>    
>>        if (bs->read_only) {
>> @@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>>            mark_request_serialising(req, bdrv_get_cluster_size(bs));
>>        }
>>    
>> -    waited = wait_serialising_requests(req);
>> +    found = find_or_wait_serialising_requests(req,
>> +                                              !(flags & BDRV_REQ_ALLOCATE));
>> +    if (found && (flags & BDRV_REQ_ALLOCATE)) {
>> +        return -EAGAIN;
>> +    }
>>    
>> -    assert(!waited || !req->serialising ||
>> +    assert(!found || !req->serialising ||
>>               is_request_serialising_and_aligned(req));
>>        assert(req->overlap_offset <= offset);
>>        assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
>> @@ -1866,6 +1879,10 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>>        assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
>>        assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
>>    
>> +    if (flags & BDRV_REQ_ALLOCATE) {
>> +        flags |= BDRV_REQ_SERIALISING;
>> +    }
>> +
>>        trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
>>    
>>        if (!bs->drv) {
>>
> 
> patch looks correct technically, I just don't know the reasoning..
> 
> the only thing, is that it would be good to add a comment in BDRV flags definition section, that _ALLOCATE implies _SERIALIZE.
> 
> 

I see your point, but it does not imply SERIALIZE in sense that the
caller must set both (as with ALLOCATE and WRITE_ZEROES) - it's set
implicitly.

maybe:

diff --git a/include/block/block.h b/include/block/block.h
index f571082415..a0bf3fed93 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -87,6 +87,9 @@ typedef enum {
       * efficiently allocate the space so it reads as zeroes, or return 
an error.
       * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
       * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+     * This flag implicitly behaves as BDRV_REQ_SERIALISING i.e. it is
+     * protected from conflicts with overlapping requests. If such 
conflict is
+     * detected, -EAGAIN is returned.
       */
      BDRV_REQ_ALLOCATE           = 0x100,



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-05 13:25   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-05 14:11     ` Anton Nefedov
  2018-12-12 17:19       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-05 14:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

On 5/12/2018 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> Current write_zeroes implementation is good enough to satisfy this flag too
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>    block/file-posix.c | 8 +++++++-
>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 07bbdab953..b0b7ab0159 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -603,6 +603,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>            } else {
>>                s->discard_zeroes = true;
>>                s->has_fallocate = true;
>> +            bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
>>            }
>>        } else {
>>            if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
>> @@ -646,10 +647,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>    #ifdef CONFIG_XFS
>>        if (platform_test_xfs_fd(s->fd)) {
>>            s->is_xfs = true;
>> +        bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
> 
> why we should handle xfs separately? is there a case with xfs, not handled by previous generic if ()?
> 

The driver supports ALLOCATE either when it's XFS, or when fallocate is
available. Currently there is no test for fallocate, it's just implied
it's supported until ENOTSUP returned.
I think it's safer (for possible future changes) to set it twice even
though you're right and first condition currently covers the XFS 
condition too.

>>        }
>>    #endif
>>    
>> -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>> +    bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>>        ret = 0;
>>    fail:
>>        if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
>> @@ -1520,6 +1522,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>            }
>>            s->has_fallocate = false;
>>        }
>> +
>> +    if (!s->has_fallocate) {
>> +        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
>> +    }
>>    #endif
>>    
>>        return -ENOTSUP;
>>
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-05 14:01   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-05 16:59     ` Anton Nefedov
  2018-12-05 17:42       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-05 16:59 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

On 5/12/2018 5:01 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> Reviewed-by: Alberto Garcia <berto@igalia.com>
>> ---
>>    qapi/block-core.json       |  4 +-
>>    block/qcow2.h              |  6 +++
>>    block/qcow2-cluster.c      |  2 +-
>>    block/qcow2.c              | 80 +++++++++++++++++++++++++++++++++++++-
>>    block/trace-events         |  1 +
>>    tests/qemu-iotests/060     | 26 ++++++++-----
>>    tests/qemu-iotests/060.out |  5 ++-
>>    7 files changed, 109 insertions(+), 15 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index d4fe710836..50598aa8fe 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3004,6 +3004,8 @@
>>    #
>>    # @cor_write: a write due to copy-on-read (since 2.11)
>>    #
>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
>> +#
>>    # Since: 2.9
>>    ##
>>    { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
>> @@ -3022,7 +3024,7 @@
>>                'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>                'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>                'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
>> -            'cor_write'] }
>> +            'cor_write', 'cluster_alloc_space'] }
>>    
>>    ##
>>    # @BlkdebugInjectErrorOptions:
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 8662b68575..8a64077897 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -389,6 +389,12 @@ typedef struct QCowL2Meta
>>         */
>>        Qcow2COWRegion cow_end;
>>    
>> +    /**
>> +     * Indicates that COW regions are already handled and do not require
>> +     * any more processing.
>> +     */
>> +    bool skip_cow;
>> +
>>        /**
>>         * The I/O vector with the data from the actual guest write request.
>>         * If non-NULL, this is meant to be merged together with the data
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index d37fe08b3d..3685c5f67e 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
>>        assert(start->offset + start->nb_bytes <= end->offset);
>>        assert(!m->data_qiov || m->data_qiov->size == data_bytes);
>>    
>> -    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
>> +    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
>>            return 0;
>>        }
>>    
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 991d6ac91b..027188a1a3 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>                continue;
>>            }
>>    
>> +        /* If COW regions are handled already, skip this too */
>> +        if (m->skip_cow) {
>> +            continue;
>> +        }
>> +
>>            /* The data (middle) region must be immediately after the
>>             * start region */
>>            if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>        return false;
>>    }
>>    
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
>> +{
>> +    int64_t nr;
>> +    return !bytes ||
>> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
> 
> bdrv_is_allocated_above may return error < 0
> 

Probably I just took is_zero() as an example.
But somewhere there's even a rationale (bdrv_co_do_copy_on_readv):

         ret = bdrv_is_allocated(bs, cluster_offset,
                                 MIN(cluster_bytes, max_transfer), &pnum);
         if (ret < 0) {
             /* Safe to treat errors in querying allocation as if
              * unallocated; we'll probably fail again soon on the
              * read, but at least that will set a decent errno.
              */
             pnum = MIN(cluster_bytes, max_transfer);
         }

>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +    /* This check is designed for optimization shortcut so it must be
>> +     * efficient.
>> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
>> +     * as accurate and can result in false negatives). */
>> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
>> +                          m->cow_start.nb_bytes) &&
>> +           is_unallocated(bs, m->offset + m->cow_end.offset,
>> +                          m->cow_end.nb_bytes);
>> +}
>> +
>> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    QCowL2Meta *m;
>> +
>> +    if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
>> +        return 0;
>> +    }
>> +
>> +    if (bs->encrypted) {
>> +        return 0;
>> +    }
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        int ret;
>> +
>> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
>> +            continue;
>> +        }
>> +
>> +        if (!is_zero_cow(bs, m)) {
>> +            continue;
>> +        }
>> +
>> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
>> +        /* instead of writing zero COW buffers,
>> +           efficiently zero out the whole clusters */
>> +        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
>> +                                    m->nb_clusters * s->cluster_size,
>> +                                    BDRV_REQ_ALLOCATE);
>> +        if (ret < 0) {
>> +            if (ret != -ENOTSUP && ret != -EAGAIN) {
> 
> why do you ignore ENOTSUP?
> 

This might particularly happen after first pwrite_zeroes() to
file-posix.
It generally reports BDRV_REQ_ALLOCATE on open, but then it might reset
the flag if fallocate() returns ENOTSUP.

Nevermind file-posix; with any file driver what can we possibly do here:
this operation is non-mandatory and the error is not critical.
Never trying again doesn't look quite right, the file driver can reset
BDRV_REQ_ALLOCATE if it likes and we won't bother it anymore.


> And also, hmm, should not we have problems with serializing? If already started unaligned request,
> it is serializing.. And now, we are going to start nested serializing request, which will definitely
> return EAGAIN..
> 

The request we're processing here is on qcow2 level yet, it won't affect
the bs->file serialization queue.

[..]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-05 16:59     ` Anton Nefedov
@ 2018-12-05 17:42       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-05 17:42 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

05.12.2018 19:59, Anton Nefedov wrote:
> On 5/12/2018 5:01 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 03.12.2018 13:14, Anton Nefedov wrote:
>>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>>> cluster instead of writing explicit zero buffers later in perform_cow().
>>>
>>> iotest 060:
>>> write to the discarded cluster does not trigger COW anymore.
>>> Use a backing image instead.
>>>
>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>>> Reviewed-by: Alberto Garcia <berto@igalia.com>
>>> ---
>>>     qapi/block-core.json       |  4 +-
>>>     block/qcow2.h              |  6 +++
>>>     block/qcow2-cluster.c      |  2 +-
>>>     block/qcow2.c              | 80 +++++++++++++++++++++++++++++++++++++-
>>>     block/trace-events         |  1 +
>>>     tests/qemu-iotests/060     | 26 ++++++++-----
>>>     tests/qemu-iotests/060.out |  5 ++-
>>>     7 files changed, 109 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index d4fe710836..50598aa8fe 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -3004,6 +3004,8 @@
>>>     #
>>>     # @cor_write: a write due to copy-on-read (since 2.11)
>>>     #
>>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
>>> +#
>>>     # Since: 2.9
>>>     ##
>>>     { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
>>> @@ -3022,7 +3024,7 @@
>>>                 'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>>                 'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>>                 'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
>>> -            'cor_write'] }
>>> +            'cor_write', 'cluster_alloc_space'] }
>>>     
>>>     ##
>>>     # @BlkdebugInjectErrorOptions:
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 8662b68575..8a64077897 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -389,6 +389,12 @@ typedef struct QCowL2Meta
>>>          */
>>>         Qcow2COWRegion cow_end;
>>>     
>>> +    /**
>>> +     * Indicates that COW regions are already handled and do not require
>>> +     * any more processing.
>>> +     */
>>> +    bool skip_cow;
>>> +
>>>         /**
>>>          * The I/O vector with the data from the actual guest write request.
>>>          * If non-NULL, this is meant to be merged together with the data
>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>> index d37fe08b3d..3685c5f67e 100644
>>> --- a/block/qcow2-cluster.c
>>> +++ b/block/qcow2-cluster.c
>>> @@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
>>>         assert(start->offset + start->nb_bytes <= end->offset);
>>>         assert(!m->data_qiov || m->data_qiov->size == data_bytes);
>>>     
>>> -    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
>>> +    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
>>>             return 0;
>>>         }
>>>     
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 991d6ac91b..027188a1a3 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>>                 continue;
>>>             }
>>>     
>>> +        /* If COW regions are handled already, skip this too */
>>> +        if (m->skip_cow) {
>>> +            continue;
>>> +        }
>>> +
>>>             /* The data (middle) region must be immediately after the
>>>              * start region */
>>>             if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>>> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>>         return false;
>>>     }
>>>     
>>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
>>> +{
>>> +    int64_t nr;
>>> +    return !bytes ||
>>> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
>>
>> bdrv_is_allocated_above may return error < 0
>>
> 
> Probably I just took is_zero() as an example.
> But somewhere there's even a rationale (bdrv_co_do_copy_on_readv):
> 
>           ret = bdrv_is_allocated(bs, cluster_offset,
>                                   MIN(cluster_bytes, max_transfer), &pnum);
>           if (ret < 0) {
>               /* Safe to treat errors in querying allocation as if
>                * unallocated; we'll probably fail again soon on the
>                * read, but at least that will set a decent errno.
>                */
>               pnum = MIN(cluster_bytes, max_transfer);
>           }

aha, anyway, !bdrv_is_allocated_above is true when and only when the function
successfully returned "unallocated".

ahaha, this rationale has funny mistake: s/unallocated/allocated )


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
  2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-07 14:31   ` Alberto Garcia
  2018-12-12 12:15   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 14:31 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:53 AM CET, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 2/9] blkverify: set supported write/zero flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 2/9] blkverify: set " Anton Nefedov
@ 2018-12-07 14:32   ` Alberto Garcia
  2018-12-12 12:26   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 14:32 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:54 AM CET, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags Anton Nefedov
@ 2018-12-07 14:33   ` Alberto Garcia
  2018-12-07 14:46     ` Anton Nefedov
  2018-12-12 12:33   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 14:33 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:55 AM CET, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/quorum.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/block/quorum.c b/block/quorum.c
> index 16b3c8067c..d21a6a3b8e 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
>      },
>  };
>  
> +static void quorum_set_supported_flags(BlockDriverState *bs)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    int i;
> +
> +    bs->supported_write_flags = BDRV_REQ_FUA;
> +    for (i = 0; i < s->num_children; i++) {
> +        bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
> +    }
> +
> +    bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
> +}

You don't set supported_zero_flags here anymore ?

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags
  2018-12-07 14:33   ` Alberto Garcia
@ 2018-12-07 14:46     ` Anton Nefedov
  2018-12-07 14:54       ` Alberto Garcia
  0 siblings, 1 reply; 42+ messages in thread
From: Anton Nefedov @ 2018-12-07 14:46 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On 7/12/2018 5:33 PM, Alberto Garcia wrote:
> On Mon 03 Dec 2018 11:14:55 AM CET, Anton Nefedov wrote:
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>   block/quorum.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/quorum.c b/block/quorum.c
>> index 16b3c8067c..d21a6a3b8e 100644
>> --- a/block/quorum.c
>> +++ b/block/quorum.c
>> @@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
>>       },
>>   };
>>   
>> +static void quorum_set_supported_flags(BlockDriverState *bs)
>> +{
>> +    BDRVQuorumState *s = bs->opaque;
>> +    int i;
>> +
>> +    bs->supported_write_flags = BDRV_REQ_FUA;
>> +    for (i = 0; i < s->num_children; i++) {
>> +        bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
>> +    }
>> +
>> +    bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
>> +}
> 
> You don't set supported_zero_flags here anymore ?
> 
> Berto
> 

Yes, I noticed (that late) that quorum doesn't actually implement
write_zeroes(). bdrv_co_do_pwrite_zeroes() specifically checks that
there must be no supported flags set in such case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags
  2018-12-07 14:46     ` Anton Nefedov
@ 2018-12-07 14:54       ` Alberto Garcia
  0 siblings, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 14:54 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Fri 07 Dec 2018 03:46:13 PM CET, Anton Nefedov wrote:
>>> +static void quorum_set_supported_flags(BlockDriverState *bs)
>>> +{
>>> +    BDRVQuorumState *s = bs->opaque;
>>> +    int i;
>>> +
>>> +    bs->supported_write_flags = BDRV_REQ_FUA;
>>> +    for (i = 0; i < s->num_children; i++) {
>>> +        bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
>>> +    }
>>> +
>>> +    bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
>>> +}
>> 
>> You don't set supported_zero_flags here anymore ?
>
> Yes, I noticed (that late) that quorum doesn't actually implement
> write_zeroes(). bdrv_co_do_pwrite_zeroes() specifically checks that
> there must be no supported flags set in such case.

Oh, I see.

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers Anton Nefedov
  2018-12-05 13:28   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-07 15:00   ` Alberto Garcia
  1 sibling, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 15:00 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:59 AM CET, Anton Nefedov wrote:
> Support the flag if the underlying BDS supports it
>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
  2018-12-05 13:25   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-07 15:09   ` Alberto Garcia
  1 sibling, 0 replies; 42+ messages in thread
From: Alberto Garcia @ 2018-12-07 15:09 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev,
	Vladimir Sementsov-Ogievskiy

On Mon 03 Dec 2018 11:14:58 AM CET, Anton Nefedov wrote:
> Current write_zeroes implementation is good enough to satisfy this flag too
>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
  2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
  2018-12-07 14:31   ` Alberto Garcia
@ 2018-12-12 12:15   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-12 12:15 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov<anton.nefedov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 2/9] blkverify: set supported write/zero flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 2/9] blkverify: set " Anton Nefedov
  2018-12-07 14:32   ` Alberto Garcia
@ 2018-12-12 12:26   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-12 12:26 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags Anton Nefedov
  2018-12-07 14:33   ` Alberto Garcia
@ 2018-12-12 12:33   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-12 12:33 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising
  2018-12-05 14:01     ` Anton Nefedov
@ 2018-12-12 12:48       ` Vladimir Sementsov-Ogievskiy
  2018-12-13 11:57         ` Anton Nefedov
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-12 12:48 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

05.12.2018 17:01, Anton Nefedov wrote:
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -87,6 +87,9 @@ typedef enum {
>         * efficiently allocate the space so it reads as zeroes, or return
> an error.
>         * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
>         * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
> +     * This flag implicitly behaves as BDRV_REQ_SERIALISING i.e. it is
> +     * protected from conflicts with overlapping requests. If such
> conflict is
> +     * detected, -EAGAIN is returned.

"behaves as" sounds like "do the same" for me, so better is "implicitly sets" or something like this.

>         */
>        BDRV_REQ_ALLOCATE           = 0x100,
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-05 14:11     ` Anton Nefedov
@ 2018-12-12 17:19       ` Vladimir Sementsov-Ogievskiy
  2018-12-13 12:01         ` Anton Nefedov
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-12 17:19 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

05.12.2018 17:11, Anton Nefedov wrote:
> On 5/12/2018 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 03.12.2018 13:14, Anton Nefedov wrote:
>>> Current write_zeroes implementation is good enough to satisfy this flag too
>>>
>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>>> ---
>>>     block/file-posix.c | 8 +++++++-
>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 07bbdab953..b0b7ab0159 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -603,6 +603,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>>             } else {
>>>                 s->discard_zeroes = true;
>>>                 s->has_fallocate = true;
>>> +            bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
>>>             }
>>>         } else {
>>>             if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
>>> @@ -646,10 +647,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>>     #ifdef CONFIG_XFS
>>>         if (platform_test_xfs_fd(s->fd)) {
>>>             s->is_xfs = true;
>>> +        bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
>>
>> why we should handle xfs separately? is there a case with xfs, not handled by previous generic if ()?
>>
> 
> The driver supports ALLOCATE either when it's XFS, or when fallocate is
> available. Currently there is no test for fallocate, it's just implied
> it's supported until ENOTSUP returned.
> I think it's safer (for possible future changes) to set it twice even
> though you're right and first condition currently covers the XFS
> condition too.

Aha, ok.



> 
>>>         }
>>>     #endif
>>>     
>>> -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>> +    bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>>>         ret = 0;
>>>     fail:
>>>         if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
>>> @@ -1520,6 +1522,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>>             }
>>>             s->has_fallocate = false;
>>>         }
>>> +
>>> +    if (!s->has_fallocate) {
>>> +        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
>>> +    }

hm, if CONFIG_FALLOCATE is disabled, flag will remain in supported_zero_flags

>>>     #endif
>>>     
>>>         return -ENOTSUP;
>>>
>>
>>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising
  2018-12-12 12:48       ` Vladimir Sementsov-Ogievskiy
@ 2018-12-13 11:57         ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-13 11:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto



On 12/12/2018 3:48 PM, Vladimir Sementsov-Ogievskiy wrote:
> 05.12.2018 17:01, Anton Nefedov wrote:
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -87,6 +87,9 @@ typedef enum {
>>          * efficiently allocate the space so it reads as zeroes, or return
>> an error.
>>          * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
>>          * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
>> +     * This flag implicitly behaves as BDRV_REQ_SERIALISING i.e. it is
>> +     * protected from conflicts with overlapping requests. If such
>> conflict is
>> +     * detected, -EAGAIN is returned.
> 
> "behaves as" sounds like "do the same" for me, so better is "implicitly sets" or something like this.
> 

"implicitly sets" is good enough for me

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE
  2018-12-12 17:19       ` Vladimir Sementsov-Ogievskiy
@ 2018-12-13 12:01         ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-13 12:01 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

On 12/12/2018 8:19 PM, Vladimir Sementsov-Ogievskiy wrote:
> 05.12.2018 17:11, Anton Nefedov wrote:
>> On 5/12/2018 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> 03.12.2018 13:14, Anton Nefedov wrote:
>>>>          }
>>>>      #endif
>>>>      
>>>> -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>>> +    bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>>>>          ret = 0;
>>>>      fail:
>>>>          if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
>>>> @@ -1520,6 +1522,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>>>              }
>>>>              s->has_fallocate = false;
>>>>          }
>>>> +
>>>> +    if (!s->has_fallocate) {
>>>> +        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
>>>> +    }
 >>>>>     #endif
> 
> hm, if CONFIG_FALLOCATE is disabled, flag will remain in supported_zero_flags
> 

right..
I think there should be a separate patch to reset s->has_* in
non-CONFIG_FALLOCATE* cases. Then I'll move this hunk just one line down
under the following #endif

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2018-12-03 13:59   ` Alberto Garcia
  2018-12-05 14:01   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-13 12:02   ` Vladimir Sementsov-Ogievskiy
  2018-12-13 13:57     ` Anton Nefedov
  2018-12-14 16:20   ` Vladimir Sementsov-Ogievskiy
  3 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-13 12:02 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing image,
> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
> cluster instead of writing explicit zero buffers later in perform_cow().
> 

[...]

> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>               continue;
>           }
>   
> +        /* If COW regions are handled already, skip this too */
> +        if (m->skip_cow) {
> +            continue;
> +        }
> +
>           /* The data (middle) region must be immediately after the
>            * start region */
>           if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>       return false;
>   }
>   
> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
> +{
> +    int64_t nr;
> +    return !bytes ||
> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);

hm, nr may be < bytes if it is up to file length. And we lose this case, when, it
may be considered as unallocated too.

Doesn't harm, however.

> +}
> +
> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> +{
> +    /* This check is designed for optimization shortcut so it must be
> +     * efficient.
> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
> +     * as accurate and can result in false negatives). */

But, in case of allocated zeros, we'll read them anyway (as part of COW process),
so, it may be handled in the same way too. May be not here, but after read we can
check for zeroes, and again effectively write zeros to the whole cluster.

Again it may be done separately, I don't sure it worth doing.

> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
> +                          m->cow_start.nb_bytes) &&
> +           is_unallocated(bs, m->offset + m->cow_end.offset,
> +                          m->cow_end.nb_bytes);
> +}
> +
> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    QCowL2Meta *m;
> +
> +    if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
> +        return 0;
> +    }
> +
> +    if (bs->encrypted) {
> +        return 0;
> +    }
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        int ret;
> +
> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
> +            continue;
> +        }
> +
> +        if (!is_zero_cow(bs, m)) {
> +            continue;
> +        }

pre_write_overlap_check should be here

> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
> +        /* instead of writing zero COW buffers,
> +           efficiently zero out the whole clusters */
> +        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
> +                                    m->nb_clusters * s->cluster_size,
> +                                    BDRV_REQ_ALLOCATE);
> +        if (ret < 0) {
> +            if (ret != -ENOTSUP && ret != -EAGAIN) {
> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
> +        m->skip_cow = true;
> +    }
> +    return 0;
> +}
> +


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-13 12:02   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-13 13:57     ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-13 13:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

On 13/12/2018 3:02 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
> 
> [...]
> 
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>                continue;
>>            }
>>    
>> +        /* If COW regions are handled already, skip this too */
>> +        if (m->skip_cow) {
>> +            continue;
>> +        }
>> +
>>            /* The data (middle) region must be immediately after the
>>             * start region */
>>            if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>        return false;
>>    }
>>    
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
>> +{
>> +    int64_t nr;
>> +    return !bytes ||
>> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
> 
> hm, nr may be < bytes if it is up to file length. And we lose this case, when, it
> may be considered as unallocated too.
> 
> Doesn't harm, however.
> 

Thanks guys for your review.

This case I think is too rare to care about.

>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +    /* This check is designed for optimization shortcut so it must be
>> +     * efficient.
>> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
>> +     * as accurate and can result in false negatives). */
> 
> But, in case of allocated zeros, we'll read them anyway (as part of COW process),
> so, it may be handled in the same way too. May be not here, but after read we can
> check for zeroes, and again effectively write zeros to the whole cluster.
> 
> Again it may be done separately, I don't sure it worth doing.
> 

Detecting zeroes after we read them (and not here) might make sense;
performance gain should be about the same (minus some CPU to check the
read buffer for zeroes).

The question is how frequent it might hit:
  - raw backing image with holes
  - qcow2 backing image with sub-cluster zero areas
    (e.g. smaller cluster size)
  - backing image contains lots of space with explicit zeroes
    (e.g. guest FS with 'shred' extensively used to delete files)

None of these probably occur that frequent.
But might be a next step and a separate series.

>> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
>> +                          m->cow_start.nb_bytes) &&
>> +           is_unallocated(bs, m->offset + m->cow_end.offset,
>> +                          m->cow_end.nb_bytes);
>> +}
>> +
>> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    QCowL2Meta *m;
>> +
>> +    if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
>> +        return 0;
>> +    }
>> +
>> +    if (bs->encrypted) {
>> +        return 0;
>> +    }
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        int ret;
>> +
>> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
>> +            continue;
>> +        }
>> +
>> +        if (!is_zero_cow(bs, m)) {
>> +            continue;
>> +        }
> 
> pre_write_overlap_check should be here
> 

Existing pre_write_overlap_check in qcow2_co_pwritev() should cover it,
since the check aligns the range to cluster boundaries.

However I think I missed a comment about it here. For suspicious
readers :) and just in case someone starts moving this code around.

I propose:

diff --git a/block/qcow2.c b/block/qcow2.c
index 027188a1a3..b3b3124083 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2088,6 +2088,9 @@ static int handle_alloc_space(BlockDriverState 
*bs, QCowL2Meta *l2meta)
              continue;
          }

+        /* Conventional place for qcow2_pre_write_overlap_check() but 
in this
+           case it is already done for these clusters */
+
          BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
          /* instead of writing zero COW buffers,
             efficiently zero out the whole clusters */


>> +
>> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
>> +        /* instead of writing zero COW buffers,
>> +           efficiently zero out the whole clusters */
>> +        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
>> +                                    m->nb_clusters * s->cluster_size,
>> +                                    BDRV_REQ_ALLOCATE);
>> +        if (ret < 0) {
>> +            if (ret != -ENOTSUP && ret != -EAGAIN) {
>> +                return ret;
>> +            }
>> +            continue;
>> +        }
>> +
>> +        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
>> +        m->skip_cow = true;
>> +    }
>> +    return 0;
>> +}
>> +
> 
> 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
                     ` (2 preceding siblings ...)
  2018-12-13 12:02   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-14 16:20   ` Vladimir Sementsov-Ogievskiy
  2018-12-17 10:17     ` Anton Nefedov
  3 siblings, 1 reply; 42+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-12-14 16:20 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto

03.12.2018 13:14, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing image,
> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
> cluster instead of writing explicit zero buffers later in perform_cow().
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
> 

[..]

> --- a/tests/qemu-iotests/060
> +++ b/tests/qemu-iotests/060
> @@ -150,27 +150,33 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
>   echo
>   echo "=== Testing overlap while COW is in flight ==="
>   echo
> +BACKING_IMG=$TEST_IMG.base
> +TEST_IMG=$BACKING_IMG _make_test_img 1G
> +
> +$QEMU_IO -c 'write 64k 64k' "$BACKING_IMG" | _filter_qemu_io
> +
>   # compat=0.10 is required in order to make the following discard actually
> -# unallocate the sector rather than make it a zero sector - we want COW, after
> -# all.
> -IMGOPTS='compat=0.10' _make_test_img 1G
> +# unallocate the sector rather than make it a zero sector as we would like
> +# to reuse it for another guest offset
> +IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
>   # Write two clusters, the second one enforces creation of an L2 table after
>   # the first data cluster.
>   $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
> -# Discard the first cluster. This cluster will soon enough be reallocated and
> -# used for COW.
> +# Discard the first cluster. This cluster will soon enough be reallocated
>   $QEMU_IO -c 'discard 0k 64k' "$TEST_IMG" | _filter_qemu_io
>   # Now, corrupt the image by marking the second L2 table cluster as free.
>   poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
> -# Start a write operation requiring COW on the image stopping it right before
> -# doing the read; then, trigger the corruption prevention by writing anything to
> -# any unallocated cluster, leading to an attempt to overwrite the second L2
> +# Start a write operation requiring COW on the image;
> +# this write will reuse the host offset released by a previous discard.
> +# Stop it right before doing the read.
> +# Then, trigger the corruption prevention by writing anything to
> +# another unallocated cluster, leading to an attempt to overwrite the second L2
>   # table. Finally, resume the COW write and see it fail (but not crash).
>   echo "open -o file.driver=blkdebug $TEST_IMG
>   break cow_read 0
> -aio_write 0k 1k
> +aio_write 64k 1k
>   wait_break 0
> -write 64k 64k
> +write 128k 64k

don't understand why you need these changes.

works for me, without them, if write to backing at 0 offset, of course.

As I understand, discard create unallocated holes in top qcow2 for old qcow2 version.

>   resume 0" | $QEMU_IO | _filter_qemu_io
>   
>   echo
> diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
> index d67c6234a4..ec8f15d2f0 100644
> --- a/tests/qemu-iotests/060.out
> +++ b/tests/qemu-iotests/060.out
> @@ -97,7 +97,10 @@ read 512/512 bytes at offset 0
>   
>   === Testing overlap while COW is in flight ===
>   
> -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
> +wrote 65536/65536 bytes at offset 65536
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
>   wrote 65536/65536 bytes at offset 0
>   64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 65536/65536 bytes at offset 536870912
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas
  2018-12-14 16:20   ` Vladimir Sementsov-Ogievskiy
@ 2018-12-17 10:17     ` Anton Nefedov
  0 siblings, 0 replies; 42+ messages in thread
From: Anton Nefedov @ 2018-12-17 10:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: qemu-block, kwolf, mreitz, eblake, Denis Lunev, berto



On 14/12/2018 7:20 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
> 
> [..]
> 
>> --- a/tests/qemu-iotests/060
>> +++ b/tests/qemu-iotests/060
>> @@ -150,27 +150,33 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
>>    echo
>>    echo "=== Testing overlap while COW is in flight ==="
>>    echo
>> +BACKING_IMG=$TEST_IMG.base
>> +TEST_IMG=$BACKING_IMG _make_test_img 1G
>> +
>> +$QEMU_IO -c 'write 64k 64k' "$BACKING_IMG" | _filter_qemu_io
>> +
>>    # compat=0.10 is required in order to make the following discard actually
>> -# unallocate the sector rather than make it a zero sector - we want COW, after
>> -# all.
>> -IMGOPTS='compat=0.10' _make_test_img 1G
>> +# unallocate the sector rather than make it a zero sector as we would like
>> +# to reuse it for another guest offset
>> +IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
>>    # Write two clusters, the second one enforces creation of an L2 table after
>>    # the first data cluster.
>>    $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
>> -# Discard the first cluster. This cluster will soon enough be reallocated and
>> -# used for COW.
>> +# Discard the first cluster. This cluster will soon enough be reallocated
>>    $QEMU_IO -c 'discard 0k 64k' "$TEST_IMG" | _filter_qemu_io
>>    # Now, corrupt the image by marking the second L2 table cluster as free.
>>    poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
>> -# Start a write operation requiring COW on the image stopping it right before
>> -# doing the read; then, trigger the corruption prevention by writing anything to
>> -# any unallocated cluster, leading to an attempt to overwrite the second L2
>> +# Start a write operation requiring COW on the image;
>> +# this write will reuse the host offset released by a previous discard.
>> +# Stop it right before doing the read.
>> +# Then, trigger the corruption prevention by writing anything to
>> +# another unallocated cluster, leading to an attempt to overwrite the second L2
>>    # table. Finally, resume the COW write and see it fail (but not crash).
>>    echo "open -o file.driver=blkdebug $TEST_IMG
>>    break cow_read 0
>> -aio_write 0k 1k
>> +aio_write 64k 1k
>>    wait_break 0
>> -write 64k 64k
>> +write 128k 64k
> 
> don't understand why you need these changes.
> 
> works for me, without them, if write to backing at 0 offset, of course.
> 
> As I understand, discard create unallocated holes in top qcow2 for old qcow2 version.
> 

Ok, so COW happens regardless if this guest offset has been discarded
before. These offset changes are indeed not needed. Just the backing
file.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-12-17 10:18 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 10:14 [Qemu-devel] [PATCH v10 0/9] qcow2: cluster space preallocation Anton Nefedov
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags Anton Nefedov
2018-12-05 12:43   ` Vladimir Sementsov-Ogievskiy
2018-12-05 13:27     ` Anton Nefedov
2018-12-07 14:31   ` Alberto Garcia
2018-12-12 12:15   ` Vladimir Sementsov-Ogievskiy
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 2/9] blkverify: set " Anton Nefedov
2018-12-07 14:32   ` Alberto Garcia
2018-12-12 12:26   ` Vladimir Sementsov-Ogievskiy
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags Anton Nefedov
2018-12-07 14:33   ` Alberto Garcia
2018-12-07 14:46     ` Anton Nefedov
2018-12-07 14:54       ` Alberto Garcia
2018-12-12 12:33   ` Vladimir Sementsov-Ogievskiy
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
2018-12-05 12:59   ` Vladimir Sementsov-Ogievskiy
2018-12-05 13:38     ` Anton Nefedov
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising Anton Nefedov
2018-12-05 13:14   ` Vladimir Sementsov-Ogievskiy
2018-12-05 14:01     ` Anton Nefedov
2018-12-12 12:48       ` Vladimir Sementsov-Ogievskiy
2018-12-13 11:57         ` Anton Nefedov
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
2018-12-05 13:25   ` Vladimir Sementsov-Ogievskiy
2018-12-05 14:11     ` Anton Nefedov
2018-12-12 17:19       ` Vladimir Sementsov-Ogievskiy
2018-12-13 12:01         ` Anton Nefedov
2018-12-07 15:09   ` Alberto Garcia
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers Anton Nefedov
2018-12-05 13:28   ` Vladimir Sementsov-Ogievskiy
2018-12-07 15:00   ` Alberto Garcia
2018-12-03 10:14 ` [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
2018-12-03 13:59   ` Alberto Garcia
2018-12-03 14:04     ` Anton Nefedov
2018-12-05 14:01   ` Vladimir Sementsov-Ogievskiy
2018-12-05 16:59     ` Anton Nefedov
2018-12-05 17:42       ` Vladimir Sementsov-Ogievskiy
2018-12-13 12:02   ` Vladimir Sementsov-Ogievskiy
2018-12-13 13:57     ` Anton Nefedov
2018-12-14 16:20   ` Vladimir Sementsov-Ogievskiy
2018-12-17 10:17     ` Anton Nefedov
2018-12-03 10:15 ` [Qemu-devel] [PATCH v10 9/9] iotest 134: test cluster-misaligned encrypted write Anton Nefedov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.