All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/10] preallocate filter
@ 2020-08-21 14:11 Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

Hi all!

Here is a filter, which does preallocation on write.

In Virtuozzo we have to deal with some custom distributed storage
solution, where allocation is relatively expensive operation. We have to
workaround it in Qemu, so here is a new filter.

v5: rewrite patch 08 as Nir suggested

v4:
01-04: add r-bs
05: add coroutine_fn tag
06: use QEMU_LOCK_GUARD and fix reqs_lock leak
07: grammar
08-10: add r-b

Vladimir Sementsov-Ogievskiy (10):
  block: simplify comment to BDRV_REQ_SERIALISING
  block/io.c: drop assertion on double waiting for request serialisation
  block/io: split out bdrv_find_conflicting_request
  block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg
  block: bdrv_mark_request_serialising: split non-waiting function
  block: introduce BDRV_REQ_NO_WAIT flag
  block: introduce preallocate filter
  iotests.py: add verify_o_direct helper
  iotests.py: add filter_img_check
  iotests: add 298 to test new preallocate filter driver

 docs/system/qemu-block-drivers.rst.inc |  26 +++
 qapi/block-core.json                   |  20 +-
 include/block/block.h                  |  20 +-
 include/block/block_int.h              |   3 +-
 block/file-posix.c                     |   2 +-
 block/io.c                             | 130 ++++++-----
 block/preallocate.c                    | 291 +++++++++++++++++++++++++
 block/Makefile.objs                    |   1 +
 tests/qemu-iotests/298                 |  50 +++++
 tests/qemu-iotests/298.out             |   6 +
 tests/qemu-iotests/group               |   1 +
 tests/qemu-iotests/iotests.py          |  16 ++
 12 files changed, 498 insertions(+), 68 deletions(-)
 create mode 100644 block/preallocate.c
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

-- 
2.21.3



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 02/10] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.

2. We are going to add one more user of BDRV_REQ_SERIALISING, so
   comment about backup becomes a bit confusing here. The use case in
   backup is documented in block/backup.c, so let's just drop
   duplication here.

3. The fact that BDRV_REQ_SERIALISING is only for write requests is
   omitted. Add a note.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block.h | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6e36154061..b8f4e86e8d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -53,16 +53,7 @@ typedef enum {
      * content. */
     BDRV_REQ_WRITE_UNCHANGED    = 0x40,
 
-    /*
-     * BDRV_REQ_SERIALISING forces request serialisation for writes.
-     * It is used to ensure that writes to the backing file of a backup process
-     * target cannot race with a read of the backup target that defers to the
-     * backing file.
-     *
-     * Note, that BDRV_REQ_SERIALISING is _not_ opposite in meaning to
-     * BDRV_REQ_NO_SERIALISING. A more descriptive name for the latter might be
-     * _DO_NOT_WAIT_FOR_SERIALISING, except that is too long.
-     */
+    /* Forces request serialisation. Use only with write requests. */
     BDRV_REQ_SERIALISING        = 0x80,
 
     /* Execute the request only if the operation can be offloaded or otherwise
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 02/10] block/io.c: drop assertion on double waiting for request serialisation
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 03/10] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, Paolo Bonzini, den

The comments states, that on misaligned request we should have already
been waiting. But for bdrv_padding_rmw_read, we called
bdrv_mark_request_serialising with align = request_alignment, and now
we serialise with align = cluster_size. So we may have to wait again
with larger alignment.

Note, that the only user of BDRV_REQ_SERIALISING is backup which issues
cluster-aligned requests, so seems the assertion should not fire for
now. But it's wrong anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index ad3a51ed53..b18680a842 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1881,7 +1881,6 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
                           BdrvTrackedRequest *req, int flags)
 {
     BlockDriverState *bs = child->bs;
-    bool waited;
     int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
     if (bs->read_only) {
@@ -1893,15 +1892,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(flags & ~BDRV_REQ_MASK));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        waited = bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
-        /*
-         * For a misaligned request we should have already waited earlier,
-         * because we come after bdrv_padding_rmw_read which must be called
-         * with the request already marked as serialising.
-         */
-        assert(!waited ||
-               (req->offset == req->overlap_offset &&
-                req->bytes == req->overlap_bytes));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     } else {
         bdrv_wait_serialising_requests(req);
     }
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 03/10] block/io: split out bdrv_find_conflicting_request
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 02/10] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

To be reused in separate.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 71 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/block/io.c b/block/io.c
index b18680a842..5b96715058 100644
--- a/block/io.c
+++ b/block/io.c
@@ -727,43 +727,54 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
+/* Called with self->bs->reqs_lock held */
+static BdrvTrackedRequest *
+bdrv_find_conflicting_request(BdrvTrackedRequest *self)
+{
+    BdrvTrackedRequest *req;
+
+    QLIST_FOREACH(req, &self->bs->tracked_requests, list) {
+        if (req == self || (!req->serialising && !self->serialising)) {
+            continue;
+        }
+        if (tracked_request_overlaps(req, self->overlap_offset,
+                                     self->overlap_bytes))
+        {
+            /*
+             * Hitting this means there was a reentrant request, for
+             * example, a block driver issuing nested requests.  This must
+             * never happen since it means deadlock.
+             */
+            assert(qemu_coroutine_self() != req->co);
+
+            /*
+             * If the request is already (indirectly) waiting for us, or
+             * will wait for us as soon as it wakes up, then just go on
+             * (instead of producing a deadlock in the former case).
+             */
+            if (!req->waiting_for) {
+                return req;
+            }
+        }
+    }
+
+    return NULL;
+}
+
 static bool coroutine_fn
 bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
                                       BdrvTrackedRequest *self)
 {
     BdrvTrackedRequest *req;
-    bool retry;
     bool waited = false;
 
-    do {
-        retry = false;
-        QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (req == self || (!req->serialising && !self->serialising)) {
-                continue;
-            }
-            if (tracked_request_overlaps(req, self->overlap_offset,
-                                         self->overlap_bytes))
-            {
-                /* Hitting this means there was a reentrant request, for
-                 * example, a block driver issuing nested requests.  This must
-                 * never happen since it means deadlock.
-                 */
-                assert(qemu_coroutine_self() != req->co);
-
-                /* If the request is already (indirectly) waiting for us, or
-                 * will wait for us as soon as it wakes up, then just go on
-                 * (instead of producing a deadlock in the former case). */
-                if (!req->waiting_for) {
-                    self->waiting_for = req;
-                    qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
-                    self->waiting_for = NULL;
-                    retry = true;
-                    waited = true;
-                    break;
-                }
-            }
-        }
-    } while (retry);
+    while ((req = bdrv_find_conflicting_request(self))) {
+        self->waiting_for = req;
+        qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
+        self->waiting_for = NULL;
+        waited = true;
+    }
+
     return waited;
 }
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 03/10] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

bs is linked in req, so no needs to pass it separately. Most of
tracked-requests API doesn't have bs argument. Actually, after this
patch only tracked_request_begin has it, but it's for purpose.

While being here, also add a comment about what "_locked" is.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 5b96715058..36bbe4b9b1 100644
--- a/block/io.c
+++ b/block/io.c
@@ -761,16 +761,16 @@ bdrv_find_conflicting_request(BdrvTrackedRequest *self)
     return NULL;
 }
 
+/* Called with self->bs->reqs_lock held */
 static bool coroutine_fn
-bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
-                                      BdrvTrackedRequest *self)
+bdrv_wait_serialising_requests_locked(BdrvTrackedRequest *self)
 {
     BdrvTrackedRequest *req;
     bool waited = false;
 
     while ((req = bdrv_find_conflicting_request(self))) {
         self->waiting_for = req;
-        qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
+        qemu_co_queue_wait(&req->wait_queue, &self->bs->reqs_lock);
         self->waiting_for = NULL;
         waited = true;
     }
@@ -794,7 +794,7 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 
     req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
     req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
-    waited = bdrv_wait_serialising_requests_locked(bs, req);
+    waited = bdrv_wait_serialising_requests_locked(req);
     qemu_co_mutex_unlock(&bs->reqs_lock);
     return waited;
 }
@@ -876,7 +876,7 @@ static bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self
     }
 
     qemu_co_mutex_lock(&bs->reqs_lock);
-    waited = bdrv_wait_serialising_requests_locked(bs, self);
+    waited = bdrv_wait_serialising_requests_locked(self);
     qemu_co_mutex_unlock(&bs->reqs_lock);
 
     return waited;
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-25 12:43   ` Max Reitz
  2020-08-21 14:11 ` [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

We'll need a separate function, which will only "mark" request
serialising with specified align but not wait for conflicting
requests. So, it will be like old bdrv_mark_request_serialising(),
before merging bdrv_wait_serialising_requests_locked() into it.

To reduce the possible mess, let's do the following:

Public function that does both marking and waiting will be called
bdrv_make_request_serialising, and private function which will only
"mark" will be called tracked_request_set_serialising().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h |  3 ++-
 block/file-posix.c        |  2 +-
 block/io.c                | 35 +++++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38dec0275b..4d56a1b141 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1039,7 +1039,8 @@ extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
-bool coroutine_fn bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
+                                                uint64_t align);
 BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs);
 
 int get_tmp_filename(char *filename, int size);
diff --git a/block/file-posix.c b/block/file-posix.c
index 9a00d4190a..560d1c0a94 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2957,7 +2957,7 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
         req->bytes = end - req->offset;
         req->overlap_bytes = req->bytes;
 
-        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
+        bdrv_make_request_serialising(req, bs->bl.request_alignment);
     }
 #endif
 
diff --git a/block/io.c b/block/io.c
index 36bbe4b9b1..dd28befb08 100644
--- a/block/io.c
+++ b/block/io.c
@@ -778,15 +778,14 @@ bdrv_wait_serialising_requests_locked(BdrvTrackedRequest *self)
     return waited;
 }
 
-bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+/* Called with req->bs->reqs_lock held */
+static void tracked_request_set_serialising(BdrvTrackedRequest *req,
+                                            uint64_t align)
 {
-    BlockDriverState *bs = req->bs;
     int64_t overlap_offset = req->offset & ~(align - 1);
     uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
                                - overlap_offset;
-    bool waited;
 
-    qemu_co_mutex_lock(&bs->reqs_lock);
     if (!req->serialising) {
         atomic_inc(&req->bs->serialising_in_flight);
         req->serialising = true;
@@ -794,9 +793,6 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 
     req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
     req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
-    waited = bdrv_wait_serialising_requests_locked(req);
-    qemu_co_mutex_unlock(&bs->reqs_lock);
-    return waited;
 }
 
 /**
@@ -882,6 +878,21 @@ static bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self
     return waited;
 }
 
+bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
+                                                uint64_t align)
+{
+    bool waited;
+
+    qemu_co_mutex_lock(&req->bs->reqs_lock);
+
+    tracked_request_set_serialising(req, align);
+    waited = bdrv_wait_serialising_requests_locked(req);
+
+    qemu_co_mutex_unlock(&req->bs->reqs_lock);
+
+    return waited;
+}
+
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
                                    size_t size)
 {
@@ -1492,7 +1503,7 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
          * with each other for the same cluster.  For example, in copy-on-read
          * it ensures that the CoR read and write operations are atomic and
          * guest writes cannot interleave between them. */
-        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
     } else {
         bdrv_wait_serialising_requests(req);
     }
@@ -1903,7 +1914,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(flags & ~BDRV_REQ_MASK));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
     } else {
         bdrv_wait_serialising_requests(req);
     }
@@ -2069,7 +2080,7 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
-        bdrv_mark_request_serialising(req, align);
+        bdrv_make_request_serialising(req, align);
 
         bdrv_padding_rmw_read(child, req, &pad, true);
 
@@ -2183,7 +2194,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     }
 
     if (bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad)) {
-        bdrv_mark_request_serialising(&req, align);
+        bdrv_make_request_serialising(&req, align);
         bdrv_padding_rmw_read(child, &req, &pad, false);
     }
 
@@ -3347,7 +3358,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
      * new area, we need to make sure that no write requests are made to it
      * concurrently or they might be overwritten by preallocation. */
     if (new_bytes) {
-        bdrv_mark_request_serialising(&req, 1);
+        bdrv_make_request_serialising(&req, 1);
     }
     if (bs->read_only) {
         error_setg(errp, "Image is read-only");
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-25 13:10   ` Max Reitz
  2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

Add flag to make serialising request no wait: if there are conflicting
requests, just return error immediately. It's will be used in upcoming
preallocate filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h |  9 ++++++++-
 block/io.c            | 11 ++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index b8f4e86e8d..877fda06a4 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -67,8 +67,15 @@ typedef enum {
      * written to qiov parameter which may be NULL.
      */
     BDRV_REQ_PREFETCH  = 0x200,
+
+    /*
+     * If we need to wait for other requests, just fail immediately. Used
+     * only together with BDRV_REQ_SERIALISING.
+     */
+    BDRV_REQ_NO_WAIT = 0x400,
+
     /* Mask of valid flags */
-    BDRV_REQ_MASK               = 0x3ff,
+    BDRV_REQ_MASK               = 0x7ff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/block/io.c b/block/io.c
index dd28befb08..c93b1e98a3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(bs->open_flags & BDRV_O_INACTIVE));
     assert((bs->open_flags & BDRV_O_NO_IO) == 0);
     assert(!(flags & ~BDRV_REQ_MASK));
+    assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags & BDRV_REQ_SERIALISING)));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
+        QEMU_LOCK_GUARD(&bs->reqs_lock);
+
+        tracked_request_set_serialising(req, bdrv_get_cluster_size(bs));
+
+        if ((flags & BDRV_REQ_NO_WAIT) && bdrv_find_conflicting_request(req)) {
+            return -EBUSY;
+        }
+
+        bdrv_wait_serialising_requests_locked(req);
     } else {
         bdrv_wait_serialising_requests(req);
     }
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
  2020-08-25 15:11   ` Max Reitz
  2020-08-21 14:11 ` [PATCH v5 08/10] iotests.py: add verify_o_direct helper Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

It's intended to be inserted between format and protocol nodes to
preallocate additional space (expanding protocol file) on writes
crossing EOF. It improves performance for file-systems with slow
allocation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 docs/system/qemu-block-drivers.rst.inc |  26 +++
 qapi/block-core.json                   |  20 +-
 block/preallocate.c                    | 291 +++++++++++++++++++++++++
 block/Makefile.objs                    |   1 +
 4 files changed, 337 insertions(+), 1 deletion(-)
 create mode 100644 block/preallocate.c

diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index b052a6d14e..5e8a35c571 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
 More than one byte could be locked by the QEMU instance, each byte of which
 reflects a particular permission that is acquired or protected by the running
 block driver.
+
+Filter drivers
+~~~~~~~~~~~~~~
+
+QEMU supports several filter drivers, which don't store any data, but do some
+additional tasks, hooking io requests.
+
+.. program:: filter-drivers
+.. option:: preallocate
+
+  Preallocate filter driver is intended to be inserted between format
+  and protocol nodes and does preallocation of some additional space
+  (expanding the protocol file) on write. It may be used for
+  file-systems with slow allocation.
+
+  Supported options:
+
+  .. program:: preallocate
+  .. option:: prealloc-align
+
+    On preallocation, align file length to this number, default 1M.
+
+  .. program:: preallocate
+  .. option:: prealloc-size
+
+    How much to preallocate, default 128M.
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 197bdc1c36..b40448063b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2805,7 +2805,7 @@
             'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
             'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
-            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
             'sheepdog',
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
@@ -3074,6 +3074,23 @@
   'data': { 'aes': 'QCryptoBlockOptionsQCow',
             'luks': 'QCryptoBlockOptionsLUKS'} }
 
+##
+# @BlockdevOptionsPreallocate:
+#
+# Filter driver intended to be inserted between format and protocol node
+# and do preallocation in protocol node on write.
+#
+# @prealloc-align: on preallocation, align file length to this number,
+#                 default 1048576 (1M)
+#
+# @prealloc-size: how much to preallocate, default 134217728 (128M)
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockdevOptionsPreallocate',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { '*prealloc-align': 'int', '*prealloc-size': 'int' } }
+
 ##
 # @BlockdevOptionsQcow2:
 #
@@ -3979,6 +3996,7 @@
       'null-co':    'BlockdevOptionsNull',
       'nvme':       'BlockdevOptionsNVMe',
       'parallels':  'BlockdevOptionsGenericFormat',
+      'preallocate':'BlockdevOptionsPreallocate',
       'qcow2':      'BlockdevOptionsQcow2',
       'qcow':       'BlockdevOptionsQcow',
       'qed':        'BlockdevOptionsGenericCOWFormat',
diff --git a/block/preallocate.c b/block/preallocate.c
new file mode 100644
index 0000000000..bdf54dbd2f
--- /dev/null
+++ b/block/preallocate.c
@@ -0,0 +1,291 @@
+/*
+ * preallocate filter driver
+ *
+ * The driver performs preallocate operation: it is injected above
+ * some node, and before each write over EOF it does additional preallocating
+ * write-zeroes request.
+ *
+ * Copyright (c) 2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "qemu/option.h"
+#include "qemu/units.h"
+#include "block/block_int.h"
+
+
+typedef struct BDRVPreallocateState {
+    int64_t prealloc_size;
+    int64_t prealloc_align;
+
+    /*
+     * Filter is started as not-active, so it doesn't do any preallocations nor
+     * requires BLK_PERM_RESIZE on its child. This is needed to create filter
+     * above another node-child and then do bdrv_replace_node to insert the
+     * filter.
+     *
+     * Filter becomes active the first time it detects that its parents have
+     * BLK_PERM_RESIZE on it.
+     *
+     * Filter becomes active forever: it doesn't lose active status if parents
+     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink the file on
+     * filter close.
+     */
+    bool active;
+
+    /*
+     * Track real data end, to crop preallocation on close  data_end may be
+     * negative, which means that actual status is unknown (nothing cropped in
+     * this case)
+     */
+    int64_t data_end;
+} BDRVPreallocateState;
+
+#define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
+#define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
+static QemuOptsList runtime_opts = {
+    .name = "preallocate",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        {
+            .name = PREALLOCATE_OPT_PREALLOC_ALIGN,
+            .type = QEMU_OPT_SIZE,
+            .help = "on preallocation, align file length to this number, "
+                "default 1M",
+        },
+        {
+            .name = PREALLOCATE_OPT_PREALLOC_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "how much to preallocate, default 128M",
+        },
+        { /* end of list */ }
+    },
+};
+
+static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
+                            Error **errp)
+{
+    QemuOpts *opts;
+    BDRVPreallocateState *s = bs->opaque;
+
+    /*
+     * Parameters are hardcoded now. May need to add corresponding options in
+     * future.
+     */
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &error_abort);
+    s->prealloc_align =
+        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_ALIGN, 1 * MiB);
+    s->prealloc_size =
+        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_SIZE, 128 * MiB);
+    qemu_opts_del(opts);
+
+    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
+                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
+                               false, errp);
+    if (!bs->file) {
+        return -EINVAL;
+    }
+
+    s->data_end = bdrv_getlength(bs->file->bs);
+    if (s->data_end < 0) {
+        return s->data_end;
+    }
+
+    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
+
+    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
+            bs->file->bs->supported_zero_flags);
+
+    return 0;
+}
+
+static void preallocate_close(BlockDriverState *bs)
+{
+    BDRVPreallocateState *s = bs->opaque;
+
+    if (s->active && s->data_end >= 0 &&
+        bdrv_getlength(bs->file->bs) > s->data_end)
+    {
+        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
+    }
+}
+
+static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
+                                   BdrvChildRole role,
+                                   BlockReopenQueue *reopen_queue,
+                                   uint64_t perm, uint64_t shared,
+                                   uint64_t *nperm, uint64_t *nshared)
+{
+    BDRVPreallocateState *s = bs->opaque;
+
+    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
+
+    s->active = s->active || (perm & BLK_PERM_RESIZE);
+
+    if (s->active) {
+        /* Force RESIZE permission, to be able to crop file on close() */
+        *nperm |= BLK_PERM_RESIZE;
+    }
+}
+
+static coroutine_fn int preallocate_co_preadv_part(
+        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+        QEMUIOVector *qiov, size_t qiov_offset, int flags)
+{
+    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+                               flags);
+}
+
+static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
+                                               int64_t offset, int bytes)
+{
+    return bdrv_co_pdiscard(bs->file, offset, bytes);
+}
+
+static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
+                                       int64_t bytes, bool write_zero)
+{
+    BDRVPreallocateState *s = bs->opaque;
+    int64_t len, start, end;
+
+    if (!s->active) {
+        return false;
+    }
+
+    if (s->data_end >= 0) {
+        s->data_end = MAX(s->data_end,
+                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
+    }
+
+    len = bdrv_getlength(bs->file->bs);
+    if (len < 0) {
+        return false;
+    }
+
+    if (s->data_end < 0) {
+        s->data_end = MAX(len,
+                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
+    }
+
+    if (offset + bytes <= len) {
+        return false;
+    }
+
+    start = write_zero ? MIN(offset, len) : len;
+    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
+
+    return !bdrv_co_pwrite_zeroes(bs->file, start, end - start,
+            BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
+}
+
+static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
+        int64_t offset, int bytes, BdrvRequestFlags flags)
+{
+    if (do_preallocate(bs, offset, bytes, true)) {
+        return 0;
+    }
+
+    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
+}
+
+static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
+                                                    uint64_t offset,
+                                                    uint64_t bytes,
+                                                    QEMUIOVector *qiov,
+                                                    size_t qiov_offset,
+                                                    int flags)
+{
+    do_preallocate(bs, offset, bytes, false);
+
+    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+                                flags);
+}
+
+static int coroutine_fn
+preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
+                        bool exact, PreallocMode prealloc,
+                        BdrvRequestFlags flags, Error **errp)
+{
+    BDRVPreallocateState *s = bs->opaque;
+    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
+
+    /* s->data_end may become negative here, which means unknown data end */
+    s->data_end = bdrv_getlength(bs->file->bs);
+
+    return ret;
+}
+
+static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
+{
+    return bdrv_co_flush(bs->file->bs);
+}
+
+static int64_t preallocate_getlength(BlockDriverState *bs)
+{
+    /*
+     * We probably can return s->data_end here, but seems safer to return real
+     * file length, not trying to hide the preallocation.
+     *
+     * Still, don't miss the chance to restore s->data_end if it is broken.
+     */
+    BDRVPreallocateState *s = bs->opaque;
+    int64_t ret = bdrv_getlength(bs->file->bs);
+
+    if (s->data_end < 0) {
+        s->data_end = ret;
+    }
+
+    return ret;
+}
+
+BlockDriver bdrv_preallocate_filter = {
+    .format_name = "preallocate",
+    .instance_size = sizeof(BDRVPreallocateState),
+
+    .bdrv_getlength = preallocate_getlength,
+    .bdrv_open = preallocate_open,
+    .bdrv_close = preallocate_close,
+
+    .bdrv_co_preadv_part = preallocate_co_preadv_part,
+    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
+    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
+    .bdrv_co_pdiscard = preallocate_co_pdiscard,
+    .bdrv_co_flush = preallocate_co_flush,
+    .bdrv_co_truncate = preallocate_co_truncate,
+
+    .bdrv_co_block_status = bdrv_co_block_status_from_file,
+
+    .bdrv_child_perm = preallocate_child_perm,
+
+    .has_variable_length = true,
+    .is_filter = true,
+};
+
+static void bdrv_preallocate_init(void)
+{
+    bdrv_register(&bdrv_preallocate_filter);
+}
+
+block_init(bdrv_preallocate_init);
diff --git a/block/Makefile.objs b/block/Makefile.objs
index 19c6f371c9..f8e6f16522 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -44,6 +44,7 @@ block-obj-y += crypto.o
 block-obj-y += aio_task.o
 block-obj-y += backup-top.o
 block-obj-y += filter-compress.o
+block-obj-y += preallocate.o
 common-obj-y += monitor/
 block-obj-y += monitor/
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 08/10] iotests.py: add verify_o_direct helper
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 17:29   ` Nir Soffer
  2020-08-21 14:11 ` [PATCH v5 09/10] iotests.py: add filter_img_check Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

Add python notrun-helper similar to _check_o_direct for bash tests.
To be used in the following commit.

Suggested-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/iotests.py | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..7f1aa187a9 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -29,6 +29,7 @@ import struct
 import subprocess
 import sys
 import time
+import errno
 from typing import (Any, Callable, Dict, Iterable,
                     List, Optional, Sequence, Tuple, TypeVar)
 import unittest
@@ -1083,6 +1084,17 @@ def _verify_aio_mode(supported_aio_modes: Sequence[str] = ()) -> None:
     if supported_aio_modes and (aiomode not in supported_aio_modes):
         notrun('not suitable for this aio mode: %s' % aiomode)
 
+def verify_o_direct() -> None:
+    with FilePath('test_o_direct') as f:
+        try:
+            fd = os.open(f, os.O_DIRECT | os.O_CREAT | os.O_RDWR)
+        except OSError as e:
+            if e.errno != errno.EINVAL:
+                raise
+            notrun(f'file system at {test_dir} does not support O_DIRECT')
+        else:
+            os.close(fd)
+
 def supports_quorum():
     return 'quorum' in qemu_img_pipe('--help')
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 09/10] iotests.py: add filter_img_check
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 08/10] iotests.py: add verify_o_direct helper Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-21 14:11 ` [PATCH v5 10/10] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
  2020-08-27 21:08 ` [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

Add analog of bash _filter_qemu_img_check to python framework.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/iotests.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 7f1aa187a9..14f1d47d52 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -403,6 +403,10 @@ def filter_img_info(output, filename):
         lines.append(line)
     return '\n'.join(lines)
 
+def filter_img_check(msg):
+    msg = re.sub(r'.*allocated.*fragmented.*compressed clusters', '', msg)
+    return re.sub(r'Image end offset: [0-9]+', '', msg).strip()
+
 def filter_imgfmt(msg):
     return msg.replace(imgfmt, 'IMGFMT')
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 10/10] iotests: add 298 to test new preallocate filter driver
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 09/10] iotests.py: add filter_img_check Vladimir Sementsov-Ogievskiy
@ 2020-08-21 14:11 ` Vladimir Sementsov-Ogievskiy
  2020-08-27 21:08 ` [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
  10 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-21 14:11 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, armbru, qemu-devel, mreitz, nsoffer,
	stefanha, den

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/298     | 50 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/298.out |  6 +++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 57 insertions(+)
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
new file mode 100644
index 0000000000..4f2087352a
--- /dev/null
+++ b/tests/qemu-iotests/298
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+#
+# Test for preallocate filter
+#
+# Copyright (c) 2020 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import log
+
+iotests.script_initialize(supported_fmts=['qcow2'])
+iotests.verify_o_direct()
+
+size = 10 * 1024 * 1024
+disk = iotests.file_path('disk')
+
+iotests.qemu_img_create('-f', iotests.imgfmt, disk, str(size))
+
+opts = f'driver={iotests.imgfmt},' \
+    f'file.driver=preallocate,file.file.filename={disk}'
+p = iotests.QemuIoInteractive('--image-opts', '-t', 'none', opts)
+
+log(p.cmd('write 0 1M'), filters=[iotests.filter_qemu_io])
+p.cmd('flush')
+
+if os.path.getsize(disk) > 100 * 1024 * 1024:
+    log('file in progress is big, preallocation works')
+
+p.close()
+
+if os.path.getsize(disk) < 10 * 1024 * 1024:
+    log('file is small after close')
+
+# Check that there are no leaks.
+log(iotests.qemu_img_pipe('check', '-f', 'qcow2', disk),
+    filters=[iotests.filter_img_check])
diff --git a/tests/qemu-iotests/298.out b/tests/qemu-iotests/298.out
new file mode 100644
index 0000000000..baf8f8425c
--- /dev/null
+++ b/tests/qemu-iotests/298.out
@@ -0,0 +1,6 @@
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+file in progress is big, preallocation works
+file is small after close
+No errors were found on the image.
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 7f76066640..cdcde2fe48 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -306,6 +306,7 @@
 295 rw
 296 rw
 297 meta
+298 auto quick
 299 auto quick
 301 backing quick
 302 quick
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 08/10] iotests.py: add verify_o_direct helper
  2020-08-21 14:11 ` [PATCH v5 08/10] iotests.py: add verify_o_direct helper Vladimir Sementsov-Ogievskiy
@ 2020-08-21 17:29   ` Nir Soffer
  0 siblings, 0 replies; 30+ messages in thread
From: Nir Soffer @ 2020-08-21 17:29 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Fam Zheng, Kevin Wolf, qemu-block, QEMU Developers,
	Markus Armbruster, Stefan Hajnoczi, den, Max Reitz

On Fri, Aug 21, 2020 at 5:12 PM Vladimir Sementsov-Ogievskiy
<vsementsov@virtuozzo.com> wrote:
>
> Add python notrun-helper similar to _check_o_direct for bash tests.
> To be used in the following commit.
>
> Suggested-by: Nir Soffer <nsoffer@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  tests/qemu-iotests/iotests.py | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> index 717b5b652c..7f1aa187a9 100644
> --- a/tests/qemu-iotests/iotests.py
> +++ b/tests/qemu-iotests/iotests.py
> @@ -29,6 +29,7 @@ import struct
>  import subprocess
>  import sys
>  import time
> +import errno
>  from typing import (Any, Callable, Dict, Iterable,
>                      List, Optional, Sequence, Tuple, TypeVar)
>  import unittest
> @@ -1083,6 +1084,17 @@ def _verify_aio_mode(supported_aio_modes: Sequence[str] = ()) -> None:
>      if supported_aio_modes and (aiomode not in supported_aio_modes):
>          notrun('not suitable for this aio mode: %s' % aiomode)
>
> +def verify_o_direct() -> None:
> +    with FilePath('test_o_direct') as f:
> +        try:
> +            fd = os.open(f, os.O_DIRECT | os.O_CREAT | os.O_RDWR)
> +        except OSError as e:
> +            if e.errno != errno.EINVAL:
> +                raise
> +            notrun(f'file system at {test_dir} does not support O_DIRECT')
> +        else:
> +            os.close(fd)
> +
>  def supports_quorum():
>      return 'quorum' in qemu_img_pipe('--help')
>
> --
> 2.21.3
>

Reviewed-by: Nir Soffer <nsoffer@redhat.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
  2020-08-26  7:06     ` Vladimir Sementsov-Ogievskiy
  2020-08-25 15:11   ` Max Reitz
  1 sibling, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-24 17:52 UTC (permalink / raw)
  To: qemu-block; +Cc: fam, kwolf, armbru, qemu-devel, mreitz, nsoffer, stefanha, den

21.08.2020 17:11, Vladimir Sementsov-Ogievskiy wrote:
> It's intended to be inserted between format and protocol nodes to
> preallocate additional space (expanding protocol file) on writes
> crossing EOF. It improves performance for file-systems with slow
> allocation.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   docs/system/qemu-block-drivers.rst.inc |  26 +++
>   qapi/block-core.json                   |  20 +-
>   block/preallocate.c                    | 291 +++++++++++++++++++++++++
>   block/Makefile.objs                    |   1 +
>   4 files changed, 337 insertions(+), 1 deletion(-)
>   create mode 100644 block/preallocate.c
> 
> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> index b052a6d14e..5e8a35c571 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
>   More than one byte could be locked by the QEMU instance, each byte of which
>   reflects a particular permission that is acquired or protected by the running
>   block driver.
> +

[..]

> +
> +static coroutine_fn int preallocate_co_preadv_part(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
> +{
> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
> +                               flags);
> +}
> +
> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
> +                                               int64_t offset, int bytes)
> +{
> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
> +}
> +
> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
> +                                       int64_t bytes, bool write_zero)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t len, start, end;

int64_t old_data_end = s->data_end;

> +
> +    if (!s->active) {
> +        return false;
> +    }
> +
> +    if (s->data_end >= 0) {
> +        s->data_end = MAX(s->data_end,
> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
> +    }
> +
> +    len = bdrv_getlength(bs->file->bs);
> +    if (len < 0) {
> +        return false;

return old_data_end >= 0 && offset >= old_data_end;


And with this small improvement we make the following test 20% faster (ssd, ext4):

./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 150000 -d 64 -f qcow2  -s 16k -t none -n -w $img;

(assume additional patch which inserts the filter).

Great! So, preallocation is generally good idea, not only for vstorage.

What about inserting the filter by default?

I'll come tomorrow with more complete test results.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function
  2020-08-21 14:11 ` [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
@ 2020-08-25 12:43   ` Max Reitz
  0 siblings, 0 replies; 30+ messages in thread
From: Max Reitz @ 2020-08-25 12:43 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 924 bytes --]

On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
> We'll need a separate function, which will only "mark" request
> serialising with specified align but not wait for conflicting
> requests. So, it will be like old bdrv_mark_request_serialising(),
> before merging bdrv_wait_serialising_requests_locked() into it.
> 
> To reduce the possible mess, let's do the following:
> 
> Public function that does both marking and waiting will be called
> bdrv_make_request_serialising, and private function which will only
> "mark" will be called tracked_request_set_serialising().
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block_int.h |  3 ++-
>  block/file-posix.c        |  2 +-
>  block/io.c                | 35 +++++++++++++++++++++++------------
>  3 files changed, 26 insertions(+), 14 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag
  2020-08-21 14:11 ` [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
@ 2020-08-25 13:10   ` Max Reitz
  2020-08-26  6:26     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2020-08-25 13:10 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 2561 bytes --]

On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
> Add flag to make serialising request no wait: if there are conflicting
> requests, just return error immediately. It's will be used in upcoming
> preallocate filter.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block.h |  9 ++++++++-
>  block/io.c            | 11 ++++++++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index b8f4e86e8d..877fda06a4 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -67,8 +67,15 @@ typedef enum {
>       * written to qiov parameter which may be NULL.
>       */
>      BDRV_REQ_PREFETCH  = 0x200,
> +
> +    /*
> +     * If we need to wait for other requests, just fail immediately. Used
> +     * only together with BDRV_REQ_SERIALISING.
> +     */
> +    BDRV_REQ_NO_WAIT = 0x400,
> +
>      /* Mask of valid flags */
> -    BDRV_REQ_MASK               = 0x3ff,
> +    BDRV_REQ_MASK               = 0x7ff,
>  } BdrvRequestFlags;
>  
>  typedef struct BlockSizes {
> diff --git a/block/io.c b/block/io.c
> index dd28befb08..c93b1e98a3 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>      assert(!(bs->open_flags & BDRV_O_INACTIVE));
>      assert((bs->open_flags & BDRV_O_NO_IO) == 0);
>      assert(!(flags & ~BDRV_REQ_MASK));
> +    assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags & BDRV_REQ_SERIALISING)));
>  
>      if (flags & BDRV_REQ_SERIALISING) {
> -        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
> +        QEMU_LOCK_GUARD(&bs->reqs_lock);
> +
> +        tracked_request_set_serialising(req, bdrv_get_cluster_size(bs));
> +
> +        if ((flags & BDRV_REQ_NO_WAIT) && bdrv_find_conflicting_request(req)) {

bdrv_find_conflicting_request() will return NULL even if there are
conflicting requests, but those have a non-NULL waiting_for.  Is that
something to consider?

(I would like to think that will never have a real impact because then
we must find some other conflicting request; but isn’t is possible that
we find an overlapping request that waits for another request with which
it overlaps, but our request does not?)

Max

> +            return -EBUSY;
> +        }
> +
> +        bdrv_wait_serialising_requests_locked(req);
>      } else {
>          bdrv_wait_serialising_requests(req);
>      }
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
  2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
@ 2020-08-25 15:11   ` Max Reitz
  2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
  2020-08-26 13:51     ` David Edmondson
  1 sibling, 2 replies; 30+ messages in thread
From: Max Reitz @ 2020-08-25 15:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 17545 bytes --]

On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
> It's intended to be inserted between format and protocol nodes to
> preallocate additional space (expanding protocol file) on writes
> crossing EOF. It improves performance for file-systems with slow
> allocation.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  docs/system/qemu-block-drivers.rst.inc |  26 +++
>  qapi/block-core.json                   |  20 +-
>  block/preallocate.c                    | 291 +++++++++++++++++++++++++
>  block/Makefile.objs                    |   1 +
>  4 files changed, 337 insertions(+), 1 deletion(-)
>  create mode 100644 block/preallocate.c

Looks good to me in essence.  Besides minor details, I wonder most about
whether truncating the file on close can be safe, but more about that below.

> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> index b052a6d14e..5e8a35c571 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
>  More than one byte could be locked by the QEMU instance, each byte of which
>  reflects a particular permission that is acquired or protected by the running
>  block driver.
> +
> +Filter drivers
> +~~~~~~~~~~~~~~
> +
> +QEMU supports several filter drivers, which don't store any data, but do some

s/do/perform/

> +additional tasks, hooking io requests.
> +
> +.. program:: filter-drivers
> +.. option:: preallocate
> +
> +  Preallocate filter driver is intended to be inserted between format

*The preallocate filter driver

> +  and protocol nodes and does preallocation of some additional space

I’d simplify this as s/does preallocation of/preallocates/

> +  (expanding the protocol file) on write. It may be used for

I’d complicate this as s/on write/when writing past the file’s end/, or
“when data is written to the file after its end”, or at least “on
post-EOF writes”.

Maybe also s/It may be used for/This can be useful for/?

> +  file-systems with slow allocation.
> +
> +  Supported options:
> +
> +  .. program:: preallocate
> +  .. option:: prealloc-align
> +
> +    On preallocation, align file length to this number, default 1M.

*the file length

As for “number”...  Well, it is a number.  But “value” might fit better.
 Or “length (in bytes)”?

> +
> +  .. program:: preallocate
> +  .. option:: prealloc-size
> +
> +    How much to preallocate, default 128M.
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 197bdc1c36..b40448063b 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2805,7 +2805,7 @@
>              'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
>              'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
> -            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
> +            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
>              'sheepdog',
>              'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> @@ -3074,6 +3074,23 @@
>    'data': { 'aes': 'QCryptoBlockOptionsQCow',
>              'luks': 'QCryptoBlockOptionsLUKS'} }
>  
> +##
> +# @BlockdevOptionsPreallocate:
> +#
> +# Filter driver intended to be inserted between format and protocol node
> +# and do preallocation in protocol node on write.
> +#
> +# @prealloc-align: on preallocation, align file length to this number,
> +#                 default 1048576 (1M)

Speaking of alignment, this second line isn’t properly aligned.

> +#
> +# @prealloc-size: how much to preallocate, default 134217728 (128M)
> +#
> +# Since: 5.2
> +##
> +{ 'struct': 'BlockdevOptionsPreallocate',
> +  'base': 'BlockdevOptionsGenericFormat',
> +  'data': { '*prealloc-align': 'int', '*prealloc-size': 'int' } }
> +
>  ##
>  # @BlockdevOptionsQcow2:
>  #
> @@ -3979,6 +3996,7 @@
>        'null-co':    'BlockdevOptionsNull',
>        'nvme':       'BlockdevOptionsNVMe',
>        'parallels':  'BlockdevOptionsGenericFormat',
> +      'preallocate':'BlockdevOptionsPreallocate',
>        'qcow2':      'BlockdevOptionsQcow2',
>        'qcow':       'BlockdevOptionsQcow',
>        'qed':        'BlockdevOptionsGenericCOWFormat',
> diff --git a/block/preallocate.c b/block/preallocate.c
> new file mode 100644
> index 0000000000..bdf54dbd2f
> --- /dev/null
> +++ b/block/preallocate.c
> @@ -0,0 +1,291 @@
> +/*
> + * preallocate filter driver
> + *
> + * The driver performs preallocate operation: it is injected above
> + * some node, and before each write over EOF it does additional preallocating
> + * write-zeroes request.
> + *
> + * Copyright (c) 2020 Virtuozzo International GmbH.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qapi/error.h"
> +#include "qemu/module.h"
> +#include "qemu/option.h"
> +#include "qemu/units.h"
> +#include "block/block_int.h"
> +
> +
> +typedef struct BDRVPreallocateState {
> +    int64_t prealloc_size;
> +    int64_t prealloc_align;
> +
> +    /*
> +     * Filter is started as not-active, so it doesn't do any preallocations nor
> +     * requires BLK_PERM_RESIZE on its child. This is needed to create filter
> +     * above another node-child and then do bdrv_replace_node to insert the
> +     * filter.

A bit weird, but seems fair.  It’s basically just a cache for whether
this node has a writer on it or not.

Apart from the weirdness, I don’t understand the “another node-child”.
Say you have “format” -> “proto”, and then you want to insert
“prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
blockdev-replace format.file=prealloc?  So what is that “other node-child”?

> +     *
> +     * Filter becomes active the first time it detects that its parents have
> +     * BLK_PERM_RESIZE on it.
> +     * Filter becomes active forever: it doesn't lose active status if parents
> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink the file on
> +     * filter close.

Oh, the file is shrunk?  That wasn’t clear from the documentation.

Hm.  Seems like useful behavior.  I just wonder if keeping the
permission around indefinitely makes sense.  Why not just truncate it
when the permission is revoked?

> +     */
> +    bool active;
> +
> +    /*
> +     * Track real data end, to crop preallocation on close  data_end may be

s/on close /when closed./

> +     * negative, which means that actual status is unknown (nothing cropped in
> +     * this case)
> +     */
> +    int64_t data_end;
> +} BDRVPreallocateState;
> +
> +#define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
> +#define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
> +static QemuOptsList runtime_opts = {
> +    .name = "preallocate",
> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = PREALLOCATE_OPT_PREALLOC_ALIGN,
> +            .type = QEMU_OPT_SIZE,
> +            .help = "on preallocation, align file length to this number, "
> +                "default 1M",
> +        },
> +        {
> +            .name = PREALLOCATE_OPT_PREALLOC_SIZE,
> +            .type = QEMU_OPT_SIZE,
> +            .help = "how much to preallocate, default 128M",
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
> +                            Error **errp)
> +{
> +    QemuOpts *opts;
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    /*
> +     * Parameters are hardcoded now. May need to add corresponding options in
> +     * future.
> +     */
> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &error_abort);
> +    s->prealloc_align =
> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_ALIGN, 1 * MiB);
> +    s->prealloc_size =
> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_SIZE, 128 * MiB);
> +    qemu_opts_del(opts);

Should we have some validation on these parameters?  Like,
prealloc_align being at least 512, or maybe even file.request_alignment?

(I suppose anything for prealloc_size is fine, though 0 doesn’t make
much sense...)

> +
> +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
> +                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
> +                               false, errp);
> +    if (!bs->file) {
> +        return -EINVAL;
> +    }
> +
> +    s->data_end = bdrv_getlength(bs->file->bs);
> +    if (s->data_end < 0) {
> +        return s->data_end;
> +    }
> +
> +    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
> +
> +    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
> +            bs->file->bs->supported_zero_flags);
> +
> +    return 0;
> +}
> +
> +static void preallocate_close(BlockDriverState *bs)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    if (s->active && s->data_end >= 0 &&
> +        bdrv_getlength(bs->file->bs) > s->data_end)
> +    {
> +        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);

Now that I think more about it...  What if there are other writers on
bs->file?  This may throw data away.  Maybe BDS.wr_highest_offset can
help to make a better call?

> +    }
> +}
> +
> +static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                   BdrvChildRole role,
> +                                   BlockReopenQueue *reopen_queue,
> +                                   uint64_t perm, uint64_t shared,
> +                                   uint64_t *nperm, uint64_t *nshared)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
> +
> +    s->active = s->active || (perm & BLK_PERM_RESIZE);
> +
> +    if (s->active) {
> +        /* Force RESIZE permission, to be able to crop file on close() */
> +        *nperm |= BLK_PERM_RESIZE;
> +    }
> +}
> +
> +static coroutine_fn int preallocate_co_preadv_part(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
> +{
> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
> +                               flags);
> +}
> +
> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
> +                                               int64_t offset, int bytes)
> +{
> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
> +}
> +
> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
> +                                       int64_t bytes, bool write_zero)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t len, start, end;
> +
> +    if (!s->active) {
> +        return false;
> +    }
> +
> +    if (s->data_end >= 0) {
> +        s->data_end = MAX(s->data_end,
> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
> +    }
> +
> +    len = bdrv_getlength(bs->file->bs);

I’d rename @len so it’s clear that it has nothing to do with the request
length.  Like, file_len.

(Because...

> +    if (len < 0) {
> +        return false;
> +    }
> +
> +    if (s->data_end < 0) {
> +        s->data_end = MAX(len,
> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
> +    }
> +
> +    if (offset + bytes <= len) {
> +        return false;
> +    }
> +
> +    start = write_zero ? MIN(offset, len) : len;

...I got confused here for a bit)

> +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);

Ah, I expected offset + s->prealloc_size.  But OK.

> +    return !bdrv_co_pwrite_zeroes(bs->file, start, end - start,
> +            BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> +}
> +
> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    if (do_preallocate(bs, offset, bytes, true)) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
> +                                                    uint64_t offset,
> +                                                    uint64_t bytes,
> +                                                    QEMUIOVector *qiov,
> +                                                    size_t qiov_offset,
> +                                                    int flags)
> +{
> +    do_preallocate(bs, offset, bytes, false);
> +
> +    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
> +                                flags);
> +}
> +
> +static int coroutine_fn
> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
> +                        bool exact, PreallocMode prealloc,
> +                        BdrvRequestFlags flags, Error **errp)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
> +
> +    /* s->data_end may become negative here, which means unknown data end */
> +    s->data_end = bdrv_getlength(bs->file->bs);

What would be the problem with just setting data_end = offset?  We only
use data_end to truncate down on close, so basically repeat the
bdrv_co_truncate() call here, which seems correct.

> +
> +    return ret;
> +}
> +
> +static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
> +{
> +    return bdrv_co_flush(bs->file->bs);
> +}
> +
> +static int64_t preallocate_getlength(BlockDriverState *bs)
> +{
> +    /*
> +     * We probably can return s->data_end here, but seems safer to return real
> +     * file length, not trying to hide the preallocation.

I don’t know.  The auto-truncation on close makes that a bit weird, in
my opinion.  But since this filter is probably never directly below a
BlockBackend (i.e., the length is never exposed to anything outside of
the block layer), but always below a format driver, it should be fine.
(In fact, those drivers do probably rather care about the true file
length rather than whatever they may have truncated it to, so you’re
right, it should be safer.)

Max

> +     *
> +     * Still, don't miss the chance to restore s->data_end if it is broken.
> +     */
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t ret = bdrv_getlength(bs->file->bs);
> +
> +    if (s->data_end < 0) {
> +        s->data_end = ret;
> +    }
> +
> +    return ret;
> +}
> +
> +BlockDriver bdrv_preallocate_filter = {
> +    .format_name = "preallocate",
> +    .instance_size = sizeof(BDRVPreallocateState),
> +
> +    .bdrv_getlength = preallocate_getlength,
> +    .bdrv_open = preallocate_open,
> +    .bdrv_close = preallocate_close,
> +
> +    .bdrv_co_preadv_part = preallocate_co_preadv_part,
> +    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
> +    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard = preallocate_co_pdiscard,
> +    .bdrv_co_flush = preallocate_co_flush,
> +    .bdrv_co_truncate = preallocate_co_truncate,
> +
> +    .bdrv_co_block_status = bdrv_co_block_status_from_file,
> +
> +    .bdrv_child_perm = preallocate_child_perm,
> +
> +    .has_variable_length = true,
> +    .is_filter = true,
> +};
> +
> +static void bdrv_preallocate_init(void)
> +{
> +    bdrv_register(&bdrv_preallocate_filter);
> +}
> +
> +block_init(bdrv_preallocate_init);
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 19c6f371c9..f8e6f16522 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -44,6 +44,7 @@ block-obj-y += crypto.o
>  block-obj-y += aio_task.o
>  block-obj-y += backup-top.o
>  block-obj-y += filter-compress.o
> +block-obj-y += preallocate.o
>  common-obj-y += monitor/
>  block-obj-y += monitor/
>  
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag
  2020-08-25 13:10   ` Max Reitz
@ 2020-08-26  6:26     ` Vladimir Sementsov-Ogievskiy
  2020-08-26  8:36       ` Max Reitz
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  6:26 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

25.08.2020 16:10, Max Reitz wrote:
> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>> Add flag to make serialising request no wait: if there are conflicting
>> requests, just return error immediately. It's will be used in upcoming
>> preallocate filter.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/block/block.h |  9 ++++++++-
>>   block/io.c            | 11 ++++++++++-
>>   2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index b8f4e86e8d..877fda06a4 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -67,8 +67,15 @@ typedef enum {
>>        * written to qiov parameter which may be NULL.
>>        */
>>       BDRV_REQ_PREFETCH  = 0x200,
>> +
>> +    /*
>> +     * If we need to wait for other requests, just fail immediately. Used
>> +     * only together with BDRV_REQ_SERIALISING.
>> +     */
>> +    BDRV_REQ_NO_WAIT = 0x400,
>> +
>>       /* Mask of valid flags */
>> -    BDRV_REQ_MASK               = 0x3ff,
>> +    BDRV_REQ_MASK               = 0x7ff,
>>   } BdrvRequestFlags;
>>   
>>   typedef struct BlockSizes {
>> diff --git a/block/io.c b/block/io.c
>> index dd28befb08..c93b1e98a3 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
>>       assert(!(bs->open_flags & BDRV_O_INACTIVE));
>>       assert((bs->open_flags & BDRV_O_NO_IO) == 0);
>>       assert(!(flags & ~BDRV_REQ_MASK));
>> +    assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags & BDRV_REQ_SERIALISING)));
>>   
>>       if (flags & BDRV_REQ_SERIALISING) {
>> -        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
>> +        QEMU_LOCK_GUARD(&bs->reqs_lock);
>> +
>> +        tracked_request_set_serialising(req, bdrv_get_cluster_size(bs));
>> +
>> +        if ((flags & BDRV_REQ_NO_WAIT) && bdrv_find_conflicting_request(req)) {
> 
> bdrv_find_conflicting_request() will return NULL even if there are
> conflicting requests, but those have a non-NULL waiting_for.  Is that
> something to consider?
> 
> (I would like to think that will never have a real impact because then
> we must find some other conflicting request; but isn’t is possible that
> we find an overlapping request that waits for another request with which
> it overlaps, but our request does not?)
> 

Actually check in bdrv_find_conflicting_request() is the same like in the following
bdrv_wait_serialising_requests_locked(), so, if bdrv_find_conflicting_request() returns
NULL, it means that in bdrv_wait_serialising_requests_locked() it will return NULL
again (as there are no yield points) and we will not wait, so all is OK.

And, why is it OK to ignore already waiting requests in
bdrv_wait_serialising_requests_locked(): just because if we proceed now with our request,
these waiting requests will have to wait for us, when they wake and go to the next iteration
of waiting loop.

> 
>> +            return -EBUSY;
>> +        }
>> +
>> +        bdrv_wait_serialising_requests_locked(req);
>>       } else {
>>           bdrv_wait_serialising_requests(req);
>>       }
>>
> 
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-25 15:11   ` Max Reitz
@ 2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
  2020-08-26  8:52       ` Max Reitz
  2020-08-26 13:51     ` David Edmondson
  1 sibling, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  6:49 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

25.08.2020 18:11, Max Reitz wrote:
> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>> It's intended to be inserted between format and protocol nodes to
>> preallocate additional space (expanding protocol file) on writes
>> crossing EOF. It improves performance for file-systems with slow
>> allocation.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   docs/system/qemu-block-drivers.rst.inc |  26 +++
>>   qapi/block-core.json                   |  20 +-
>>   block/preallocate.c                    | 291 +++++++++++++++++++++++++
>>   block/Makefile.objs                    |   1 +
>>   4 files changed, 337 insertions(+), 1 deletion(-)
>>   create mode 100644 block/preallocate.c
> 
> Looks good to me in essence.  Besides minor details, I wonder most about
> whether truncating the file on close can be safe, but more about that below.
> 
>> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
>> index b052a6d14e..5e8a35c571 100644
>> --- a/docs/system/qemu-block-drivers.rst.inc
>> +++ b/docs/system/qemu-block-drivers.rst.inc
>> @@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
>>   More than one byte could be locked by the QEMU instance, each byte of which
>>   reflects a particular permission that is acquired or protected by the running
>>   block driver.
>> +
>> +Filter drivers
>> +~~~~~~~~~~~~~~
>> +
>> +QEMU supports several filter drivers, which don't store any data, but do some
> 
> s/do/perform/
> 
>> +additional tasks, hooking io requests.
>> +
>> +.. program:: filter-drivers
>> +.. option:: preallocate
>> +
>> +  Preallocate filter driver is intended to be inserted between format
> 
> *The preallocate filter driver
> 
>> +  and protocol nodes and does preallocation of some additional space
> 
> I’d simplify this as s/does preallocation of/preallocates/
> 
>> +  (expanding the protocol file) on write. It may be used for
> 
> I’d complicate this as s/on write/when writing past the file’s end/, or
> “when data is written to the file after its end”, or at least “on
> post-EOF writes”.
> 
> Maybe also s/It may be used for/This can be useful for/?
> 
>> +  file-systems with slow allocation.
>> +
>> +  Supported options:
>> +
>> +  .. program:: preallocate
>> +  .. option:: prealloc-align
>> +
>> +    On preallocation, align file length to this number, default 1M.
> 
> *the file length
> 
> As for “number”...  Well, it is a number.  But “value” might fit better.
>   Or “length (in bytes)”?
> 
>> +
>> +  .. program:: preallocate
>> +  .. option:: prealloc-size
>> +
>> +    How much to preallocate, default 128M.
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 197bdc1c36..b40448063b 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2805,7 +2805,7 @@
>>               'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
>>               'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
>>               'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>> -            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>> +            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>>               { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
>>               'sheepdog',
>>               'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
>> @@ -3074,6 +3074,23 @@
>>     'data': { 'aes': 'QCryptoBlockOptionsQCow',
>>               'luks': 'QCryptoBlockOptionsLUKS'} }
>>   
>> +##
>> +# @BlockdevOptionsPreallocate:
>> +#
>> +# Filter driver intended to be inserted between format and protocol node
>> +# and do preallocation in protocol node on write.
>> +#
>> +# @prealloc-align: on preallocation, align file length to this number,
>> +#                 default 1048576 (1M)
> 
> Speaking of alignment, this second line isn’t properly aligned.
> 
>> +#
>> +# @prealloc-size: how much to preallocate, default 134217728 (128M)
>> +#
>> +# Since: 5.2
>> +##
>> +{ 'struct': 'BlockdevOptionsPreallocate',
>> +  'base': 'BlockdevOptionsGenericFormat',
>> +  'data': { '*prealloc-align': 'int', '*prealloc-size': 'int' } }
>> +
>>   ##
>>   # @BlockdevOptionsQcow2:
>>   #
>> @@ -3979,6 +3996,7 @@
>>         'null-co':    'BlockdevOptionsNull',
>>         'nvme':       'BlockdevOptionsNVMe',
>>         'parallels':  'BlockdevOptionsGenericFormat',
>> +      'preallocate':'BlockdevOptionsPreallocate',
>>         'qcow2':      'BlockdevOptionsQcow2',
>>         'qcow':       'BlockdevOptionsQcow',
>>         'qed':        'BlockdevOptionsGenericCOWFormat',
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> new file mode 100644
>> index 0000000000..bdf54dbd2f
>> --- /dev/null
>> +++ b/block/preallocate.c
>> @@ -0,0 +1,291 @@
>> +/*
>> + * preallocate filter driver
>> + *
>> + * The driver performs preallocate operation: it is injected above
>> + * some node, and before each write over EOF it does additional preallocating
>> + * write-zeroes request.
>> + *
>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qapi/error.h"
>> +#include "qemu/module.h"
>> +#include "qemu/option.h"
>> +#include "qemu/units.h"
>> +#include "block/block_int.h"
>> +
>> +
>> +typedef struct BDRVPreallocateState {
>> +    int64_t prealloc_size;
>> +    int64_t prealloc_align;
>> +
>> +    /*
>> +     * Filter is started as not-active, so it doesn't do any preallocations nor
>> +     * requires BLK_PERM_RESIZE on its child. This is needed to create filter
>> +     * above another node-child and then do bdrv_replace_node to insert the
>> +     * filter.
> 
> A bit weird, but seems fair.  It’s basically just a cache for whether
> this node has a writer on it or not.
> 
> Apart from the weirdness, I don’t understand the “another node-child”.
> Say you have “format” -> “proto”, and then you want to insert
> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
> blockdev-replace format.file=prealloc?

Yes something like this. Actually, I'm about inserting the filter automatically from block layer code, but your reasoning is about same thing and is better.

> So what is that “other node-child”?

Hmm, just my misleading wording. At least '-' in wrong place. Just "other node child", or "child of another node"..

> 
>> +     *
>> +     * Filter becomes active the first time it detects that its parents have
>> +     * BLK_PERM_RESIZE on it.
>> +     * Filter becomes active forever: it doesn't lose active status if parents
>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink the file on
>> +     * filter close.
> 
> Oh, the file is shrunk?  That wasn’t clear from the documentation.
> 
> Hm.  Seems like useful behavior.  I just wonder if keeping the
> permission around indefinitely makes sense.  Why not just truncate it
> when the permission is revoked?

How? Parent is closed earlier, so on close we will not have the permission. So, we force-keep the permission up to our close().

> 
>> +     */
>> +    bool active;
>> +
>> +    /*
>> +     * Track real data end, to crop preallocation on close  data_end may be
> 
> s/on close /when closed./
> 
>> +     * negative, which means that actual status is unknown (nothing cropped in
>> +     * this case)
>> +     */
>> +    int64_t data_end;
>> +} BDRVPreallocateState;
>> +
>> +#define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
>> +#define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
>> +static QemuOptsList runtime_opts = {
>> +    .name = "preallocate",
>> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = PREALLOCATE_OPT_PREALLOC_ALIGN,
>> +            .type = QEMU_OPT_SIZE,
>> +            .help = "on preallocation, align file length to this number, "
>> +                "default 1M",
>> +        },
>> +        {
>> +            .name = PREALLOCATE_OPT_PREALLOC_SIZE,
>> +            .type = QEMU_OPT_SIZE,
>> +            .help = "how much to preallocate, default 128M",
>> +        },
>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
>> +                            Error **errp)
>> +{
>> +    QemuOpts *opts;
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    /*
>> +     * Parameters are hardcoded now. May need to add corresponding options in
>> +     * future.
>> +     */
>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &error_abort);
>> +    s->prealloc_align =
>> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_ALIGN, 1 * MiB);
>> +    s->prealloc_size =
>> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_SIZE, 128 * MiB);
>> +    qemu_opts_del(opts);
> 
> Should we have some validation on these parameters?  Like,
> prealloc_align being at least 512, or maybe even file.request_alignment?

why not

> 
> (I suppose anything for prealloc_size is fine, though 0 doesn’t make
> much sense...)
> 
>> +
>> +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
>> +                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
>> +                               false, errp);
>> +    if (!bs->file) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    s->data_end = bdrv_getlength(bs->file->bs);
>> +    if (s->data_end < 0) {
>> +        return s->data_end;
>> +    }
>> +
>> +    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
>> +
>> +    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
>> +            bs->file->bs->supported_zero_flags);
>> +
>> +    return 0;
>> +}
>> +
>> +static void preallocate_close(BlockDriverState *bs)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    if (s->active && s->data_end >= 0 &&
>> +        bdrv_getlength(bs->file->bs) > s->data_end)
>> +    {
>> +        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
> 
> Now that I think more about it...  What if there are other writers on
> bs->file?  This may throw data away.

Good point. Actually, if bs->file has other writing parents, the logic of the filter
around "data_end" is broken. So we must unshare WRITE and RESIZE permissions.

>  Maybe BDS.wr_highest_offset can
> help to make a better call?

Anyway, we'll have to use wr_highest_offset of the filter not the child node
(in the child wr_highest_offset will track preallocations as well), so we
want to unshare WRITE/RESIZE permissions.

> 
>> +    }
>> +}
>> +
>> +static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
>> +                                   BdrvChildRole role,
>> +                                   BlockReopenQueue *reopen_queue,
>> +                                   uint64_t perm, uint64_t shared,
>> +                                   uint64_t *nperm, uint64_t *nshared)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
>> +
>> +    s->active = s->active || (perm & BLK_PERM_RESIZE);
>> +
>> +    if (s->active) {
>> +        /* Force RESIZE permission, to be able to crop file on close() */
>> +        *nperm |= BLK_PERM_RESIZE;
>> +    }
>> +}
>> +
>> +static coroutine_fn int preallocate_co_preadv_part(
>> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
>> +{
>> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                               flags);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
>> +                                               int64_t offset, int bytes)
>> +{
>> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
>> +}
>> +
>> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
>> +                                       int64_t bytes, bool write_zero)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t len, start, end;
>> +
>> +    if (!s->active) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end >= 0) {
>> +        s->data_end = MAX(s->data_end,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    len = bdrv_getlength(bs->file->bs);
> 
> I’d rename @len so it’s clear that it has nothing to do with the request
> length.  Like, file_len.
> 
> (Because...

Ok

> 
>> +    if (len < 0) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end < 0) {
>> +        s->data_end = MAX(len,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    if (offset + bytes <= len) {
>> +        return false;
>> +    }
>> +
>> +    start = write_zero ? MIN(offset, len) : len;
> 
> ...I got confused here for a bit)
> 
>> +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
> 
> Ah, I expected offset + s->prealloc_size.  But OK.
> 
>> +    return !bdrv_co_pwrite_zeroes(bs->file, start, end - start,
>> +            BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
>> +        int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> +    if (do_preallocate(bs, offset, bytes, true)) {
>> +        return 0;
>> +    }
>> +
>> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
>> +}
>> +
>> +static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
>> +                                                    uint64_t offset,
>> +                                                    uint64_t bytes,
>> +                                                    QEMUIOVector *qiov,
>> +                                                    size_t qiov_offset,
>> +                                                    int flags)
>> +{
>> +    do_preallocate(bs, offset, bytes, false);
>> +
>> +    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                                flags);
>> +}
>> +
>> +static int coroutine_fn
>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>> +                        bool exact, PreallocMode prealloc,
>> +                        BdrvRequestFlags flags, Error **errp)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
>> +
>> +    /* s->data_end may become negative here, which means unknown data end */
>> +    s->data_end = bdrv_getlength(bs->file->bs);
> 
> What would be the problem with just setting data_end = offset?  We only
> use data_end to truncate down on close, so basically repeat the
> bdrv_co_truncate() call here, which seems correct.

We can use offset only if ret >= 0 && exact is true.. But simpler is just call
bdrv_getlength() anyway.

> 
>> +
>> +    return ret;
>> +}
>> +
>> +static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
>> +{
>> +    return bdrv_co_flush(bs->file->bs);
>> +}
>> +
>> +static int64_t preallocate_getlength(BlockDriverState *bs)
>> +{
>> +    /*
>> +     * We probably can return s->data_end here, but seems safer to return real
>> +     * file length, not trying to hide the preallocation.
> 
> I don’t know.  The auto-truncation on close makes that a bit weird, in
> my opinion.  But since this filter is probably never directly below a
> BlockBackend (i.e., the length is never exposed to anything outside of
> the block layer), but always below a format driver, it should be fine.
> (In fact, those drivers do probably rather care about the true file
> length rather than whatever they may have truncated it to, so you’re
> right, it should be safer.)
> 
> Max
> 


Thanks for reviewing!


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26  7:06     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  7:06 UTC (permalink / raw)
  To: qemu-block; +Cc: fam, kwolf, armbru, qemu-devel, mreitz, nsoffer, stefanha, den

24.08.2020 20:52, Vladimir Sementsov-Ogievskiy wrote:
> 21.08.2020 17:11, Vladimir Sementsov-Ogievskiy wrote:
>> It's intended to be inserted between format and protocol nodes to
>> preallocate additional space (expanding protocol file) on writes
>> crossing EOF. It improves performance for file-systems with slow
>> allocation.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   docs/system/qemu-block-drivers.rst.inc |  26 +++
>>   qapi/block-core.json                   |  20 +-
>>   block/preallocate.c                    | 291 +++++++++++++++++++++++++
>>   block/Makefile.objs                    |   1 +
>>   4 files changed, 337 insertions(+), 1 deletion(-)
>>   create mode 100644 block/preallocate.c
>>
>> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
>> index b052a6d14e..5e8a35c571 100644
>> --- a/docs/system/qemu-block-drivers.rst.inc
>> +++ b/docs/system/qemu-block-drivers.rst.inc
>> @@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
>>   More than one byte could be locked by the QEMU instance, each byte of which
>>   reflects a particular permission that is acquired or protected by the running
>>   block driver.
>> +
> 
> [..]
> 
>> +
>> +static coroutine_fn int preallocate_co_preadv_part(
>> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
>> +{
>> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                               flags);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
>> +                                               int64_t offset, int bytes)
>> +{
>> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
>> +}
>> +
>> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
>> +                                       int64_t bytes, bool write_zero)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t len, start, end;
> 
> int64_t old_data_end = s->data_end;
> 
>> +
>> +    if (!s->active) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end >= 0) {
>> +        s->data_end = MAX(s->data_end,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    len = bdrv_getlength(bs->file->bs);
>> +    if (len < 0) {
>> +        return false;
> 
> return old_data_end >= 0 && offset >= old_data_end;
> 
> 
> And with this small improvement we make the following test 20% faster (ssd, ext4):
> 
> ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 150000 -d 64 -f qcow2  -s 16k -t none -n -w $img;
> 
> (assume additional patch which inserts the filter).
> 
> Great! So, preallocation is generally good idea, not only for vstorage.
> 
> What about inserting the filter by default?
> 
> I'll come tomorrow with more complete test results.
> 

Some results:

the two commands to compare:
img=/ssd/x.qcow2; for i in {1..5}; do ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 150000 -d 64 -s 16k -t none -n -w --image-opts driver=qcow2,file.filename=$img; done
img=/ssd/x.qcow2; for i in {1..5}; do ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 150000 -d 64 -s 16k -t none -n -w --image-opts driver=qcow2,file.driver=preallocate,file.file.filename=$img; done


       no-filter  with-filter
ssd
   ext4: 6.94s      5.53s (-20%)
   xfs:  6.9s       25s (+262%)
hdd (with -c 45000)
   ext4: 8.23s      12.75s (+55%)
   xfs:  7.79s      25.4s (+226%)

vstorage (some custom distributed fs), with -c 4000 over ext4 over ssd: 42.9s ~> 0.27s, more than 150 times faster with filter!
same, with -c 2000 over ext4 over hdd: 22.8s ~> 0.53s, ~43 times faster.


So, we do have large improvement for the vstorage (some custom distributed fs), which is our main target. And there is a significant improvement for ext4 over ssd. And (a lot more) significant degradation in other cases. For sure, we can't insert the filter by default, but it is useful.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag
  2020-08-26  6:26     ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26  8:36       ` Max Reitz
  2020-08-26  8:57         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2020-08-26  8:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 3785 bytes --]

On 26.08.20 08:26, Vladimir Sementsov-Ogievskiy wrote:
> 25.08.2020 16:10, Max Reitz wrote:
>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Add flag to make serialising request no wait: if there are conflicting
>>> requests, just return error immediately. It's will be used in upcoming
>>> preallocate filter.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   include/block/block.h |  9 ++++++++-
>>>   block/io.c            | 11 ++++++++++-
>>>   2 files changed, 18 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index b8f4e86e8d..877fda06a4 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>> @@ -67,8 +67,15 @@ typedef enum {
>>>        * written to qiov parameter which may be NULL.
>>>        */
>>>       BDRV_REQ_PREFETCH  = 0x200,
>>> +
>>> +    /*
>>> +     * If we need to wait for other requests, just fail immediately.
>>> Used
>>> +     * only together with BDRV_REQ_SERIALISING.
>>> +     */
>>> +    BDRV_REQ_NO_WAIT = 0x400,
>>> +
>>>       /* Mask of valid flags */
>>> -    BDRV_REQ_MASK               = 0x3ff,
>>> +    BDRV_REQ_MASK               = 0x7ff,
>>>   } BdrvRequestFlags;
>>>     typedef struct BlockSizes {
>>> diff --git a/block/io.c b/block/io.c
>>> index dd28befb08..c93b1e98a3 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child,
>>> int64_t offset, uint64_t bytes,
>>>       assert(!(bs->open_flags & BDRV_O_INACTIVE));
>>>       assert((bs->open_flags & BDRV_O_NO_IO) == 0);
>>>       assert(!(flags & ~BDRV_REQ_MASK));
>>> +    assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags &
>>> BDRV_REQ_SERIALISING)));
>>>         if (flags & BDRV_REQ_SERIALISING) {
>>> -        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
>>> +        QEMU_LOCK_GUARD(&bs->reqs_lock);
>>> +
>>> +        tracked_request_set_serialising(req,
>>> bdrv_get_cluster_size(bs));
>>> +
>>> +        if ((flags & BDRV_REQ_NO_WAIT) &&
>>> bdrv_find_conflicting_request(req)) {
>>
>> bdrv_find_conflicting_request() will return NULL even if there are
>> conflicting requests, but those have a non-NULL waiting_for.  Is that
>> something to consider?
>>
>> (I would like to think that will never have a real impact because then
>> we must find some other conflicting request; but isn’t is possible that
>> we find an overlapping request that waits for another request with which
>> it overlaps, but our request does not?)
>>
> 
> Actually check in bdrv_find_conflicting_request() is the same like in
> the following
> bdrv_wait_serialising_requests_locked(), so, if
> bdrv_find_conflicting_request() returns
> NULL, it means that in bdrv_wait_serialising_requests_locked() it will
> return NULL
> again (as there are no yield points) and we will not wait, so all is OK.

OK.  I thought that maybe we would want to avoid that other requests
might have to wait for the preallocation write.  (Of course, we can’t
avoid that altogether, but if we already know of such requests at the
beginning of the request...)

Well, if the only thing to look out for is that preallocation writes
themselves do not wait:

Reviewed-by: Max Reitz <mreitz@redhat.com>

> And, why is it OK to ignore already waiting requests in
> bdrv_wait_serialising_requests_locked(): just because if we proceed now
> with our request,
> these waiting requests will have to wait for us, when they wake and go
> to the next iteration
> of waiting loop.

Sure.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26  8:52       ` Max Reitz
  2020-08-26  9:15         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2020-08-26  8:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 6740 bytes --]

On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
> 25.08.2020 18:11, Max Reitz wrote:
>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>> It's intended to be inserted between format and protocol nodes to
>>> preallocate additional space (expanding protocol file) on writes
>>> crossing EOF. It improves performance for file-systems with slow
>>> allocation.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   docs/system/qemu-block-drivers.rst.inc |  26 +++
>>>   qapi/block-core.json                   |  20 +-
>>>   block/preallocate.c                    | 291 +++++++++++++++++++++++++
>>>   block/Makefile.objs                    |   1 +
>>>   4 files changed, 337 insertions(+), 1 deletion(-)
>>>   create mode 100644 block/preallocate.c

[...]

>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>> new file mode 100644
>>> index 0000000000..bdf54dbd2f
>>> --- /dev/null
>>> +++ b/block/preallocate.c
>>> @@ -0,0 +1,291 @@
>>> +/*
>>> + * preallocate filter driver
>>> + *
>>> + * The driver performs preallocate operation: it is injected above
>>> + * some node, and before each write over EOF it does additional
>>> preallocating
>>> + * write-zeroes request.
>>> + *
>>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>>> + *
>>> + * Author:
>>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +
>>> +#include "qapi/error.h"
>>> +#include "qemu/module.h"
>>> +#include "qemu/option.h"
>>> +#include "qemu/units.h"
>>> +#include "block/block_int.h"
>>> +
>>> +
>>> +typedef struct BDRVPreallocateState {
>>> +    int64_t prealloc_size;
>>> +    int64_t prealloc_align;
>>> +
>>> +    /*
>>> +     * Filter is started as not-active, so it doesn't do any
>>> preallocations nor
>>> +     * requires BLK_PERM_RESIZE on its child. This is needed to
>>> create filter
>>> +     * above another node-child and then do bdrv_replace_node to
>>> insert the
>>> +     * filter.
>>
>> A bit weird, but seems fair.  It’s basically just a cache for whether
>> this node has a writer on it or not.
>>
>> Apart from the weirdness, I don’t understand the “another node-child”.
>> Say you have “format” -> “proto”, and then you want to insert
>> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
>> blockdev-replace format.file=prealloc?
> 
> Yes something like this. Actually, I'm about inserting the filter
> automatically from block layer code, but your reasoning is about same
> thing and is better.
> 
>> So what is that “other node-child”?
> 
> Hmm, just my misleading wording. At least '-' in wrong place. Just
> "other node child", or "child of another node"..

OK.

>>> +     *
>>> +     * Filter becomes active the first time it detects that its
>>> parents have
>>> +     * BLK_PERM_RESIZE on it.
>>> +     * Filter becomes active forever: it doesn't lose active status
>>> if parents
>>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
>>> the file on
>>> +     * filter close.
>>
>> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>>
>> Hm.  Seems like useful behavior.  I just wonder if keeping the
>> permission around indefinitely makes sense.  Why not just truncate it
>> when the permission is revoked?
> 
> How? Parent is closed earlier, so on close we will not have the
> permission. So, we force-keep the permission up to our close().

I thought that preallocate_child_perm() would be invoked when the parent
is detached, so we could do the truncate there, before relinquishing
preallocate’s RESIZE permission.

[...]

>>> +static void preallocate_close(BlockDriverState *bs)
>>> +{
>>> +    BDRVPreallocateState *s = bs->opaque;
>>> +
>>> +    if (s->active && s->data_end >= 0 &&
>>> +        bdrv_getlength(bs->file->bs) > s->data_end)
>>> +    {
>>> +        bdrv_truncate(bs->file, s->data_end, true,
>>> PREALLOC_MODE_OFF, 0, NULL);
>>
>> Now that I think more about it...  What if there are other writers on
>> bs->file?  This may throw data away.
> 
> Good point. Actually, if bs->file has other writing parents, the logic
> of the filter
> around "data_end" is broken. So we must unshare WRITE and RESIZE
> permissions.

That’s certainly a heavy hammer, but it’d work.

>>  Maybe BDS.wr_highest_offset can
>> help to make a better call?
> 
> Anyway, we'll have to use wr_highest_offset of the filter not the child
> node
> (in the child wr_highest_offset will track preallocations as well),

That’s true.

> so we want to unshare WRITE/RESIZE permissions.

[...]

>>> +static int coroutine_fn
>>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>>> +                        bool exact, PreallocMode prealloc,
>>> +                        BdrvRequestFlags flags, Error **errp)
>>> +{
>>> +    BDRVPreallocateState *s = bs->opaque;
>>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
>>> flags, errp);
>>> +
>>> +    /* s->data_end may become negative here, which means unknown
>>> data end */
>>> +    s->data_end = bdrv_getlength(bs->file->bs);
>>
>> What would be the problem with just setting data_end = offset?  We only
>> use data_end to truncate down on close, so basically repeat the
>> bdrv_co_truncate() call here, which seems correct.
> 
> We can use offset only if ret >= 0 && exact is true..

Well, on close, we ignore any errors from truncate() anyway.  So even if
we used exact=false here, it shouldn’t matter.

> But simpler is
> just call
> bdrv_getlength() anyway.

Certainly simpler than thinking about potential edge cases, true.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag
  2020-08-26  8:36       ` Max Reitz
@ 2020-08-26  8:57         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  8:57 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

26.08.2020 11:36, Max Reitz wrote:
> On 26.08.20 08:26, Vladimir Sementsov-Ogievskiy wrote:
>> 25.08.2020 16:10, Max Reitz wrote:
>>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> Add flag to make serialising request no wait: if there are conflicting
>>>> requests, just return error immediately. It's will be used in upcoming
>>>> preallocate filter.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>    include/block/block.h |  9 ++++++++-
>>>>    block/io.c            | 11 ++++++++++-
>>>>    2 files changed, 18 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/block/block.h b/include/block/block.h
>>>> index b8f4e86e8d..877fda06a4 100644
>>>> --- a/include/block/block.h
>>>> +++ b/include/block/block.h
>>>> @@ -67,8 +67,15 @@ typedef enum {
>>>>         * written to qiov parameter which may be NULL.
>>>>         */
>>>>        BDRV_REQ_PREFETCH  = 0x200,
>>>> +
>>>> +    /*
>>>> +     * If we need to wait for other requests, just fail immediately.
>>>> Used
>>>> +     * only together with BDRV_REQ_SERIALISING.
>>>> +     */
>>>> +    BDRV_REQ_NO_WAIT = 0x400,
>>>> +
>>>>        /* Mask of valid flags */
>>>> -    BDRV_REQ_MASK               = 0x3ff,
>>>> +    BDRV_REQ_MASK               = 0x7ff,
>>>>    } BdrvRequestFlags;
>>>>      typedef struct BlockSizes {
>>>> diff --git a/block/io.c b/block/io.c
>>>> index dd28befb08..c93b1e98a3 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child,
>>>> int64_t offset, uint64_t bytes,
>>>>        assert(!(bs->open_flags & BDRV_O_INACTIVE));
>>>>        assert((bs->open_flags & BDRV_O_NO_IO) == 0);
>>>>        assert(!(flags & ~BDRV_REQ_MASK));
>>>> +    assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags &
>>>> BDRV_REQ_SERIALISING)));
>>>>          if (flags & BDRV_REQ_SERIALISING) {
>>>> -        bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
>>>> +        QEMU_LOCK_GUARD(&bs->reqs_lock);
>>>> +
>>>> +        tracked_request_set_serialising(req,
>>>> bdrv_get_cluster_size(bs));
>>>> +
>>>> +        if ((flags & BDRV_REQ_NO_WAIT) &&
>>>> bdrv_find_conflicting_request(req)) {
>>>
>>> bdrv_find_conflicting_request() will return NULL even if there are
>>> conflicting requests, but those have a non-NULL waiting_for.  Is that
>>> something to consider?
>>>
>>> (I would like to think that will never have a real impact because then
>>> we must find some other conflicting request; but isn’t is possible that
>>> we find an overlapping request that waits for another request with which
>>> it overlaps, but our request does not?)
>>>
>>
>> Actually check in bdrv_find_conflicting_request() is the same like in
>> the following
>> bdrv_wait_serialising_requests_locked(), so, if
>> bdrv_find_conflicting_request() returns
>> NULL, it means that in bdrv_wait_serialising_requests_locked() it will
>> return NULL
>> again (as there are no yield points) and we will not wait, so all is OK.
> 
> OK.  I thought that maybe we would want to avoid that other requests
> might have to wait for the preallocation write.  (Of course, we can’t
> avoid that altogether, but if we already know of such requests at the
> beginning of the request...)
> 

Hmm, I see your point now. Hmm actually, what we want:

1. make preallocation on write
2. serialize the request
3. if there are conflicting requests do nothing, as (most probably) the
conflicting request will do preallocation itself

So, what we can say about intersecting requests, which are waiting for something (and therefore are not "conflicting")?
Aha, for sure, they don't preallocate (otherwise they don't wait). So [3] is still satisfied..

There was not original aim to not make the other parallel request wait for preallocation. So, it may be done later.. Intuitively, I don't expect the benefit: if there is no concurrent preallocate request, who are these intersecting requests? For example, EOF-crossing READS, waiting for some serialized not preallocating (so fit into file-size) WRITE. But this shouldn't happen often :)

> Well, if the only thing to look out for is that preallocation writes
> themselves do not wait:

So, let's assume that yes, we care that they don't wait themselves (mostly to avoid concurrent preallocation requests) and don't care about other intersecting requests.

> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>

Thanks!

> 
>> And, why is it OK to ignore already waiting requests in
>> bdrv_wait_serialising_requests_locked(): just because if we proceed now
>> with our request,
>> these waiting requests will have to wait for us, when they wake and go
>> to the next iteration
>> of waiting loop.
> 
> Sure.
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-26  8:52       ` Max Reitz
@ 2020-08-26  9:15         ` Vladimir Sementsov-Ogievskiy
  2020-08-26  9:58           ` Max Reitz
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  9:15 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

26.08.2020 11:52, Max Reitz wrote:
> On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
>> 25.08.2020 18:11, Max Reitz wrote:
>>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> It's intended to be inserted between format and protocol nodes to
>>>> preallocate additional space (expanding protocol file) on writes
>>>> crossing EOF. It improves performance for file-systems with slow
>>>> allocation.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>    docs/system/qemu-block-drivers.rst.inc |  26 +++
>>>>    qapi/block-core.json                   |  20 +-
>>>>    block/preallocate.c                    | 291 +++++++++++++++++++++++++
>>>>    block/Makefile.objs                    |   1 +
>>>>    4 files changed, 337 insertions(+), 1 deletion(-)
>>>>    create mode 100644 block/preallocate.c
> 
> [...]
> 
>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>> new file mode 100644
>>>> index 0000000000..bdf54dbd2f
>>>> --- /dev/null
>>>> +++ b/block/preallocate.c
>>>> @@ -0,0 +1,291 @@
>>>> +/*
>>>> + * preallocate filter driver
>>>> + *
>>>> + * The driver performs preallocate operation: it is injected above
>>>> + * some node, and before each write over EOF it does additional
>>>> preallocating
>>>> + * write-zeroes request.
>>>> + *
>>>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>>>> + *
>>>> + * Author:
>>>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License as published by
>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>> + * (at your option) any later version.
>>>> + *
>>>> + * This program is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + * GNU General Public License for more details.
>>>> + *
>>>> + * You should have received a copy of the GNU General Public License
>>>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +
>>>> +#include "qapi/error.h"
>>>> +#include "qemu/module.h"
>>>> +#include "qemu/option.h"
>>>> +#include "qemu/units.h"
>>>> +#include "block/block_int.h"
>>>> +
>>>> +
>>>> +typedef struct BDRVPreallocateState {
>>>> +    int64_t prealloc_size;
>>>> +    int64_t prealloc_align;
>>>> +
>>>> +    /*
>>>> +     * Filter is started as not-active, so it doesn't do any
>>>> preallocations nor
>>>> +     * requires BLK_PERM_RESIZE on its child. This is needed to
>>>> create filter
>>>> +     * above another node-child and then do bdrv_replace_node to
>>>> insert the
>>>> +     * filter.
>>>
>>> A bit weird, but seems fair.  It’s basically just a cache for whether
>>> this node has a writer on it or not.
>>>
>>> Apart from the weirdness, I don’t understand the “another node-child”.
>>> Say you have “format” -> “proto”, and then you want to insert
>>> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
>>> blockdev-replace format.file=prealloc?
>>
>> Yes something like this. Actually, I'm about inserting the filter
>> automatically from block layer code, but your reasoning is about same
>> thing and is better.
>>
>>> So what is that “other node-child”?
>>
>> Hmm, just my misleading wording. At least '-' in wrong place. Just
>> "other node child", or "child of another node"..
> 
> OK.
> 
>>>> +     *
>>>> +     * Filter becomes active the first time it detects that its
>>>> parents have
>>>> +     * BLK_PERM_RESIZE on it.
>>>> +     * Filter becomes active forever: it doesn't lose active status
>>>> if parents
>>>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
>>>> the file on
>>>> +     * filter close.
>>>
>>> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>>>
>>> Hm.  Seems like useful behavior.  I just wonder if keeping the
>>> permission around indefinitely makes sense.  Why not just truncate it
>>> when the permission is revoked?
>>
>> How? Parent is closed earlier, so on close we will not have the
>> permission. So, we force-keep the permission up to our close().
> 
> I thought that preallocate_child_perm() would be invoked when the parent
> is detached, so we could do the truncate there, before relinquishing
> preallocate’s RESIZE permission.
> 

Hmm, I can check it. I just a bit afraid of doing something serious like truncation in .bdrv_child_perm() handler, which doesn't seem to imply such usage.

> 
>>>> +static void preallocate_close(BlockDriverState *bs)
>>>> +{
>>>> +    BDRVPreallocateState *s = bs->opaque;
>>>> +
>>>> +    if (s->active && s->data_end >= 0 &&
>>>> +        bdrv_getlength(bs->file->bs) > s->data_end)
>>>> +    {
>>>> +        bdrv_truncate(bs->file, s->data_end, true,
>>>> PREALLOC_MODE_OFF, 0, NULL);
>>>
>>> Now that I think more about it...  What if there are other writers on
>>> bs->file?  This may throw data away.
>>
>> Good point. Actually, if bs->file has other writing parents, the logic
>> of the filter
>> around "data_end" is broken. So we must unshare WRITE and RESIZE
>> permissions.
> 
> That’s certainly a heavy hammer, but it’d work.
> 
>>>   Maybe BDS.wr_highest_offset can
>>> help to make a better call?
>>
>> Anyway, we'll have to use wr_highest_offset of the filter not the child
>> node
>> (in the child wr_highest_offset will track preallocations as well),
> 
> That’s true.
> 
>> so we want to unshare WRITE/RESIZE permissions.
> 
> [...]
> 
>>>> +static int coroutine_fn
>>>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>>>> +                        bool exact, PreallocMode prealloc,
>>>> +                        BdrvRequestFlags flags, Error **errp)
>>>> +{
>>>> +    BDRVPreallocateState *s = bs->opaque;
>>>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
>>>> flags, errp);
>>>> +
>>>> +    /* s->data_end may become negative here, which means unknown
>>>> data end */
>>>> +    s->data_end = bdrv_getlength(bs->file->bs);
>>>
>>> What would be the problem with just setting data_end = offset?  We only
>>> use data_end to truncate down on close, so basically repeat the
>>> bdrv_co_truncate() call here, which seems correct.
>>
>> We can use offset only if ret >= 0 && exact is true..
> 
> Well, on close, we ignore any errors from truncate() anyway.  So even if
> we used exact=false here, it shouldn’t matter.

Here we handle truncate from user. If I understand "exact" correctly the following is possible:

1. parent does truncate 1M -> 2M, exact=false
2. driver decides to truncate to 5M (exact condition is sutisfied)
3. but we remember s->data_end = 2M, and on close will shrink to 2M

Doesn't seem a real problem.. So, we just consider the preallocation done by driver as our preallocation to be dropped on close().

Could there be problems on user shrink?

1. parent does truncate 2M -> 1M, exact=false
2. driver decides to do notihng (why not)
3. on close() we'll shrink to 1M..

Again, seems no real problems.

But in case when bdrv_co_truncate failed, we should call bdrv_getlength anyway, as we don't know the result of failed truncation. Or we can just set s->data_end = -1, and not care too much about fail-scenarios.

So, probably we can do
s->data_end = ret < 0 : ret : offset;

But I just feel safer with additional bdrv_getlength().

> 
>> But simpler is
>> just call
>> bdrv_getlength() anyway.
> 
> Certainly simpler than thinking about potential edge cases, true.
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-26  9:15         ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26  9:58           ` Max Reitz
  2020-08-26 11:29             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2020-08-26  9:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 9071 bytes --]

On 26.08.20 11:15, Vladimir Sementsov-Ogievskiy wrote:
> 26.08.2020 11:52, Max Reitz wrote:
>> On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
>>> 25.08.2020 18:11, Max Reitz wrote:
>>>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>> It's intended to be inserted between format and protocol nodes to
>>>>> preallocate additional space (expanding protocol file) on writes
>>>>> crossing EOF. It improves performance for file-systems with slow
>>>>> allocation.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>    docs/system/qemu-block-drivers.rst.inc |  26 +++
>>>>>    qapi/block-core.json                   |  20 +-
>>>>>    block/preallocate.c                    | 291
>>>>> +++++++++++++++++++++++++
>>>>>    block/Makefile.objs                    |   1 +
>>>>>    4 files changed, 337 insertions(+), 1 deletion(-)
>>>>>    create mode 100644 block/preallocate.c
>>
>> [...]
>>
>>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>>> new file mode 100644
>>>>> index 0000000000..bdf54dbd2f
>>>>> --- /dev/null
>>>>> +++ b/block/preallocate.c
>>>>> @@ -0,0 +1,291 @@
>>>>> +/*
>>>>> + * preallocate filter driver
>>>>> + *
>>>>> + * The driver performs preallocate operation: it is injected above
>>>>> + * some node, and before each write over EOF it does additional
>>>>> preallocating
>>>>> + * write-zeroes request.
>>>>> + *
>>>>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>>>>> + *
>>>>> + * Author:
>>>>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or
>>>>> modify
>>>>> + * it under the terms of the GNU General Public License as
>>>>> published by
>>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>>> + * (at your option) any later version.
>>>>> + *
>>>>> + * This program is distributed in the hope that it will be useful,
>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>> + * GNU General Public License for more details.
>>>>> + *
>>>>> + * You should have received a copy of the GNU General Public License
>>>>> + * along with this program. If not, see
>>>>> <http://www.gnu.org/licenses/>.
>>>>> + */
>>>>> +
>>>>> +#include "qemu/osdep.h"
>>>>> +
>>>>> +#include "qapi/error.h"
>>>>> +#include "qemu/module.h"
>>>>> +#include "qemu/option.h"
>>>>> +#include "qemu/units.h"
>>>>> +#include "block/block_int.h"
>>>>> +
>>>>> +
>>>>> +typedef struct BDRVPreallocateState {
>>>>> +    int64_t prealloc_size;
>>>>> +    int64_t prealloc_align;
>>>>> +
>>>>> +    /*
>>>>> +     * Filter is started as not-active, so it doesn't do any
>>>>> preallocations nor
>>>>> +     * requires BLK_PERM_RESIZE on its child. This is needed to
>>>>> create filter
>>>>> +     * above another node-child and then do bdrv_replace_node to
>>>>> insert the
>>>>> +     * filter.
>>>>
>>>> A bit weird, but seems fair.  It’s basically just a cache for whether
>>>> this node has a writer on it or not.
>>>>
>>>> Apart from the weirdness, I don’t understand the “another node-child”.
>>>> Say you have “format” -> “proto”, and then you want to insert
>>>> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
>>>> blockdev-replace format.file=prealloc?
>>>
>>> Yes something like this. Actually, I'm about inserting the filter
>>> automatically from block layer code, but your reasoning is about same
>>> thing and is better.
>>>
>>>> So what is that “other node-child”?
>>>
>>> Hmm, just my misleading wording. At least '-' in wrong place. Just
>>> "other node child", or "child of another node"..
>>
>> OK.
>>
>>>>> +     *
>>>>> +     * Filter becomes active the first time it detects that its
>>>>> parents have
>>>>> +     * BLK_PERM_RESIZE on it.
>>>>> +     * Filter becomes active forever: it doesn't lose active status
>>>>> if parents
>>>>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
>>>>> the file on
>>>>> +     * filter close.
>>>>
>>>> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>>>>
>>>> Hm.  Seems like useful behavior.  I just wonder if keeping the
>>>> permission around indefinitely makes sense.  Why not just truncate it
>>>> when the permission is revoked?
>>>
>>> How? Parent is closed earlier, so on close we will not have the
>>> permission. So, we force-keep the permission up to our close().
>>
>> I thought that preallocate_child_perm() would be invoked when the parent
>> is detached, so we could do the truncate there, before relinquishing
>> preallocate’s RESIZE permission.
>>
> 
> Hmm, I can check it. I just a bit afraid of doing something serious like
> truncation in .bdrv_child_perm() handler, which doesn't seem to imply
> such usage.

I’m a bit conflicted.  On one hand, I share your concern.  On the other,
I find it completely reasonable that drivers might want to do I/O when
permissions change.

Usually, this is done as part of reopen, like in qcow2 when a drive
changes from R/W to RO and caches need to be flushed.  But I actually
think it makes more sense as part of the permission system, because of
course a reopen is not the only time when permissions can change.

So that would be another alternative, to implement
.bdrv_reopen_prepare(), and to check reopen_state->perm there.  If
RESIZE is about to go away, then we truncate.  We could keep track of
whether we did any preallocations after the last such truncate, and if
we did, do another truncate on close.

This would cover reopen at least.  Seems better than nothing, but, well...

(One big problem I see with truncating in .bdrv_child_perm() is that
that function is in no way committing.  It’s just a request: “If your
parents need this of you, what do you need of your children?”
But I think that could be addressed by doing it in .bdrv_set_perm()
instead.  Of note is that file-posix actually does do I/O in its
raw_set_perm() function, in that it commits to file locks.)

[...]

>>>>> +static int coroutine_fn
>>>>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>>>>> +                        bool exact, PreallocMode prealloc,
>>>>> +                        BdrvRequestFlags flags, Error **errp)
>>>>> +{
>>>>> +    BDRVPreallocateState *s = bs->opaque;
>>>>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
>>>>> flags, errp);
>>>>> +
>>>>> +    /* s->data_end may become negative here, which means unknown
>>>>> data end */
>>>>> +    s->data_end = bdrv_getlength(bs->file->bs);
>>>>
>>>> What would be the problem with just setting data_end = offset?  We only
>>>> use data_end to truncate down on close, so basically repeat the
>>>> bdrv_co_truncate() call here, which seems correct.
>>>
>>> We can use offset only if ret >= 0 && exact is true..
>>
>> Well, on close, we ignore any errors from truncate() anyway.  So even if
>> we used exact=false here, it shouldn’t matter.
> 
> Here we handle truncate from user. If I understand "exact" correctly the
> following is possible:
> 
> 1. parent does truncate 1M -> 2M, exact=false
> 2. driver decides to truncate to 5M (exact condition is sutisfied)
> 3. but we remember s->data_end = 2M, and on close will shrink to 2M

In theory, yes; in practice, exact=false just means to ignore failures
due to unsupported sizes.  So in this example, the driver would have
resized to 5M because 2M is an impossible size, and thus the shrink on
close would just fail.

> Doesn't seem a real problem.. So, we just consider the preallocation
> done by driver as our preallocation to be dropped on close().
> 
> Could there be problems on user shrink?
> 
> 1. parent does truncate 2M -> 1M, exact=false
> 2. driver decides to do notihng (why not)
> 3. on close() we'll shrink to 1M..
> 
> Again, seems no real problems.

Same as above, in practice the only difference between exact=false and
exact=true is how failures are reported.

> But in case when bdrv_co_truncate failed, we should call bdrv_getlength
> anyway, as we don't know the result of failed truncation. Or we can just
> set s->data_end = -1, and not care too much about fail-scenarios.
> 
> So, probably we can do
> s->data_end = ret < 0 : ret : offset;
> 
> But I just feel safer with additional bdrv_getlength().

Yes, let’s just leave it as you wrote it.

(It’s a bit silly of me to argue based on how drivers handle exact=false
in practice right now.  It can always change.  Also, I shouldn’t pretend
calling bdrv_getlength() would be an issue whatsoever, because it just
isn’t.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-26  9:58           ` Max Reitz
@ 2020-08-26 11:29             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26 11:29 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: qemu-devel, eblake, armbru, fam, stefanha, kwolf, den, nsoffer

26.08.2020 12:58, Max Reitz wrote:
> On 26.08.20 11:15, Vladimir Sementsov-Ogievskiy wrote:
>> 26.08.2020 11:52, Max Reitz wrote:
>>> On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
>>>> 25.08.2020 18:11, Max Reitz wrote:
>>>>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> It's intended to be inserted between format and protocol nodes to
>>>>>> preallocate additional space (expanding protocol file) on writes
>>>>>> crossing EOF. It improves performance for file-systems with slow
>>>>>> allocation.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> ---
>>>>>>     docs/system/qemu-block-drivers.rst.inc |  26 +++
>>>>>>     qapi/block-core.json                   |  20 +-
>>>>>>     block/preallocate.c                    | 291
>>>>>> +++++++++++++++++++++++++
>>>>>>     block/Makefile.objs                    |   1 +
>>>>>>     4 files changed, 337 insertions(+), 1 deletion(-)
>>>>>>     create mode 100644 block/preallocate.c
>>>
>>> [...]
>>>
>>>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>>>> new file mode 100644
>>>>>> index 0000000000..bdf54dbd2f
>>>>>> --- /dev/null
>>>>>> +++ b/block/preallocate.c
>>>>>> @@ -0,0 +1,291 @@
>>>>>> +/*
>>>>>> + * preallocate filter driver
>>>>>> + *
>>>>>> + * The driver performs preallocate operation: it is injected above
>>>>>> + * some node, and before each write over EOF it does additional
>>>>>> preallocating
>>>>>> + * write-zeroes request.
>>>>>> + *
>>>>>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>>>>>> + *
>>>>>> + * Author:
>>>>>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>>>>>> + *
>>>>>> + * This program is free software; you can redistribute it and/or
>>>>>> modify
>>>>>> + * it under the terms of the GNU General Public License as
>>>>>> published by
>>>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>>>> + * (at your option) any later version.
>>>>>> + *
>>>>>> + * This program is distributed in the hope that it will be useful,
>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>>> + * GNU General Public License for more details.
>>>>>> + *
>>>>>> + * You should have received a copy of the GNU General Public License
>>>>>> + * along with this program. If not, see
>>>>>> <http://www.gnu.org/licenses/>.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +
>>>>>> +#include "qapi/error.h"
>>>>>> +#include "qemu/module.h"
>>>>>> +#include "qemu/option.h"
>>>>>> +#include "qemu/units.h"
>>>>>> +#include "block/block_int.h"
>>>>>> +
>>>>>> +
>>>>>> +typedef struct BDRVPreallocateState {
>>>>>> +    int64_t prealloc_size;
>>>>>> +    int64_t prealloc_align;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Filter is started as not-active, so it doesn't do any
>>>>>> preallocations nor
>>>>>> +     * requires BLK_PERM_RESIZE on its child. This is needed to
>>>>>> create filter
>>>>>> +     * above another node-child and then do bdrv_replace_node to
>>>>>> insert the
>>>>>> +     * filter.
>>>>>
>>>>> A bit weird, but seems fair.  It’s basically just a cache for whether
>>>>> this node has a writer on it or not.
>>>>>
>>>>> Apart from the weirdness, I don’t understand the “another node-child”.
>>>>> Say you have “format” -> “proto”, and then you want to insert
>>>>> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
>>>>> blockdev-replace format.file=prealloc?
>>>>
>>>> Yes something like this. Actually, I'm about inserting the filter
>>>> automatically from block layer code, but your reasoning is about same
>>>> thing and is better.
>>>>
>>>>> So what is that “other node-child”?
>>>>
>>>> Hmm, just my misleading wording. At least '-' in wrong place. Just
>>>> "other node child", or "child of another node"..
>>>
>>> OK.
>>>
>>>>>> +     *
>>>>>> +     * Filter becomes active the first time it detects that its
>>>>>> parents have
>>>>>> +     * BLK_PERM_RESIZE on it.
>>>>>> +     * Filter becomes active forever: it doesn't lose active status
>>>>>> if parents
>>>>>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
>>>>>> the file on
>>>>>> +     * filter close.
>>>>>
>>>>> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>>>>>
>>>>> Hm.  Seems like useful behavior.  I just wonder if keeping the
>>>>> permission around indefinitely makes sense.  Why not just truncate it
>>>>> when the permission is revoked?
>>>>
>>>> How? Parent is closed earlier, so on close we will not have the
>>>> permission. So, we force-keep the permission up to our close().
>>>
>>> I thought that preallocate_child_perm() would be invoked when the parent
>>> is detached, so we could do the truncate there, before relinquishing
>>> preallocate’s RESIZE permission.
>>>
>>
>> Hmm, I can check it. I just a bit afraid of doing something serious like
>> truncation in .bdrv_child_perm() handler, which doesn't seem to imply
>> such usage.
> 
> I’m a bit conflicted.  On one hand, I share your concern.  On the other,
> I find it completely reasonable that drivers might want to do I/O when
> permissions change.
> 
> Usually, this is done as part of reopen, like in qcow2 when a drive
> changes from R/W to RO and caches need to be flushed.  But I actually
> think it makes more sense as part of the permission system, because of
> course a reopen is not the only time when permissions can change.
> 
> So that would be another alternative, to implement
> .bdrv_reopen_prepare(), and to check reopen_state->perm there.  If
> RESIZE is about to go away, then we truncate.  We could keep track of
> whether we did any preallocations after the last such truncate, and if
> we did, do another truncate on close.
> 
> This would cover reopen at least.  Seems better than nothing, but, well...

Reopen will not cover the main case: bdrv_clase_all..

> 
> (One big problem I see with truncating in .bdrv_child_perm() is that
> that function is in no way committing.  It’s just a request: “If your
> parents need this of you, what do you need of your children?”
> But I think that could be addressed by doing it in .bdrv_set_perm()
> instead.  Of note is that file-posix actually does do I/O in its
> raw_set_perm() function, in that it commits to file locks.)

will try

> 
> [...]
> 
>>>>>> +static int coroutine_fn
>>>>>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>>>>>> +                        bool exact, PreallocMode prealloc,
>>>>>> +                        BdrvRequestFlags flags, Error **errp)
>>>>>> +{
>>>>>> +    BDRVPreallocateState *s = bs->opaque;
>>>>>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
>>>>>> flags, errp);
>>>>>> +
>>>>>> +    /* s->data_end may become negative here, which means unknown
>>>>>> data end */
>>>>>> +    s->data_end = bdrv_getlength(bs->file->bs);
>>>>>
>>>>> What would be the problem with just setting data_end = offset?  We only
>>>>> use data_end to truncate down on close, so basically repeat the
>>>>> bdrv_co_truncate() call here, which seems correct.
>>>>
>>>> We can use offset only if ret >= 0 && exact is true..
>>>
>>> Well, on close, we ignore any errors from truncate() anyway.  So even if
>>> we used exact=false here, it shouldn’t matter.
>>
>> Here we handle truncate from user. If I understand "exact" correctly the
>> following is possible:
>>
>> 1. parent does truncate 1M -> 2M, exact=false
>> 2. driver decides to truncate to 5M (exact condition is sutisfied)
>> 3. but we remember s->data_end = 2M, and on close will shrink to 2M
> 
> In theory, yes; in practice, exact=false just means to ignore failures
> due to unsupported sizes.  So in this example, the driver would have
> resized to 5M because 2M is an impossible size, and thus the shrink on
> close would just fail.
> 
>> Doesn't seem a real problem.. So, we just consider the preallocation
>> done by driver as our preallocation to be dropped on close().
>>
>> Could there be problems on user shrink?
>>
>> 1. parent does truncate 2M -> 1M, exact=false
>> 2. driver decides to do notihng (why not)
>> 3. on close() we'll shrink to 1M..
>>
>> Again, seems no real problems.
> 
> Same as above, in practice the only difference between exact=false and
> exact=true is how failures are reported.
> 
>> But in case when bdrv_co_truncate failed, we should call bdrv_getlength
>> anyway, as we don't know the result of failed truncation. Or we can just
>> set s->data_end = -1, and not care too much about fail-scenarios.
>>
>> So, probably we can do
>> s->data_end = ret < 0 : ret : offset;
>>
>> But I just feel safer with additional bdrv_getlength().
> 
> Yes, let’s just leave it as you wrote it.
> 
> (It’s a bit silly of me to argue based on how drivers handle exact=false
> in practice right now.  It can always change.  Also, I shouldn’t pretend
> calling bdrv_getlength() would be an issue whatsoever, because it just
> isn’t.)
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-25 15:11   ` Max Reitz
  2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26 13:51     ` David Edmondson
  2020-08-27  9:19       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 30+ messages in thread
From: David Edmondson @ 2020-08-26 13:51 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

On Tuesday, 2020-08-25 at 17:11:34 +02, Max Reitz wrote:

> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>> It's intended to be inserted between format and protocol nodes to
>> preallocate additional space (expanding protocol file) on writes
>> crossing EOF. It improves performance for file-systems with slow
>> allocation.
>> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>  docs/system/qemu-block-drivers.rst.inc |  26 +++
>>  qapi/block-core.json                   |  20 +-
>>  block/preallocate.c                    | 291 +++++++++++++++++++++++++
>>  block/Makefile.objs                    |   1 +
>>  4 files changed, 337 insertions(+), 1 deletion(-)
>>  create mode 100644 block/preallocate.c
>
> Looks good to me in essence.  Besides minor details, I wonder most about
> whether truncating the file on close can be safe, but more about that below.
>
>> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
>> index b052a6d14e..5e8a35c571 100644
>> --- a/docs/system/qemu-block-drivers.rst.inc
>> +++ b/docs/system/qemu-block-drivers.rst.inc
>> @@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU process on the image file.
>>  More than one byte could be locked by the QEMU instance, each byte of which
>>  reflects a particular permission that is acquired or protected by the running
>>  block driver.
>> +
>> +Filter drivers
>> +~~~~~~~~~~~~~~
>> +
>> +QEMU supports several filter drivers, which don't store any data, but do some
>
> s/do/perform/
>
>> +additional tasks, hooking io requests.
>> +
>> +.. program:: filter-drivers
>> +.. option:: preallocate
>> +
>> +  Preallocate filter driver is intended to be inserted between format
>
> *The preallocate filter driver
>
>> +  and protocol nodes and does preallocation of some additional space
>
> I’d simplify this as s/does preallocation of/preallocates/
>
>> +  (expanding the protocol file) on write. It may be used for
>
> I’d complicate this as s/on write/when writing past the file’s end/, or
> “when data is written to the file after its end”, or at least “on
> post-EOF writes”.
>
> Maybe also s/It may be used for/This can be useful for/?
>
>> +  file-systems with slow allocation.
>> +
>> +  Supported options:
>> +
>> +  .. program:: preallocate
>> +  .. option:: prealloc-align
>> +
>> +    On preallocation, align file length to this number, default 1M.
>
> *the file length
>
> As for “number”...  Well, it is a number.  But “value” might fit better.
>  Or “length (in bytes)”?

Isn't it really:

"On preallocation, ensure that the file length is aligned to a multiple
of this value, default 1M."

?

>> +
>> +  .. program:: preallocate
>> +  .. option:: prealloc-size
>> +
>> +    How much to preallocate, default 128M.
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 197bdc1c36..b40448063b 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2805,7 +2805,7 @@
>>              'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
>>              'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
>>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>> -            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>> +            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>>              { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
>>              'sheepdog',
>>              'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
>> @@ -3074,6 +3074,23 @@
>>    'data': { 'aes': 'QCryptoBlockOptionsQCow',
>>              'luks': 'QCryptoBlockOptionsLUKS'} }
>>  
>> +##
>> +# @BlockdevOptionsPreallocate:
>> +#
>> +# Filter driver intended to be inserted between format and protocol node
>> +# and do preallocation in protocol node on write.
>> +#
>> +# @prealloc-align: on preallocation, align file length to this number,
>> +#                 default 1048576 (1M)
>
> Speaking of alignment, this second line isn’t properly aligned.
>
>> +#
>> +# @prealloc-size: how much to preallocate, default 134217728 (128M)
>> +#
>> +# Since: 5.2
>> +##
>> +{ 'struct': 'BlockdevOptionsPreallocate',
>> +  'base': 'BlockdevOptionsGenericFormat',
>> +  'data': { '*prealloc-align': 'int', '*prealloc-size': 'int' } }
>> +
>>  ##
>>  # @BlockdevOptionsQcow2:
>>  #
>> @@ -3979,6 +3996,7 @@
>>        'null-co':    'BlockdevOptionsNull',
>>        'nvme':       'BlockdevOptionsNVMe',
>>        'parallels':  'BlockdevOptionsGenericFormat',
>> +      'preallocate':'BlockdevOptionsPreallocate',
>>        'qcow2':      'BlockdevOptionsQcow2',
>>        'qcow':       'BlockdevOptionsQcow',
>>        'qed':        'BlockdevOptionsGenericCOWFormat',
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> new file mode 100644
>> index 0000000000..bdf54dbd2f
>> --- /dev/null
>> +++ b/block/preallocate.c
>> @@ -0,0 +1,291 @@
>> +/*
>> + * preallocate filter driver
>> + *
>> + * The driver performs preallocate operation: it is injected above
>> + * some node, and before each write over EOF it does additional preallocating
>> + * write-zeroes request.
>> + *
>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qapi/error.h"
>> +#include "qemu/module.h"
>> +#include "qemu/option.h"
>> +#include "qemu/units.h"
>> +#include "block/block_int.h"
>> +
>> +
>> +typedef struct BDRVPreallocateState {
>> +    int64_t prealloc_size;
>> +    int64_t prealloc_align;
>> +
>> +    /*
>> +     * Filter is started as not-active, so it doesn't do any preallocations nor
>> +     * requires BLK_PERM_RESIZE on its child. This is needed to create filter
>> +     * above another node-child and then do bdrv_replace_node to insert the
>> +     * filter.
>
> A bit weird, but seems fair.  It’s basically just a cache for whether
> this node has a writer on it or not.
>
> Apart from the weirdness, I don’t understand the “another node-child”.
> Say you have “format” -> “proto”, and then you want to insert
> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
> blockdev-replace format.file=prealloc?  So what is that “other node-child”?
>
>> +     *
>> +     * Filter becomes active the first time it detects that its parents have
>> +     * BLK_PERM_RESIZE on it.
>> +     * Filter becomes active forever: it doesn't lose active status if parents
>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink the file on
>> +     * filter close.
>
> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>
> Hm.  Seems like useful behavior.  I just wonder if keeping the
> permission around indefinitely makes sense.  Why not just truncate it
> when the permission is revoked?
>
>> +     */
>> +    bool active;
>> +
>> +    /*
>> +     * Track real data end, to crop preallocation on close  data_end may be
>
> s/on close /when closed./
>
>> +     * negative, which means that actual status is unknown (nothing cropped in
>> +     * this case)
>> +     */
>> +    int64_t data_end;
>> +} BDRVPreallocateState;
>> +
>> +#define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
>> +#define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
>> +static QemuOptsList runtime_opts = {
>> +    .name = "preallocate",
>> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = PREALLOCATE_OPT_PREALLOC_ALIGN,
>> +            .type = QEMU_OPT_SIZE,
>> +            .help = "on preallocation, align file length to this number, "
>> +                "default 1M",
>> +        },
>> +        {
>> +            .name = PREALLOCATE_OPT_PREALLOC_SIZE,
>> +            .type = QEMU_OPT_SIZE,
>> +            .help = "how much to preallocate, default 128M",
>> +        },
>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
>> +                            Error **errp)
>> +{
>> +    QemuOpts *opts;
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    /*
>> +     * Parameters are hardcoded now. May need to add corresponding options in
>> +     * future.
>> +     */
>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &error_abort);
>> +    s->prealloc_align =
>> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_ALIGN, 1 * MiB);
>> +    s->prealloc_size =
>> +        qemu_opt_get_size(opts, PREALLOCATE_OPT_PREALLOC_SIZE, 128 * MiB);
>> +    qemu_opts_del(opts);
>
> Should we have some validation on these parameters?  Like,
> prealloc_align being at least 512, or maybe even file.request_alignment?
>
> (I suppose anything for prealloc_size is fine, though 0 doesn’t make
> much sense...)
>
>> +
>> +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
>> +                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
>> +                               false, errp);
>> +    if (!bs->file) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    s->data_end = bdrv_getlength(bs->file->bs);
>> +    if (s->data_end < 0) {
>> +        return s->data_end;
>> +    }
>> +
>> +    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
>> +
>> +    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
>> +            bs->file->bs->supported_zero_flags);
>> +
>> +    return 0;
>> +}
>> +
>> +static void preallocate_close(BlockDriverState *bs)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    if (s->active && s->data_end >= 0 &&
>> +        bdrv_getlength(bs->file->bs) > s->data_end)
>> +    {
>> +        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
>
> Now that I think more about it...  What if there are other writers on
> bs->file?  This may throw data away.  Maybe BDS.wr_highest_offset can
> help to make a better call?
>
>> +    }
>> +}
>> +
>> +static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
>> +                                   BdrvChildRole role,
>> +                                   BlockReopenQueue *reopen_queue,
>> +                                   uint64_t perm, uint64_t shared,
>> +                                   uint64_t *nperm, uint64_t *nshared)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
>> +
>> +    s->active = s->active || (perm & BLK_PERM_RESIZE);
>> +
>> +    if (s->active) {
>> +        /* Force RESIZE permission, to be able to crop file on close() */
>> +        *nperm |= BLK_PERM_RESIZE;
>> +    }
>> +}
>> +
>> +static coroutine_fn int preallocate_co_preadv_part(
>> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
>> +{
>> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                               flags);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
>> +                                               int64_t offset, int bytes)
>> +{
>> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
>> +}
>> +
>> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
>> +                                       int64_t bytes, bool write_zero)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t len, start, end;
>> +
>> +    if (!s->active) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end >= 0) {
>> +        s->data_end = MAX(s->data_end,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    len = bdrv_getlength(bs->file->bs);
>
> I’d rename @len so it’s clear that it has nothing to do with the request
> length.  Like, file_len.
>
> (Because...
>
>> +    if (len < 0) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end < 0) {
>> +        s->data_end = MAX(len,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    if (offset + bytes <= len) {
>> +        return false;
>> +    }
>> +
>> +    start = write_zero ? MIN(offset, len) : len;
>
> ...I got confused here for a bit)
>
>> +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
>
> Ah, I expected offset + s->prealloc_size.  But OK.
>
>> +    return !bdrv_co_pwrite_zeroes(bs->file, start, end - start,
>> +            BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
>> +        int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> +    if (do_preallocate(bs, offset, bytes, true)) {
>> +        return 0;
>> +    }
>> +
>> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
>> +}
>> +
>> +static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
>> +                                                    uint64_t offset,
>> +                                                    uint64_t bytes,
>> +                                                    QEMUIOVector *qiov,
>> +                                                    size_t qiov_offset,
>> +                                                    int flags)
>> +{
>> +    do_preallocate(bs, offset, bytes, false);
>> +
>> +    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                                flags);
>> +}
>> +
>> +static int coroutine_fn
>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>> +                        bool exact, PreallocMode prealloc,
>> +                        BdrvRequestFlags flags, Error **errp)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
>> +
>> +    /* s->data_end may become negative here, which means unknown data end */
>> +    s->data_end = bdrv_getlength(bs->file->bs);
>
> What would be the problem with just setting data_end = offset?  We only
> use data_end to truncate down on close, so basically repeat the
> bdrv_co_truncate() call here, which seems correct.
>
>> +
>> +    return ret;
>> +}
>> +
>> +static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
>> +{
>> +    return bdrv_co_flush(bs->file->bs);
>> +}
>> +
>> +static int64_t preallocate_getlength(BlockDriverState *bs)
>> +{
>> +    /*
>> +     * We probably can return s->data_end here, but seems safer to return real
>> +     * file length, not trying to hide the preallocation.
>
> I don’t know.  The auto-truncation on close makes that a bit weird, in
> my opinion.  But since this filter is probably never directly below a
> BlockBackend (i.e., the length is never exposed to anything outside of
> the block layer), but always below a format driver, it should be fine.
> (In fact, those drivers do probably rather care about the true file
> length rather than whatever they may have truncated it to, so you’re
> right, it should be safer.)
>
> Max
>
>> +     *
>> +     * Still, don't miss the chance to restore s->data_end if it is broken.
>> +     */
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t ret = bdrv_getlength(bs->file->bs);
>> +
>> +    if (s->data_end < 0) {
>> +        s->data_end = ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +BlockDriver bdrv_preallocate_filter = {
>> +    .format_name = "preallocate",
>> +    .instance_size = sizeof(BDRVPreallocateState),
>> +
>> +    .bdrv_getlength = preallocate_getlength,
>> +    .bdrv_open = preallocate_open,
>> +    .bdrv_close = preallocate_close,
>> +
>> +    .bdrv_co_preadv_part = preallocate_co_preadv_part,
>> +    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
>> +    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
>> +    .bdrv_co_pdiscard = preallocate_co_pdiscard,
>> +    .bdrv_co_flush = preallocate_co_flush,
>> +    .bdrv_co_truncate = preallocate_co_truncate,
>> +
>> +    .bdrv_co_block_status = bdrv_co_block_status_from_file,
>> +
>> +    .bdrv_child_perm = preallocate_child_perm,
>> +
>> +    .has_variable_length = true,
>> +    .is_filter = true,
>> +};
>> +
>> +static void bdrv_preallocate_init(void)
>> +{
>> +    bdrv_register(&bdrv_preallocate_filter);
>> +}
>> +
>> +block_init(bdrv_preallocate_init);
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index 19c6f371c9..f8e6f16522 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -44,6 +44,7 @@ block-obj-y += crypto.o
>>  block-obj-y += aio_task.o
>>  block-obj-y += backup-top.o
>>  block-obj-y += filter-compress.o
>> +block-obj-y += preallocate.o
>>  common-obj-y += monitor/
>>  block-obj-y += monitor/
>>  
>> 

dme.
-- 
I think I waited too long, I'm moving into the dollhouse.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 07/10] block: introduce preallocate filter
  2020-08-26 13:51     ` David Edmondson
@ 2020-08-27  9:19       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-27  9:19 UTC (permalink / raw)
  To: David Edmondson, Max Reitz, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den

26.08.2020 16:51, David Edmondson wrote:
>>> +  file-systems with slow allocation.
>>> +
>>> +  Supported options:
>>> +
>>> +  .. program:: preallocate
>>> +  .. option:: prealloc-align
>>> +
>>> +    On preallocation, align file length to this number, default 1M.
>> *the file length
>>
>> As for “number”...  Well, it is a number.  But “value” might fit better.
>>   Or “length (in bytes)”?
> Isn't it really:
> 
> "On preallocation, ensure that the file length is aligned to a multiple
> of this value, default 1M."
> 

Sounds good, thanks!


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 00/10] preallocate filter
  2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (9 preceding siblings ...)
  2020-08-21 14:11 ` [PATCH v5 10/10] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
@ 2020-08-27 21:08 ` Vladimir Sementsov-Ogievskiy
  2020-09-01 15:07   ` Max Reitz
  10 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-27 21:08 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, eblake, armbru, fam, stefanha, mreitz, kwolf, den, nsoffer

21.08.2020 17:11, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is a filter, which does preallocation on write.
> 
> In Virtuozzo we have to deal with some custom distributed storage
> solution, where allocation is relatively expensive operation. We have to
> workaround it in Qemu, so here is a new filter.

I have a problem now with this thing.

We need preallocation. But we don't want to explicitly specify it in all the management tools. So it should be inserted by default. It's OK for us to keep this default different from upstream... But there are problems with the implicitly inserted filter (actually iotests fail and I failed to fix them)

1. I have to set bs->inherits_from for filter and it's child by hand after bdrv_replace_node(), otherwise bdrv_check_perm doesn't work.

2. I have to set filter_bs->implicit and teach bdrv_refresh_filename() to ignore implicit filters when it checks for drv->bdrv_file_open, to avoid appearing of json in backing file names

3. And the real design problem, which seems impossible to fix: reopen is broken, just because user is not prepared to the fact that file child is a filter, not a file node and has another options, and don't support options of file-posix.

And seems all it (and mostly [3]) shows that implicitly inserting the filter is near to be impossible..

So, what are possible solutions?

In virtuozzo7 we have preallocation feature done inside qcow2 driver. This is very uncomfortable: we should to handle each possible over-EOF write to underlying node (to keep data_end in sync to be able to shrink preallocation on close()).. I don't like this way and don't want to port it..

Another option is implementing preallocation inside file-posix driver. Then, instead of BDRV_REQ_NO_WAIT flag I'll need to extend serialising requests API (bdrv_make_request_serialising() is already used in file-posix.c) to dupport no-wait behavior + expanding the serialising request bounds. This option seems feasible, so I'll try this way if no other ideas.

Filter is obviously the true way: we use generic block layer for native request serialising, don't need to catch every write in qcow2 driver, don't need to modify any other driver and get a universal thing. But how to insert it implicitly (or at least automatically in some cases) and avoid all the problems?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 00/10] preallocate filter
  2020-08-27 21:08 ` [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-09-01 15:07   ` Max Reitz
  2020-09-17  9:00     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2020-09-01 15:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: fam, kwolf, qemu-devel, armbru, nsoffer, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 3345 bytes --]

On 27.08.20 23:08, Vladimir Sementsov-Ogievskiy wrote:
> 21.08.2020 17:11, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is a filter, which does preallocation on write.
>>
>> In Virtuozzo we have to deal with some custom distributed storage
>> solution, where allocation is relatively expensive operation. We have to
>> workaround it in Qemu, so here is a new filter.
> 
> I have a problem now with this thing.
> 
> We need preallocation. But we don't want to explicitly specify it in all
> the management tools.

Why?

> So it should be inserted by default.

Why?  You mean without any option?  That seems...  Interesting?

(Also like a recipe for reports of performance regression in some cases.)

> It's OK for
> us to keep this default different from upstream... But there are
> problems with the implicitly inserted filter (actually iotests fail and
> I failed to fix them)

I would suspect even if the iotests passed we would end up with a heap
of problems that we would only notice at some later point.  I thought
you too weren’t too fond of the idea of implicit filters.

> 1. I have to set bs->inherits_from for filter and it's child by hand
> after bdrv_replace_node(), otherwise bdrv_check_perm doesn't work.
> 
> 2. I have to set filter_bs->implicit and teach bdrv_refresh_filename()
> to ignore implicit filters when it checks for drv->bdrv_file_open, to
> avoid appearing of json in backing file names
> 
> 3. And the real design problem, which seems impossible to fix: reopen is
> broken, just because user is not prepared to the fact that file child is
> a filter, not a file node and has another options, and don't support
> options of file-posix.

Well, what should I say.  I feel like we have made efforts in the past
years to make the block graph fully visible to users and yield the
responsibility of managing it to the users, too, so I’m not surprised if
a step backwards breaks that.

> And seems all it (and mostly [3]) shows that implicitly inserting the
> filter is near to be impossible..
> 
> So, what are possible solutions?
> 
> In virtuozzo7 we have preallocation feature done inside qcow2 driver.
> This is very uncomfortable: we should to handle each possible over-EOF
> write to underlying node (to keep data_end in sync to be able to shrink
> preallocation on close()).. I don't like this way and don't want to port
> it..
> 
> Another option is implementing preallocation inside file-posix driver.
> Then, instead of BDRV_REQ_NO_WAIT flag I'll need to extend serialising
> requests API (bdrv_make_request_serialising() is already used in
> file-posix.c) to dupport no-wait behavior + expanding the serialising
> request bounds. This option seems feasible, so I'll try this way if no
> other ideas.

Possible, but you haven’t yet explained what the problem with the
management layer inserting the preallocation filter is.

> Filter is obviously the true way: we use generic block layer for native
> request serialising, don't need to catch every write in qcow2 driver,
> don't need to modify any other driver and get a universal thing. But how
> to insert it implicitly (or at least automatically in some cases) and
> avoid all the problems?

I don’t understand why inserting it implicitly is important.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 00/10] preallocate filter
  2020-09-01 15:07   ` Max Reitz
@ 2020-09-17  9:00     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-09-17  9:00 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: qemu-devel, eblake, armbru, fam, stefanha, kwolf, den, nsoffer

01.09.2020 18:07, Max Reitz wrote:
> On 27.08.20 23:08, Vladimir Sementsov-Ogievskiy wrote:
>> 21.08.2020 17:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is a filter, which does preallocation on write.
>>>
>>> In Virtuozzo we have to deal with some custom distributed storage
>>> solution, where allocation is relatively expensive operation. We have to
>>> workaround it in Qemu, so here is a new filter.
>>
>> I have a problem now with this thing.
>>
>> We need preallocation. But we don't want to explicitly specify it in all
>> the management tools.
> 
> Why?
> 
>> So it should be inserted by default.
> 
> Why?  You mean without any option?  That seems...  Interesting?
> 
> (Also like a recipe for reports of performance regression in some cases.)
> 
>> It's OK for
>> us to keep this default different from upstream... But there are
>> problems with the implicitly inserted filter (actually iotests fail and
>> I failed to fix them)
> 
> I would suspect even if the iotests passed we would end up with a heap
> of problems that we would only notice at some later point.  I thought
> you too weren’t too fond of the idea of implicit filters.
> 
>> 1. I have to set bs->inherits_from for filter and it's child by hand
>> after bdrv_replace_node(), otherwise bdrv_check_perm doesn't work.
>>
>> 2. I have to set filter_bs->implicit and teach bdrv_refresh_filename()
>> to ignore implicit filters when it checks for drv->bdrv_file_open, to
>> avoid appearing of json in backing file names
>>
>> 3. And the real design problem, which seems impossible to fix: reopen is
>> broken, just because user is not prepared to the fact that file child is
>> a filter, not a file node and has another options, and don't support
>> options of file-posix.
> 
> Well, what should I say.  I feel like we have made efforts in the past
> years to make the block graph fully visible to users and yield the
> responsibility of managing it to the users, too, so I’m not surprised if
> a step backwards breaks that.
> 
>> And seems all it (and mostly [3]) shows that implicitly inserting the
>> filter is near to be impossible..
>>
>> So, what are possible solutions?
>>
>> In virtuozzo7 we have preallocation feature done inside qcow2 driver.
>> This is very uncomfortable: we should to handle each possible over-EOF
>> write to underlying node (to keep data_end in sync to be able to shrink
>> preallocation on close()).. I don't like this way and don't want to port
>> it..
>>
>> Another option is implementing preallocation inside file-posix driver.
>> Then, instead of BDRV_REQ_NO_WAIT flag I'll need to extend serialising
>> requests API (bdrv_make_request_serialising() is already used in
>> file-posix.c) to dupport no-wait behavior + expanding the serialising
>> request bounds. This option seems feasible, so I'll try this way if no
>> other ideas.
> 
> Possible, but you haven’t yet explained what the problem with the
> management layer inserting the preallocation filter is.
> 
>> Filter is obviously the true way: we use generic block layer for native
>> request serialising, don't need to catch every write in qcow2 driver,
>> don't need to modify any other driver and get a universal thing. But how
>> to insert it implicitly (or at least automatically in some cases) and
>> avoid all the problems?
> 
> I don’t understand why inserting it implicitly is important.
> 

You are right. Thanks for strong point of view, this makes me to revise my own. Now I'm working on v6.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-09-17  9:02 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 02/10] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 03/10] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
2020-08-25 12:43   ` Max Reitz
2020-08-21 14:11 ` [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
2020-08-25 13:10   ` Max Reitz
2020-08-26  6:26     ` Vladimir Sementsov-Ogievskiy
2020-08-26  8:36       ` Max Reitz
2020-08-26  8:57         ` Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
2020-08-26  7:06     ` Vladimir Sementsov-Ogievskiy
2020-08-25 15:11   ` Max Reitz
2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
2020-08-26  8:52       ` Max Reitz
2020-08-26  9:15         ` Vladimir Sementsov-Ogievskiy
2020-08-26  9:58           ` Max Reitz
2020-08-26 11:29             ` Vladimir Sementsov-Ogievskiy
2020-08-26 13:51     ` David Edmondson
2020-08-27  9:19       ` Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 08/10] iotests.py: add verify_o_direct helper Vladimir Sementsov-Ogievskiy
2020-08-21 17:29   ` Nir Soffer
2020-08-21 14:11 ` [PATCH v5 09/10] iotests.py: add filter_img_check Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 10/10] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
2020-08-27 21:08 ` [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-01 15:07   ` Max Reitz
2020-09-17  9:00     ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.