All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] preallocate filter
@ 2020-06-20 14:36 Vladimir Sementsov-Ogievskiy
  2020-06-20 14:36 ` [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising Vladimir Sementsov-Ogievskiy
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

Hi all!

Here is a filter, which does preallocation on write.

In Virtuozzo we have to deal with some custom distributed storage solution,
where allocation is relatively expensive operation. We have to workaround it
in Qemu, so here is a new filter.

Performance results with the following test are very significant:

    IMG=/megassd/z; FILE_OPTS=file.filename=$IMG; COUNT=15000; CHUNK=64K; \
    CLUSTER=1M; rm -f $IMG; \
    ./qemu-img create -f qcow2 -o cluster_size=$CLUSTER $IMG 1G; \
    ./qemu-img bench -c $COUNT -d 1 -s $CHUNK -w -t none --image-opts driver=qcow2,$FILE_OPTS;

 - qemu-img bench reports ~44.3s, and if I switch FILE_OPTS to be

    FILE_OPTS=file.driver=preallocate,file.file.filename=$IMG

 qemu-img bench reports ~4.3s, 90% benefit!

(/megassd is a mount point of distributed storage)

Unfortunately, I don't know another real-world case where preallocation
would so helpful. If someone have running ceph instance and can check
it I'd be very grateful.

Still, I hope, it may be useful at least for some testing purposes.
And, small example I have: small writes with small sectors on usual ext4 over ssd gives
the following for me:

    IMG=/ssd/z; FILE_OPTS=file.filename=$IMG; COUNT=15000; CHUNK=512; \
    CLUSTER=512; rm -f $IMG; \
    ./qemu-img create -f qcow2 -o cluster_size=$CLUSTER $IMG 1G; \
    ./qemu-img bench -c $COUNT -d 1 -s $CHUNK -w -t none --image-opts driver=qcow2,$FILE_OPTS;

 ~17.1s

 and, if I switch FILE_OPTS to use new filter, the result is ~14.9s, i.e ~13% better.

=====

The series also introduces bdrv-lock-region interface, which may be reused to
implement copy-on-read operation directly inside copy-on-read filter, instead of
handling the special flag in generic code.

Vladimir Sementsov-Ogievskiy (5):
  block/io: introduce bdrv_try_mark_request_serialising
  block/io: introduce bdrv_co_range_try_lock
  block: introduce preallocate filter
  iotests: QemuIoInteractive: use qemu_io_args_no_fmt
  iotests: add 298 to test new preallocate filter driver

 qapi/block-core.json          |   3 +-
 include/block/block.h         |   9 ++
 include/block/block_int.h     |  16 +++
 include/qemu/typedefs.h       |   1 +
 block/io.c                    | 142 ++++++++++++++++---
 block/preallocate.c           | 255 ++++++++++++++++++++++++++++++++++
 block/Makefile.objs           |   1 +
 tests/qemu-iotests/298        |  45 ++++++
 tests/qemu-iotests/298.out    |   5 +
 tests/qemu-iotests/group      |   1 +
 tests/qemu-iotests/iotests.py |   2 +-
 11 files changed, 455 insertions(+), 25 deletions(-)
 create mode 100644 block/preallocate.c
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

-- 
2.18.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising
  2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-06-20 14:36 ` Vladimir Sementsov-Ogievskiy
  2020-07-07 15:56   ` Stefan Hajnoczi
  2020-06-20 14:36 ` [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

Introduce a function to mark the request serialising only if there are
no conflicting request to wait for.

The function is static, so mark it unused. The attribute is to be
dropped in the next commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/block/io.c b/block/io.c
index df8f2a98d4..60cee1d759 100644
--- a/block/io.c
+++ b/block/io.c
@@ -727,10 +727,11 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
-static bool coroutine_fn
-bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
-                                      BdrvTrackedRequest *self)
+/* Called with self->bs->reqs_lock locked */
+static bool coroutine_fn wait_or_find_conflicts(BdrvTrackedRequest *self,
+                                                bool wait)
 {
+    BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     bool retry;
     bool waited = false;
@@ -754,6 +755,9 @@ bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
                  * will wait for us as soon as it wakes up, then just go on
                  * (instead of producing a deadlock in the former case). */
                 if (!req->waiting_for) {
+                    if (!wait) {
+                        return true;
+                    }
                     self->waiting_for = req;
                     qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
                     self->waiting_for = NULL;
@@ -767,13 +771,18 @@ bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
     return waited;
 }
 
-bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+/* Return true on success, false on fail. Always success with blocking = true */
+static bool bdrv_do_mark_request_serialising(BdrvTrackedRequest *req,
+        uint64_t align, bool blocking, bool *waited)
 {
     BlockDriverState *bs = req->bs;
     int64_t overlap_offset = req->offset & ~(align - 1);
     uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
                                - overlap_offset;
-    bool waited;
+    int64_t old_overlap_offset = req->overlap_offset;
+    uint64_t old_overlap_bytes = req->overlap_bytes;
+    bool old_serializing = req->serialising;
+    bool found;
 
     qemu_co_mutex_lock(&bs->reqs_lock);
     if (!req->serialising) {
@@ -783,11 +792,46 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 
     req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
     req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
-    waited = bdrv_wait_serialising_requests_locked(bs, req);
+    found = wait_or_find_conflicts(req, blocking);
+
+    if (!blocking && found) {
+        req->overlap_offset = old_overlap_offset;
+        req->overlap_bytes = old_overlap_bytes;
+        if (!old_serializing) {
+            atomic_dec(&req->bs->serialising_in_flight);
+            req->serialising = false;
+        }
+    }
+
+    *waited = found && blocking;
+
     qemu_co_mutex_unlock(&bs->reqs_lock);
+
+    return blocking || !found;
+}
+
+/* Return true if had to wait for conflicts */
+bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+{
+    bool waited;
+    bool success = bdrv_do_mark_request_serialising(req, align, true, &waited);
+
+    assert(success);
     return waited;
 }
 
+/* Return true on success, false if there are some conflicts */
+__attribute__ ((unused))
+static bool bdrv_try_mark_request_serialising(BdrvTrackedRequest *req,
+                                              uint64_t align)
+{
+    bool waited;
+    bool success = bdrv_do_mark_request_serialising(req, align, false, &waited);
+
+    assert(!waited);
+    return success;
+}
+
 /**
  * Return the tracked request on @bs for the current coroutine, or
  * NULL if there is none.
@@ -865,7 +909,7 @@ static bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self
     }
 
     qemu_co_mutex_lock(&bs->reqs_lock);
-    waited = bdrv_wait_serialising_requests_locked(bs, self);
+    waited = wait_or_find_conflicts(self, true);
     qemu_co_mutex_unlock(&bs->reqs_lock);
 
     return waited;
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock
  2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
  2020-06-20 14:36 ` [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising Vladimir Sementsov-Ogievskiy
@ 2020-06-20 14:36 ` Vladimir Sementsov-Ogievskiy
  2020-07-07 16:10   ` Stefan Hajnoczi
  2020-06-20 14:36 ` [PATCH 3/5] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

Introduce an interface to make a "critical section" on top of
serialising requests feature. This is needed to avoid intersecting with
other requests during a set of operations. Will be used in a next
commit to implement preallocate filter.

To keep assertions during intersecting requests handling, we need to
add _locked versions of read/write requests, to be used inside locked
section. The following commit will need only locked version of
write-zeroes, so add only it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h     |  9 ++++
 include/block/block_int.h | 16 ++++++++
 include/qemu/typedefs.h   |  1 +
 block/io.c                | 86 +++++++++++++++++++++++++++++++--------
 4 files changed, 95 insertions(+), 17 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 25e299605e..c4b2a493b4 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -391,6 +391,15 @@ int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
  */
 int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
                                        int bytes, BdrvRequestFlags flags);
+
+/*
+ * Version of bdrv_co_pwrite_zeroes to be used inside bdrv_co_range_try_lock()
+ * .. bdrv_co_range_unlock() pair.
+ */
+int coroutine_fn bdrv_co_pwrite_zeroes_locked(BdrvChild *child,
+    int64_t offset, int bytes, BdrvRequestFlags flags,
+    BdrvTrackedRequest *lock);
+
 BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
     const char *backing_file);
 void bdrv_refresh_filename(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 791de6a59c..9ec56f1825 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -66,6 +66,7 @@ enum BdrvTrackedRequestType {
     BDRV_TRACKED_WRITE,
     BDRV_TRACKED_DISCARD,
     BDRV_TRACKED_TRUNCATE,
+    BDRV_TRACKED_LOCK,
 };
 
 typedef struct BdrvTrackedRequest {
@@ -83,6 +84,12 @@ typedef struct BdrvTrackedRequest {
     CoQueue wait_queue; /* coroutines blocked on this request */
 
     struct BdrvTrackedRequest *waiting_for;
+
+    /*
+     * If non-zero, the request is under lock, so it's allowed to intersect
+     * (actully it must be inside) the @lock request.
+     */
+    struct BdrvTrackedRequest *lock;
 } BdrvTrackedRequest;
 
 struct BlockDriver {
@@ -994,6 +1001,15 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     int64_t offset, unsigned int bytes,
     QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
 
+/*
+ * Creates serializing BDRV_TRACKED_LOCK request if no conflicts, on success
+ * return pointer to it, to be used in bdrv_co_range_unlock().
+ */
+BdrvTrackedRequest *coroutine_fn bdrv_co_range_try_lock(BlockDriverState *bs,
+                                                        int64_t offset,
+                                                        int64_t bytes);
+void coroutine_fn bdrv_co_range_unlock(BdrvTrackedRequest *req);
+
 static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
     int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
 {
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index ecf3cde26c..e2bbe3af96 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -28,6 +28,7 @@ typedef struct Aml Aml;
 typedef struct AnnounceTimer AnnounceTimer;
 typedef struct BdrvDirtyBitmap BdrvDirtyBitmap;
 typedef struct BdrvDirtyBitmapIter BdrvDirtyBitmapIter;
+typedef struct BdrvTrackedRequest BdrvTrackedRequest;
 typedef struct BlockBackend BlockBackend;
 typedef struct BlockBackendRootState BlockBackendRootState;
 typedef struct BlockDriverState BlockDriverState;
diff --git a/block/io.c b/block/io.c
index 60cee1d759..0de72cdb4c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -691,7 +691,8 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
                                   BlockDriverState *bs,
                                   int64_t offset,
                                   uint64_t bytes,
-                                  enum BdrvTrackedRequestType type)
+                                  enum BdrvTrackedRequestType type,
+                                  BdrvTrackedRequest *lock)
 {
     assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
 
@@ -704,6 +705,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
         .serialising    = false,
         .overlap_offset = offset,
         .overlap_bytes  = bytes,
+        .lock           = lock,
     };
 
     qemu_co_queue_init(&req->wait_queue);
@@ -745,15 +747,26 @@ static bool coroutine_fn wait_or_find_conflicts(BdrvTrackedRequest *self,
             if (tracked_request_overlaps(req, self->overlap_offset,
                                          self->overlap_bytes))
             {
-                /* Hitting this means there was a reentrant request, for
-                 * example, a block driver issuing nested requests.  This must
-                 * never happen since it means deadlock.
+                if (self->lock == req) {
+                    /* This is atomic request under range_lock */
+                    assert(req->type == BDRV_TRACKED_LOCK);
+                    assert(self->offset >= req->offset);
+                    assert(self->bytes <= req->bytes);
+                    continue;
+                }
+                /*
+                 * Hitting this means there was a reentrant request (except for
+                 * range_lock, handled above), for example, a block driver
+                 * issuing nested requests.  This must never happen since it
+                 * means deadlock.
                  */
                 assert(qemu_coroutine_self() != req->co);
 
-                /* If the request is already (indirectly) waiting for us, or
+                /*
+                 * If the request is already (indirectly) waiting for us, or
                  * will wait for us as soon as it wakes up, then just go on
-                 * (instead of producing a deadlock in the former case). */
+                 * (instead of producing a deadlock in the former case).
+                 */
                 if (!req->waiting_for) {
                     if (!wait) {
                         return true;
@@ -821,7 +834,6 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 }
 
 /* Return true on success, false if there are some conflicts */
-__attribute__ ((unused))
 static bool bdrv_try_mark_request_serialising(BdrvTrackedRequest *req,
                                               uint64_t align)
 {
@@ -1796,7 +1808,7 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
 
     bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad);
 
-    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_READ);
+    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_READ, NULL);
     ret = bdrv_aligned_preadv(child, &req, offset, bytes,
                               bs->bl.request_alignment,
                               qiov, qiov_offset, flags);
@@ -2169,9 +2181,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
     return bdrv_co_pwritev_part(child, offset, bytes, qiov, 0, flags);
 }
 
-int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+static int coroutine_fn bdrv_co_pwritev_part_locked(BdrvChild *child,
     int64_t offset, unsigned int bytes, QEMUIOVector *qiov, size_t qiov_offset,
-    BdrvRequestFlags flags)
+    BdrvRequestFlags flags, BdrvTrackedRequest *lock)
 {
     BlockDriverState *bs = child->bs;
     BdrvTrackedRequest req;
@@ -2215,7 +2227,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
      * Pad qiov with the read parts and be sure to have a tracked request not
      * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
      */
-    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_WRITE);
+    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_WRITE, lock);
 
     if (flags & BDRV_REQ_ZERO_WRITE) {
         ret = bdrv_co_do_zero_pwritev(child, offset, bytes, flags, &req);
@@ -2239,8 +2251,23 @@ out:
     return ret;
 }
 
+int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, size_t qiov_offset,
+    BdrvRequestFlags flags)
+{
+    return bdrv_co_pwritev_part_locked(child, offset, bytes, qiov, qiov_offset,
+                                       flags, NULL);
+}
+
 int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
                                        int bytes, BdrvRequestFlags flags)
+{
+    return bdrv_co_pwrite_zeroes_locked(child, offset, bytes, flags, NULL);
+}
+
+int coroutine_fn bdrv_co_pwrite_zeroes_locked(BdrvChild *child, int64_t offset,
+                                              int bytes, BdrvRequestFlags flags,
+                                              BdrvTrackedRequest *lock)
 {
     trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 
@@ -2248,8 +2275,8 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
         flags &= ~BDRV_REQ_MAY_UNMAP;
     }
 
-    return bdrv_co_pwritev(child, offset, bytes, NULL,
-                           BDRV_REQ_ZERO_WRITE | flags);
+    return bdrv_co_pwritev_part_locked(child, offset, bytes, NULL, 0,
+                                       BDRV_REQ_ZERO_WRITE | flags, lock);
 }
 
 /*
@@ -2980,7 +3007,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
     tail = (offset + bytes) % align;
 
     bdrv_inc_in_flight(bs);
-    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_DISCARD);
+    tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_DISCARD, NULL);
 
     ret = bdrv_co_write_req_prepare(child, offset, bytes, &req, 0);
     if (ret < 0) {
@@ -3252,7 +3279,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
     if (recurse_src) {
         bdrv_inc_in_flight(src->bs);
         tracked_request_begin(&req, src->bs, src_offset, bytes,
-                              BDRV_TRACKED_READ);
+                              BDRV_TRACKED_READ, NULL);
 
         /* BDRV_REQ_SERIALISING is only for write operation */
         assert(!(read_flags & BDRV_REQ_SERIALISING));
@@ -3269,7 +3296,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
     } else {
         bdrv_inc_in_flight(dst->bs);
         tracked_request_begin(&req, dst->bs, dst_offset, bytes,
-                              BDRV_TRACKED_WRITE);
+                              BDRV_TRACKED_WRITE, NULL);
         ret = bdrv_co_write_req_prepare(dst, dst_offset, bytes, &req,
                                         write_flags);
         if (!ret) {
@@ -3381,7 +3408,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
 
     bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, offset - new_bytes, new_bytes,
-                          BDRV_TRACKED_TRUNCATE);
+                          BDRV_TRACKED_TRUNCATE, NULL);
 
     /* If we are growing the image and potentially using preallocation for the
      * new area, we need to make sure that no write requests are made to it
@@ -3494,3 +3521,28 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
 
     return bdrv_run_co(child->bs, bdrv_truncate_co_entry, &tco);
 }
+
+BdrvTrackedRequest *coroutine_fn bdrv_co_range_try_lock(BlockDriverState *bs,
+                                                        int64_t offset,
+                                                        int64_t bytes)
+{
+    BdrvTrackedRequest *req = g_new(BdrvTrackedRequest, 1);
+    bool success;
+
+    tracked_request_begin(req, bs, offset, bytes, BDRV_TRACKED_LOCK, NULL);
+    success = bdrv_try_mark_request_serialising(req, bdrv_get_cluster_size(bs));
+    if (!success) {
+        g_free(req);
+        req = NULL;
+    }
+
+    return req;
+}
+
+void coroutine_fn bdrv_co_range_unlock(BdrvTrackedRequest *req)
+{
+    assert(req->type == BDRV_TRACKED_LOCK);
+
+    tracked_request_end(req);
+    g_free(req);
+}
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] block: introduce preallocate filter
  2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
  2020-06-20 14:36 ` [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising Vladimir Sementsov-Ogievskiy
  2020-06-20 14:36 ` [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock Vladimir Sementsov-Ogievskiy
@ 2020-06-20 14:36 ` Vladimir Sementsov-Ogievskiy
  2020-07-08 12:07   ` Stefan Hajnoczi
  2020-06-20 14:36 ` [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt Vladimir Sementsov-Ogievskiy
  2020-06-20 14:36 ` [PATCH 5/5] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
  4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

It may be used for file-systems with slow allocation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 qapi/block-core.json |   3 +-
 block/preallocate.c  | 255 +++++++++++++++++++++++++++++++++++++++++++
 block/Makefile.objs  |   1 +
 3 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 block/preallocate.c

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0e1c6a59f2..a0bda399d6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2805,7 +2805,7 @@
             'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
             'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
-            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
             'sheepdog',
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
@@ -3995,6 +3995,7 @@
       'null-co':    'BlockdevOptionsNull',
       'nvme':       'BlockdevOptionsNVMe',
       'parallels':  'BlockdevOptionsGenericFormat',
+      'preallocate':'BlockdevOptionsGenericFormat',
       'qcow2':      'BlockdevOptionsQcow2',
       'qcow':       'BlockdevOptionsQcow',
       'qed':        'BlockdevOptionsGenericCOWFormat',
diff --git a/block/preallocate.c b/block/preallocate.c
new file mode 100644
index 0000000000..c272a6e41d
--- /dev/null
+++ b/block/preallocate.c
@@ -0,0 +1,255 @@
+/*
+ * preallocate filter driver
+ *
+ * The driver performs preallocate operation: it is injected above
+ * some node, and before each write over EOF it does additional preallocating
+ * write-zeroes request.
+ *
+ * Copyright (c) 2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/module.h"
+#include "qemu/units.h"
+#include "block/block_int.h"
+
+
+typedef struct BDRVPreallocateState {
+    int64_t prealloc_size;
+    int64_t prealloc_align;
+
+    /*
+     * Track real data end, to crop preallocation on close  data_end may be
+     * negative, which means that actual status is unknown (nothing cropped in
+     * this case)
+     */
+    int64_t data_end;
+} BDRVPreallocateState;
+
+
+static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
+                            Error **errp)
+{
+    BDRVPreallocateState *s = bs->opaque;
+
+    /*
+     * Parameters are hardcoded now. May need to add corresponding options in
+     * future.
+     */
+    s->prealloc_align = 1 * MiB;
+    s->prealloc_size = 128 * MiB;
+
+    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
+                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
+                               false, errp);
+    if (!bs->file) {
+        return -EINVAL;
+    }
+
+    s->data_end = bdrv_getlength(bs->file->bs);
+    if (s->data_end < 0) {
+        return s->data_end;
+    }
+
+    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
+
+    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
+            bs->file->bs->supported_zero_flags);
+
+    return 0;
+}
+
+static void preallocate_close(BlockDriverState *bs)
+{
+    BDRVPreallocateState *s = bs->opaque;
+
+    if (s->data_end >= 0 && bdrv_getlength(bs->file->bs) > s->data_end) {
+        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
+    }
+}
+
+static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
+                                   BdrvChildRole role,
+                                   BlockReopenQueue *reopen_queue,
+                                   uint64_t perm, uint64_t shared,
+                                   uint64_t *nperm, uint64_t *nshared)
+{
+    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
+
+    /* Force RESIZE permission, to be able to crop file on close() */
+    *nperm |= BLK_PERM_RESIZE;
+}
+
+static coroutine_fn int preallocate_co_preadv_part(
+        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+        QEMUIOVector *qiov, size_t qiov_offset, int flags)
+{
+    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+                               flags);
+}
+
+static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
+                                               int64_t offset, int bytes)
+{
+    return bdrv_co_pdiscard(bs->file, offset, bytes);
+}
+
+static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
+                                       int64_t bytes, bool write_zero)
+{
+    BDRVPreallocateState *s = bs->opaque;
+    int64_t len, start, end;
+    BdrvTrackedRequest *lock;
+    int ret;
+
+    if (s->data_end >= 0) {
+        s->data_end = MAX(s->data_end,
+                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
+    }
+
+    len = bdrv_getlength(bs->file->bs);
+    if (len < 0) {
+        return false;
+    }
+
+    if (s->data_end < 0) {
+        s->data_end = MAX(len,
+                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
+    }
+
+    if (offset + bytes <= len) {
+        return false;
+    }
+
+    lock = bdrv_co_range_try_lock(bs->file->bs, len, INT64_MAX - len);
+    if (!lock) {
+        /* There are already preallocating requests in-fligth */
+        return false;
+    }
+
+    /* Length should not have changed */
+    assert(len == bdrv_getlength(bs->file->bs));
+
+    start = write_zero ? MIN(offset, len) : len;
+    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
+
+    ret = bdrv_co_pwrite_zeroes_locked(bs->file, start, end - start,
+                                       BDRV_REQ_NO_FALLBACK, lock);
+
+    bdrv_co_range_unlock(lock);
+
+    return !ret;
+}
+
+static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
+        int64_t offset, int bytes, BdrvRequestFlags flags)
+{
+    if (do_preallocate(bs, offset, bytes, true)) {
+        return 0;
+    }
+
+    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
+}
+
+static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
+                                                    uint64_t offset,
+                                                    uint64_t bytes,
+                                                    QEMUIOVector *qiov,
+                                                    size_t qiov_offset,
+                                                    int flags)
+{
+    do_preallocate(bs, offset, bytes, false);
+
+    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+                                flags);
+}
+
+static int coroutine_fn
+preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
+                        bool exact, PreallocMode prealloc,
+                        BdrvRequestFlags flags, Error **errp)
+{
+    BDRVPreallocateState *s = bs->opaque;
+    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
+
+    /* s->data_end may become negative here, which means unknown data end */
+    s->data_end = bdrv_getlength(bs->file->bs);
+
+    return ret;
+}
+
+static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
+{
+    if (!bs->file) {
+        return 0;
+    }
+
+    return bdrv_co_flush(bs->file->bs);
+}
+
+static int64_t preallocate_getlength(BlockDriverState *bs)
+{
+    /*
+     * We probably can return s->data_end here, but seems safer to return real
+     * file length, not trying to hide the preallocation.
+     *
+     * Still, don't miss the chance to restore s->data_end if it is broken.
+     */
+    BDRVPreallocateState *s = bs->opaque;
+    int64_t ret = bdrv_getlength(bs->file->bs);
+
+    if (s->data_end < 0) {
+        s->data_end = ret;
+    }
+
+    return ret;
+}
+
+BlockDriver bdrv_preallocate_filter = {
+    .format_name = "preallocate",
+    .instance_size = sizeof(BDRVPreallocateState),
+
+    .bdrv_getlength = preallocate_getlength,
+    .bdrv_open = preallocate_open,
+    .bdrv_close = preallocate_close,
+
+    .bdrv_co_preadv_part = preallocate_co_preadv_part,
+    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
+    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
+    .bdrv_co_pdiscard = preallocate_co_pdiscard,
+    .bdrv_co_flush = preallocate_co_flush,
+    .bdrv_co_truncate = preallocate_co_truncate,
+
+    .bdrv_co_block_status = bdrv_co_block_status_from_file,
+
+    .bdrv_child_perm = preallocate_child_perm,
+
+    .has_variable_length = true,
+    .is_filter = true,
+};
+
+static void bdrv_preallocate_init(void)
+{
+    bdrv_register(&bdrv_preallocate_filter);
+}
+
+block_init(bdrv_preallocate_init);
diff --git a/block/Makefile.objs b/block/Makefile.objs
index 3635b6b4c1..f46a353a35 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -45,6 +45,7 @@ block-obj-y += crypto.o
 block-obj-y += aio_task.o
 block-obj-y += backup-top.o
 block-obj-y += filter-compress.o
+block-obj-y += preallocate.o
 common-obj-y += monitor/
 
 block-obj-y += stream.o
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt
  2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2020-06-20 14:36 ` [PATCH 3/5] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-06-20 14:36 ` Vladimir Sementsov-Ogievskiy
  2020-07-08 12:08   ` Stefan Hajnoczi
  2020-06-20 14:36 ` [PATCH 5/5] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
  4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

All users of QemuIoInteractive provides -f argument, so it's incorrect
to use qemu_io_args, which contains -f too. Let's use
qemu_io_args_no_fmt, which also makes possible to use --image-opts with
QemuIoInteractive in the following patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/iotests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index f20d90f969..a8f71df4eb 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -211,7 +211,7 @@ def get_virtio_scsi_device():
 
 class QemuIoInteractive:
     def __init__(self, *args):
-        self.args = qemu_io_args + list(args)
+        self.args = qemu_io_args_no_fmt + list(args)
         self._p = subprocess.Popen(self.args, stdin=subprocess.PIPE,
                                    stdout=subprocess.PIPE,
                                    stderr=subprocess.STDOUT,
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] iotests: add 298 to test new preallocate filter driver
  2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2020-06-20 14:36 ` [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt Vladimir Sementsov-Ogievskiy
@ 2020-06-20 14:36 ` Vladimir Sementsov-Ogievskiy
  2020-07-08 12:09   ` Stefan Hajnoczi
  4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-20 14:36 UTC (permalink / raw)
  To: qemu-block
  Cc: fam, kwolf, vsementsov, Anton.Nefedov, armbru, qemu-devel,
	mreitz, stefanha, den

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/298     | 45 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/298.out |  5 +++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 51 insertions(+)
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
new file mode 100644
index 0000000000..a6af14fc55
--- /dev/null
+++ b/tests/qemu-iotests/298
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+#
+# Test for preallocate filter
+#
+# Copyright (c) 2020 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import log
+
+iotests.script_initialize(supported_fmts=['qcow2'])
+
+size = 10 * 1024 * 1024
+disk = iotests.file_path('disk')
+
+iotests.qemu_img_create('-f', iotests.imgfmt, disk, str(size))
+
+opts = f'driver={iotests.imgfmt},' \
+    f'file.driver=preallocate,file.file.filename={disk}'
+p = iotests.QemuIoInteractive('--image-opts', '-t', 'none', opts)
+
+log(p.cmd('write 0 1M'), filters=[iotests.filter_qemu_io])
+p.cmd('flush')
+
+if os.path.getsize(disk) > 100 * 1024 * 1024:
+    log('file in progress is big, preallocation works')
+
+p.close()
+
+if os.path.getsize(disk) < 10 * 1024 * 1024:
+    log('file is small after close')
diff --git a/tests/qemu-iotests/298.out b/tests/qemu-iotests/298.out
new file mode 100644
index 0000000000..6c34d172a4
--- /dev/null
+++ b/tests/qemu-iotests/298.out
@@ -0,0 +1,5 @@
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+file in progress is big, preallocation works
+file is small after close
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index d886fa0cb3..1bfb864841 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -302,3 +302,4 @@
 291 rw quick
 292 rw auto quick
 297 meta
+298 meta
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising
  2020-06-20 14:36 ` [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising Vladimir Sementsov-Ogievskiy
@ 2020-07-07 15:56   ` Stefan Hajnoczi
  2020-07-08 15:51     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-07 15:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

On Sat, Jun 20, 2020 at 05:36:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Introduce a function to mark the request serialising only if there are
> no conflicting request to wait for.
> 
> The function is static, so mark it unused. The attribute is to be
> dropped in the next commit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 51 insertions(+), 7 deletions(-)

I found this patch difficult to understand because there are multiple
levels of functions passing flags to ultimiately do different things in
a common function.

Here are some ideas if you have time to rework this patch:

1. Introduce a bdrv_find_overlapping_request() function that does most
   of bdrv_wait_serialising_requests_locked() but does not wait. Then
   bdrv_wait_serialising_requests_locked() can call that function in a
   loop and wait if an overlapping request is found.

2. Pass overlap_offset/overlap_bytes arguments to
   bdrv_find_overlapping_request() instead of changing and restoring the
   value in bdrv_do_mark_request_serialising().

3. Use consistent names for flags: wait/blocking, found/success

I'm not sure if all these ideas will work, but I get the feeling this
code can be refactored to make it easier to understand. Since I don't
have a concrete suggestion and the code looks correct:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock
  2020-06-20 14:36 ` [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock Vladimir Sementsov-Ogievskiy
@ 2020-07-07 16:10   ` Stefan Hajnoczi
  2020-07-08 15:56     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-07 16:10 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]

On Sat, Jun 20, 2020 at 05:36:46PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> @@ -83,6 +84,12 @@ typedef struct BdrvTrackedRequest {
>      CoQueue wait_queue; /* coroutines blocked on this request */
>  
>      struct BdrvTrackedRequest *waiting_for;
> +
> +    /*
> +     * If non-zero, the request is under lock, so it's allowed to intersect
> +     * (actully it must be inside) the @lock request.

s/actully/actually/

> @@ -745,15 +747,26 @@ static bool coroutine_fn wait_or_find_conflicts(BdrvTrackedRequest *self,
>              if (tracked_request_overlaps(req, self->overlap_offset,
>                                           self->overlap_bytes))
>              {
> -                /* Hitting this means there was a reentrant request, for
> -                 * example, a block driver issuing nested requests.  This must
> -                 * never happen since it means deadlock.
> +                if (self->lock == req) {
> +                    /* This is atomic request under range_lock */
> +                    assert(req->type == BDRV_TRACKED_LOCK);
> +                    assert(self->offset >= req->offset);
> +                    assert(self->bytes <= req->bytes);

These assertions do not catch requests that start within the locked
region but span beyond the end of the region. How about:

  assert(self->offset + self->bytes - req->offset >= req->bytes);

> +int coroutine_fn bdrv_co_pwrite_zeroes_locked(BdrvChild *child, int64_t offset,
> +                                              int bytes, BdrvRequestFlags flags,
> +                                              BdrvTrackedRequest *lock)

The name is confusing because _locked() normally means that a mutex
should be held. Functions using that naming convention already exist in
block/io.c. It would be nice to distinguish between functions that need
BdrvTrackedRequest and functions that must be called with a mutex held.

How about bdrv_co_pwrite_zeroes_with_lock()?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: introduce preallocate filter
  2020-06-20 14:36 ` [PATCH 3/5] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
@ 2020-07-08 12:07   ` Stefan Hajnoczi
  2020-07-08 16:17     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-08 12:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 11905 bytes --]

On Sat, Jun 20, 2020 at 05:36:47PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> It may be used for file-systems with slow allocation.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  qapi/block-core.json |   3 +-
>  block/preallocate.c  | 255 +++++++++++++++++++++++++++++++++++++++++++
>  block/Makefile.objs  |   1 +
>  3 files changed, 258 insertions(+), 1 deletion(-)
>  create mode 100644 block/preallocate.c

Please add documentation to docs/system/qemu-block-drivers.rst.inc
describing the purpose of this block driver and how to use it.

Since this filter grows the file I guess it's intended to be below an
image format?

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 0e1c6a59f2..a0bda399d6 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2805,7 +2805,7 @@
>              'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
>              'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
> -            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
> +            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
>              'sheepdog',
>              'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
> @@ -3995,6 +3995,7 @@
>        'null-co':    'BlockdevOptionsNull',
>        'nvme':       'BlockdevOptionsNVMe',
>        'parallels':  'BlockdevOptionsGenericFormat',
> +      'preallocate':'BlockdevOptionsGenericFormat',
>        'qcow2':      'BlockdevOptionsQcow2',
>        'qcow':       'BlockdevOptionsQcow',
>        'qed':        'BlockdevOptionsGenericCOWFormat',
> diff --git a/block/preallocate.c b/block/preallocate.c
> new file mode 100644
> index 0000000000..c272a6e41d
> --- /dev/null
> +++ b/block/preallocate.c
> @@ -0,0 +1,255 @@
> +/*
> + * preallocate filter driver
> + *
> + * The driver performs preallocate operation: it is injected above
> + * some node, and before each write over EOF it does additional preallocating
> + * write-zeroes request.
> + *
> + * Copyright (c) 2020 Virtuozzo International GmbH.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/module.h"
> +#include "qemu/units.h"
> +#include "block/block_int.h"
> +
> +
> +typedef struct BDRVPreallocateState {
> +    int64_t prealloc_size;
> +    int64_t prealloc_align;
> +
> +    /*
> +     * Track real data end, to crop preallocation on close  data_end may be
> +     * negative, which means that actual status is unknown (nothing cropped in
> +     * this case)
> +     */
> +    int64_t data_end;
> +} BDRVPreallocateState;
> +
> +
> +static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
> +                            Error **errp)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    /*
> +     * Parameters are hardcoded now. May need to add corresponding options in
> +     * future.
> +     */

The code for .bdrv_open() options is quick to write. If you add the
options right away then it will be much easier for users who need to
tweak them in the future.

> +    s->prealloc_align = 1 * MiB;
> +    s->prealloc_size = 128 * MiB;
> +
> +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
> +                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
> +                               false, errp);
> +    if (!bs->file) {
> +        return -EINVAL;
> +    }
> +
> +    s->data_end = bdrv_getlength(bs->file->bs);
> +    if (s->data_end < 0) {
> +        return s->data_end;
> +    }
> +
> +    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
> +
> +    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
> +            bs->file->bs->supported_zero_flags);
> +
> +    return 0;
> +}
> +
> +static void preallocate_close(BlockDriverState *bs)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    if (s->data_end >= 0 && bdrv_getlength(bs->file->bs) > s->data_end) {
> +        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
> +    }
> +}
> +
> +static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                   BdrvChildRole role,
> +                                   BlockReopenQueue *reopen_queue,
> +                                   uint64_t perm, uint64_t shared,
> +                                   uint64_t *nperm, uint64_t *nshared)
> +{
> +    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
> +
> +    /* Force RESIZE permission, to be able to crop file on close() */
> +    *nperm |= BLK_PERM_RESIZE;
> +}
> +
> +static coroutine_fn int preallocate_co_preadv_part(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
> +{
> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
> +                               flags);
> +}
> +
> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
> +                                               int64_t offset, int bytes)
> +{
> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
> +}
> +
> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
> +                                       int64_t bytes, bool write_zero)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t len, start, end;
> +    BdrvTrackedRequest *lock;
> +    int ret;
> +
> +    if (s->data_end >= 0) {
> +        s->data_end = MAX(s->data_end,
> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
> +    }
> +
> +    len = bdrv_getlength(bs->file->bs);
> +    if (len < 0) {
> +        return false;
> +    }
> +
> +    if (s->data_end < 0) {
> +        s->data_end = MAX(len,
> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
> +    }
> +
> +    if (offset + bytes <= len) {
> +        return false;
> +    }
> +
> +    lock = bdrv_co_range_try_lock(bs->file->bs, len, INT64_MAX - len);
> +    if (!lock) {
> +        /* There are already preallocating requests in-fligth */

s/fligth/flight/

> +        return false;
> +    }
> +
> +    /* Length should not have changed */
> +    assert(len == bdrv_getlength(bs->file->bs));
> +
> +    start = write_zero ? MIN(offset, len) : len;
> +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
> +
> +    ret = bdrv_co_pwrite_zeroes_locked(bs->file, start, end - start,
> +                                       BDRV_REQ_NO_FALLBACK, lock);
> +
> +    bdrv_co_range_unlock(lock);

Hmm...if this piece of code is the only user of bdrv_co_range_try_lock()
then a BDRV_REQ_NO_WAIT flag might be a simpler API.

I thought the lock request would be used to perform multiple operations,
but if it's just for a single operation then I think it's less code and
easier to understand without the lock request.

> +
> +    return !ret;
> +}
> +
> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    if (do_preallocate(bs, offset, bytes, true)) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
> +                                                    uint64_t offset,
> +                                                    uint64_t bytes,
> +                                                    QEMUIOVector *qiov,
> +                                                    size_t qiov_offset,
> +                                                    int flags)
> +{
> +    do_preallocate(bs, offset, bytes, false);
> +
> +    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
> +                                flags);
> +}
> +
> +static int coroutine_fn
> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
> +                        bool exact, PreallocMode prealloc,
> +                        BdrvRequestFlags flags, Error **errp)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
> +
> +    /* s->data_end may become negative here, which means unknown data end */
> +    s->data_end = bdrv_getlength(bs->file->bs);
> +
> +    return ret;
> +}
> +
> +static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
> +{
> +    if (!bs->file) {
> +        return 0;
> +    }

When does this happen? It's surprising to see the !bs->file check here
but not in other functions.

> +
> +    return bdrv_co_flush(bs->file->bs);
> +}
> +
> +static int64_t preallocate_getlength(BlockDriverState *bs)
> +{
> +    /*
> +     * We probably can return s->data_end here, but seems safer to return real
> +     * file length, not trying to hide the preallocation.
> +     *
> +     * Still, don't miss the chance to restore s->data_end if it is broken.
> +     */
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t ret = bdrv_getlength(bs->file->bs);
> +
> +    if (s->data_end < 0) {
> +        s->data_end = ret;
> +    }
> +
> +    return ret;
> +}
> +
> +BlockDriver bdrv_preallocate_filter = {
> +    .format_name = "preallocate",
> +    .instance_size = sizeof(BDRVPreallocateState),
> +
> +    .bdrv_getlength = preallocate_getlength,
> +    .bdrv_open = preallocate_open,
> +    .bdrv_close = preallocate_close,
> +
> +    .bdrv_co_preadv_part = preallocate_co_preadv_part,
> +    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
> +    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard = preallocate_co_pdiscard,
> +    .bdrv_co_flush = preallocate_co_flush,
> +    .bdrv_co_truncate = preallocate_co_truncate,
> +
> +    .bdrv_co_block_status = bdrv_co_block_status_from_file,
> +
> +    .bdrv_child_perm = preallocate_child_perm,
> +
> +    .has_variable_length = true,
> +    .is_filter = true,
> +};
> +
> +static void bdrv_preallocate_init(void)
> +{
> +    bdrv_register(&bdrv_preallocate_filter);
> +}
> +
> +block_init(bdrv_preallocate_init);
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 3635b6b4c1..f46a353a35 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -45,6 +45,7 @@ block-obj-y += crypto.o
>  block-obj-y += aio_task.o
>  block-obj-y += backup-top.o
>  block-obj-y += filter-compress.o
> +block-obj-y += preallocate.o
>  common-obj-y += monitor/
>  
>  block-obj-y += stream.o
> -- 
> 2.18.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt
  2020-06-20 14:36 ` [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt Vladimir Sementsov-Ogievskiy
@ 2020-07-08 12:08   ` Stefan Hajnoczi
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-08 12:08 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On Sat, Jun 20, 2020 at 05:36:48PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> All users of QemuIoInteractive provides -f argument, so it's incorrect
> to use qemu_io_args, which contains -f too. Let's use
> qemu_io_args_no_fmt, which also makes possible to use --image-opts with
> QemuIoInteractive in the following patch.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  tests/qemu-iotests/iotests.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] iotests: add 298 to test new preallocate filter driver
  2020-06-20 14:36 ` [PATCH 5/5] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
@ 2020-07-08 12:09   ` Stefan Hajnoczi
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-08 12:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 492 bytes --]

On Sat, Jun 20, 2020 at 05:36:49PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  tests/qemu-iotests/298     | 45 ++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/298.out |  5 +++++
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 51 insertions(+)
>  create mode 100644 tests/qemu-iotests/298
>  create mode 100644 tests/qemu-iotests/298.out

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising
  2020-07-07 15:56   ` Stefan Hajnoczi
@ 2020-07-08 15:51     ` Vladimir Sementsov-Ogievskiy
  2020-07-13 11:23       ` Stefan Hajnoczi
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-08 15:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

07.07.2020 18:56, Stefan Hajnoczi wrote:
> On Sat, Jun 20, 2020 at 05:36:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Introduce a function to mark the request serialising only if there are
>> no conflicting request to wait for.
>>
>> The function is static, so mark it unused. The attribute is to be
>> dropped in the next commit.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 51 insertions(+), 7 deletions(-)
> 
> I found this patch difficult to understand because there are multiple
> levels of functions passing flags to ultimiately do different things in
> a common function.
> 
> Here are some ideas if you have time to rework this patch:
> 
> 1. Introduce a bdrv_find_overlapping_request() function that does most
>     of bdrv_wait_serialising_requests_locked() but does not wait. Then
>     bdrv_wait_serialising_requests_locked() can call that function in a
>     loop and wait if an overlapping request is found.

I thought about it, but splitting bdrv_find_overlapping_request is not so clear:
it should include most of the logic inside "if (tracked_request_overlaps(..":
an assertion, and checking !req->waiting_for. So the semantics of new functions
becomes unclear, and it lead to splitting "->waiting_for" logic.. So, I decided
to keep the whole function as is, not splitted. I just can't imagine reasonable
split.

> 
> 2. Pass overlap_offset/overlap_bytes arguments to
>     bdrv_find_overlapping_request() instead of changing and restoring the
>     value in bdrv_do_mark_request_serialising().

I'm not sure that it would be safe to not add a request to the list during the
search.

> 
> 3. Use consistent names for flags: wait/blocking, found/success
> 
> I'm not sure if all these ideas will work, but I get the feeling this
> code can be refactored to make it easier to understand. Since I don't
> have a concrete suggestion and the code looks correct:

Hmm. Unfortunately I didn't record the problems I faced on the way to resulting
design, so I just don't remember now the details. So, I'll try to apply your
suggestions, and remember the problems (or we'll get better patch :)

> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> 

Thanks!

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock
  2020-07-07 16:10   ` Stefan Hajnoczi
@ 2020-07-08 15:56     ` Vladimir Sementsov-Ogievskiy
  2020-07-13 11:23       ` Stefan Hajnoczi
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-08 15:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

07.07.2020 19:10, Stefan Hajnoczi wrote:
> On Sat, Jun 20, 2020 at 05:36:46PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> @@ -83,6 +84,12 @@ typedef struct BdrvTrackedRequest {
>>       CoQueue wait_queue; /* coroutines blocked on this request */
>>   
>>       struct BdrvTrackedRequest *waiting_for;
>> +
>> +    /*
>> +     * If non-zero, the request is under lock, so it's allowed to intersect
>> +     * (actully it must be inside) the @lock request.
> 
> s/actully/actually/
> 
>> @@ -745,15 +747,26 @@ static bool coroutine_fn wait_or_find_conflicts(BdrvTrackedRequest *self,
>>               if (tracked_request_overlaps(req, self->overlap_offset,
>>                                            self->overlap_bytes))
>>               {
>> -                /* Hitting this means there was a reentrant request, for
>> -                 * example, a block driver issuing nested requests.  This must
>> -                 * never happen since it means deadlock.
>> +                if (self->lock == req) {
>> +                    /* This is atomic request under range_lock */
>> +                    assert(req->type == BDRV_TRACKED_LOCK);
>> +                    assert(self->offset >= req->offset);
>> +                    assert(self->bytes <= req->bytes);
> 
> These assertions do not catch requests that start within the locked
> region but span beyond the end of the region. How about:
> 
>    assert(self->offset + self->bytes - req->offset >= req->bytes);
> 
>> +int coroutine_fn bdrv_co_pwrite_zeroes_locked(BdrvChild *child, int64_t offset,
>> +                                              int bytes, BdrvRequestFlags flags,
>> +                                              BdrvTrackedRequest *lock)
> 
> The name is confusing because _locked() normally means that a mutex
> should be held. Functions using that naming convention already exist in
> block/io.c. It would be nice to distinguish between functions that need
> BdrvTrackedRequest and functions that must be called with a mutex held.
> 
> How about bdrv_co_pwrite_zeroes_with_lock()?
> 

OK.

May be _in_locked_range ? A bit longer, but more understandable.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: introduce preallocate filter
  2020-07-08 12:07   ` Stefan Hajnoczi
@ 2020-07-08 16:17     ` Vladimir Sementsov-Ogievskiy
  2020-07-13 11:27       ` Stefan Hajnoczi
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-08 16:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

08.07.2020 15:07, Stefan Hajnoczi wrote:
> On Sat, Jun 20, 2020 at 05:36:47PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> It may be used for file-systems with slow allocation.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   qapi/block-core.json |   3 +-
>>   block/preallocate.c  | 255 +++++++++++++++++++++++++++++++++++++++++++
>>   block/Makefile.objs  |   1 +
>>   3 files changed, 258 insertions(+), 1 deletion(-)
>>   create mode 100644 block/preallocate.c
> 
> Please add documentation to docs/system/qemu-block-drivers.rst.inc
> describing the purpose of this block driver and how to use it.

This implies adding new section "Filters", yes?

> 
> Since this filter grows the file I guess it's intended to be below an
> image format?

Yes, between format and protocol nodes.

> 
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 0e1c6a59f2..a0bda399d6 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2805,7 +2805,7 @@
>>               'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
>>               'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
>>               'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>> -            'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>> +            'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>>               { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
>>               'sheepdog',
>>               'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
>> @@ -3995,6 +3995,7 @@
>>         'null-co':    'BlockdevOptionsNull',
>>         'nvme':       'BlockdevOptionsNVMe',
>>         'parallels':  'BlockdevOptionsGenericFormat',
>> +      'preallocate':'BlockdevOptionsGenericFormat',
>>         'qcow2':      'BlockdevOptionsQcow2',
>>         'qcow':       'BlockdevOptionsQcow',
>>         'qed':        'BlockdevOptionsGenericCOWFormat',
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> new file mode 100644
>> index 0000000000..c272a6e41d
>> --- /dev/null
>> +++ b/block/preallocate.c
>> @@ -0,0 +1,255 @@
>> +/*
>> + * preallocate filter driver
>> + *
>> + * The driver performs preallocate operation: it is injected above
>> + * some node, and before each write over EOF it does additional preallocating
>> + * write-zeroes request.
>> + *
>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qemu/module.h"
>> +#include "qemu/units.h"
>> +#include "block/block_int.h"
>> +
>> +
>> +typedef struct BDRVPreallocateState {
>> +    int64_t prealloc_size;
>> +    int64_t prealloc_align;
>> +
>> +    /*
>> +     * Track real data end, to crop preallocation on close  data_end may be
>> +     * negative, which means that actual status is unknown (nothing cropped in
>> +     * this case)
>> +     */
>> +    int64_t data_end;
>> +} BDRVPreallocateState;
>> +
>> +
>> +static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
>> +                            Error **errp)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    /*
>> +     * Parameters are hardcoded now. May need to add corresponding options in
>> +     * future.
>> +     */
> 
> The code for .bdrv_open() options is quick to write. If you add the
> options right away then it will be much easier for users who need to
> tweak them in the future.

OK

> 
>> +    s->prealloc_align = 1 * MiB;
>> +    s->prealloc_size = 128 * MiB;
>> +
>> +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
>> +                               BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
>> +                               false, errp);
>> +    if (!bs->file) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    s->data_end = bdrv_getlength(bs->file->bs);
>> +    if (s->data_end < 0) {
>> +        return s->data_end;
>> +    }
>> +
>> +    bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
>> +
>> +    bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +        ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
>> +            bs->file->bs->supported_zero_flags);
>> +
>> +    return 0;
>> +}
>> +
>> +static void preallocate_close(BlockDriverState *bs)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +
>> +    if (s->data_end >= 0 && bdrv_getlength(bs->file->bs) > s->data_end) {
>> +        bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0, NULL);
>> +    }
>> +}
>> +
>> +static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
>> +                                   BdrvChildRole role,
>> +                                   BlockReopenQueue *reopen_queue,
>> +                                   uint64_t perm, uint64_t shared,
>> +                                   uint64_t *nperm, uint64_t *nshared)
>> +{
>> +    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
>> +
>> +    /* Force RESIZE permission, to be able to crop file on close() */
>> +    *nperm |= BLK_PERM_RESIZE;
>> +}
>> +
>> +static coroutine_fn int preallocate_co_preadv_part(
>> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> +        QEMUIOVector *qiov, size_t qiov_offset, int flags)
>> +{
>> +    return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                               flags);
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pdiscard(BlockDriverState *bs,
>> +                                               int64_t offset, int bytes)
>> +{
>> +    return bdrv_co_pdiscard(bs->file, offset, bytes);
>> +}
>> +
>> +static bool coroutine_fn do_preallocate(BlockDriverState *bs, int64_t offset,
>> +                                       int64_t bytes, bool write_zero)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t len, start, end;
>> +    BdrvTrackedRequest *lock;
>> +    int ret;
>> +
>> +    if (s->data_end >= 0) {
>> +        s->data_end = MAX(s->data_end,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    len = bdrv_getlength(bs->file->bs);
>> +    if (len < 0) {
>> +        return false;
>> +    }
>> +
>> +    if (s->data_end < 0) {
>> +        s->data_end = MAX(len,
>> +                          QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE));
>> +    }
>> +
>> +    if (offset + bytes <= len) {
>> +        return false;
>> +    }
>> +
>> +    lock = bdrv_co_range_try_lock(bs->file->bs, len, INT64_MAX - len);
>> +    if (!lock) {
>> +        /* There are already preallocating requests in-fligth */
> 
> s/fligth/flight/
> 
>> +        return false;
>> +    }
>> +
>> +    /* Length should not have changed */
>> +    assert(len == bdrv_getlength(bs->file->bs));
>> +
>> +    start = write_zero ? MIN(offset, len) : len;
>> +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
>> +
>> +    ret = bdrv_co_pwrite_zeroes_locked(bs->file, start, end - start,
>> +                                       BDRV_REQ_NO_FALLBACK, lock);
>> +
>> +    bdrv_co_range_unlock(lock);
> 
> Hmm...if this piece of code is the only user of bdrv_co_range_try_lock()
> then a BDRV_REQ_NO_WAIT flag might be a simpler API.
> 
> I thought the lock request would be used to perform multiple operations,
> but if it's just for a single operation then I think it's less code and
> easier to understand without the lock request.

Hmm, again, I don't remember exact reasons. Firstly, I was afraid of length
change during try_lock and have a double check for bdrv_getlength(). Then
I decided that it's impossible and change the check to an assertion.
Probably, the only reason to leave locked range was "I already have the code,
it will help with copy-on-read, why not to use it".. OK, I'll try rewrite it
with help of new flag.

> 
>> +
>> +    return !ret;
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
>> +        int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> +    if (do_preallocate(bs, offset, bytes, true)) {
>> +        return 0;
>> +    }
>> +
>> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
>> +}
>> +
>> +static coroutine_fn int preallocate_co_pwritev_part(BlockDriverState *bs,
>> +                                                    uint64_t offset,
>> +                                                    uint64_t bytes,
>> +                                                    QEMUIOVector *qiov,
>> +                                                    size_t qiov_offset,
>> +                                                    int flags)
>> +{
>> +    do_preallocate(bs, offset, bytes, false);
>> +
>> +    return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
>> +                                flags);
>> +}
>> +
>> +static int coroutine_fn
>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>> +                        bool exact, PreallocMode prealloc,
>> +                        BdrvRequestFlags flags, Error **errp)
>> +{
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, flags, errp);
>> +
>> +    /* s->data_end may become negative here, which means unknown data end */
>> +    s->data_end = bdrv_getlength(bs->file->bs);
>> +
>> +    return ret;
>> +}
>> +
>> +static int coroutine_fn preallocate_co_flush(BlockDriverState *bs)
>> +{
>> +    if (!bs->file) {
>> +        return 0;
>> +    }
> 
> When does this happen? It's surprising to see the !bs->file check here
> but not in other functions.

It's just done line in mirror-top and backup-top.. But seems there should not be such an issue. Will drop.

> 
>> +
>> +    return bdrv_co_flush(bs->file->bs);
>> +}
>> +
>> +static int64_t preallocate_getlength(BlockDriverState *bs)
>> +{
>> +    /*
>> +     * We probably can return s->data_end here, but seems safer to return real
>> +     * file length, not trying to hide the preallocation.
>> +     *
>> +     * Still, don't miss the chance to restore s->data_end if it is broken.
>> +     */
>> +    BDRVPreallocateState *s = bs->opaque;
>> +    int64_t ret = bdrv_getlength(bs->file->bs);
>> +
>> +    if (s->data_end < 0) {
>> +        s->data_end = ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +BlockDriver bdrv_preallocate_filter = {
>> +    .format_name = "preallocate",
>> +    .instance_size = sizeof(BDRVPreallocateState),
>> +
>> +    .bdrv_getlength = preallocate_getlength,
>> +    .bdrv_open = preallocate_open,
>> +    .bdrv_close = preallocate_close,
>> +
>> +    .bdrv_co_preadv_part = preallocate_co_preadv_part,
>> +    .bdrv_co_pwritev_part = preallocate_co_pwritev_part,
>> +    .bdrv_co_pwrite_zeroes = preallocate_co_pwrite_zeroes,
>> +    .bdrv_co_pdiscard = preallocate_co_pdiscard,
>> +    .bdrv_co_flush = preallocate_co_flush,
>> +    .bdrv_co_truncate = preallocate_co_truncate,
>> +
>> +    .bdrv_co_block_status = bdrv_co_block_status_from_file,
>> +
>> +    .bdrv_child_perm = preallocate_child_perm,
>> +
>> +    .has_variable_length = true,
>> +    .is_filter = true,
>> +};
>> +
>> +static void bdrv_preallocate_init(void)
>> +{
>> +    bdrv_register(&bdrv_preallocate_filter);
>> +}
>> +
>> +block_init(bdrv_preallocate_init);
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index 3635b6b4c1..f46a353a35 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -45,6 +45,7 @@ block-obj-y += crypto.o
>>   block-obj-y += aio_task.o
>>   block-obj-y += backup-top.o
>>   block-obj-y += filter-compress.o
>> +block-obj-y += preallocate.o
>>   common-obj-y += monitor/
>>   
>>   block-obj-y += stream.o
>> -- 
>> 2.18.0
>>


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising
  2020-07-08 15:51     ` Vladimir Sementsov-Ogievskiy
@ 2020-07-13 11:23       ` Stefan Hajnoczi
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-13 11:23 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 2502 bytes --]

On Wed, Jul 08, 2020 at 06:51:18PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 07.07.2020 18:56, Stefan Hajnoczi wrote:
> > On Sat, Jun 20, 2020 at 05:36:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > Introduce a function to mark the request serialising only if there are
> > > no conflicting request to wait for.
> > > 
> > > The function is static, so mark it unused. The attribute is to be
> > > dropped in the next commit.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > ---
> > >   block/io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
> > >   1 file changed, 51 insertions(+), 7 deletions(-)
> > 
> > I found this patch difficult to understand because there are multiple
> > levels of functions passing flags to ultimiately do different things in
> > a common function.
> > 
> > Here are some ideas if you have time to rework this patch:
> > 
> > 1. Introduce a bdrv_find_overlapping_request() function that does most
> >     of bdrv_wait_serialising_requests_locked() but does not wait. Then
> >     bdrv_wait_serialising_requests_locked() can call that function in a
> >     loop and wait if an overlapping request is found.
> 
> I thought about it, but splitting bdrv_find_overlapping_request is not so clear:
> it should include most of the logic inside "if (tracked_request_overlaps(..":
> an assertion, and checking !req->waiting_for. So the semantics of new functions
> becomes unclear, and it lead to splitting "->waiting_for" logic.. So, I decided
> to keep the whole function as is, not splitted. I just can't imagine reasonable
> split.
> 
> > 
> > 2. Pass overlap_offset/overlap_bytes arguments to
> >     bdrv_find_overlapping_request() instead of changing and restoring the
> >     value in bdrv_do_mark_request_serialising().
> 
> I'm not sure that it would be safe to not add a request to the list during the
> search.
> 
> > 
> > 3. Use consistent names for flags: wait/blocking, found/success
> > 
> > I'm not sure if all these ideas will work, but I get the feeling this
> > code can be refactored to make it easier to understand. Since I don't
> > have a concrete suggestion and the code looks correct:
> 
> Hmm. Unfortunately I didn't record the problems I faced on the way to resulting
> design, so I just don't remember now the details. So, I'll try to apply your
> suggestions, and remember the problems (or we'll get better patch :)

Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock
  2020-07-08 15:56     ` Vladimir Sementsov-Ogievskiy
@ 2020-07-13 11:23       ` Stefan Hajnoczi
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-13 11:23 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 944 bytes --]

On Wed, Jul 08, 2020 at 06:56:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 07.07.2020 19:10, Stefan Hajnoczi wrote:
> > On Sat, Jun 20, 2020 at 05:36:46PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > +int coroutine_fn bdrv_co_pwrite_zeroes_locked(BdrvChild *child, int64_t offset,
> > > +                                              int bytes, BdrvRequestFlags flags,
> > > +                                              BdrvTrackedRequest *lock)
> > 
> > The name is confusing because _locked() normally means that a mutex
> > should be held. Functions using that naming convention already exist in
> > block/io.c. It would be nice to distinguish between functions that need
> > BdrvTrackedRequest and functions that must be called with a mutex held.
> > 
> > How about bdrv_co_pwrite_zeroes_with_lock()?
> > 
> 
> OK.
> 
> May be _in_locked_range ? A bit longer, but more understandable.

Sounds good.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: introduce preallocate filter
  2020-07-08 16:17     ` Vladimir Sementsov-Ogievskiy
@ 2020-07-13 11:27       ` Stefan Hajnoczi
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2020-07-13 11:27 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, kwolf, qemu-block, Anton.Nefedov, armbru, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]

On Wed, Jul 08, 2020 at 07:17:52PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 08.07.2020 15:07, Stefan Hajnoczi wrote:
> > On Sat, Jun 20, 2020 at 05:36:47PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > It may be used for file-systems with slow allocation.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > ---
> > >   qapi/block-core.json |   3 +-
> > >   block/preallocate.c  | 255 +++++++++++++++++++++++++++++++++++++++++++
> > >   block/Makefile.objs  |   1 +
> > >   3 files changed, 258 insertions(+), 1 deletion(-)
> > >   create mode 100644 block/preallocate.c
> > 
> > Please add documentation to docs/system/qemu-block-drivers.rst.inc
> > describing the purpose of this block driver and how to use it.
> 
> This implies adding new section "Filters", yes?

Yes, please.

> > 
> > Since this filter grows the file I guess it's intended to be below an
> > image format?
> 
> Yes, between format and protocol nodes.

Thanks for confirming. Please include this in the documentation. While
reading the code I was thinking about how this can work if the guest is
directly exposed to the filter. Users might wonder about the same thing.

> > > +    /* Length should not have changed */
> > > +    assert(len == bdrv_getlength(bs->file->bs));
> > > +
> > > +    start = write_zero ? MIN(offset, len) : len;
> > > +    end = QEMU_ALIGN_UP(offset + bytes + s->prealloc_size, s->prealloc_align);
> > > +
> > > +    ret = bdrv_co_pwrite_zeroes_locked(bs->file, start, end - start,
> > > +                                       BDRV_REQ_NO_FALLBACK, lock);
> > > +
> > > +    bdrv_co_range_unlock(lock);
> > 
> > Hmm...if this piece of code is the only user of bdrv_co_range_try_lock()
> > then a BDRV_REQ_NO_WAIT flag might be a simpler API.
> > 
> > I thought the lock request would be used to perform multiple operations,
> > but if it's just for a single operation then I think it's less code and
> > easier to understand without the lock request.
> 
> Hmm, again, I don't remember exact reasons. Firstly, I was afraid of length
> change during try_lock and have a double check for bdrv_getlength(). Then
> I decided that it's impossible and change the check to an assertion.
> Probably, the only reason to leave locked range was "I already have the code,
> it will help with copy-on-read, why not to use it".. OK, I'll try rewrite it
> with help of new flag.

Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-13 11:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-20 14:36 [PATCH 0/5] preallocate filter Vladimir Sementsov-Ogievskiy
2020-06-20 14:36 ` [PATCH 1/5] block/io: introduce bdrv_try_mark_request_serialising Vladimir Sementsov-Ogievskiy
2020-07-07 15:56   ` Stefan Hajnoczi
2020-07-08 15:51     ` Vladimir Sementsov-Ogievskiy
2020-07-13 11:23       ` Stefan Hajnoczi
2020-06-20 14:36 ` [PATCH 2/5] block/io: introduce bdrv_co_range_try_lock Vladimir Sementsov-Ogievskiy
2020-07-07 16:10   ` Stefan Hajnoczi
2020-07-08 15:56     ` Vladimir Sementsov-Ogievskiy
2020-07-13 11:23       ` Stefan Hajnoczi
2020-06-20 14:36 ` [PATCH 3/5] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
2020-07-08 12:07   ` Stefan Hajnoczi
2020-07-08 16:17     ` Vladimir Sementsov-Ogievskiy
2020-07-13 11:27       ` Stefan Hajnoczi
2020-06-20 14:36 ` [PATCH 4/5] iotests: QemuIoInteractive: use qemu_io_args_no_fmt Vladimir Sementsov-Ogievskiy
2020-07-08 12:08   ` Stefan Hajnoczi
2020-06-20 14:36 ` [PATCH 5/5] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
2020-07-08 12:09   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.