qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] file-posix: Cache next hole
@ 2021-02-11 17:22 Max Reitz
  2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Max Reitz @ 2021-02-11 17:22 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Hi,

On Launchpad, Alexandre has reported a performance problem with
SEEK_HOLE on filesystems with much fragmentation:

  https://bugs.launchpad.net/qemu/+bug/1912224

This isn’t the first report we’ve received on SEEK_HOLE being slow, and
we’ve already taken measures to address this issue.  For example,
block-status has a @want_zero parameter so that we won’t do such a seek
operation unless the caller needs that information.  qcow2 tries to
avoid falling through to the protocol level, because most of the time,
it itself knows well which clusters are zero and which aren’t.

But both of these measures don’t work e.g. when mirroring a raw file,
when we want to look up zero areas (because it may speed up the mirror
and save space), and where we don’t have a format layer that has
randomly accessible zero information.

Alexandre proposed caching the offset of the next hole, which we can do
in file-posix (unless the WRITE permission is shared), and which works
best for images that are (nearly) fully allocated.  Such images are also
the ones that profit the least off of the time lost on SEEK_HOLE
operations, i.e. where it’s most important to keep that time low.

I understand that caching filesystem information in file-posix is a bit
weird, but I’d like to hear what you think.  Alexandre has reported that
it alleviated his problem quite significantly (see comment 8 in the LP
report).


(Speaking of “unless the WRITE permission is shared”: mirror_top is a
bit broken in that it takes no permissions (but WRITE if necessary) and
shares everything.  That seems wrong.  Patch 1 addresses that, so that
patch 2 can actually do something when mirroring an image.)


Max Reitz (2):
  block/mirror: Fix mirror_top's permissions
  file-posix: Cache next hole

 block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
 block/mirror.c     | 32 ++++++++++++++----
 2 files changed, 105 insertions(+), 8 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] block/mirror: Fix mirror_top's permissions
  2021-02-11 17:22 [PATCH 0/2] file-posix: Cache next hole Max Reitz
@ 2021-02-11 17:22 ` Max Reitz
  2021-02-11 19:33   ` Eric Blake
  2021-02-12  9:04   ` Vladimir Sementsov-Ogievskiy
  2021-02-11 17:22 ` [PATCH 2/2] file-posix: Cache next hole Max Reitz
  2021-03-29 16:21 ` [PATCH 0/2] " Max Reitz
  2 siblings, 2 replies; 14+ messages in thread
From: Max Reitz @ 2021-02-11 17:22 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

mirror_top currently shares all permissions, and takes only the WRITE
permission (if some parent has taken that permission, too).

That is wrong, though; mirror_top is a filter, so it should take
permissions like any other filter does.  For example, if the parent
needs CONSISTENT_READ, we need to take that, too, and if it cannot share
the WRITE permission, we cannot share it either.

The exception is when mirror_top is used for active commit, where we
cannot take CONSISTENT_READ (because it is deliberately unshared above
the base node) and where we must share WRITE (so that it is shared for
all images in the backing chain, so the mirror job can take it for the
target BB).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 8e1ad6eceb..1edfc3cc14 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -89,6 +89,7 @@ typedef struct MirrorBlockJob {
 typedef struct MirrorBDSOpaque {
     MirrorBlockJob *job;
     bool stop;
+    bool is_commit;
 } MirrorBDSOpaque;
 
 struct MirrorOp {
@@ -1513,13 +1514,27 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
         return;
     }
 
-    /* Must be able to forward guest writes to the real image */
-    *nperm = 0;
-    if (perm & BLK_PERM_WRITE) {
-        *nperm |= BLK_PERM_WRITE;
-    }
+    bdrv_default_perms(bs, c, role, reopen_queue,
+                       perm, shared, nperm, nshared);
 
-    *nshared = BLK_PERM_ALL;
+    if (s->is_commit) {
+        /*
+         * For commit jobs, we cannot take CONSISTENT_READ, because
+         * that permission is unshared for everything above the base
+         * node (except for filters on the base node).
+         * We also have to force-share the WRITE permission, or
+         * otherwise we would block ourselves at the base node (if
+         * writes are blocked for a node, they are also blocked for
+         * its backing file).
+         * (We could also share RESIZE, because it may be needed for
+         * the target if its size is less than the top node's; but
+         * bdrv_default_perms_for_cow() automatically shares RESIZE
+         * for backing nodes if WRITE is shared, so there is no need
+         * to do it here.)
+         */
+        *nperm &= ~BLK_PERM_CONSISTENT_READ;
+        *nshared |= BLK_PERM_WRITE;
+    }
 }
 
 /* Dummy node that provides consistent read to its users without requiring it
@@ -1583,6 +1598,8 @@ static BlockJob *mirror_start_job(
         return NULL;
     }
 
+    target_is_backing = bdrv_chain_contains(bs, target);
+
     /* In the case of active commit, add dummy driver to provide consistent
      * reads on the top, while disabling it in the intermediate nodes, and make
      * the backing chain writable. */
@@ -1605,6 +1622,8 @@ static BlockJob *mirror_start_job(
     bs_opaque = g_new0(MirrorBDSOpaque, 1);
     mirror_top_bs->opaque = bs_opaque;
 
+    bs_opaque->is_commit = target_is_backing;
+
     /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
      * it alive until block_job_create() succeeds even if bs has no parent. */
     bdrv_ref(mirror_top_bs);
@@ -1646,7 +1665,6 @@ static BlockJob *mirror_start_job(
     target_perms = BLK_PERM_WRITE;
     target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
 
-    target_is_backing = bdrv_chain_contains(bs, target);
     if (target_is_backing) {
         int64_t bs_size, target_size;
         bs_size = bdrv_getlength(bs);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] file-posix: Cache next hole
  2021-02-11 17:22 [PATCH 0/2] file-posix: Cache next hole Max Reitz
  2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
@ 2021-02-11 17:22 ` Max Reitz
  2021-02-11 20:00   ` Eric Blake
  2021-02-11 20:38   ` Vladimir Sementsov-Ogievskiy
  2021-03-29 16:21 ` [PATCH 0/2] " Max Reitz
  2 siblings, 2 replies; 14+ messages in thread
From: Max Reitz @ 2021-02-11 17:22 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
slow on certain filesystems and/or under certain circumstances.  That is
why we generally try to avoid it (which is why bdrv_co_block_status()
has the @want_zero parameter, and which is why qcow2 has a metadata
preallocation detection, so we do not fall through to the protocol layer
to discover which blocks are zero, unless that is really necessary
(i.e., for metadata-preallocated images)).

In addition to those measures, we can also try to speed up zero
detection by letting file-posix cache some hole location information,
namely where the next hole after the most recently queried offset is.
This helps especially for images that are (nearly) fully allocated,
which is coincidentally also the case where querying for zero
information cannot gain us much.

Note that this of course only works so long as we have no concurrent
writers to the image, which is the case when the WRITE capability is not
shared.

Alternatively (or perhaps as an improvement in the future), we could let
file-posix keep track of what it knows is zero and what it knows is
non-zero with bitmaps, which would help images that actually have a
significant number of holes (where this implementation here cannot do
much).  But for such images, SEEK_HOLE/DATA are generally faster (they
do not need to seek through the whole file), and the performance lost by
querying the block status does not feel as bad because it is outweighed
by the performance that can be saved by special-cases zeroed areas, so
focussing on images that are (nearly) fully allocated is more important.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 05079b40ca..2ca0a2e05b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -172,6 +172,11 @@ typedef struct BDRVRawState {
     } stats;
 
     PRManager *pr_mgr;
+
+    bool can_cache_next_zero_offset;
+    bool next_zero_offset_valid;
+    uint64_t next_zero_offset_from;
+    uint64_t next_zero_offset;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
@@ -2049,7 +2054,25 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                        uint64_t bytes, QEMUIOVector *qiov,
                                        int flags)
 {
+    BDRVRawState *s = bs->opaque;
+
     assert(flags == 0);
+
+    /*
+     * If offset is just above s->next_zero_offset, the hole that was
+     * reportedly there might be removed from the file (because only
+     * whole filesystem clusters can be zeroed).  But that does not
+     * matter, because block-status does not care about whether there
+     * actually is a hole, but just about whether there are zeroes
+     * there - and this write will not make those zeroes non-zero.
+     */
+    if (s->next_zero_offset_valid &&
+        offset <= s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
@@ -2183,6 +2206,10 @@ static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
     struct stat st;
     int ret;
 
+    if (s->next_zero_offset_valid && offset < s->next_zero_offset) {
+        s->next_zero_offset_valid = false;
+    }
+
     if (fstat(s->fd, &st)) {
         ret = -errno;
         error_setg_errno(errp, -ret, "Failed to fstat() the file");
@@ -2616,8 +2643,17 @@ static int coroutine_fn raw_co_delete_file(BlockDriverState *bs,
 static int find_allocation(BlockDriverState *bs, off_t start,
                            off_t *data, off_t *hole)
 {
-#if defined SEEK_HOLE && defined SEEK_DATA
     BDRVRawState *s = bs->opaque;
+
+    if (s->next_zero_offset_valid) {
+        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
+            *data = start;
+            *hole = s->next_zero_offset;
+            return 0;
+        }
+    }
+
+#if defined SEEK_HOLE && defined SEEK_DATA
     off_t offs;
 
     /*
@@ -2716,6 +2752,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             int64_t *map,
                                             BlockDriverState **file)
 {
+    BDRVRawState *s = bs->opaque;
     off_t data = 0, hole = 0;
     int ret;
 
@@ -2734,6 +2771,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     }
 
     ret = find_allocation(bs, offset, &data, &hole);
+    s->next_zero_offset_valid = false;
     if (ret == -ENXIO) {
         /* Trailing hole */
         *pnum = bytes;
@@ -2761,6 +2799,12 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
         }
 
         ret = BDRV_BLOCK_DATA;
+
+        if (s->can_cache_next_zero_offset) {
+            s->next_zero_offset_valid = true;
+            s->next_zero_offset_from = offset;
+            s->next_zero_offset = hole;
+        }
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
@@ -2910,6 +2954,13 @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
     RawPosixAIOData acb;
     int ret;
 
+    if (s->next_zero_offset_valid &&
+        offset <= s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset_from)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_fildes     = s->fd,
@@ -2941,6 +2992,17 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     RawPosixAIOData acb;
     ThreadPoolFunc *handler;
 
+    if (s->next_zero_offset_valid &&
+        offset < s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset_from)
+    {
+        if (offset > s->next_zero_offset_from) {
+            s->next_zero_offset = offset;
+        } else {
+            s->next_zero_offset_valid = false;
+        }
+    }
+
 #ifdef CONFIG_FALLOCATE
     if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
         BdrvTrackedRequest *req;
@@ -3155,6 +3217,15 @@ static void raw_set_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared)
     raw_handle_perm_lock(bs, RAW_PL_COMMIT, perm, shared, NULL);
     s->perm = perm;
     s->shared_perm = shared;
+
+    /*
+     * We can only cache anything if there are no external writers on
+     * the image.
+     */
+    s->can_cache_next_zero_offset = !(shared & BLK_PERM_WRITE);
+    if (!s->can_cache_next_zero_offset) {
+        s->next_zero_offset_valid = false;
+    }
 }
 
 static void raw_abort_perm_update(BlockDriverState *bs)
@@ -3203,6 +3274,14 @@ static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
         return -EIO;
     }
 
+    /* Same as in raw_co_pwritev() */
+    if (s->next_zero_offset_valid &&
+        dst_offset <= s->next_zero_offset &&
+        dst_offset + bytes > s->next_zero_offset_from)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_type       = QEMU_AIO_COPY_RANGE,
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block/mirror: Fix mirror_top's permissions
  2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
@ 2021-02-11 19:33   ` Eric Blake
  2021-02-12  9:04   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 14+ messages in thread
From: Eric Blake @ 2021-02-11 19:33 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 2/11/21 11:22 AM, Max Reitz wrote:
> mirror_top currently shares all permissions, and takes only the WRITE
> permission (if some parent has taken that permission, too).
> 
> That is wrong, though; mirror_top is a filter, so it should take
> permissions like any other filter does.  For example, if the parent
> needs CONSISTENT_READ, we need to take that, too, and if it cannot share
> the WRITE permission, we cannot share it either.
> 
> The exception is when mirror_top is used for active commit, where we
> cannot take CONSISTENT_READ (because it is deliberately unshared above
> the base node) and where we must share WRITE (so that it is shared for
> all images in the backing chain, so the mirror job can take it for the
> target BB).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-11 17:22 ` [PATCH 2/2] file-posix: Cache next hole Max Reitz
@ 2021-02-11 20:00   ` Eric Blake
  2021-02-12  9:25     ` Max Reitz
  2021-02-11 20:38   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Blake @ 2021-02-11 20:00 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 2/11/21 11:22 AM, Max Reitz wrote:
> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> slow on certain filesystems and/or under certain circumstances.  That is
> why we generally try to avoid it (which is why bdrv_co_block_status()
> has the @want_zero parameter, and which is why qcow2 has a metadata
> preallocation detection, so we do not fall through to the protocol layer
> to discover which blocks are zero, unless that is really necessary
> (i.e., for metadata-preallocated images)).
> 
> In addition to those measures, we can also try to speed up zero
> detection by letting file-posix cache some hole location information,
> namely where the next hole after the most recently queried offset is.
> This helps especially for images that are (nearly) fully allocated,
> which is coincidentally also the case where querying for zero
> information cannot gain us much.
> 
> Note that this of course only works so long as we have no concurrent
> writers to the image, which is the case when the WRITE capability is not
> shared.
> 
> Alternatively (or perhaps as an improvement in the future), we could let
> file-posix keep track of what it knows is zero and what it knows is
> non-zero with bitmaps, which would help images that actually have a
> significant number of holes (where this implementation here cannot do
> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> do not need to seek through the whole file), and the performance lost by
> querying the block status does not feel as bad because it is outweighed
> by the performance that can be saved by special-cases zeroed areas, so
> focussing on images that are (nearly) fully allocated is more important.

focusing

> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 80 insertions(+), 1 deletion(-)
> 

>  static int find_allocation(BlockDriverState *bs, off_t start,
>                             off_t *data, off_t *hole)
>  {
> -#if defined SEEK_HOLE && defined SEEK_DATA
>      BDRVRawState *s = bs->opaque;
> +
> +    if (s->next_zero_offset_valid) {
> +        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
> +            *data = start;
> +            *hole = s->next_zero_offset;
> +            return 0;
> +        }
> +    }
> +
> +#if defined SEEK_HOLE && defined SEEK_DATA

Why move the #if? If SEEK_HOLE is not defined, s->next_zero_offset_valid
should never be set, because we'll treat the entire image as data.  But
at the same time, it doesn't hurt, so doesn't stop my review.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-11 17:22 ` [PATCH 2/2] file-posix: Cache next hole Max Reitz
  2021-02-11 20:00   ` Eric Blake
@ 2021-02-11 20:38   ` Vladimir Sementsov-Ogievskiy
  2021-02-12  9:14     ` Max Reitz
  1 sibling, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-11 20:38 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

11.02.2021 20:22, Max Reitz wrote:
> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> slow on certain filesystems and/or under certain circumstances.  That is
> why we generally try to avoid it (which is why bdrv_co_block_status()
> has the @want_zero parameter, and which is why qcow2 has a metadata
> preallocation detection, so we do not fall through to the protocol layer
> to discover which blocks are zero, unless that is really necessary
> (i.e., for metadata-preallocated images)).
> 
> In addition to those measures, we can also try to speed up zero
> detection by letting file-posix cache some hole location information,
> namely where the next hole after the most recently queried offset is.
> This helps especially for images that are (nearly) fully allocated,
> which is coincidentally also the case where querying for zero
> information cannot gain us much.
> 
> Note that this of course only works so long as we have no concurrent
> writers to the image, which is the case when the WRITE capability is not
> shared.
> 
> Alternatively (or perhaps as an improvement in the future), we could let
> file-posix keep track of what it knows is zero and what it knows is
> non-zero with bitmaps, which would help images that actually have a
> significant number of holes (where this implementation here cannot do
> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> do not need to seek through the whole file), and the performance lost by
> querying the block status does not feel as bad because it is outweighed
> by the performance that can be saved by special-cases zeroed areas, so
> focussing on images that are (nearly) fully allocated is more important.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

I'll look at it tomorrow... Just wanted to note that something similar was proposed by Kevin some time ago:

<20190124141731.21509-1-kwolf@redhat.com>
https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html

(o_0 two years ago.. time passes so fast)

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block/mirror: Fix mirror_top's permissions
  2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
  2021-02-11 19:33   ` Eric Blake
@ 2021-02-12  9:04   ` Vladimir Sementsov-Ogievskiy
  2021-02-12  9:23     ` Max Reitz
  1 sibling, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-12  9:04 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

11.02.2021 20:22, Max Reitz wrote:
> mirror_top currently shares all permissions, and takes only the WRITE
> permission (if some parent has taken that permission, too).
> 
> That is wrong, though; mirror_top is a filter, so it should take
> permissions like any other filter does.  For example, if the parent
> needs CONSISTENT_READ, we need to take that, too, and if it cannot share
> the WRITE permission, we cannot share it either.
> 
> The exception is when mirror_top is used for active commit, where we
> cannot take CONSISTENT_READ (because it is deliberately unshared above
> the base node) and where we must share WRITE (so that it is shared for
> all images in the backing chain, so the mirror job can take it for the
> target BB).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/mirror.c | 32 +++++++++++++++++++++++++-------
>   1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 8e1ad6eceb..1edfc3cc14 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -89,6 +89,7 @@ typedef struct MirrorBlockJob {
>   typedef struct MirrorBDSOpaque {
>       MirrorBlockJob *job;
>       bool stop;
> +    bool is_commit;
>   } MirrorBDSOpaque;
>   
>   struct MirrorOp {
> @@ -1513,13 +1514,27 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
>           return;
>       }
>   
> -    /* Must be able to forward guest writes to the real image */
> -    *nperm = 0;
> -    if (perm & BLK_PERM_WRITE) {
> -        *nperm |= BLK_PERM_WRITE;
> -    }
> +    bdrv_default_perms(bs, c, role, reopen_queue,
> +                       perm, shared, nperm, nshared);
>   
> -    *nshared = BLK_PERM_ALL;
> +    if (s->is_commit) {
> +        /*
> +         * For commit jobs, we cannot take CONSISTENT_READ, because
> +         * that permission is unshared for everything above the base
> +         * node (except for filters on the base node).
> +         * We also have to force-share the WRITE permission, or
> +         * otherwise we would block ourselves at the base node (if
> +         * writes are blocked for a node, they are also blocked for
> +         * its backing file).
> +         * (We could also share RESIZE, because it may be needed for
> +         * the target if its size is less than the top node's; but
> +         * bdrv_default_perms_for_cow() automatically shares RESIZE
> +         * for backing nodes if WRITE is shared, so there is no need
> +         * to do it here.)
> +         */
> +        *nperm &= ~BLK_PERM_CONSISTENT_READ;

this works only because we don't assert READ permission on generic read path in block/io.c, like we do for WRITE..
but this is better than pre-patch anyway..

How block-jobs and filters works - definitely goes beyond our permissions architecture..

> +        *nshared |= BLK_PERM_WRITE;
> +    }
>   }
>   
>   /* Dummy node that provides consistent read to its users without requiring it
> @@ -1583,6 +1598,8 @@ static BlockJob *mirror_start_job(
>           return NULL;
>       }
>   
> +    target_is_backing = bdrv_chain_contains(bs, target);

may be initialized together with definition..

> +
>       /* In the case of active commit, add dummy driver to provide consistent
>        * reads on the top, while disabling it in the intermediate nodes, and make
>        * the backing chain writable. */
> @@ -1605,6 +1622,8 @@ static BlockJob *mirror_start_job(
>       bs_opaque = g_new0(MirrorBDSOpaque, 1);
>       mirror_top_bs->opaque = bs_opaque;
>   
> +    bs_opaque->is_commit = target_is_backing;
> +
>       /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
>        * it alive until block_job_create() succeeds even if bs has no parent. */
>       bdrv_ref(mirror_top_bs);
> @@ -1646,7 +1665,6 @@ static BlockJob *mirror_start_job(
>       target_perms = BLK_PERM_WRITE;
>       target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>   
> -    target_is_backing = bdrv_chain_contains(bs, target);
>       if (target_is_backing) {
>           int64_t bs_size, target_size;
>           bs_size = bdrv_getlength(bs);
> 

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-11 20:38   ` Vladimir Sementsov-Ogievskiy
@ 2021-02-12  9:14     ` Max Reitz
  2021-02-12 10:25       ` Kevin Wolf
  0 siblings, 1 reply; 14+ messages in thread
From: Max Reitz @ 2021-02-12  9:14 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
> 11.02.2021 20:22, Max Reitz wrote:
>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>> slow on certain filesystems and/or under certain circumstances.  That is
>> why we generally try to avoid it (which is why bdrv_co_block_status()
>> has the @want_zero parameter, and which is why qcow2 has a metadata
>> preallocation detection, so we do not fall through to the protocol layer
>> to discover which blocks are zero, unless that is really necessary
>> (i.e., for metadata-preallocated images)).
>>
>> In addition to those measures, we can also try to speed up zero
>> detection by letting file-posix cache some hole location information,
>> namely where the next hole after the most recently queried offset is.
>> This helps especially for images that are (nearly) fully allocated,
>> which is coincidentally also the case where querying for zero
>> information cannot gain us much.
>>
>> Note that this of course only works so long as we have no concurrent
>> writers to the image, which is the case when the WRITE capability is not
>> shared.
>>
>> Alternatively (or perhaps as an improvement in the future), we could let
>> file-posix keep track of what it knows is zero and what it knows is
>> non-zero with bitmaps, which would help images that actually have a
>> significant number of holes (where this implementation here cannot do
>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>> do not need to seek through the whole file), and the performance lost by
>> querying the block status does not feel as bad because it is outweighed
>> by the performance that can be saved by special-cases zeroed areas, so
>> focussing on images that are (nearly) fully allocated is more important.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> I'll look at it tomorrow... Just wanted to note that something similar 
> was proposed by Kevin some time ago:
> 
> <20190124141731.21509-1-kwolf@redhat.com>
> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html

Interesting.  The reasoning that it doesn’t matter whether anyone writes 
to the assumed-data regions makes sense.

I can’t see a real reason why it was kind of forgotten, apparently...

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block/mirror: Fix mirror_top's permissions
  2021-02-12  9:04   ` Vladimir Sementsov-Ogievskiy
@ 2021-02-12  9:23     ` Max Reitz
  2021-02-12 10:48       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 14+ messages in thread
From: Max Reitz @ 2021-02-12  9:23 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

On 12.02.21 10:04, Vladimir Sementsov-Ogievskiy wrote:
> 11.02.2021 20:22, Max Reitz wrote:
>> mirror_top currently shares all permissions, and takes only the WRITE
>> permission (if some parent has taken that permission, too).
>>
>> That is wrong, though; mirror_top is a filter, so it should take
>> permissions like any other filter does.  For example, if the parent
>> needs CONSISTENT_READ, we need to take that, too, and if it cannot share
>> the WRITE permission, we cannot share it either.
>>
>> The exception is when mirror_top is used for active commit, where we
>> cannot take CONSISTENT_READ (because it is deliberately unshared above
>> the base node) and where we must share WRITE (so that it is shared for
>> all images in the backing chain, so the mirror job can take it for the
>> target BB).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/mirror.c | 32 +++++++++++++++++++++++++-------
>>   1 file changed, 25 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 8e1ad6eceb..1edfc3cc14 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -89,6 +89,7 @@ typedef struct MirrorBlockJob {
>>   typedef struct MirrorBDSOpaque {
>>       MirrorBlockJob *job;
>>       bool stop;
>> +    bool is_commit;
>>   } MirrorBDSOpaque;
>>   struct MirrorOp {
>> @@ -1513,13 +1514,27 @@ static void 
>> bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
>>           return;
>>       }
>> -    /* Must be able to forward guest writes to the real image */
>> -    *nperm = 0;
>> -    if (perm & BLK_PERM_WRITE) {
>> -        *nperm |= BLK_PERM_WRITE;
>> -    }
>> +    bdrv_default_perms(bs, c, role, reopen_queue,
>> +                       perm, shared, nperm, nshared);
>> -    *nshared = BLK_PERM_ALL;
>> +    if (s->is_commit) {
>> +        /*
>> +         * For commit jobs, we cannot take CONSISTENT_READ, because
>> +         * that permission is unshared for everything above the base
>> +         * node (except for filters on the base node).
>> +         * We also have to force-share the WRITE permission, or
>> +         * otherwise we would block ourselves at the base node (if
>> +         * writes are blocked for a node, they are also blocked for
>> +         * its backing file).
>> +         * (We could also share RESIZE, because it may be needed for
>> +         * the target if its size is less than the top node's; but
>> +         * bdrv_default_perms_for_cow() automatically shares RESIZE
>> +         * for backing nodes if WRITE is shared, so there is no need
>> +         * to do it here.)
>> +         */
>> +        *nperm &= ~BLK_PERM_CONSISTENT_READ;
> 
> this works only because we don't assert READ permission on generic read 
> path in block/io.c, like we do for WRITE..
> but this is better than pre-patch anyway..

Yes, because you can read regardless of CONSISTENT_READ, the question is 
just whether you’ll receive consistent data.  The caller needs to decide 
whether that’s the case.

Taking that permission kind of is deferring the question whether the 
data will be consistent to the block layer.

In case of commit, we unshare CONSISTENT_READ above the base, because 
the data between base and top will not be consistent.  Starting from 
top, we know it is, so we do not need to take the permission, because we 
do not need that guarantee from the block layer; the commit job can give 
that guarantee itself.

(I suppose we could add some ALLOW_INCONSISTENT flag to read requests, 
and the permission is checked when that flag is not present, but I don’t 
think we really need to.)

(Technically we have the problem that there could be something else 
between top and base that unshares CONSISTENT_READ, and we’ll never 
know, but addressing that would be complicated and it’s only a 
hypothetical problem, AFAIA.)

> How block-jobs and filters works - definitely goes beyond our 
> permissions architecture..

FWIW, AFAIR, the first job filter node was commit_top, whose purpose was 
precisely to allow unsharing CONSISTENT_READ on the base and then 
offering it again on the top.

>> +        *nshared |= BLK_PERM_WRITE;
>> +    }
>>   }
>>   /* Dummy node that provides consistent read to its users without 
>> requiring it
>> @@ -1583,6 +1598,8 @@ static BlockJob *mirror_start_job(
>>           return NULL;
>>       }
>> +    target_is_backing = bdrv_chain_contains(bs, target);
> 
> may be initialized together with definition..

Could be, but would that be better? :)

>> +
>>       /* In the case of active commit, add dummy driver to provide 
>> consistent
>>        * reads on the top, while disabling it in the intermediate 
>> nodes, and make
>>        * the backing chain writable. */
>> @@ -1605,6 +1622,8 @@ static BlockJob *mirror_start_job(
>>       bs_opaque = g_new0(MirrorBDSOpaque, 1);
>>       mirror_top_bs->opaque = bs_opaque;
>> +    bs_opaque->is_commit = target_is_backing;
>> +
>>       /* bdrv_append takes ownership of the mirror_top_bs reference, 
>> need to keep
>>        * it alive until block_job_create() succeeds even if bs has no 
>> parent. */
>>       bdrv_ref(mirror_top_bs);
>> @@ -1646,7 +1665,6 @@ static BlockJob *mirror_start_job(
>>       target_perms = BLK_PERM_WRITE;
>>       target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>> -    target_is_backing = bdrv_chain_contains(bs, target);
>>       if (target_is_backing) {
>>           int64_t bs_size, target_size;
>>           bs_size = bdrv_getlength(bs);
>>
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Thanks!

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-11 20:00   ` Eric Blake
@ 2021-02-12  9:25     ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2021-02-12  9:25 UTC (permalink / raw)
  To: Eric Blake, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 11.02.21 21:00, Eric Blake wrote:
> On 2/11/21 11:22 AM, Max Reitz wrote:
>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>> slow on certain filesystems and/or under certain circumstances.  That is
>> why we generally try to avoid it (which is why bdrv_co_block_status()
>> has the @want_zero parameter, and which is why qcow2 has a metadata
>> preallocation detection, so we do not fall through to the protocol layer
>> to discover which blocks are zero, unless that is really necessary
>> (i.e., for metadata-preallocated images)).
>>
>> In addition to those measures, we can also try to speed up zero
>> detection by letting file-posix cache some hole location information,
>> namely where the next hole after the most recently queried offset is.
>> This helps especially for images that are (nearly) fully allocated,
>> which is coincidentally also the case where querying for zero
>> information cannot gain us much.
>>
>> Note that this of course only works so long as we have no concurrent
>> writers to the image, which is the case when the WRITE capability is not
>> shared.
>>
>> Alternatively (or perhaps as an improvement in the future), we could let
>> file-posix keep track of what it knows is zero and what it knows is
>> non-zero with bitmaps, which would help images that actually have a
>> significant number of holes (where this implementation here cannot do
>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>> do not need to seek through the whole file), and the performance lost by
>> querying the block status does not feel as bad because it is outweighed
>> by the performance that can be saved by special-cases zeroed areas, so
>> focussing on images that are (nearly) fully allocated is more important.
> 
> focusing

Wiktionary lists both as valid. *shrug*

>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 80 insertions(+), 1 deletion(-)
>>
> 
>>   static int find_allocation(BlockDriverState *bs, off_t start,
>>                              off_t *data, off_t *hole)
>>   {
>> -#if defined SEEK_HOLE && defined SEEK_DATA
>>       BDRVRawState *s = bs->opaque;
>> +
>> +    if (s->next_zero_offset_valid) {
>> +        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
>> +            *data = start;
>> +            *hole = s->next_zero_offset;
>> +            return 0;
>> +        }
>> +    }
>> +
>> +#if defined SEEK_HOLE && defined SEEK_DATA
> 
> Why move the #if? If SEEK_HOLE is not defined, s->next_zero_offset_valid
> should never be set, because we'll treat the entire image as data.  But
> at the same time, it doesn't hurt, so doesn't stop my review.

I found it better to put it outside the SEEK_HOLE/DATA section, because 
while it won’t work when neither are defined, the code is still valid.

I don’t know. :)

> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-12  9:14     ` Max Reitz
@ 2021-02-12 10:25       ` Kevin Wolf
  2021-02-12 12:11         ` Max Reitz
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2021-02-12 10:25 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 12.02.2021 um 10:14 hat Max Reitz geschrieben:
> On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
> > 11.02.2021 20:22, Max Reitz wrote:
> > > We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> > > slow on certain filesystems and/or under certain circumstances.  That is
> > > why we generally try to avoid it (which is why bdrv_co_block_status()
> > > has the @want_zero parameter, and which is why qcow2 has a metadata
> > > preallocation detection, so we do not fall through to the protocol layer
> > > to discover which blocks are zero, unless that is really necessary
> > > (i.e., for metadata-preallocated images)).
> > > 
> > > In addition to those measures, we can also try to speed up zero
> > > detection by letting file-posix cache some hole location information,
> > > namely where the next hole after the most recently queried offset is.
> > > This helps especially for images that are (nearly) fully allocated,
> > > which is coincidentally also the case where querying for zero
> > > information cannot gain us much.
> > > 
> > > Note that this of course only works so long as we have no concurrent
> > > writers to the image, which is the case when the WRITE capability is not
> > > shared.
> > > 
> > > Alternatively (or perhaps as an improvement in the future), we could let
> > > file-posix keep track of what it knows is zero and what it knows is
> > > non-zero with bitmaps, which would help images that actually have a
> > > significant number of holes (where this implementation here cannot do
> > > much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> > > do not need to seek through the whole file), and the performance lost by
> > > querying the block status does not feel as bad because it is outweighed
> > > by the performance that can be saved by special-cases zeroed areas, so
> > > focussing on images that are (nearly) fully allocated is more important.
> > > 
> > > Signed-off-by: Max Reitz <mreitz@redhat.com>
> > 
> > I'll look at it tomorrow... Just wanted to note that something similar
> > was proposed by Kevin some time ago:
> > 
> > <20190124141731.21509-1-kwolf@redhat.com>
> > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html
> 
> Interesting.  The reasoning that it doesn’t matter whether anyone
> writes to the assumed-data regions makes sense.
> 
> I can’t see a real reason why it was kind of forgotten, apparently...

After qcow2 stopped recursively querying the file-posix layer, the
relevant case under discussion was fixed anyway, so it didn't have the
highest priority any more...

I think the open question (and possibly work) in the old thread was
whether this should be moved out of file-posix into the generic block
layer.

With your patch, I guess the other open question is whether we want to
try and cache holes anyway. I assume that in the common case, you may
have many consecutive data extents, but probably rarely many holes (I
guess you can have more than one if some areas are unallocated and
others are allocated, but unwritten?) Then it's probably not worth
caching holes.

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block/mirror: Fix mirror_top's permissions
  2021-02-12  9:23     ` Max Reitz
@ 2021-02-12 10:48       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-12 10:48 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

12.02.2021 12:23, Max Reitz wrote:
> On 12.02.21 10:04, Vladimir Sementsov-Ogievskiy wrote:
>> 11.02.2021 20:22, Max Reitz wrote:
>>> mirror_top currently shares all permissions, and takes only the WRITE
>>> permission (if some parent has taken that permission, too).
>>>
>>> That is wrong, though; mirror_top is a filter, so it should take
>>> permissions like any other filter does.  For example, if the parent
>>> needs CONSISTENT_READ, we need to take that, too, and if it cannot share
>>> the WRITE permission, we cannot share it either.
>>>
>>> The exception is when mirror_top is used for active commit, where we
>>> cannot take CONSISTENT_READ (because it is deliberately unshared above
>>> the base node) and where we must share WRITE (so that it is shared for
>>> all images in the backing chain, so the mirror job can take it for the
>>> target BB).
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   block/mirror.c | 32 +++++++++++++++++++++++++-------
>>>   1 file changed, 25 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/block/mirror.c b/block/mirror.c
>>> index 8e1ad6eceb..1edfc3cc14 100644
>>> --- a/block/mirror.c
>>> +++ b/block/mirror.c
>>> @@ -89,6 +89,7 @@ typedef struct MirrorBlockJob {
>>>   typedef struct MirrorBDSOpaque {
>>>       MirrorBlockJob *job;
>>>       bool stop;
>>> +    bool is_commit;
>>>   } MirrorBDSOpaque;
>>>   struct MirrorOp {
>>> @@ -1513,13 +1514,27 @@ static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
>>>           return;
>>>       }
>>> -    /* Must be able to forward guest writes to the real image */
>>> -    *nperm = 0;
>>> -    if (perm & BLK_PERM_WRITE) {
>>> -        *nperm |= BLK_PERM_WRITE;
>>> -    }
>>> +    bdrv_default_perms(bs, c, role, reopen_queue,
>>> +                       perm, shared, nperm, nshared);
>>> -    *nshared = BLK_PERM_ALL;
>>> +    if (s->is_commit) {
>>> +        /*
>>> +         * For commit jobs, we cannot take CONSISTENT_READ, because
>>> +         * that permission is unshared for everything above the base
>>> +         * node (except for filters on the base node).
>>> +         * We also have to force-share the WRITE permission, or
>>> +         * otherwise we would block ourselves at the base node (if
>>> +         * writes are blocked for a node, they are also blocked for
>>> +         * its backing file).
>>> +         * (We could also share RESIZE, because it may be needed for
>>> +         * the target if its size is less than the top node's; but
>>> +         * bdrv_default_perms_for_cow() automatically shares RESIZE
>>> +         * for backing nodes if WRITE is shared, so there is no need
>>> +         * to do it here.)
>>> +         */
>>> +        *nperm &= ~BLK_PERM_CONSISTENT_READ;
>>
>> this works only because we don't assert READ permission on generic read path in block/io.c, like we do for WRITE..
>> but this is better than pre-patch anyway..
> 
> Yes, because you can read regardless of CONSISTENT_READ, the question is just whether you’ll receive consistent data.  The caller needs to decide whether that’s the case.
> 
> Taking that permission kind of is deferring the question whether the data will be consistent to the block layer.
> 
> In case of commit, we unshare CONSISTENT_READ above the base, because the data between base and top will not be consistent.  Starting from top, we know it is, so we do not need to take the permission, because we do not need that guarantee from the block layer; the commit job can give that guarantee itself.
> 
> (I suppose we could add some ALLOW_INCONSISTENT flag to read requests, and the permission is checked when that flag is not present, but I don’t think we really need to.)
> 
> (Technically we have the problem that there could be something else between top and base that unshares CONSISTENT_READ, and we’ll never know, but addressing that would be complicated and it’s only a hypothetical problem, AFAIA.)
> 
>> How block-jobs and filters works - definitely goes beyond our permissions architecture..
> 
> FWIW, AFAIR, the first job filter node was commit_top, whose purpose was precisely to allow unsharing CONSISTENT_READ on the base and then offering it again on the top.

Hmm reasonable. But sharing writes though backing chain is not that safe

> 
>>> +        *nshared |= BLK_PERM_WRITE;
>>> +    }
>>>   }
>>>   /* Dummy node that provides consistent read to its users without requiring it
>>> @@ -1583,6 +1598,8 @@ static BlockJob *mirror_start_job(
>>>           return NULL;
>>>       }
>>> +    target_is_backing = bdrv_chain_contains(bs, target);
>>
>> may be initialized together with definition..
> 
> Could be, but would that be better? :)
> 
>>> +
>>>       /* In the case of active commit, add dummy driver to provide consistent
>>>        * reads on the top, while disabling it in the intermediate nodes, and make
>>>        * the backing chain writable. */
>>> @@ -1605,6 +1622,8 @@ static BlockJob *mirror_start_job(
>>>       bs_opaque = g_new0(MirrorBDSOpaque, 1);
>>>       mirror_top_bs->opaque = bs_opaque;
>>> +    bs_opaque->is_commit = target_is_backing;
>>> +
>>>       /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
>>>        * it alive until block_job_create() succeeds even if bs has no parent. */
>>>       bdrv_ref(mirror_top_bs);
>>> @@ -1646,7 +1665,6 @@ static BlockJob *mirror_start_job(
>>>       target_perms = BLK_PERM_WRITE;
>>>       target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>>> -    target_is_backing = bdrv_chain_contains(bs, target);
>>>       if (target_is_backing) {
>>>           int64_t bs_size, target_size;
>>>           bs_size = bdrv_getlength(bs);
>>>
>>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Thanks!
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] file-posix: Cache next hole
  2021-02-12 10:25       ` Kevin Wolf
@ 2021-02-12 12:11         ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2021-02-12 12:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On 12.02.21 11:25, Kevin Wolf wrote:
> Am 12.02.2021 um 10:14 hat Max Reitz geschrieben:
>> On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.02.2021 20:22, Max Reitz wrote:
>>>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>>>> slow on certain filesystems and/or under certain circumstances.  That is
>>>> why we generally try to avoid it (which is why bdrv_co_block_status()
>>>> has the @want_zero parameter, and which is why qcow2 has a metadata
>>>> preallocation detection, so we do not fall through to the protocol layer
>>>> to discover which blocks are zero, unless that is really necessary
>>>> (i.e., for metadata-preallocated images)).
>>>>
>>>> In addition to those measures, we can also try to speed up zero
>>>> detection by letting file-posix cache some hole location information,
>>>> namely where the next hole after the most recently queried offset is.
>>>> This helps especially for images that are (nearly) fully allocated,
>>>> which is coincidentally also the case where querying for zero
>>>> information cannot gain us much.
>>>>
>>>> Note that this of course only works so long as we have no concurrent
>>>> writers to the image, which is the case when the WRITE capability is not
>>>> shared.
>>>>
>>>> Alternatively (or perhaps as an improvement in the future), we could let
>>>> file-posix keep track of what it knows is zero and what it knows is
>>>> non-zero with bitmaps, which would help images that actually have a
>>>> significant number of holes (where this implementation here cannot do
>>>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>>>> do not need to seek through the whole file), and the performance lost by
>>>> querying the block status does not feel as bad because it is outweighed
>>>> by the performance that can be saved by special-cases zeroed areas, so
>>>> focussing on images that are (nearly) fully allocated is more important.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>
>>> I'll look at it tomorrow... Just wanted to note that something similar
>>> was proposed by Kevin some time ago:
>>>
>>> <20190124141731.21509-1-kwolf@redhat.com>
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html
>>
>> Interesting.  The reasoning that it doesn’t matter whether anyone
>> writes to the assumed-data regions makes sense.
>>
>> I can’t see a real reason why it was kind of forgotten, apparently...
> 
> After qcow2 stopped recursively querying the file-posix layer, the
> relevant case under discussion was fixed anyway, so it didn't have the
> highest priority any more...
> 
> I think the open question (and possibly work) in the old thread was
> whether this should be moved out of file-posix into the generic block
> layer.
> 
> With your patch, I guess the other open question is whether we want to
> try and cache holes anyway. I assume that in the common case, you may
> have many consecutive data extents, but probably rarely many holes (I
> guess you can have more than one if some areas are unallocated and
> others are allocated, but unwritten?) Then it's probably not worth
> caching holes.

Yes, that’s what I think.  If you have many small holes, caching won’t 
help much anyway (at least this way, where we’d only cache information 
about a single hole).  So caching holes would mainly help when you have 
several large holes.  But in that case, finding them will surely speed 
up whatever it is the block-status caller is doing (e.g. mirroring), so 
a slow SEEK_HOLE/DATA will be tolerable.

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] file-posix: Cache next hole
  2021-02-11 17:22 [PATCH 0/2] file-posix: Cache next hole Max Reitz
  2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
  2021-02-11 17:22 ` [PATCH 2/2] file-posix: Cache next hole Max Reitz
@ 2021-03-29 16:21 ` Max Reitz
  2 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2021-03-29 16:21 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 11.02.21 18:22, Max Reitz wrote:
> Hi,

[...]

> (Speaking of “unless the WRITE permission is shared”: mirror_top is a
> bit broken in that it takes no permissions (but WRITE if necessary) and
> shares everything.  That seems wrong.  Patch 1 addresses that, so that
> patch 2 can actually do something when mirroring an image.)

I plan to send a v2 of patch 2 at some point, but for now I’ve applied 
patch 1 to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block

https://bugzilla.redhat.com/show_bug.cgi?id=1940118 reports an abort, 
which I think can be avoided with patch 1 of this series: The mirror job 
lifts all permissions on the source node, so you can freely take locks 
with some other process, and then mirror_exit_common() fails when it 
tries to take those permissions back (at least when cancelling the job).

I plan to send an iotest for this, but getting this into rc1 is more 
important than waiting for the test, I think.

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-29 16:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-11 17:22 [PATCH 0/2] file-posix: Cache next hole Max Reitz
2021-02-11 17:22 ` [PATCH 1/2] block/mirror: Fix mirror_top's permissions Max Reitz
2021-02-11 19:33   ` Eric Blake
2021-02-12  9:04   ` Vladimir Sementsov-Ogievskiy
2021-02-12  9:23     ` Max Reitz
2021-02-12 10:48       ` Vladimir Sementsov-Ogievskiy
2021-02-11 17:22 ` [PATCH 2/2] file-posix: Cache next hole Max Reitz
2021-02-11 20:00   ` Eric Blake
2021-02-12  9:25     ` Max Reitz
2021-02-11 20:38   ` Vladimir Sementsov-Ogievskiy
2021-02-12  9:14     ` Max Reitz
2021-02-12 10:25       ` Kevin Wolf
2021-02-12 12:11         ` Max Reitz
2021-03-29 16:21 ` [PATCH 0/2] " Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).