All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] block: block-status cache for data regions
@ 2021-06-17 15:52 Max Reitz
  2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

Hi,

We’ve already had two attempts at introducing a block-status cache for
data regions
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), both of which were local to file-posix.

I have a gitlab issue assigned to me
(https://gitlab.com/qemu-project/qemu/-/issues/307), so I suppose it’s
time for round three.

This series tries to address what I gathered from the reviews for both:
(1) We should try to have this cache in the general block layer for all
protocol drivers; (2) Just to be sure, we should have a mutex for it;
(3) External writers don’t matter if we just cache data regions, because
(for a protocol node) reporting a zero region as data isn’t fatal.

Patch 1 is clean-up, patch 2 is the main one, patches 3 to 6 make it
more useful: Some protocol drivers force-clamp the returned *pnum based
on the @bytes parameter, but that’s completely unnecessary, because
bdrv_co_block_status() will clamp the value, too.  It’s beneficial to
return as large a *pnum value as possible to bdrv_co_block_status(), so
our cache can be more effective.


Max Reitz (6):
  block: Drop BDS comment regarding bdrv_append()
  block: block-status cache for data regions
  block/file-posix: Do not force-cap *pnum
  block/gluster: Do not force-cap *pnum
  block/nbd: Do not force-cap *pnum
  block/iscsi: Do not force-cap *pnum

 include/block/block_int.h | 21 ++++++++--
 block.c                   |  2 +
 block/file-posix.c        |  7 ++--
 block/gluster.c           |  7 ++--
 block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
 block/iscsi.c             |  3 --
 block/nbd.c               |  2 +-
 7 files changed, 105 insertions(+), 17 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] block: Drop BDS comment regarding bdrv_append()
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 17:42   ` Eric Blake
  2021-06-19  9:38   ` Vladimir Sementsov-Ogievskiy
  2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 057d88b1fc..a8f9598102 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,12 +832,6 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
-/*
- * Note: the function bdrv_append() copies and swaps contents of
- * BlockDriverStates, so if you add new fields to this struct, please
- * inspect bdrv_append() to determine if the new fields need to be
- * copied as well.
- */
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] block: block-status cache for data regions
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
  2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 18:51   ` Eric Blake
  2021-06-19 10:20   ` Vladimir Sementsov-Ogievskiy
  2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 19 ++++++++++
 block.c                   |  2 +
 block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *            the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    CoMutex lock;
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
@@ -997,6 +1014,8 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    BdrvBlockStatusCache block_status_cache;
 };
 
 struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 3f456892d0..bad64d5c4f 100644
--- a/block.c
+++ b/block.c
@@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
 
     qemu_co_queue_init(&bs->flush_queue);
 
+    qemu_co_mutex_init(&bs->block_status_cache.lock);
+
     for (i = 0; i < bdrv_drain_all_count; i++) {
         bdrv_drained_begin(bs);
     }
diff --git a/block/io.c b/block/io.c
index 323854d063..320638cc48 100644
--- a/block/io.c
+++ b/block/io.c
@@ -35,6 +35,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/range.h"
 #include "sysemu/replay.h"
 
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
@@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     bool need_flush = false;
     int head = 0;
     int tail = 0;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
     int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* Invalidate the cached block-status data range if this write overlaps */
+    qemu_co_mutex_lock(&bsc->lock);
+    if (bsc->valid &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start))
+    {
+        bsc->valid = false;
+    }
+    qemu_co_mutex_unlock(&bsc->lock);
+
     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
     tail = (offset + bytes) % alignment;
@@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     int64_t aligned_offset, aligned_bytes;
     uint32_t align;
     bool has_filtered_child;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     assert(pnum);
     *pnum = 0;
@@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 
     if (bs->drv->bdrv_co_block_status) {
-        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-                                            aligned_bytes, pnum, &local_map,
-                                            &local_file);
+        bool from_cache = false;
+
+        /*
+         * Use the block-status cache only for protocol nodes: Format
+         * drivers are generally quick to inquire the status, but protocol
+         * drivers often need to get information from outside of qemu, so
+         * we do not have control over the actual implementation.  There
+         * have been cases where inquiring the status took an unreasonably
+         * long time, and we can do nothing in qemu to fix it.
+         * This is especially problematic for images with large data areas,
+         * because finding the few holes in them and giving them special
+         * treatment does not gain much performance.  Therefore, we try to
+         * cache the last-identified data region.
+         *
+         * Second, limiting ourselves to protocol nodes allows us to assume
+         * the block status for data regions to be DATA | OFFSET_VALID, and
+         * that the host offset is the same as the guest offset.
+         *
+         * Note that it is possible that external writers zero parts of
+         * the cached regions without the cache being invalidated, and so
+         * we may report zeroes as data.  This is not catastrophic,
+         * however, because reporting zeroes as data is fine.
+         */
+        if (QLIST_EMPTY(&bs->children)) {
+            qemu_co_mutex_lock(&bsc->lock);
+            if (bsc->valid &&
+                aligned_offset >= bsc->data_start &&
+                aligned_offset < bsc->data_end)
+            {
+                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+                local_file = bs;
+                local_map = aligned_offset;
+
+                *pnum = bsc->data_end - aligned_offset;
+
+                from_cache = true;
+            }
+            qemu_co_mutex_unlock(&bsc->lock);
+        }
+
+        if (!from_cache) {
+            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+                                                aligned_bytes, pnum, &local_map,
+                                                &local_file);
+
+            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
+                qemu_co_mutex_lock(&bsc->lock);
+                bsc->data_start = aligned_offset;
+                bsc->data_end = aligned_offset + *pnum;
+                bsc->valid = true;
+                qemu_co_mutex_unlock(&bsc->lock);
+            }
+        }
     } else {
         /* Default code for filters */
 
@@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
     int max_pdiscard, ret;
     int head, tail, align;
     BlockDriverState *bs = child->bs;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
         return -ENOMEDIUM;
@@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
         return 0;
     }
 
+    /* Invalidate the cached block-status data range if this discard overlaps */
+    qemu_co_mutex_lock(&bsc->lock);
+    if (bsc->valid &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start))
+    {
+        bsc->valid = false;
+    }
+    qemu_co_mutex_unlock(&bsc->lock);
+
     /* Discard is advisory, but some devices track and coalesce
      * unaligned requests, so we must pass everything down rather than
      * round here.  Still, most devices will just silently ignore
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] block/file-posix: Do not force-cap *pnum
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
  2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
  2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 20:16   ` Eric Blake
  2021-06-19 10:32   ` Vladimir Sementsov-Ogievskiy
  2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b3fbb9bd63..aeb370d5bb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t start,
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  */
 static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             bool want_zero,
@@ -2727,7 +2728,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
 
         /*
          * We are not allowed to return partial sectors, though, so
@@ -2746,7 +2747,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
     *map = offset;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] block/gluster: Do not force-cap *pnum
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
                   ` (2 preceding siblings ...)
  2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 20:17   ` Eric Blake
  2021-06-19 10:36   ` Vladimir Sementsov-Ogievskiy
  2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz
  2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/gluster.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  *
  * (Based on raw_co_block_status() from file-posix.c.)
  */
@@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
         ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
                   ` (3 preceding siblings ...)
  2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 20:20   ` Eric Blake
  2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
  2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 616f9ae6c4..930bd234de 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
         .type = NBD_CMD_BLOCK_STATUS,
         .from = offset,
         .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-                   MIN(bytes, s->info.size - offset)),
+                   s->info.size - offset),
         .flags = NBD_CMD_FLAG_REQ_ONE,
     };
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] block/iscsi: Do not force-cap *pnum
  2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
                   ` (4 preceding siblings ...)
  2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz
@ 2021-06-17 15:52 ` Max Reitz
  2021-06-18 20:20   ` Eric Blake
  2021-06-19 11:13   ` Vladimir Sementsov-Ogievskiy
  5 siblings, 2 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/iscsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4d2a416ce7..852384086b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -781,9 +781,6 @@ retry:
         iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
     }
 
-    if (*pnum > bytes) {
-        *pnum = bytes;
-    }
 out_unlock:
     qemu_mutex_unlock(&iscsilun->mutex);
     g_free(iTask.err_str);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] block: Drop BDS comment regarding bdrv_append()
  2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
@ 2021-06-18 17:42   ` Eric Blake
  2021-06-19  9:38   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Blake @ 2021-06-18 17:42 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:42PM +0200, Max Reitz wrote:
> There is a comment above the BDS definition stating care must be taken
> to consider handling newly added fields in bdrv_append().
> 
> Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
> (nine years ago), and in any case, bdrv_swap() was dropped in
> 8e419aefa (six years ago).  So no such care is necessary anymore.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  include/block/block_int.h | 6 ------
>  1 file changed, 6 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] block: block-status cache for data regions
  2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz
@ 2021-06-18 18:51   ` Eric Blake
  2021-06-21  9:37     ` Max Reitz
  2021-06-19 10:20   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Blake @ 2021-06-18 18:51 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote:
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.

Here's hoping third time's the charm!

> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  include/block/block_int.h | 19 ++++++++++
>  block.c                   |  2 +
>  block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>  3 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a8f9598102..c09512556a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -832,6 +832,23 @@ struct BdrvChild {
>      QLIST_ENTRY(BdrvChild) next_parent;
>  };
>  
> +/*
> + * Allows bdrv_co_block_status() to cache one data region for a
> + * protocol node.
> + *
> + * @lock: Lock for accessing this object's fields
> + * @valid: Whether the cache is valid
> + * @data_start: Offset where we know (or strongly assume) is data
> + * @data_end: Offset where the data region ends (which is not necessarily
> + *            the start of a zeroed region)
> + */
> +typedef struct BdrvBlockStatusCache {
> +    CoMutex lock;
> +    bool valid;
> +    int64_t data_start;
> +    int64_t data_end;
> +} BdrvBlockStatusCache;

Looks like the right bits of information, and I'm glad you documented
the need to be prepared for protocols that report split data sections
rather than consolidated.

> +++ b/block/io.c
> @@ -35,6 +35,7 @@
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> +#include "qemu/range.h"
>  #include "sysemu/replay.h"
>  
>  /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      bool need_flush = false;
>      int head = 0;
>      int tail = 0;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>          return -ENOTSUP;
>      }
>  
> +    /* Invalidate the cached block-status data range if this write overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);

Are we going to be suffering from a lot of lock contention performance
degradation?  Is there a way to take advantage of RCU access patterns
for any more performance without sacrificing correctness?

> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }

Do we want to invalidate the entire bsc, or can we be smart and leave
the prefix intact (if offset > bsc->data_start, then set bsc->data_end
to offset)?

> +    qemu_co_mutex_unlock(&bsc->lock);

Worth using WITH_QEMU_LOCK_GUARD?

> +
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
>      tail = (offset + bytes) % alignment;
> @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>      int64_t aligned_offset, aligned_bytes;
>      uint32_t align;
>      bool has_filtered_child;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      assert(pnum);
>      *pnum = 0;
> @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>      aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>  
>      if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */

Useful comment.

> +        if (QLIST_EMPTY(&bs->children)) {
> +            qemu_co_mutex_lock(&bsc->lock);
> +            if (bsc->valid &&
> +                aligned_offset >= bsc->data_start &&
> +                aligned_offset < bsc->data_end)
> +            {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                *pnum = bsc->data_end - aligned_offset;
> +
> +                from_cache = true;
> +            }
> +            qemu_co_mutex_unlock(&bsc->lock);
> +        }
> +
> +        if (!from_cache) {
> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
> +                qemu_co_mutex_lock(&bsc->lock);
> +                bsc->data_start = aligned_offset;
> +                bsc->data_end = aligned_offset + *pnum;
> +                bsc->valid = true;
> +                qemu_co_mutex_unlock(&bsc->lock);
> +            }
> +        }

Looks correct.

>      } else {
>          /* Default code for filters */
>  
> @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>      int max_pdiscard, ret;
>      int head, tail, align;
>      BlockDriverState *bs = child->bs;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>          return -ENOMEDIUM;
> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>          return 0;
>      }
>  
> +    /* Invalidate the cached block-status data range if this discard overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }

Same question as above about possibly shortening the cached range
in-place rather than discarding it altogether.

> +    qemu_co_mutex_unlock(&bsc->lock);
> +
>      /* Discard is advisory, but some devices track and coalesce
>       * unaligned requests, so we must pass everything down rather than
>       * round here.  Still, most devices will just silently ignore
> -- 
> 2.31.1
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz
@ 2021-06-18 20:16   ` Eric Blake
  2021-06-21  9:38     ` Max Reitz
  2021-06-19 10:32   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Blake @ 2021-06-18 20:16 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.

We should update the documentation in include/block/block_int.h to
mention that the driver's block_status callback may treat *pnum as a
soft cap, and that returning a larger value is fine.

But I agree with this change in the individual drivers, as long as we
remember to make our global contract explicit that we can now rely on
it ;)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz
@ 2021-06-18 20:17   ` Eric Blake
  2021-06-19 10:36   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Blake @ 2021-06-18 20:17 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:45PM +0200, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/gluster.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz
@ 2021-06-18 20:20   ` Eric Blake
  2021-06-19 11:12     ` Vladimir Sementsov-Ogievskiy
  2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Blake @ 2021-06-18 20:20 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, vsementsov, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:46PM +0200, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/nbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 616f9ae6c4..930bd234de 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
>          .type = NBD_CMD_BLOCK_STATUS,
>          .from = offset,
>          .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
> -                   MIN(bytes, s->info.size - offset)),
> +                   s->info.size - offset),
>          .flags = NBD_CMD_FLAG_REQ_ONE,

I'd love to someday get rid of using NBD_CMD_FLAG_REQ_ONE (so the
server can reply with more extents in one go), but that's a bigger
task and unrelated to your block-layer cache.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] block/iscsi: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz
@ 2021-06-18 20:20   ` Eric Blake
  2021-06-19 11:13   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Blake @ 2021-06-18 20:20 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Thu, Jun 17, 2021 at 05:52:47PM +0200, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/iscsi.c | 3 ---
>  1 file changed, 3 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] block: Drop BDS comment regarding bdrv_append()
  2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
  2021-06-18 17:42   ` Eric Blake
@ 2021-06-19  9:38   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19  9:38 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> There is a comment above the BDS definition stating care must be taken
> to consider handling newly added fields in bdrv_append().
> 
> Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
> (nine years ago), and in any case, bdrv_swap() was dropped in
> 8e419aefa (six years ago).  So no such care is necessary anymore.
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] block: block-status cache for data regions
  2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz
  2021-06-18 18:51   ` Eric Blake
@ 2021-06-19 10:20   ` Vladimir Sementsov-Ogievskiy
  2021-06-21 10:05     ` Max Reitz
  1 sibling, 1 reply; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:20 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block_int.h | 19 ++++++++++
>   block.c                   |  2 +
>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>   3 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a8f9598102..c09512556a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -832,6 +832,23 @@ struct BdrvChild {
>       QLIST_ENTRY(BdrvChild) next_parent;
>   };
>   
> +/*
> + * Allows bdrv_co_block_status() to cache one data region for a
> + * protocol node.
> + *
> + * @lock: Lock for accessing this object's fields
> + * @valid: Whether the cache is valid
> + * @data_start: Offset where we know (or strongly assume) is data
> + * @data_end: Offset where the data region ends (which is not necessarily
> + *            the start of a zeroed region)
> + */
> +typedef struct BdrvBlockStatusCache {
> +    CoMutex lock;
> +    bool valid;
> +    int64_t data_start;
> +    int64_t data_end;
> +} BdrvBlockStatusCache;
> +
>   struct BlockDriverState {
>       /* Protected by big QEMU lock or read-only after opening.  No special
>        * locking needed during I/O...
> @@ -997,6 +1014,8 @@ struct BlockDriverState {
>   
>       /* BdrvChild links to this node may never be frozen */
>       bool never_freeze;
> +
> +    BdrvBlockStatusCache block_status_cache;
>   };
>   
>   struct BlockBackendRootState {
> diff --git a/block.c b/block.c
> index 3f456892d0..bad64d5c4f 100644
> --- a/block.c
> +++ b/block.c
> @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
>   
>       qemu_co_queue_init(&bs->flush_queue);
>   
> +    qemu_co_mutex_init(&bs->block_status_cache.lock);
> +
>       for (i = 0; i < bdrv_drain_all_count; i++) {
>           bdrv_drained_begin(bs);
>       }
> diff --git a/block/io.c b/block/io.c
> index 323854d063..320638cc48 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -35,6 +35,7 @@
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/range.h"
>   #include "sysemu/replay.h"
>   
>   /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>       bool need_flush = false;
>       int head = 0;
>       int tail = 0;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           return -ENOTSUP;
>       }
>   
> +    /* Invalidate the cached block-status data range if this write overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }
> +    qemu_co_mutex_unlock(&bsc->lock);

I think this fragment used twice worth small function, like

block_status_cache_discard_region().

That would be clean and avoids duplication..

> +
>       assert(alignment % bs->bl.request_alignment == 0);
>       head = offset % alignment;
>       tail = (offset + bytes) % alignment;
> @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>       int64_t aligned_offset, aligned_bytes;
>       uint32_t align;
>       bool has_filtered_child;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       assert(pnum);
>       *pnum = 0;
> @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>   
>       if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */
> +        if (QLIST_EMPTY(&bs->children)) {
> +            qemu_co_mutex_lock(&bsc->lock);
> +            if (bsc->valid &&
> +                aligned_offset >= bsc->data_start &&
> +                aligned_offset < bsc->data_end)
> +            {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                *pnum = bsc->data_end - aligned_offset;
> +
> +                from_cache = true;
> +            }
> +            qemu_co_mutex_unlock(&bsc->lock);
> +        }
> +
> +        if (!from_cache) {
> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {

You do cache the data for any driver that returns DATA | OFFSET_VALID. But it's unused for format drivers (which may return OFFSET_VALID of course).

I think, here should be

if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && QLIST_EMPTY(&bs->children)) {

> +                qemu_co_mutex_lock(&bsc->lock);
> +                bsc->data_start = aligned_offset;
> +                bsc->data_end = aligned_offset + *pnum;
> +                bsc->valid = true;
> +                qemu_co_mutex_unlock(&bsc->lock);
> +            }
> +        }


Hmm, also, new additions may worth a separate small functions too, so that new object gets a clean API:

bsc_discard()
bsc_get()
bsc_set()

or something like this. I don't insist.

>       } else {
>           /* Default code for filters */
>   
> @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>       int max_pdiscard, ret;
>       int head, tail, align;
>       BlockDriverState *bs = child->bs;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>           return -ENOMEDIUM;
> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>           return 0;
>       }
>   
> +    /* Invalidate the cached block-status data range if this discard overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }
> +    qemu_co_mutex_unlock(&bsc->lock);
> +
>       /* Discard is advisory, but some devices track and coalesce
>        * unaligned requests, so we must pass everything down rather than
>        * round here.  Still, most devices will just silently ignore
> 

Overall, seems good to me.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz
  2021-06-18 20:16   ` Eric Blake
@ 2021-06-19 10:32   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:32 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz
  2021-06-18 20:17   ` Eric Blake
@ 2021-06-19 10:36   ` Vladimir Sementsov-Ogievskiy
  2021-06-21  9:47     ` Max Reitz
  1 sibling, 1 reply; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:36 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


> ---
>   block/gluster.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index e8ee14c8e9..8ef7bb18d5 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -1461,7 +1461,8 @@ exit:
>    * the specified offset) that are known to be in the same
>    * allocated/unallocated state.
>    *
> - * 'bytes' is the max value 'pnum' should be set to.
> + * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
> + * well exceed it.
>    *
>    * (Based on raw_co_block_status() from file-posix.c.)
>    */
> @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
>       } else if (data == offset) {
>           /* On a data extent, compute bytes to the end of the extent,
>            * possibly including a partial sector at EOF. */
> -        *pnum = MIN(bytes, hole - offset);
> +        *pnum = hole - offset;
>           ret = BDRV_BLOCK_DATA;

Interesting, isn't it a bug that we don't ROUND_UP *pnum to request_alignment here like it is done in file-posix ?

>       } else {
>           /* On a hole, compute bytes to the beginning of the next extent.  */
>           assert(hole == offset);
> -        *pnum = MIN(bytes, data - offset);
> +        *pnum = data - offset;
>           ret = BDRV_BLOCK_ZERO;
>       }
>   
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz
  2021-06-18 20:20   ` Eric Blake
@ 2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
  2021-06-21  9:50     ` Max Reitz
  2021-06-21 18:53     ` Eric Blake
  1 sibling, 2 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:53 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/nbd.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 616f9ae6c4..930bd234de 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
>           .type = NBD_CMD_BLOCK_STATUS,
>           .from = offset,
>           .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
> -                   MIN(bytes, s->info.size - offset)),
> +                   s->info.size - offset),
>           .flags = NBD_CMD_FLAG_REQ_ONE,
>       };
>   
> 

Hmm..

I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work.

(look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop).

For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent.

So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-18 20:20   ` Eric Blake
@ 2021-06-19 11:12     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 11:12 UTC (permalink / raw)
  To: Eric Blake, Max Reitz; +Cc: qemu-block, Kevin Wolf, qemu-devel

18.06.2021 23:20, Eric Blake wrote:
> On Thu, Jun 17, 2021 at 05:52:46PM +0200, Max Reitz wrote:
>> bdrv_co_block_status() does it for us, we do not need to do it here.
>>
>> The advantage of not capping *pnum is that bdrv_co_block_status() can
>> cache larger data regions than requested by its caller.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/nbd.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
>>
>> diff --git a/block/nbd.c b/block/nbd.c
>> index 616f9ae6c4..930bd234de 100644
>> --- a/block/nbd.c
>> +++ b/block/nbd.c
>> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
>>           .type = NBD_CMD_BLOCK_STATUS,
>>           .from = offset,
>>           .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
>> -                   MIN(bytes, s->info.size - offset)),
>> +                   s->info.size - offset),
>>           .flags = NBD_CMD_FLAG_REQ_ONE,
> 
> I'd love to someday get rid of using NBD_CMD_FLAG_REQ_ONE (so the
> server can reply with more extents in one go), but that's a bigger
> task and unrelated to your block-layer cache.
> 

I think for this to work, the generic block_status should be updated so we can work with several extents in one go.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] block/iscsi: Do not force-cap *pnum
  2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz
  2021-06-18 20:20   ` Eric Blake
@ 2021-06-19 11:13   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 11:13 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

17.06.2021 18:52, Max Reitz wrote:
> bdrv_co_block_status() does it for us, we do not need to do it here.
> 
> The advantage of not capping *pnum is that bdrv_co_block_status() can
> cache larger data regions than requested by its caller.
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] block: block-status cache for data regions
  2021-06-18 18:51   ` Eric Blake
@ 2021-06-21  9:37     ` Max Reitz
  0 siblings, 0 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-21  9:37 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block

On 18.06.21 20:51, Eric Blake wrote:
> On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote:
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
> Here's hoping third time's the charm!

Indeed :)

>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block_int.h | 19 ++++++++++
>>   block.c                   |  2 +
>>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index a8f9598102..c09512556a 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -832,6 +832,23 @@ struct BdrvChild {
>>       QLIST_ENTRY(BdrvChild) next_parent;
>>   };
>>   
>> +/*
>> + * Allows bdrv_co_block_status() to cache one data region for a
>> + * protocol node.
>> + *
>> + * @lock: Lock for accessing this object's fields
>> + * @valid: Whether the cache is valid
>> + * @data_start: Offset where we know (or strongly assume) is data
>> + * @data_end: Offset where the data region ends (which is not necessarily
>> + *            the start of a zeroed region)
>> + */
>> +typedef struct BdrvBlockStatusCache {
>> +    CoMutex lock;
>> +    bool valid;
>> +    int64_t data_start;
>> +    int64_t data_end;
>> +} BdrvBlockStatusCache;
> Looks like the right bits of information, and I'm glad you documented
> the need to be prepared for protocols that report split data sections
> rather than consolidated.
>
>> +++ b/block/io.c
>> @@ -35,6 +35,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> +#include "qemu/range.h"
>>   #include "sysemu/replay.h"
>>   
>>   /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
>> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>       bool need_flush = false;
>>       int head = 0;
>>       int tail = 0;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>   
>>       int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>           return -ENOTSUP;
>>       }
>>   
>> +    /* Invalidate the cached block-status data range if this write overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
> Are we going to be suffering from a lot of lock contention performance
> degradation?  Is there a way to take advantage of RCU access patterns
> for any more performance without sacrificing correctness?

The critical section is so short that I considered it fine.  I wanted to 
use RW locks, but then realized that every RW lock operation is 
internally locked by another mutex, so it wouldn’t gain anything.

I’m not sure whether RCU is worth it here.

We could try something very crude, namely to just not take a lock and 
make `valid` an atomic.  After all, it doesn’t really matter whether 
`data_start` and `data_end` are consistent values, and resetting `valid` 
to false is always safe.

The worst that could happen is that a concurrent block-status call tries 
to set up an overlapping data area, which we thus fail to recognize 
here.  But if such a thing were to happen, it could just as well happen 
before said concurrent call took any lock on `bsc`.

>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
> Do we want to invalidate the entire bsc, or can we be smart and leave
> the prefix intact (if offset > bsc->data_start, then set bsc->data_end
> to offset)?

Perhaps we could be smart, but I don’t think it really makes a 
difference in practice, so I think keeping it simple is better.

>> +    qemu_co_mutex_unlock(&bsc->lock);
> Worth using WITH_QEMU_LOCK_GUARD?

I knew I forgot something, right.  Will use!

Max



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum
  2021-06-18 20:16   ` Eric Blake
@ 2021-06-21  9:38     ` Max Reitz
  0 siblings, 0 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-21  9:38 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block

On 18.06.21 22:16, Eric Blake wrote:
> On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote:
>> bdrv_co_block_status() does it for us, we do not need to do it here.
>>
>> The advantage of not capping *pnum is that bdrv_co_block_status() can
>> cache larger data regions than requested by its caller.
> We should update the documentation in include/block/block_int.h to
> mention that the driver's block_status callback may treat *pnum as a
> soft cap, and that returning a larger value is fine.

Oh, sure.

Max

> But I agree with this change in the individual drivers, as long as we
> remember to make our global contract explicit that we can now rely on
> it ;)
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum
  2021-06-19 10:36   ` Vladimir Sementsov-Ogievskiy
@ 2021-06-21  9:47     ` Max Reitz
  0 siblings, 0 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-21  9:47 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

On 19.06.21 12:36, Vladimir Sementsov-Ogievskiy wrote:
> 17.06.2021 18:52, Max Reitz wrote:
>> bdrv_co_block_status() does it for us, we do not need to do it here.
>>
>> The advantage of not capping *pnum is that bdrv_co_block_status() can
>> cache larger data regions than requested by its caller.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>
>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>
>
>> ---
>>   block/gluster.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/gluster.c b/block/gluster.c
>> index e8ee14c8e9..8ef7bb18d5 100644
>> --- a/block/gluster.c
>> +++ b/block/gluster.c
>> @@ -1461,7 +1461,8 @@ exit:
>>    * the specified offset) that are known to be in the same
>>    * allocated/unallocated state.
>>    *
>> - * 'bytes' is the max value 'pnum' should be set to.
>> + * 'bytes' is a soft cap for 'pnum'.  If the information is free, 
>> 'pnum' may
>> + * well exceed it.
>>    *
>>    * (Based on raw_co_block_status() from file-posix.c.)
>>    */
>> @@ -1500,12 +1501,12 @@ static int coroutine_fn 
>> qemu_gluster_co_block_status(BlockDriverState *bs,
>>       } else if (data == offset) {
>>           /* On a data extent, compute bytes to the end of the extent,
>>            * possibly including a partial sector at EOF. */
>> -        *pnum = MIN(bytes, hole - offset);
>> +        *pnum = hole - offset;
>>           ret = BDRV_BLOCK_DATA;
>
> Interesting, isn't it a bug that we don't ROUND_UP *pnum to 
> request_alignment here like it is done in file-posix ?

Guess I forgot gluster in 9c3db310ff0 O:)

I don’t think I’ll be able to reproduce it for gluster, but I suppose 
just doing the same thing for gluster should be fine...

Max



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
@ 2021-06-21  9:50     ` Max Reitz
  2021-06-21 18:54       ` Eric Blake
  2021-06-21 18:53     ` Eric Blake
  1 sibling, 1 reply; 28+ messages in thread
From: Max Reitz @ 2021-06-21  9:50 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

On 19.06.21 12:53, Vladimir Sementsov-Ogievskiy wrote:
> 17.06.2021 18:52, Max Reitz wrote:
>> bdrv_co_block_status() does it for us, we do not need to do it here.
>>
>> The advantage of not capping *pnum is that bdrv_co_block_status() can
>> cache larger data regions than requested by its caller.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/nbd.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/nbd.c b/block/nbd.c
>> index 616f9ae6c4..930bd234de 100644
>> --- a/block/nbd.c
>> +++ b/block/nbd.c
>> @@ -1702,7 +1702,7 @@ static int coroutine_fn 
>> nbd_client_co_block_status(
>>           .type = NBD_CMD_BLOCK_STATUS,
>>           .from = offset,
>>           .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
>> -                   MIN(bytes, s->info.size - offset)),
>> +                   s->info.size - offset),
>>           .flags = NBD_CMD_FLAG_REQ_ONE,
>>       };
>>
>
> Hmm..
>
> I don't that this change is correct. In contrast with file-posix you 
> don't get extra information for free, you just make a larger request. 
> This means that server will have to do more work.

Oh, oops.  Seems I was blind in my rage to replace this MIN() pattern.

You’re absolutely right.  So this patch should be dropped.

Max

> (look at blockstatus_to_extents, it calls bdrv_block_status_above in a 
> loop).
>
> For example, assume that nbd export is a qcow2 image with all clusters 
> allocated. With this change, nbd server will loop through the whole 
> qcow2 image, load all L2 tables to return big allocated extent.
>
> So, only server can decide, could it add some extra free information 
> to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't 
> allow it.
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] block: block-status cache for data regions
  2021-06-19 10:20   ` Vladimir Sementsov-Ogievskiy
@ 2021-06-21 10:05     ` Max Reitz
  0 siblings, 0 replies; 28+ messages in thread
From: Max Reitz @ 2021-06-21 10:05 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

On 19.06.21 12:20, Vladimir Sementsov-Ogievskiy wrote:
> 17.06.2021 18:52, Max Reitz wrote:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform. The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> The problem we face is that on some filesystems or in some
>> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
>> implementation is outside of qemu, there is little we can do about its
>> performance.
>>
>> We have already introduced the want_zero parameter to
>> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
>> unless we really want zero information; but sometimes we do want that
>> information, because for files that consist largely of zero areas,
>> special-casing those areas can give large performance boosts. So the
>> real problem is with files that consist largely of data, so that
>> inquiring the block status does not gain us much performance, but where
>> such an inquiry itself takes a lot of time.
>>
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
>>
>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block_int.h | 19 ++++++++++
>>   block.c                   |  2 +
>>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index a8f9598102..c09512556a 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -832,6 +832,23 @@ struct BdrvChild {
>>       QLIST_ENTRY(BdrvChild) next_parent;
>>   };
>>   +/*
>> + * Allows bdrv_co_block_status() to cache one data region for a
>> + * protocol node.
>> + *
>> + * @lock: Lock for accessing this object's fields
>> + * @valid: Whether the cache is valid
>> + * @data_start: Offset where we know (or strongly assume) is data
>> + * @data_end: Offset where the data region ends (which is not 
>> necessarily
>> + *            the start of a zeroed region)
>> + */
>> +typedef struct BdrvBlockStatusCache {
>> +    CoMutex lock;
>> +    bool valid;
>> +    int64_t data_start;
>> +    int64_t data_end;
>> +} BdrvBlockStatusCache;
>> +
>>   struct BlockDriverState {
>>       /* Protected by big QEMU lock or read-only after opening. No 
>> special
>>        * locking needed during I/O...
>> @@ -997,6 +1014,8 @@ struct BlockDriverState {
>>         /* BdrvChild links to this node may never be frozen */
>>       bool never_freeze;
>> +
>> +    BdrvBlockStatusCache block_status_cache;
>>   };
>>     struct BlockBackendRootState {
>> diff --git a/block.c b/block.c
>> index 3f456892d0..bad64d5c4f 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
>>         qemu_co_queue_init(&bs->flush_queue);
>>   +    qemu_co_mutex_init(&bs->block_status_cache.lock);
>> +
>>       for (i = 0; i < bdrv_drain_all_count; i++) {
>>           bdrv_drained_begin(bs);
>>       }
>> diff --git a/block/io.c b/block/io.c
>> index 323854d063..320638cc48 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -35,6 +35,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> +#include "qemu/range.h"
>>   #include "sysemu/replay.h"
>>     /* Maximum bounce buffer for copy-on-read and write zeroes, in 
>> bytes */
>> @@ -1862,6 +1863,7 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>       bool need_flush = false;
>>       int head = 0;
>>       int tail = 0;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, 
>> INT_MAX);
>>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>> @@ -1878,6 +1880,16 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>           return -ENOTSUP;
>>       }
>>   +    /* Invalidate the cached block-status data range if this write 
>> overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
>> +    qemu_co_mutex_unlock(&bsc->lock);
>
> I think this fragment used twice worth small function, like
>
> block_status_cache_discard_region().
>
> That would be clean and avoids duplication..

Sure, sounds good.

>> +
>>       assert(alignment % bs->bl.request_alignment == 0);
>>       head = offset % alignment;
>>       tail = (offset + bytes) % alignment;
>> @@ -2394,6 +2406,7 @@ static int coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>       int64_t aligned_offset, aligned_bytes;
>>       uint32_t align;
>>       bool has_filtered_child;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         assert(pnum);
>>       *pnum = 0;
>> @@ -2442,9 +2455,59 @@ static int coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>>         if (bs->drv->bdrv_co_block_status) {
>> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> -                                            aligned_bytes, pnum, 
>> &local_map,
>> -                                            &local_file);
>> +        bool from_cache = false;
>> +
>> +        /*
>> +         * Use the block-status cache only for protocol nodes: Format
>> +         * drivers are generally quick to inquire the status, but 
>> protocol
>> +         * drivers often need to get information from outside of 
>> qemu, so
>> +         * we do not have control over the actual implementation.  
>> There
>> +         * have been cases where inquiring the status took an 
>> unreasonably
>> +         * long time, and we can do nothing in qemu to fix it.
>> +         * This is especially problematic for images with large data 
>> areas,
>> +         * because finding the few holes in them and giving them 
>> special
>> +         * treatment does not gain much performance. Therefore, we 
>> try to
>> +         * cache the last-identified data region.
>> +         *
>> +         * Second, limiting ourselves to protocol nodes allows us to 
>> assume
>> +         * the block status for data regions to be DATA | 
>> OFFSET_VALID, and
>> +         * that the host offset is the same as the guest offset.
>> +         *
>> +         * Note that it is possible that external writers zero parts of
>> +         * the cached regions without the cache being invalidated, 
>> and so
>> +         * we may report zeroes as data.  This is not catastrophic,
>> +         * however, because reporting zeroes as data is fine.
>> +         */
>> +        if (QLIST_EMPTY(&bs->children)) {
>> +            qemu_co_mutex_lock(&bsc->lock);
>> +            if (bsc->valid &&
>> +                aligned_offset >= bsc->data_start &&
>> +                aligned_offset < bsc->data_end)
>> +            {
>> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>> +                local_file = bs;
>> +                local_map = aligned_offset;
>> +
>> +                *pnum = bsc->data_end - aligned_offset;
>> +
>> +                from_cache = true;
>> +            }
>> +            qemu_co_mutex_unlock(&bsc->lock);
>> +        }
>> +
>> +        if (!from_cache) {
>> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> +                                                aligned_bytes, pnum, 
>> &local_map,
>> + &local_file);
>> +
>> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
>
> You do cache the data for any driver that returns DATA | OFFSET_VALID. 
> But it's unused for format drivers (which may return OFFSET_VALID of 
> course).
>
> I think, here should be
>
> if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && 
> QLIST_EMPTY(&bs->children)) {

Actually, I originally had it like this but then was afraid that a 
reviewer would ask me why I check the children list twice, once here and 
once when reading from the cache. :)

So I thought I should check it only once, and the better place to do so 
was where the cache is actually used.

I don’t think it makes much of a difference in practice (setting the 
cache doesn’t really matter if it isn’t used), but perhaps it might make 
the checks in write-zero/discard a bit quicker, because they can return 
early with `valid == false`.

>> + qemu_co_mutex_lock(&bsc->lock);
>> +                bsc->data_start = aligned_offset;
>> +                bsc->data_end = aligned_offset + *pnum;
>> +                bsc->valid = true;
>> +                qemu_co_mutex_unlock(&bsc->lock);
>> +            }
>> +        }
>
>
> Hmm, also, new additions may worth a separate small functions too, so 
> that new object gets a clean API:
>
> bsc_discard()
> bsc_get()
> bsc_set()
>
> or something like this. I don't insist.

Sounds good, will do.

>>       } else {
>>           /* Default code for filters */
>>   @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild 
>> *child, int64_t offset,
>>       int max_pdiscard, ret;
>>       int head, tail, align;
>>       BlockDriverState *bs = child->bs;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>>           return -ENOMEDIUM;
>> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild 
>> *child, int64_t offset,
>>           return 0;
>>       }
>>   +    /* Invalidate the cached block-status data range if this 
>> discard overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
>> +    qemu_co_mutex_unlock(&bsc->lock);
>> +
>>       /* Discard is advisory, but some devices track and coalesce
>>        * unaligned requests, so we must pass everything down rather than
>>        * round here.  Still, most devices will just silently ignore
>>
>
> Overall, seems good to me.

Great :)

Max



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
  2021-06-21  9:50     ` Max Reitz
@ 2021-06-21 18:53     ` Eric Blake
  2021-06-22  9:07       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Blake @ 2021-06-21 18:53 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz

On Sat, Jun 19, 2021 at 01:53:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > +++ b/block/nbd.c
> > @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
> >           .type = NBD_CMD_BLOCK_STATUS,
> >           .from = offset,
> >           .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
> > -                   MIN(bytes, s->info.size - offset)),
> > +                   s->info.size - offset),
> >           .flags = NBD_CMD_FLAG_REQ_ONE,
> >       };
> > 
> 
> Hmm..
> 
> I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work.

Not necessarily.  The fact that we have passed NBD_CMD_FLAG_REQ_ONE
means that the server is still only allowed to give us one extent in
its answer, and that it may not give us information beyond the length
we requested.  You are right that if we lose the REQ_ONE flag we may
result in the server doing more work to provide us additional extents
that we will then be ignoring because we aren't yet set up for
avoiding REQ_ONE.  Fixing that is a longer-term goal.  But in the
short term, I see no harm in giving a larger length to the server with
REQ_ONE.

> 
> (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop).
> 
> For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent.

No, the server is allowed to reply with less length than our request,
and that is particularly true if the server does NOT have free access
to the full length of our request.  In the case of qcow2, since
bdrv_block_status is (by current design) clamped at cluster
boundaries, requesting a 4G length will NOT increase the amount of the
server response any further than the first cluster boundary (that is,
the point where the server no longer has free access to status without
loading another cluster of L2 entries).

> 
> So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it.

What the flag prohibits is the server giving us more information than
the length we requested.  But this patch is increasing our request
length for the case where the server CAN give us more information than
we need locally, on the hopes that even though the server can only
reply with one extent, we aren't wasting as many network
back-and-forth trips when a larger request would have worked.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-21  9:50     ` Max Reitz
@ 2021-06-21 18:54       ` Eric Blake
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Blake @ 2021-06-21 18:54 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Mon, Jun 21, 2021 at 11:50:02AM +0200, Max Reitz wrote:
> > I don't that this change is correct. In contrast with file-posix you
> > don't get extra information for free, you just make a larger request.
> > This means that server will have to do more work.
> 
> Oh, oops.  Seems I was blind in my rage to replace this MIN() pattern.
> 
> You’re absolutely right.  So this patch should be dropped.

I disagree - I think ths patch is still correct, as written, _because_
we use the REQ_ONE flag.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
  2021-06-21 18:53     ` Eric Blake
@ 2021-06-22  9:07       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 28+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-22  9:07 UTC (permalink / raw)
  To: Eric Blake; +Cc: Max Reitz, qemu-block, Kevin Wolf, qemu-devel

21.06.2021 21:53, Eric Blake wrote:
> On Sat, Jun 19, 2021 at 01:53:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> +++ b/block/nbd.c
>>> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
>>>            .type = NBD_CMD_BLOCK_STATUS,
>>>            .from = offset,
>>>            .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
>>> -                   MIN(bytes, s->info.size - offset)),
>>> +                   s->info.size - offset),
>>>            .flags = NBD_CMD_FLAG_REQ_ONE,
>>>        };
>>>
>>
>> Hmm..
>>
>> I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work.
> 
> Not necessarily.  The fact that we have passed NBD_CMD_FLAG_REQ_ONE
> means that the server is still only allowed to give us one extent in
> its answer, and that it may not give us information beyond the length
> we requested.  You are right that if we lose the REQ_ONE flag we may
> result in the server doing more work to provide us additional extents
> that we will then be ignoring because we aren't yet set up for
> avoiding REQ_ONE.  Fixing that is a longer-term goal.  But in the
> short term, I see no harm in giving a larger length to the server with
> REQ_ONE.
> 
>>
>> (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop).
>>
>> For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent.
> 
> No, the server is allowed to reply with less length than our request,
> and that is particularly true if the server does NOT have free access
> to the full length of our request.  In the case of qcow2, since
> bdrv_block_status is (by current design) clamped at cluster
> boundaries, requesting a 4G length will NOT increase the amount of the
> server response any further than the first cluster boundary (that is,
> the point where the server no longer has free access to status without
> loading another cluster of L2 entries).

No. No matter where bdrv_block_status_above is clamped. If the whole disk is allocated, blockstatus_to_extents() in nbd/server.c will loop through the whole requested range and merge all the information into one extent. This doesn't violate NBD_CMD_FLAG_REQ_ONE: we have one extent on output and don't go beyound the length. It's valid for the server to try to satisfy as much as possible of request, and blockstatus_to_extents works in this way currently.

Remember that nbd_extent_array_add() can merge new extent to the previous if it has the same type.

> 
>>
>> So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it.
> 
> What the flag prohibits is the server giving us more information than
> the length we requested.  But this patch is increasing our request
> length for the case where the server CAN give us more information than
> we need locally, on the hopes that even though the server can only
> reply with one extent, we aren't wasting as many network
> back-and-forth trips when a larger request would have worked.
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-06-22  9:09 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz
2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
2021-06-18 17:42   ` Eric Blake
2021-06-19  9:38   ` Vladimir Sementsov-Ogievskiy
2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz
2021-06-18 18:51   ` Eric Blake
2021-06-21  9:37     ` Max Reitz
2021-06-19 10:20   ` Vladimir Sementsov-Ogievskiy
2021-06-21 10:05     ` Max Reitz
2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz
2021-06-18 20:16   ` Eric Blake
2021-06-21  9:38     ` Max Reitz
2021-06-19 10:32   ` Vladimir Sementsov-Ogievskiy
2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz
2021-06-18 20:17   ` Eric Blake
2021-06-19 10:36   ` Vladimir Sementsov-Ogievskiy
2021-06-21  9:47     ` Max Reitz
2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz
2021-06-18 20:20   ` Eric Blake
2021-06-19 11:12     ` Vladimir Sementsov-Ogievskiy
2021-06-19 10:53   ` Vladimir Sementsov-Ogievskiy
2021-06-21  9:50     ` Max Reitz
2021-06-21 18:54       ` Eric Blake
2021-06-21 18:53     ` Eric Blake
2021-06-22  9:07       ` Vladimir Sementsov-Ogievskiy
2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz
2021-06-18 20:20   ` Eric Blake
2021-06-19 11:13   ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.