All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] block: block-status cache for data regions
@ 2021-06-23 15:01 Max Reitz
  2021-06-23 15:01 ` [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

Hi,

See the cover letter from v1 for the general idea:

https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html


The biggest change here in v2 is that instead of having a common CoMutex
protect the block-status cache, we’re using RCU now.  So to read from
the cache, or even to invalidate it, no lock is needed, only to update
it with new data.

Disclaimer: I have no experience with RCU in practice so far, neither in
qemu nor anywhere else.  So I hope I’ve used it correctly...


Differences to v1 in detail:
- Patch 2:
  - Moved BdrvBlockStatusCache.lock up to BDS, it is now the RCU writer
    lock
  - BDS.block_status_cache is now a pointer, so it can be replaced with
    RCU
  - Moved all cache access functionality into helper functions
    (bdrv_bsc_is_data(), bdrv_bsc_invalidate_range(), bdrv_bsc_fill())
    in block.c
  - Guard BSC accesses with RCU
    (BSC.valid is to be accessed atomically, which allows resetting it
    without taking an RCU write lock)
  - Check QLIST_EMPTY(&bs->children) not just when reading from the
    cache, but when filling it, too (so we don’t need an RCU update when
    it won’t make sense)
- Patch 3: Added
- Dropped the block/nbd patch (because it would make NBD query a larger
  range; the patch’s intent was to get more information for free, which
  this would not be)


git-backport-diff against v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/6:[----] [--] 'block: Drop BDS comment regarding bdrv_append()'
002/6:[0169] [FC] 'block: block-status cache for data regions'
003/6:[down] 'block: Clarify that @bytes is no limit on *pnum'
004/6:[----] [--] 'block/file-posix: Do not force-cap *pnum'
005/6:[----] [--] 'block/gluster: Do not force-cap *pnum'
006/6:[----] [--] 'block/iscsi: Do not force-cap *pnum'


Max Reitz (6):
  block: Drop BDS comment regarding bdrv_append()
  block: block-status cache for data regions
  block: Clarify that @bytes is no limit on *pnum
  block/file-posix: Do not force-cap *pnum
  block/gluster: Do not force-cap *pnum
  block/iscsi: Do not force-cap *pnum

 include/block/block_int.h | 54 +++++++++++++++++++++++--
 block.c                   | 84 +++++++++++++++++++++++++++++++++++++++
 block/file-posix.c        |  7 ++--
 block/gluster.c           |  7 ++--
 block/io.c                | 61 ++++++++++++++++++++++++++--
 block/iscsi.c             |  3 --
 6 files changed, 200 insertions(+), 16 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append()
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 057d88b1fc..a8f9598102 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,12 +832,6 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
-/*
- * Note: the function bdrv_append() copies and swaps contents of
- * BlockDriverStates, so if you add new fields to this struct, please
- * inspect bdrv_append() to determine if the new fields need to be
- * copied as well.
- */
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/6] block: block-status cache for data regions
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
  2021-06-23 15:01 ` [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
  2021-07-06 17:04   ` Kevin Wolf
  2021-06-23 15:01 ` [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum Max Reitz
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 47 ++++++++++++++++++++++
 block.c                   | 84 +++++++++++++++++++++++++++++++++++++++
 block/io.c                | 61 ++++++++++++++++++++++++++--
 3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ *         functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *            the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    /* Lock for block-status cache RCU writers */
+    CoMutex bsc_modify_lock;
+    /* Always non-NULL, but must only be dereferenced under an RCU read guard */
+    BdrvBlockStatusCache *block_status_cache;
 };
 
 struct BlockBackendRootState {
@@ -1422,4 +1443,30 @@ static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "qemu/range.h"
+#include "qemu/rcu.h"
 #include "block/coroutines.h"
 
 #ifdef CONFIG_BSD
@@ -398,6 +400,9 @@ BlockDriverState *bdrv_new(void)
 
     qemu_co_queue_init(&bs->flush_queue);
 
+    qemu_co_mutex_init(&bs->bsc_modify_lock);
+    bs->block_status_cache = g_new0(BdrvBlockStatusCache, 1);
+
     for (i = 0; i < bdrv_drain_all_count; i++) {
         bdrv_drained_begin(bs);
     }
@@ -4635,6 +4640,8 @@ static void bdrv_close(BlockDriverState *bs)
     bs->explicit_options = NULL;
     qobject_unref(bs->full_open_options);
     bs->full_open_options = NULL;
+    g_free(bs->block_status_cache);
+    bs->block_status_cache = NULL;
 
     bdrv_release_named_dirty_bitmaps(bs);
     assert(QLIST_EMPTY(&bs->dirty_bitmaps));
@@ -7590,3 +7597,80 @@ BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs)
 {
     return bdrv_skip_filters(bdrv_cow_bs(bdrv_skip_filters(bs)));
 }
+
+/**
+ * Check whether [offset, offset + bytes) overlaps with the cached
+ * block-status data region.
+ *
+ * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
+ * which is what bdrv_bsc_is_data()'s interface needs.
+ * Otherwise, *pnum is not touched.
+ */
+static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
+                                           int64_t offset, int64_t bytes,
+                                           int64_t *pnum)
+{
+    BdrvBlockStatusCache *bsc = bs->block_status_cache;
+    bool overlaps;
+
+    overlaps =
+        qatomic_read(&bsc->valid) &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start);
+
+    if (overlaps && pnum) {
+        *pnum = bsc->data_end - offset;
+    }
+
+    return overlaps;
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
+{
+    bool overlaps;
+
+    WITH_RCU_READ_LOCK_GUARD() {
+        overlaps = bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
+    }
+
+    return overlaps;
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes)
+{
+    WITH_RCU_READ_LOCK_GUARD() {
+        if (bdrv_bsc_range_overlaps_locked(bs, offset, bytes, NULL)) {
+            qatomic_set(&bs->block_status_cache->valid, false);
+        }
+    }
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+    BdrvBlockStatusCache *new_bsc = g_new(BdrvBlockStatusCache, 1);
+    BdrvBlockStatusCache *old_bsc;
+
+    *new_bsc = (BdrvBlockStatusCache) {
+        .valid = true,
+        .data_start = offset,
+        .data_end = offset + bytes,
+    };
+
+    WITH_QEMU_LOCK_GUARD(&bs->bsc_modify_lock) {
+        old_bsc = bs->block_status_cache;
+        qatomic_rcu_set(&bs->block_status_cache, new_bsc);
+        synchronize_rcu();
+    }
+
+    g_free(old_bsc);
+}
diff --git a/block/io.c b/block/io.c
index 323854d063..85fa449bf9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1878,6 +1878,9 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* Invalidate the cached block-status data range if this write overlaps */
+    bdrv_bsc_invalidate_range(bs, offset, bytes);
+
     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
     tail = (offset + bytes) % alignment;
@@ -2442,9 +2445,58 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 
     if (bs->drv->bdrv_co_block_status) {
-        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-                                            aligned_bytes, pnum, &local_map,
-                                            &local_file);
+        bool from_cache = false;
+
+        /*
+         * Use the block-status cache only for protocol nodes: Format
+         * drivers are generally quick to inquire the status, but protocol
+         * drivers often need to get information from outside of qemu, so
+         * we do not have control over the actual implementation.  There
+         * have been cases where inquiring the status took an unreasonably
+         * long time, and we can do nothing in qemu to fix it.
+         * This is especially problematic for images with large data areas,
+         * because finding the few holes in them and giving them special
+         * treatment does not gain much performance.  Therefore, we try to
+         * cache the last-identified data region.
+         *
+         * Second, limiting ourselves to protocol nodes allows us to assume
+         * the block status for data regions to be DATA | OFFSET_VALID, and
+         * that the host offset is the same as the guest offset.
+         *
+         * Note that it is possible that external writers zero parts of
+         * the cached regions without the cache being invalidated, and so
+         * we may report zeroes as data.  This is not catastrophic,
+         * however, because reporting zeroes as data is fine.
+         */
+        if (QLIST_EMPTY(&bs->children)) {
+            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
+                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+                local_file = bs;
+                local_map = aligned_offset;
+
+                from_cache = true;
+            }
+        }
+
+        if (!from_cache) {
+            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+                                                aligned_bytes, pnum, &local_map,
+                                                &local_file);
+
+            /*
+             * Note that checking QLIST_EMPTY(&bs->children) is also done when
+             * the cache is queried above.  Technically, we do not need to check
+             * it here; the worst that can happen is that we fill the cache for
+             * non-protocol nodes, and then it is never used.  However, filling
+             * the cache requires an RCU update, so double check here to avoid
+             * such an update if possible.
+             */
+            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
+                QLIST_EMPTY(&bs->children))
+            {
+                bdrv_bsc_fill(bs, aligned_offset, *pnum);
+            }
+        }
     } else {
         /* Default code for filters */
 
@@ -2997,6 +3049,9 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
         return 0;
     }
 
+    /* Invalidate the cached block-status data range if this discard overlaps */
+    bdrv_bsc_invalidate_range(bs, offset, bytes);
+
     /* Discard is advisory, but some devices track and coalesce
      * unaligned requests, so we must pass everything down rather than
      * round here.  Still, most devices will just silently ignore
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
  2021-06-23 15:01 ` [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
  2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-06-24  9:15   ` Vladimir Sementsov-Ogievskiy
  2021-06-23 15:01 ` [PATCH v2 4/6] block/file-posix: Do not force-cap *pnum Max Reitz
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index fcb599dd1c..f85b96ed99 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -347,6 +347,11 @@ struct BlockDriver {
      * clamped to bdrv_getlength() and aligned to request_alignment,
      * as well as non-NULL pnum, map, and file; in turn, the driver
      * must return an error or set pnum to an aligned non-zero value.
+     *
+     * Note that @bytes is just a hint on how big of a region the
+     * caller wants to inspect.  It is not a limit on *pnum.
+     * Implementations are free to return larger values of *pnum if
+     * doing so does not incur a performance penalty.
      */
     int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
         bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/6] block/file-posix: Do not force-cap *pnum
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
                   ` (2 preceding siblings ...)
  2021-06-23 15:01 ` [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-06-23 15:01 ` [PATCH v2 5/6] block/gluster: " Max Reitz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/file-posix.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b3fbb9bd63..aeb370d5bb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t start,
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  */
 static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             bool want_zero,
@@ -2727,7 +2728,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
 
         /*
          * We are not allowed to return partial sectors, though, so
@@ -2746,7 +2747,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
     *map = offset;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/6] block/gluster: Do not force-cap *pnum
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
                   ` (3 preceding siblings ...)
  2021-06-23 15:01 ` [PATCH v2 4/6] block/file-posix: Do not force-cap *pnum Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-06-23 15:01 ` [PATCH v2 6/6] block/iscsi: " Max Reitz
  2021-07-06 17:06 ` [PATCH v2 0/6] block: block-status cache for data regions Kevin Wolf
  6 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/gluster.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  *
  * (Based on raw_co_block_status() from file-posix.c.)
  */
@@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
         ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/6] block/iscsi: Do not force-cap *pnum
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
                   ` (4 preceding siblings ...)
  2021-06-23 15:01 ` [PATCH v2 5/6] block/gluster: " Max Reitz
@ 2021-06-23 15:01 ` Max Reitz
  2021-07-06 17:06 ` [PATCH v2 0/6] block: block-status cache for data regions Kevin Wolf
  6 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-23 15:01 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel,
	Max Reitz

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/iscsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4d2a416ce7..852384086b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -781,9 +781,6 @@ retry:
         iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
     }
 
-    if (*pnum > bytes) {
-        *pnum = bytes;
-    }
 out_unlock:
     qemu_mutex_unlock(&iscsilun->mutex);
     g_free(iTask.err_str);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-23 15:01 ` [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum Max Reitz
@ 2021-06-24  9:15   ` Vladimir Sementsov-Ogievskiy
  2021-06-24 10:16     ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-24  9:15 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, Eric Blake

23.06.2021 18:01, Max Reitz wrote:
> .bdrv_co_block_status() implementations are free to return a *pnum that
> exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
> *pnum as necessary.
> 
> On the other hand, if drivers' implementations return values for *pnum
> that are as large as possible, our recently introduced block-status
> cache will become more effective.
> 
> So, make a note in block_int.h that @bytes is no upper limit for *pnum.
> 
> Suggested-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block_int.h | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index fcb599dd1c..f85b96ed99 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -347,6 +347,11 @@ struct BlockDriver {
>        * clamped to bdrv_getlength() and aligned to request_alignment,
>        * as well as non-NULL pnum, map, and file; in turn, the driver
>        * must return an error or set pnum to an aligned non-zero value.
> +     *
> +     * Note that @bytes is just a hint on how big of a region the
> +     * caller wants to inspect.  It is not a limit on *pnum.
> +     * Implementations are free to return larger values of *pnum if
> +     * doing so does not incur a performance penalty.

Worth mention that the cache will benefit of it?

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

>        */
>       int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
>           bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] block: block-status cache for data regions
  2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
@ 2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
  2021-06-24 11:11     ` Max Reitz
  2021-07-06 17:04   ` Kevin Wolf
  1 sibling, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-24 10:05 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, Eric Blake

23.06.2021 18:01, Max Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>

I'm new to RCU, so my review can't be reliable..

> ---
>   include/block/block_int.h | 47 ++++++++++++++++++++++
>   block.c                   | 84 +++++++++++++++++++++++++++++++++++++++
>   block/io.c                | 61 ++++++++++++++++++++++++++--
>   3 files changed, 189 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a8f9598102..fcb599dd1c 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -832,6 +832,22 @@ struct BdrvChild {
>       QLIST_ENTRY(BdrvChild) next_parent;
>   };
>   
> +/*
> + * Allows bdrv_co_block_status() to cache one data region for a
> + * protocol node.
> + *
> + * @valid: Whether the cache is valid (should be accessed with atomic
> + *         functions so this can be reset by RCU readers)
> + * @data_start: Offset where we know (or strongly assume) is data
> + * @data_end: Offset where the data region ends (which is not necessarily
> + *            the start of a zeroed region)
> + */
> +typedef struct BdrvBlockStatusCache {
> +    bool valid;
> +    int64_t data_start;
> +    int64_t data_end;
> +} BdrvBlockStatusCache;
> +
>   struct BlockDriverState {
>       /* Protected by big QEMU lock or read-only after opening.  No special
>        * locking needed during I/O...
> @@ -997,6 +1013,11 @@ struct BlockDriverState {
>   
>       /* BdrvChild links to this node may never be frozen */
>       bool never_freeze;
> +
> +    /* Lock for block-status cache RCU writers */
> +    CoMutex bsc_modify_lock;
> +    /* Always non-NULL, but must only be dereferenced under an RCU read guard */
> +    BdrvBlockStatusCache *block_status_cache;>   };
>   
>   struct BlockBackendRootState {
> @@ -1422,4 +1443,30 @@ static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
>    */
>   void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
>   
> +/**
> + * Check whether the given offset is in the cached block-status data
> + * region.
> + *
> + * If it is, and @pnum is not NULL, *pnum is set to
> + * `bsc.data_end - offset`, i.e. how many bytes, starting from
> + * @offset, are data (according to the cache).
> + * Otherwise, *pnum is not touched.
> + */
> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
> +
> +/**
> + * If [offset, offset + bytes) overlaps with the currently cached
> + * block-status region, invalidate the cache.
> + *
> + * (To be used by I/O paths that cause data regions to be zero or
> + * holes.)
> + */
> +void bdrv_bsc_invalidate_range(BlockDriverState *bs,
> +                               int64_t offset, int64_t bytes);
> +
> +/**
> + * Mark the range [offset, offset + bytes) as a data region.
> + */
> +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
> +
>   #endif /* BLOCK_INT_H */
> diff --git a/block.c b/block.c
> index 3f456892d0..9ab9459f7a 100644
> --- a/block.c
> +++ b/block.c
> @@ -49,6 +49,8 @@
>   #include "qemu/timer.h"
>   #include "qemu/cutils.h"
>   #include "qemu/id.h"
> +#include "qemu/range.h"
> +#include "qemu/rcu.h"
>   #include "block/coroutines.h"
>   
>   #ifdef CONFIG_BSD
> @@ -398,6 +400,9 @@ BlockDriverState *bdrv_new(void)
>   
>       qemu_co_queue_init(&bs->flush_queue);
>   
> +    qemu_co_mutex_init(&bs->bsc_modify_lock);
> +    bs->block_status_cache = g_new0(BdrvBlockStatusCache, 1);
> +
>       for (i = 0; i < bdrv_drain_all_count; i++) {
>           bdrv_drained_begin(bs);
>       }
> @@ -4635,6 +4640,8 @@ static void bdrv_close(BlockDriverState *bs)
>       bs->explicit_options = NULL;
>       qobject_unref(bs->full_open_options);
>       bs->full_open_options = NULL;
> +    g_free(bs->block_status_cache);
> +    bs->block_status_cache = NULL;
>   
>       bdrv_release_named_dirty_bitmaps(bs);
>       assert(QLIST_EMPTY(&bs->dirty_bitmaps));
> @@ -7590,3 +7597,80 @@ BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs)
>   {
>       return bdrv_skip_filters(bdrv_cow_bs(bdrv_skip_filters(bs)));
>   }
> +
> +/**
> + * Check whether [offset, offset + bytes) overlaps with the cached
> + * block-status data region.
> + *
> + * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
> + * which is what bdrv_bsc_is_data()'s interface needs.
> + * Otherwise, *pnum is not touched.
> + */
> +static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
> +                                           int64_t offset, int64_t bytes,
> +                                           int64_t *pnum)
> +{
> +    BdrvBlockStatusCache *bsc = bs->block_status_cache;

Shouldn't use qatomic_rcu_read() ?

> +    bool overlaps;
> +
> +    overlaps =
> +        qatomic_read(&bsc->valid) &&

Hmm. Why you need atomic access? I thought that after getting rcu pointer, we are safe to read the fields.

Ah, I see, you want to also set it in rcu-reader code..

Isn't it better just to do normal rcu-write and set pointer to NULL, when cache becomes invalid? I think keeping heap-allocated structure with valid=false inside doesn't make much sense.

> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start);
> +
> +    if (overlaps && pnum) {
> +        *pnum = bsc->data_end - offset;
> +    }
> +
> +    return overlaps;
> +}
> +
> +/**
> + * See block_int.h for this function's documentation.
> + */
> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
> +{
> +    bool overlaps;
> +
> +    WITH_RCU_READ_LOCK_GUARD() {
> +        overlaps = bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
> +    }
> +
> +    return overlaps;
> +}

this may be written simpler I think:

RCU_READ_LOCK_GUARD();
return bdrv_bsc_range_overlaps_locked(..);

> +
> +/**
> + * See block_int.h for this function's documentation.
> + */
> +void bdrv_bsc_invalidate_range(BlockDriverState *bs,
> +                               int64_t offset, int64_t bytes)
> +{
> +    WITH_RCU_READ_LOCK_GUARD() {
> +        if (bdrv_bsc_range_overlaps_locked(bs, offset, bytes, NULL)) {
> +            qatomic_set(&bs->block_status_cache->valid, false);
> +        }
> +    }
> +}

Same here, why not use RCU_READ_LOCK_GUARD() ?

> +
> +/**
> + * See block_int.h for this function's documentation.
> + */
> +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes)
> +{
> +    BdrvBlockStatusCache *new_bsc = g_new(BdrvBlockStatusCache, 1);
> +    BdrvBlockStatusCache *old_bsc;
> +
> +    *new_bsc = (BdrvBlockStatusCache) {
> +        .valid = true,
> +        .data_start = offset,
> +        .data_end = offset + bytes,
> +    };
> +
> +    WITH_QEMU_LOCK_GUARD(&bs->bsc_modify_lock) {
> +        old_bsc = bs->block_status_cache;
> +        qatomic_rcu_set(&bs->block_status_cache, new_bsc);
> +        synchronize_rcu();

Interesting, that until this, synchronize_rcu() is used only in tests.. (I tried to search examples of rcu writing in the code)

> +    }
> +
> +    g_free(old_bsc);
> +}
> diff --git a/block/io.c b/block/io.c
> index 323854d063..85fa449bf9 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1878,6 +1878,9 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           return -ENOTSUP;
>       }
>   
> +    /* Invalidate the cached block-status data range if this write overlaps */
> +    bdrv_bsc_invalidate_range(bs, offset, bytes);
> +
>       assert(alignment % bs->bl.request_alignment == 0);
>       head = offset % alignment;
>       tail = (offset + bytes) % alignment;
> @@ -2442,9 +2445,58 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>   
>       if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */
> +        if (QLIST_EMPTY(&bs->children)) {
> +            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                from_cache = true;
> +            }
> +        }
> +
> +        if (!from_cache) {
> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            /*
> +             * Note that checking QLIST_EMPTY(&bs->children) is also done when
> +             * the cache is queried above.  Technically, we do not need to check
> +             * it here; the worst that can happen is that we fill the cache for
> +             * non-protocol nodes, and then it is never used.  However, filling
> +             * the cache requires an RCU update, so double check here to avoid
> +             * such an update if possible.
> +             */
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> +                QLIST_EMPTY(&bs->children))
> +            {
> +                bdrv_bsc_fill(bs, aligned_offset, *pnum);
> +            }
> +        }
>       } else {
>           /* Default code for filters */
>   
> @@ -2997,6 +3049,9 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>           return 0;
>       }
>   
> +    /* Invalidate the cached block-status data range if this discard overlaps */
> +    bdrv_bsc_invalidate_range(bs, offset, bytes);
> +
>       /* Discard is advisory, but some devices track and coalesce
>        * unaligned requests, so we must pass everything down rather than
>        * round here.  Still, most devices will just silently ignore
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-24  9:15   ` Vladimir Sementsov-Ogievskiy
@ 2021-06-24 10:16     ` Max Reitz
  2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
  2021-06-28 19:10       ` Eric Blake
  0 siblings, 2 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-24 10:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Eric Blake, qemu-devel

On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote:
> 23.06.2021 18:01, Max Reitz wrote:
>> .bdrv_co_block_status() implementations are free to return a *pnum that
>> exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
>> *pnum as necessary.
>>
>> On the other hand, if drivers' implementations return values for *pnum
>> that are as large as possible, our recently introduced block-status
>> cache will become more effective.
>>
>> So, make a note in block_int.h that @bytes is no upper limit for *pnum.
>>
>> Suggested-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block_int.h | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index fcb599dd1c..f85b96ed99 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -347,6 +347,11 @@ struct BlockDriver {
>>        * clamped to bdrv_getlength() and aligned to request_alignment,
>>        * as well as non-NULL pnum, map, and file; in turn, the driver
>>        * must return an error or set pnum to an aligned non-zero value.
>> +     *
>> +     * Note that @bytes is just a hint on how big of a region the
>> +     * caller wants to inspect.  It is not a limit on *pnum.
>> +     * Implementations are free to return larger values of *pnum if
>> +     * doing so does not incur a performance penalty.
>
> Worth mention that the cache will benefit of it?

Oh, right, absolutely.  Like so:

"block/io.c's bdrv_co_block_status() will clamp *pnum before returning 
it to its caller, but it itself can still make use of the unclamped 
*pnum value.  Specifically, the block-status cache for protocol nodes 
will benefit from storing as large a region as possible."

?

Max



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-24 10:16     ` Max Reitz
@ 2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
  2021-06-24 11:12         ` Max Reitz
  2021-06-28 19:10       ` Eric Blake
  1 sibling, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-24 10:25 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, Eric Blake

24.06.2021 13:16, Max Reitz wrote:
> On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote:
>> 23.06.2021 18:01, Max Reitz wrote:
>>> .bdrv_co_block_status() implementations are free to return a *pnum that
>>> exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
>>> *pnum as necessary.
>>>
>>> On the other hand, if drivers' implementations return values for *pnum
>>> that are as large as possible, our recently introduced block-status
>>> cache will become more effective.
>>>
>>> So, make a note in block_int.h that @bytes is no upper limit for *pnum.
>>>
>>> Suggested-by: Eric Blake <eblake@redhat.com>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   include/block/block_int.h | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>>> index fcb599dd1c..f85b96ed99 100644
>>> --- a/include/block/block_int.h
>>> +++ b/include/block/block_int.h
>>> @@ -347,6 +347,11 @@ struct BlockDriver {
>>>        * clamped to bdrv_getlength() and aligned to request_alignment,
>>>        * as well as non-NULL pnum, map, and file; in turn, the driver
>>>        * must return an error or set pnum to an aligned non-zero value.
>>> +     *
>>> +     * Note that @bytes is just a hint on how big of a region the
>>> +     * caller wants to inspect.  It is not a limit on *pnum.
>>> +     * Implementations are free to return larger values of *pnum if
>>> +     * doing so does not incur a performance penalty.
>>
>> Worth mention that the cache will benefit of it?
> 
> Oh, right, absolutely.  Like so:
> 
> "block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to its caller, but it itself can still make use of the unclamped *pnum value.  Specifically, the block-status cache for protocol nodes will benefit from storing as large a region as possible."
> 

Sounds good. Do you mean this as an addition or substitution? If the latter, I'd keep "if doing so does not incur a performance penalty"



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] block: block-status cache for data regions
  2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
@ 2021-06-24 11:11     ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-24 11:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Eric Blake, qemu-devel

On 24.06.21 12:05, Vladimir Sementsov-Ogievskiy wrote:
> 23.06.2021 18:01, Max Reitz wrote:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform. The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> The problem we face is that on some filesystems or in some
>> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
>> implementation is outside of qemu, there is little we can do about its
>> performance.
>>
>> We have already introduced the want_zero parameter to
>> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
>> unless we really want zero information; but sometimes we do want that
>> information, because for files that consist largely of zero areas,
>> special-casing those areas can give large performance boosts. So the
>> real problem is with files that consist largely of data, so that
>> inquiring the block status does not gain us much performance, but where
>> such an inquiry itself takes a lot of time.
>>
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
>>
>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>
> I'm new to RCU, so my review can't be reliable..

Yeah, well, same here. :)

>> ---
>>   include/block/block_int.h | 47 ++++++++++++++++++++++
>>   block.c                   | 84 +++++++++++++++++++++++++++++++++++++++
>>   block/io.c                | 61 ++++++++++++++++++++++++++--
>>   3 files changed, 189 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index a8f9598102..fcb599dd1c 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -832,6 +832,22 @@ struct BdrvChild {
>>       QLIST_ENTRY(BdrvChild) next_parent;
>>   };
>>   +/*
>> + * Allows bdrv_co_block_status() to cache one data region for a
>> + * protocol node.
>> + *
>> + * @valid: Whether the cache is valid (should be accessed with atomic
>> + *         functions so this can be reset by RCU readers)
>> + * @data_start: Offset where we know (or strongly assume) is data
>> + * @data_end: Offset where the data region ends (which is not 
>> necessarily
>> + *            the start of a zeroed region)
>> + */
>> +typedef struct BdrvBlockStatusCache {
>> +    bool valid;
>> +    int64_t data_start;
>> +    int64_t data_end;
>> +} BdrvBlockStatusCache;
>> +
>>   struct BlockDriverState {
>>       /* Protected by big QEMU lock or read-only after opening. No 
>> special
>>        * locking needed during I/O...
>> @@ -997,6 +1013,11 @@ struct BlockDriverState {
>>         /* BdrvChild links to this node may never be frozen */
>>       bool never_freeze;
>> +
>> +    /* Lock for block-status cache RCU writers */
>> +    CoMutex bsc_modify_lock;
>> +    /* Always non-NULL, but must only be dereferenced under an RCU 
>> read guard */
>> +    BdrvBlockStatusCache *block_status_cache;>   };
>>     struct BlockBackendRootState {
>> @@ -1422,4 +1443,30 @@ static inline BlockDriverState 
>> *bdrv_primary_bs(BlockDriverState *bs)
>>    */
>>   void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
>>   +/**
>> + * Check whether the given offset is in the cached block-status data
>> + * region.
>> + *
>> + * If it is, and @pnum is not NULL, *pnum is set to
>> + * `bsc.data_end - offset`, i.e. how many bytes, starting from
>> + * @offset, are data (according to the cache).
>> + * Otherwise, *pnum is not touched.
>> + */
>> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t 
>> *pnum);
>> +
>> +/**
>> + * If [offset, offset + bytes) overlaps with the currently cached
>> + * block-status region, invalidate the cache.
>> + *
>> + * (To be used by I/O paths that cause data regions to be zero or
>> + * holes.)
>> + */
>> +void bdrv_bsc_invalidate_range(BlockDriverState *bs,
>> +                               int64_t offset, int64_t bytes);
>> +
>> +/**
>> + * Mark the range [offset, offset + bytes) as a data region.
>> + */
>> +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes);
>> +
>>   #endif /* BLOCK_INT_H */
>> diff --git a/block.c b/block.c
>> index 3f456892d0..9ab9459f7a 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -49,6 +49,8 @@
>>   #include "qemu/timer.h"
>>   #include "qemu/cutils.h"
>>   #include "qemu/id.h"
>> +#include "qemu/range.h"
>> +#include "qemu/rcu.h"
>>   #include "block/coroutines.h"
>>     #ifdef CONFIG_BSD
>> @@ -398,6 +400,9 @@ BlockDriverState *bdrv_new(void)
>>         qemu_co_queue_init(&bs->flush_queue);
>>   +    qemu_co_mutex_init(&bs->bsc_modify_lock);
>> +    bs->block_status_cache = g_new0(BdrvBlockStatusCache, 1);
>> +
>>       for (i = 0; i < bdrv_drain_all_count; i++) {
>>           bdrv_drained_begin(bs);
>>       }
>> @@ -4635,6 +4640,8 @@ static void bdrv_close(BlockDriverState *bs)
>>       bs->explicit_options = NULL;
>>       qobject_unref(bs->full_open_options);
>>       bs->full_open_options = NULL;
>> +    g_free(bs->block_status_cache);
>> +    bs->block_status_cache = NULL;
>>         bdrv_release_named_dirty_bitmaps(bs);
>>       assert(QLIST_EMPTY(&bs->dirty_bitmaps));
>> @@ -7590,3 +7597,80 @@ BlockDriverState 
>> *bdrv_backing_chain_next(BlockDriverState *bs)
>>   {
>>       return bdrv_skip_filters(bdrv_cow_bs(bdrv_skip_filters(bs)));
>>   }
>> +
>> +/**
>> + * Check whether [offset, offset + bytes) overlaps with the cached
>> + * block-status data region.
>> + *
>> + * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
>> + * which is what bdrv_bsc_is_data()'s interface needs.
>> + * Otherwise, *pnum is not touched.
>> + */
>> +static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
>> +                                           int64_t offset, int64_t 
>> bytes,
>> +                                           int64_t *pnum)
>> +{
>> +    BdrvBlockStatusCache *bsc = bs->block_status_cache;
>
> Shouldn't use qatomic_rcu_read() ?

Oh, right, probably so.

>
>> +    bool overlaps;
>> +
>> +    overlaps =
>> +        qatomic_read(&bsc->valid) &&
>
> Hmm. Why you need atomic access? I thought that after getting rcu 
> pointer, we are safe to read the fields.
>
> Ah, I see, you want to also set it in rcu-reader code..
>
> Isn't it better just to do normal rcu-write and set pointer to NULL, 
> when cache becomes invalid? I think keeping heap-allocated structure 
> with valid=false inside doesn't make much sense.

It does, because this way I don’t need an expensive RCU write.

>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start);
>> +
>> +    if (overlaps && pnum) {
>> +        *pnum = bsc->data_end - offset;
>> +    }
>> +
>> +    return overlaps;
>> +}
>> +
>> +/**
>> + * See block_int.h for this function's documentation.
>> + */
>> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t 
>> *pnum)
>> +{
>> +    bool overlaps;
>> +
>> +    WITH_RCU_READ_LOCK_GUARD() {
>> +        overlaps = bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
>> +    }
>> +
>> +    return overlaps;
>> +}
>
> this may be written simpler I think:
>
> RCU_READ_LOCK_GUARD();
> return bdrv_bsc_range_overlaps_locked(..);

Hm, I’ll see whether it grows on me.  I kind of like the explicit scope, 
even if it’s longer.

>> +
>> +/**
>> + * See block_int.h for this function's documentation.
>> + */
>> +void bdrv_bsc_invalidate_range(BlockDriverState *bs,
>> +                               int64_t offset, int64_t bytes)
>> +{
>> +    WITH_RCU_READ_LOCK_GUARD() {
>> +        if (bdrv_bsc_range_overlaps_locked(bs, offset, bytes, NULL)) {
>> + qatomic_set(&bs->block_status_cache->valid, false);
>> +        }
>> +    }
>> +}
>
> Same here, why not use RCU_READ_LOCK_GUARD() ?
>
>> +
>> +/**
>> + * See block_int.h for this function's documentation.
>> + */
>> +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes)
>> +{
>> +    BdrvBlockStatusCache *new_bsc = g_new(BdrvBlockStatusCache, 1);
>> +    BdrvBlockStatusCache *old_bsc;
>> +
>> +    *new_bsc = (BdrvBlockStatusCache) {
>> +        .valid = true,
>> +        .data_start = offset,
>> +        .data_end = offset + bytes,
>> +    };
>> +
>> +    WITH_QEMU_LOCK_GUARD(&bs->bsc_modify_lock) {
>> +        old_bsc = bs->block_status_cache;
>> +        qatomic_rcu_set(&bs->block_status_cache, new_bsc);
>> +        synchronize_rcu();
>
> Interesting, that until this, synchronize_rcu() is used only in 
> tests.. (I tried to search examples of rcu writing in the code)

Well, as far as I understood the docs, synchronize_rcu() is a thing that 
can be used, besides call_rcu().  I didn’t want to use call_rcu(), 
because it requires adding an rcu_head struct to the protected object...

Now that I look closer at the docs, it says "it is better" to release 
all locks before synchronize_rcu(), including the BQL. Perhaps I should 
give call_rcu() a try after all.

Max

>
>
>> +    }
>> +
>> +    g_free(old_bsc);
>> +}
>> diff --git a/block/io.c b/block/io.c
>> index 323854d063..85fa449bf9 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1878,6 +1878,9 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>           return -ENOTSUP;
>>       }
>>   +    /* Invalidate the cached block-status data range if this write 
>> overlaps */
>> +    bdrv_bsc_invalidate_range(bs, offset, bytes);
>> +
>>       assert(alignment % bs->bl.request_alignment == 0);
>>       head = offset % alignment;
>>       tail = (offset + bytes) % alignment;
>> @@ -2442,9 +2445,58 @@ static int coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>>         if (bs->drv->bdrv_co_block_status) {
>> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> -                                            aligned_bytes, pnum, 
>> &local_map,
>> -                                            &local_file);
>> +        bool from_cache = false;
>> +
>> +        /*
>> +         * Use the block-status cache only for protocol nodes: Format
>> +         * drivers are generally quick to inquire the status, but 
>> protocol
>> +         * drivers often need to get information from outside of 
>> qemu, so
>> +         * we do not have control over the actual implementation.  
>> There
>> +         * have been cases where inquiring the status took an 
>> unreasonably
>> +         * long time, and we can do nothing in qemu to fix it.
>> +         * This is especially problematic for images with large data 
>> areas,
>> +         * because finding the few holes in them and giving them 
>> special
>> +         * treatment does not gain much performance. Therefore, we 
>> try to
>> +         * cache the last-identified data region.
>> +         *
>> +         * Second, limiting ourselves to protocol nodes allows us to 
>> assume
>> +         * the block status for data regions to be DATA | 
>> OFFSET_VALID, and
>> +         * that the host offset is the same as the guest offset.
>> +         *
>> +         * Note that it is possible that external writers zero parts of
>> +         * the cached regions without the cache being invalidated, 
>> and so
>> +         * we may report zeroes as data.  This is not catastrophic,
>> +         * however, because reporting zeroes as data is fine.
>> +         */
>> +        if (QLIST_EMPTY(&bs->children)) {
>> +            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
>> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>> +                local_file = bs;
>> +                local_map = aligned_offset;
>> +
>> +                from_cache = true;
>> +            }
>> +        }
>> +
>> +        if (!from_cache) {
>> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> +                                                aligned_bytes, pnum, 
>> &local_map,
>> + &local_file);
>> +
>> +            /*
>> +             * Note that checking QLIST_EMPTY(&bs->children) is also 
>> done when
>> +             * the cache is queried above.  Technically, we do not 
>> need to check
>> +             * it here; the worst that can happen is that we fill 
>> the cache for
>> +             * non-protocol nodes, and then it is never used. 
>> However, filling
>> +             * the cache requires an RCU update, so double check 
>> here to avoid
>> +             * such an update if possible.
>> +             */
>> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
>> +                QLIST_EMPTY(&bs->children))
>> +            {
>> +                bdrv_bsc_fill(bs, aligned_offset, *pnum);
>> +            }
>> +        }
>>       } else {
>>           /* Default code for filters */
>>   @@ -2997,6 +3049,9 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild 
>> *child, int64_t offset,
>>           return 0;
>>       }
>>   +    /* Invalidate the cached block-status data range if this 
>> discard overlaps */
>> +    bdrv_bsc_invalidate_range(bs, offset, bytes);
>> +
>>       /* Discard is advisory, but some devices track and coalesce
>>        * unaligned requests, so we must pass everything down rather than
>>        * round here.  Still, most devices will just silently ignore
>>
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
@ 2021-06-24 11:12         ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-06-24 11:12 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Eric Blake, qemu-devel

On 24.06.21 12:25, Vladimir Sementsov-Ogievskiy wrote:
> 24.06.2021 13:16, Max Reitz wrote:
>> On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote:
>>> 23.06.2021 18:01, Max Reitz wrote:
>>>> .bdrv_co_block_status() implementations are free to return a *pnum 
>>>> that
>>>> exceeds @bytes, because bdrv_co_block_status() in block/io.c will 
>>>> clamp
>>>> *pnum as necessary.
>>>>
>>>> On the other hand, if drivers' implementations return values for *pnum
>>>> that are as large as possible, our recently introduced block-status
>>>> cache will become more effective.
>>>>
>>>> So, make a note in block_int.h that @bytes is no upper limit for 
>>>> *pnum.
>>>>
>>>> Suggested-by: Eric Blake <eblake@redhat.com>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   include/block/block_int.h | 5 +++++
>>>>   1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>>>> index fcb599dd1c..f85b96ed99 100644
>>>> --- a/include/block/block_int.h
>>>> +++ b/include/block/block_int.h
>>>> @@ -347,6 +347,11 @@ struct BlockDriver {
>>>>        * clamped to bdrv_getlength() and aligned to request_alignment,
>>>>        * as well as non-NULL pnum, map, and file; in turn, the driver
>>>>        * must return an error or set pnum to an aligned non-zero 
>>>> value.
>>>> +     *
>>>> +     * Note that @bytes is just a hint on how big of a region the
>>>> +     * caller wants to inspect.  It is not a limit on *pnum.
>>>> +     * Implementations are free to return larger values of *pnum if
>>>> +     * doing so does not incur a performance penalty.
>>>
>>> Worth mention that the cache will benefit of it?
>>
>> Oh, right, absolutely.  Like so:
>>
>> "block/io.c's bdrv_co_block_status() will clamp *pnum before 
>> returning it to its caller, but it itself can still make use of the 
>> unclamped *pnum value.  Specifically, the block-status cache for 
>> protocol nodes will benefit from storing as large a region as possible."
>>
>
> Sounds good. Do you mean this as an addition or substitution? If the 
> latter, I'd keep "if doing so does not incur a performance penalty 

I meant it as an addition, just a new paragraph.

Max



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-24 10:16     ` Max Reitz
  2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
@ 2021-06-28 19:10       ` Eric Blake
  2021-07-12  7:47         ` Max Reitz
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Blake @ 2021-06-28 19:10 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

> > > +++ b/include/block/block_int.h
> > > @@ -347,6 +347,11 @@ struct BlockDriver {
> > >        * clamped to bdrv_getlength() and aligned to request_alignment,
> > >        * as well as non-NULL pnum, map, and file; in turn, the driver
> > >        * must return an error or set pnum to an aligned non-zero value.
> > > +     *
> > > +     * Note that @bytes is just a hint on how big of a region the
> > > +     * caller wants to inspect.  It is not a limit on *pnum.
> > > +     * Implementations are free to return larger values of *pnum if
> > > +     * doing so does not incur a performance penalty.
> > 
> > Worth mention that the cache will benefit of it?
> 
> Oh, right, absolutely.  Like so:
> 
> "block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to
> its caller, but it itself can still make use of the unclamped *pnum value. 
> Specifically, the block-status cache for protocol nodes will benefit from
> storing as large a region as possible."

How about this tweak to the wording to make it flow a little better:

block/io.c's bdrv_co_block_status() will utilize an unclamped *pnum
value for the block-status cache on protocol nodes, prior to clamping
*pnum for return to its caller.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] block: block-status cache for data regions
  2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
  2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
@ 2021-07-06 17:04   ` Kevin Wolf
  2021-07-12  7:45     ` Max Reitz
  1 sibling, 1 reply; 18+ messages in thread
From: Kevin Wolf @ 2021-07-06 17:04 UTC (permalink / raw)
  To: Max Reitz
  Cc: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel, qemu-block

Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Since you indicated that you'll respin the patch, I'll add my minor
comments:

> @@ -2442,9 +2445,58 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>      aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>  
>      if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */
> +        if (QLIST_EMPTY(&bs->children)) {
> +            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                from_cache = true;
> +            }
> +        }
> +
> +        if (!from_cache) {

Is having a separate variable from_cache really useful? This looks like
it could just be:

    if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
        // The code above
    } else {
        // The code below
    }

> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            /*
> +             * Note that checking QLIST_EMPTY(&bs->children) is also done when
> +             * the cache is queried above.  Technically, we do not need to check
> +             * it here; the worst that can happen is that we fill the cache for
> +             * non-protocol nodes, and then it is never used.  However, filling
> +             * the cache requires an RCU update, so double check here to avoid
> +             * such an update if possible.
> +             */
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> +                QLIST_EMPTY(&bs->children))
> +            {

Would it be worth asserting that local_map == aligned_offset, because
otherwise with a buggy protocol driver, the result from the cache could
be different from the first call without us noticing?

> +                bdrv_bsc_fill(bs, aligned_offset, *pnum);
> +            }
> +        }
>      } else {
>          /* Default code for filters */

Kevin



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] block: block-status cache for data regions
  2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
                   ` (5 preceding siblings ...)
  2021-06-23 15:01 ` [PATCH v2 6/6] block/iscsi: " Max Reitz
@ 2021-07-06 17:06 ` Kevin Wolf
  6 siblings, 0 replies; 18+ messages in thread
From: Kevin Wolf @ 2021-07-06 17:06 UTC (permalink / raw)
  To: Max Reitz
  Cc: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel, qemu-block

Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
> Hi,
> 
> See the cover letter from v1 for the general idea:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html
> 
> 
> The biggest change here in v2 is that instead of having a common CoMutex
> protect the block-status cache, we’re using RCU now.  So to read from
> the cache, or even to invalidate it, no lock is needed, only to update
> it with new data.
> 
> Disclaimer: I have no experience with RCU in practice so far, neither in
> qemu nor anywhere else.  So I hope I’ve used it correctly...

This I hope as well. :-)

With the missing qatomic_rcu_read() added:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] block: block-status cache for data regions
  2021-07-06 17:04   ` Kevin Wolf
@ 2021-07-12  7:45     ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-07-12  7:45 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel, qemu-block

On 06.07.21 19:04, Kevin Wolf wrote:
> Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> The problem we face is that on some filesystems or in some
>> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
>> implementation is outside of qemu, there is little we can do about its
>> performance.
>>
>> We have already introduced the want_zero parameter to
>> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
>> unless we really want zero information; but sometimes we do want that
>> information, because for files that consist largely of zero areas,
>> special-casing those areas can give large performance boosts.  So the
>> real problem is with files that consist largely of data, so that
>> inquiring the block status does not gain us much performance, but where
>> such an inquiry itself takes a lot of time.
>>
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
>>
>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Since you indicated that you'll respin the patch, I'll add my minor
> comments:
>
>> @@ -2442,9 +2445,58 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>>   
>>       if (bs->drv->bdrv_co_block_status) {
>> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
>> -                                            aligned_bytes, pnum, &local_map,
>> -                                            &local_file);
>> +        bool from_cache = false;
>> +
>> +        /*
>> +         * Use the block-status cache only for protocol nodes: Format
>> +         * drivers are generally quick to inquire the status, but protocol
>> +         * drivers often need to get information from outside of qemu, so
>> +         * we do not have control over the actual implementation.  There
>> +         * have been cases where inquiring the status took an unreasonably
>> +         * long time, and we can do nothing in qemu to fix it.
>> +         * This is especially problematic for images with large data areas,
>> +         * because finding the few holes in them and giving them special
>> +         * treatment does not gain much performance.  Therefore, we try to
>> +         * cache the last-identified data region.
>> +         *
>> +         * Second, limiting ourselves to protocol nodes allows us to assume
>> +         * the block status for data regions to be DATA | OFFSET_VALID, and
>> +         * that the host offset is the same as the guest offset.
>> +         *
>> +         * Note that it is possible that external writers zero parts of
>> +         * the cached regions without the cache being invalidated, and so
>> +         * we may report zeroes as data.  This is not catastrophic,
>> +         * however, because reporting zeroes as data is fine.
>> +         */
>> +        if (QLIST_EMPTY(&bs->children)) {
>> +            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
>> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>> +                local_file = bs;
>> +                local_map = aligned_offset;
>> +
>> +                from_cache = true;
>> +            }
>> +        }
>> +
>> +        if (!from_cache) {
> Is having a separate variable from_cache really useful? This looks like
> it could just be:
>
>      if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
>          // The code above
>      } else {
>          // The code below
>      }

Oh, yes.

(I guess this was mainly an artifact from v1 where there was a mutex 
around the bdrv_bsc_is_data() block.  Now it’s better to just roll both 
conditions into one, yes.)

>> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
>> +                                                aligned_bytes, pnum, &local_map,
>> +                                                &local_file);
>> +
>> +            /*
>> +             * Note that checking QLIST_EMPTY(&bs->children) is also done when
>> +             * the cache is queried above.  Technically, we do not need to check
>> +             * it here; the worst that can happen is that we fill the cache for
>> +             * non-protocol nodes, and then it is never used.  However, filling
>> +             * the cache requires an RCU update, so double check here to avoid
>> +             * such an update if possible.
>> +             */
>> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
>> +                QLIST_EMPTY(&bs->children))
>> +            {
> Would it be worth asserting that local_map == aligned_offset, because
> otherwise with a buggy protocol driver, the result from the cache could
> be different from the first call without us noticing?

I think it would be indeed.

Max

>> +                bdrv_bsc_fill(bs, aligned_offset, *pnum);
>> +            }
>> +        }
>>       } else {
>>           /* Default code for filters */
> Kevin
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-06-28 19:10       ` Eric Blake
@ 2021-07-12  7:47         ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2021-07-12  7:47 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On 28.06.21 21:10, Eric Blake wrote:
>>>> +++ b/include/block/block_int.h
>>>> @@ -347,6 +347,11 @@ struct BlockDriver {
>>>>         * clamped to bdrv_getlength() and aligned to request_alignment,
>>>>         * as well as non-NULL pnum, map, and file; in turn, the driver
>>>>         * must return an error or set pnum to an aligned non-zero value.
>>>> +     *
>>>> +     * Note that @bytes is just a hint on how big of a region the
>>>> +     * caller wants to inspect.  It is not a limit on *pnum.
>>>> +     * Implementations are free to return larger values of *pnum if
>>>> +     * doing so does not incur a performance penalty.
>>> Worth mention that the cache will benefit of it?
>> Oh, right, absolutely.  Like so:
>>
>> "block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to
>> its caller, but it itself can still make use of the unclamped *pnum value.
>> Specifically, the block-status cache for protocol nodes will benefit from
>> storing as large a region as possible."
> How about this tweak to the wording to make it flow a little better:
>
> block/io.c's bdrv_co_block_status() will utilize an unclamped *pnum
> value for the block-status cache on protocol nodes, prior to clamping
> *pnum for return to its caller.

Sure, thanks!

Max



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-07-12  7:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
2021-06-23 15:01 ` [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
2021-06-24 11:11     ` Max Reitz
2021-07-06 17:04   ` Kevin Wolf
2021-07-12  7:45     ` Max Reitz
2021-06-23 15:01 ` [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum Max Reitz
2021-06-24  9:15   ` Vladimir Sementsov-Ogievskiy
2021-06-24 10:16     ` Max Reitz
2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
2021-06-24 11:12         ` Max Reitz
2021-06-28 19:10       ` Eric Blake
2021-07-12  7:47         ` Max Reitz
2021-06-23 15:01 ` [PATCH v2 4/6] block/file-posix: Do not force-cap *pnum Max Reitz
2021-06-23 15:01 ` [PATCH v2 5/6] block/gluster: " Max Reitz
2021-06-23 15:01 ` [PATCH v2 6/6] block/iscsi: " Max Reitz
2021-07-06 17:06 ` [PATCH v2 0/6] block: block-status cache for data regions Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.