qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] block-status cache for data regions
@ 2021-08-12  8:41 Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 1/6] block: Drop BDS comment regarding bdrv_append() Hanna Reitz
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

Hi,

See the cover letter from v1 for the general idea:
https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html

Cover letter from v2, introducing RCU locking:
https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg01060.html


v3:
- Patch 2:
  - Add rcu_head object to BdrvBlockStatusCache, so we can use
    g_free_rcu() to free it instead of synchronize_rcu()+g_free()
  - Use qatomic_rcu_read() every time we read bs->block_status_cache
    (except in bdrv_close(), where no concurrency is possible)
  - Use RCU_READ_LOCK_GUARD() instead of WITH_RCU_READ_LOCK_GUARD() in
    functions where we lock the whole scope anyway
  - Same for QEMU_LOCK_GUARD() instead of WITH_QEMU_LOCK_GUARD() in
    bdrv_bsc_fill()
  - Drop from_cache variable in bdrv_co_block_status()
    (was an artifact from v1, which had a different control flow and
    needed this variable)
  - Assert that local_map returned from a protocol driver’s
    bdrv_co_block_status() implementation is equal to the offset we
    passed to it (see comment there for why we should do this)

- Patch 3:
  - Add note why block drivers should return larger *pnum values in
    addition to just saying that it’s allowed


git-backport-diff against v2:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/6:[----] [--] 'block: Drop BDS comment regarding bdrv_append()'
002/6:[0063] [FC] 'block: block-status cache for data regions'
003/6:[0004] [FC] 'block: Clarify that @bytes is no limit on *pnum'
004/6:[----] [--] 'block/file-posix: Do not force-cap *pnum'
005/6:[----] [--] 'block/gluster: Do not force-cap *pnum'
006/6:[----] [--] 'block/iscsi: Do not force-cap *pnum'


Hanna Reitz (6):
  block: Drop BDS comment regarding bdrv_append()
  block: block-status cache for data regions
  block: Clarify that @bytes is no limit on *pnum
  block/file-posix: Do not force-cap *pnum
  block/gluster: Do not force-cap *pnum
  block/iscsi: Do not force-cap *pnum

 include/block/block_int.h | 61 +++++++++++++++++++++++++++--
 block.c                   | 80 +++++++++++++++++++++++++++++++++++++++
 block/file-posix.c        |  7 ++--
 block/gluster.c           |  7 ++--
 block/io.c                | 65 +++++++++++++++++++++++++++++--
 block/iscsi.c             |  3 --
 6 files changed, 207 insertions(+), 16 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/6] block: Drop BDS comment regarding bdrv_append()
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 2/6] block: block-status cache for data regions Hanna Reitz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block_int.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index f1a54db0f8..12e5750fe8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -839,12 +839,6 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
-/*
- * Note: the function bdrv_append() copies and swaps contents of
- * BlockDriverStates, so if you add new fields to this struct, please
- * inspect bdrv_append() to determine if the new fields need to be
- * copied as well.
- */
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/6] block: block-status cache for data regions
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 1/6] block: Drop BDS comment regarding bdrv_append() Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-08-16 21:38   ` Eric Blake
  2021-08-25  9:54   ` Vladimir Sementsov-Ogievskiy
  2021-08-12  8:41 ` [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum Hanna Reitz
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
 include/block/block_int.h | 50 ++++++++++++++++++++++++
 block.c                   | 80 +++++++++++++++++++++++++++++++++++++++
 block/io.c                | 65 +++++++++++++++++++++++++++++--
 3 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 12e5750fe8..437d746733 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -34,6 +34,7 @@
 #include "qemu/hbitmap.h"
 #include "block/snapshot.h"
 #include "qemu/throttle.h"
+#include "qemu/rcu.h"
 
 #define BLOCK_FLAG_LAZY_REFCOUNTS   8
 
@@ -839,6 +840,24 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ *         functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *            the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    struct rcu_head rcu;
+
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
@@ -1004,6 +1023,11 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    /* Lock for block-status cache RCU writers */
+    CoMutex bsc_modify_lock;
+    /* Always non-NULL, but must only be dereferenced under an RCU read guard */
+    BdrvBlockStatusCache *block_status_cache;
 };
 
 struct BlockBackendRootState {
@@ -1429,4 +1453,30 @@ static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index e97ce0b1c8..28b00d7239 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "qemu/range.h"
+#include "qemu/rcu.h"
 #include "block/coroutines.h"
 
 #ifdef CONFIG_BSD
@@ -401,6 +403,9 @@ BlockDriverState *bdrv_new(void)
 
     qemu_co_queue_init(&bs->flush_queue);
 
+    qemu_co_mutex_init(&bs->bsc_modify_lock);
+    bs->block_status_cache = g_new0(BdrvBlockStatusCache, 1);
+
     for (i = 0; i < bdrv_drain_all_count; i++) {
         bdrv_drained_begin(bs);
     }
@@ -4694,6 +4699,8 @@ static void bdrv_close(BlockDriverState *bs)
     bs->explicit_options = NULL;
     qobject_unref(bs->full_open_options);
     bs->full_open_options = NULL;
+    g_free(bs->block_status_cache);
+    bs->block_status_cache = NULL;
 
     bdrv_release_named_dirty_bitmaps(bs);
     assert(QLIST_EMPTY(&bs->dirty_bitmaps));
@@ -7653,3 +7660,76 @@ BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs)
 {
     return bdrv_skip_filters(bdrv_cow_bs(bdrv_skip_filters(bs)));
 }
+
+/**
+ * Check whether [offset, offset + bytes) overlaps with the cached
+ * block-status data region.
+ *
+ * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
+ * which is what bdrv_bsc_is_data()'s interface needs.
+ * Otherwise, *pnum is not touched.
+ */
+static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
+                                           int64_t offset, int64_t bytes,
+                                           int64_t *pnum)
+{
+    BdrvBlockStatusCache *bsc = qatomic_rcu_read(&bs->block_status_cache);
+    bool overlaps;
+
+    overlaps =
+        qatomic_read(&bsc->valid) &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start);
+
+    if (overlaps && pnum) {
+        *pnum = bsc->data_end - offset;
+    }
+
+    return overlaps;
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
+{
+    RCU_READ_LOCK_GUARD();
+
+    return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes)
+{
+    RCU_READ_LOCK_GUARD();
+
+    if (bdrv_bsc_range_overlaps_locked(bs, offset, bytes, NULL)) {
+        qatomic_set(&bs->block_status_cache->valid, false);
+    }
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+    BdrvBlockStatusCache *new_bsc = g_new(BdrvBlockStatusCache, 1);
+    BdrvBlockStatusCache *old_bsc;
+
+    *new_bsc = (BdrvBlockStatusCache) {
+        .valid = true,
+        .data_start = offset,
+        .data_end = offset + bytes,
+    };
+
+    QEMU_LOCK_GUARD(&bs->bsc_modify_lock);
+
+    old_bsc = qatomic_rcu_read(&bs->block_status_cache);
+    qatomic_rcu_set(&bs->block_status_cache, new_bsc);
+    if (old_bsc) {
+        g_free_rcu(old_bsc, rcu);
+    }
+}
diff --git a/block/io.c b/block/io.c
index a19942718b..e62cdbda99 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1883,6 +1883,9 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* Invalidate the cached block-status data range if this write overlaps */
+    bdrv_bsc_invalidate_range(bs, offset, bytes);
+
     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
     tail = (offset + bytes) % alignment;
@@ -2447,9 +2450,62 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 
     if (bs->drv->bdrv_co_block_status) {
-        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-                                            aligned_bytes, pnum, &local_map,
-                                            &local_file);
+        /*
+         * Use the block-status cache only for protocol nodes: Format
+         * drivers are generally quick to inquire the status, but protocol
+         * drivers often need to get information from outside of qemu, so
+         * we do not have control over the actual implementation.  There
+         * have been cases where inquiring the status took an unreasonably
+         * long time, and we can do nothing in qemu to fix it.
+         * This is especially problematic for images with large data areas,
+         * because finding the few holes in them and giving them special
+         * treatment does not gain much performance.  Therefore, we try to
+         * cache the last-identified data region.
+         *
+         * Second, limiting ourselves to protocol nodes allows us to assume
+         * the block status for data regions to be DATA | OFFSET_VALID, and
+         * that the host offset is the same as the guest offset.
+         *
+         * Note that it is possible that external writers zero parts of
+         * the cached regions without the cache being invalidated, and so
+         * we may report zeroes as data.  This is not catastrophic,
+         * however, because reporting zeroes as data is fine.
+         */
+        if (QLIST_EMPTY(&bs->children) &&
+            bdrv_bsc_is_data(bs, aligned_offset, pnum))
+        {
+            ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+            local_file = bs;
+            local_map = aligned_offset;
+        } else {
+            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+                                                aligned_bytes, pnum, &local_map,
+                                                &local_file);
+
+            /*
+             * Note that checking QLIST_EMPTY(&bs->children) is also done when
+             * the cache is queried above.  Technically, we do not need to check
+             * it here; the worst that can happen is that we fill the cache for
+             * non-protocol nodes, and then it is never used.  However, filling
+             * the cache requires an RCU update, so double check here to avoid
+             * such an update if possible.
+             */
+            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
+                QLIST_EMPTY(&bs->children))
+            {
+                /*
+                 * When a protocol driver reports BLOCK_OFFSET_VALID, the
+                 * returned local_map value must be the same as the offset we
+                 * have passed (aligned_offset).
+                 * Assert this, because we follow this rule when reading from
+                 * the cache (see the `local_map = aligned_offset` assignment
+                 * above), and the result the cache delivers must be the same
+                 * as the driver would deliver.
+                 */
+                assert(local_map == aligned_offset);
+                bdrv_bsc_fill(bs, aligned_offset, *pnum);
+            }
+        }
     } else {
         /* Default code for filters */
 
@@ -3002,6 +3058,9 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
         return 0;
     }
 
+    /* Invalidate the cached block-status data range if this discard overlaps */
+    bdrv_bsc_invalidate_range(bs, offset, bytes);
+
     /* Discard is advisory, but some devices track and coalesce
      * unaligned requests, so we must pass everything down rather than
      * round here.  Still, most devices will just silently ignore
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 1/6] block: Drop BDS comment regarding bdrv_append() Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 2/6] block: block-status cache for data regions Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-08-17 15:03   ` Eric Blake
  2021-08-12  8:41 ` [PATCH v3 4/6] block/file-posix: Do not force-cap *pnum Hanna Reitz
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 437d746733..5451f89b8d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -348,6 +348,15 @@ struct BlockDriver {
      * clamped to bdrv_getlength() and aligned to request_alignment,
      * as well as non-NULL pnum, map, and file; in turn, the driver
      * must return an error or set pnum to an aligned non-zero value.
+     *
+     * Note that @bytes is just a hint on how big of a region the
+     * caller wants to inspect.  It is not a limit on *pnum.
+     * Implementations are free to return larger values of *pnum if
+     * doing so does not incur a performance penalty.
+     *
+     * block/io.c's bdrv_co_block_status() will utilize an unclamped
+     * *pnum value for the block-status cache on protocol nodes, prior
+     * to clamping *pnum for return to its caller.
      */
     int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
         bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 4/6] block/file-posix: Do not force-cap *pnum
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
                   ` (2 preceding siblings ...)
  2021-08-12  8:41 ` [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 5/6] block/gluster: " Hanna Reitz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index cb9bffe047..9f35e5631a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2744,7 +2744,8 @@ static int find_allocation(BlockDriverState *bs, off_t start,
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  */
 static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             bool want_zero,
@@ -2782,7 +2783,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
 
         /*
          * We are not allowed to return partial sectors, though, so
@@ -2801,7 +2802,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
     *map = offset;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 5/6] block/gluster: Do not force-cap *pnum
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
                   ` (3 preceding siblings ...)
  2021-08-12  8:41 ` [PATCH v3 4/6] block/file-posix: Do not force-cap *pnum Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-08-12  8:41 ` [PATCH v3 6/6] block/iscsi: " Hanna Reitz
  2021-09-07  9:23 ` [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
  6 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/gluster.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  *
  * (Based on raw_co_block_status() from file-posix.c.)
  */
@@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
         ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 6/6] block/iscsi: Do not force-cap *pnum
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
                   ` (4 preceding siblings ...)
  2021-08-12  8:41 ` [PATCH v3 5/6] block/gluster: " Hanna Reitz
@ 2021-08-12  8:41 ` Hanna Reitz
  2021-09-07  9:23 ` [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
  6 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-12  8:41 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/iscsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4d2a416ce7..852384086b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -781,9 +781,6 @@ retry:
         iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
     }
 
-    if (*pnum > bytes) {
-        *pnum = bytes;
-    }
 out_unlock:
     qemu_mutex_unlock(&iscsilun->mutex);
     g_free(iTask.err_str);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/6] block: block-status cache for data regions
  2021-08-12  8:41 ` [PATCH v3 2/6] block: block-status cache for data regions Hanna Reitz
@ 2021-08-16 21:38   ` Eric Blake
  2021-08-17  6:19     ` Hanna Reitz
  2021-08-25  9:54   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Blake @ 2021-08-16 21:38 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Thu, Aug 12, 2021 at 10:41:44AM +0200, Hanna Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 

> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---

> +++ b/block.c

> +/**
> + * Check whether [offset, offset + bytes) overlaps with the cached
> + * block-status data region.
> + *
> + * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
> + * which is what bdrv_bsc_is_data()'s interface needs.
> + * Otherwise, *pnum is not touched.

Why duplicate this comment,...

> + */
> +static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
> +                                           int64_t offset, int64_t bytes,
> +                                           int64_t *pnum)
> +{
> +    BdrvBlockStatusCache *bsc = qatomic_rcu_read(&bs->block_status_cache);
> +    bool overlaps;
> +
> +    overlaps =
> +        qatomic_read(&bsc->valid) &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start);
> +
> +    if (overlaps && pnum) {
> +        *pnum = bsc->data_end - offset;
> +    }
> +
> +    return overlaps;
> +}
> +
> +/**
> + * See block_int.h for this function's documentation.
> + */
> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
> +{
> +    RCU_READ_LOCK_GUARD();
> +
> +    return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
> +}
> +
> +/**
> + * See block_int.h for this function's documentation.

...but not these?

But that's minor.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/6] block: block-status cache for data regions
  2021-08-16 21:38   ` Eric Blake
@ 2021-08-17  6:19     ` Hanna Reitz
  0 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-08-17  6:19 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On 16.08.21 23:38, Eric Blake wrote:
> On Thu, Aug 12, 2021 at 10:41:44AM +0200, Hanna Reitz wrote:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
>> ---
>> +++ b/block.c
>> +/**
>> + * Check whether [offset, offset + bytes) overlaps with the cached
>> + * block-status data region.
>> + *
>> + * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
>> + * which is what bdrv_bsc_is_data()'s interface needs.
>> + * Otherwise, *pnum is not touched.
> Why duplicate this comment,...

I don’t think it can be duplicated, because this is a static function.  
It is very similar to bdrv_bsc_is_data()’s interface, I admit, but it 
isn’t exactly the same (besides the _locked suffix, the only difference 
is that bdrv_bsc_is_data() is for a single byte, while this function 
checks a range).

>> + */
>> +static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
>> +                                           int64_t offset, int64_t bytes,
>> +                                           int64_t *pnum)
>> +{
>> +    BdrvBlockStatusCache *bsc = qatomic_rcu_read(&bs->block_status_cache);
>> +    bool overlaps;
>> +
>> +    overlaps =
>> +        qatomic_read(&bsc->valid) &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start);
>> +
>> +    if (overlaps && pnum) {
>> +        *pnum = bsc->data_end - offset;
>> +    }
>> +
>> +    return overlaps;
>> +}
>> +
>> +/**
>> + * See block_int.h for this function's documentation.
>> + */
>> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
>> +{
>> +    RCU_READ_LOCK_GUARD();
>> +
>> +    return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
>> +}
>> +
>> +/**
>> + * See block_int.h for this function's documentation.
> ...but not these?

These are not static functions, and so I can point to the header where 
they’re declared.

(We have a wild mix of where functions are described in qemu, and it’s 
often in their C files.  I prefer having descriptions in the header, but 
because we have the precedent of explaining interfaces in C files, I 
thought I can’t get around adding at least pointers in the C file.)

> But that's minor.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Hanna



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum
  2021-08-12  8:41 ` [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum Hanna Reitz
@ 2021-08-17 15:03   ` Eric Blake
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Blake @ 2021-08-17 15:03 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Thu, Aug 12, 2021 at 10:41:45AM +0200, Hanna Reitz wrote:
> .bdrv_co_block_status() implementations are free to return a *pnum that
> exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
> *pnum as necessary.
> 
> On the other hand, if drivers' implementations return values for *pnum
> that are as large as possible, our recently introduced block-status
> cache will become more effective.
> 
> So, make a note in block_int.h that @bytes is no upper limit for *pnum.
> 
> Suggested-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block_int.h | 9 +++++++++
>  1 file changed, 9 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/6] block: block-status cache for data regions
  2021-08-12  8:41 ` [PATCH v3 2/6] block: block-status cache for data regions Hanna Reitz
  2021-08-16 21:38   ` Eric Blake
@ 2021-08-25  9:54   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-08-25  9:54 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

12.08.2021 11:41, Hanna Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
>   include/block/block_int.h | 50 ++++++++++++++++++++++++

[..]

> +            /*
> +             * Note that checking QLIST_EMPTY(&bs->children) is also done when
> +             * the cache is queried above.  Technically, we do not need to check
> +             * it here; the worst that can happen is that we fill the cache for
> +             * non-protocol nodes, and then it is never used.  However, filling
> +             * the cache requires an RCU update, so double check here to avoid
> +             * such an update if possible.
> +             */
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> +                QLIST_EMPTY(&bs->children))
> +            {
> +                /*
> +                 * When a protocol driver reports BLOCK_OFFSET_VALID, the
> +                 * returned local_map value must be the same as the offset we
> +                 * have passed (aligned_offset).
> +                 * Assert this, because we follow this rule when reading from
> +                 * the cache (see the `local_map = aligned_offset` assignment
> +                 * above), and the result the cache delivers must be the same
> +                 * as the driver would deliver.
> +                 */
> +                assert(local_map == aligned_offset);

maybe, also assert(local_file == bs);

as well, we may check only BDRV_BLOCK_DATA flag above and
assert BDRV_BLOCK_OFFSET_VALID here

anyway:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/6] block-status cache for data regions
  2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
                   ` (5 preceding siblings ...)
  2021-08-12  8:41 ` [PATCH v3 6/6] block/iscsi: " Hanna Reitz
@ 2021-09-07  9:23 ` Hanna Reitz
  6 siblings, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2021-09-07  9:23 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 12.08.21 10:41, Hanna Reitz wrote:
> Hi,
>
> See the cover letter from v1 for the general idea:
> https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html
>
> Cover letter from v2, introducing RCU locking:
> https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg01060.html
>
>
> v3:
> - Patch 2:
>    - Add rcu_head object to BdrvBlockStatusCache, so we can use
>      g_free_rcu() to free it instead of synchronize_rcu()+g_free()
>    - Use qatomic_rcu_read() every time we read bs->block_status_cache
>      (except in bdrv_close(), where no concurrency is possible)
>    - Use RCU_READ_LOCK_GUARD() instead of WITH_RCU_READ_LOCK_GUARD() in
>      functions where we lock the whole scope anyway
>    - Same for QEMU_LOCK_GUARD() instead of WITH_QEMU_LOCK_GUARD() in
>      bdrv_bsc_fill()
>    - Drop from_cache variable in bdrv_co_block_status()
>      (was an artifact from v1, which had a different control flow and
>      needed this variable)
>    - Assert that local_map returned from a protocol driver’s
>      bdrv_co_block_status() implementation is equal to the offset we
>      passed to it (see comment there for why we should do this)
>
> - Patch 3:
>    - Add note why block drivers should return larger *pnum values in
>      addition to just saying that it’s allowed

Thanks for the reviews, I’ve added the `local_file == bs` assertion in 
patch 2 as suggested by Vladimir (and updated the comment to match) and 
applied the series to my block branch:

https://github.com/XanClic/qemu/commits/block

Hanna



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-09-07  9:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12  8:41 [PATCH v3 0/6] block-status cache for data regions Hanna Reitz
2021-08-12  8:41 ` [PATCH v3 1/6] block: Drop BDS comment regarding bdrv_append() Hanna Reitz
2021-08-12  8:41 ` [PATCH v3 2/6] block: block-status cache for data regions Hanna Reitz
2021-08-16 21:38   ` Eric Blake
2021-08-17  6:19     ` Hanna Reitz
2021-08-25  9:54   ` Vladimir Sementsov-Ogievskiy
2021-08-12  8:41 ` [PATCH v3 3/6] block: Clarify that @bytes is no limit on *pnum Hanna Reitz
2021-08-17 15:03   ` Eric Blake
2021-08-12  8:41 ` [PATCH v3 4/6] block/file-posix: Do not force-cap *pnum Hanna Reitz
2021-08-12  8:41 ` [PATCH v3 5/6] block/gluster: " Hanna Reitz
2021-08-12  8:41 ` [PATCH v3 6/6] block/iscsi: " Hanna Reitz
2021-09-07  9:23 ` [PATCH v3 0/6] block-status cache for data regions Hanna Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).