* [PATCH 0/6] block: block-status cache for data regions @ 2021-06-17 15:52 Max Reitz 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz ` (5 more replies) 0 siblings, 6 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz Hi, We’ve already had two attempts at introducing a block-status cache for data regions (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), both of which were local to file-posix. I have a gitlab issue assigned to me (https://gitlab.com/qemu-project/qemu/-/issues/307), so I suppose it’s time for round three. This series tries to address what I gathered from the reviews for both: (1) We should try to have this cache in the general block layer for all protocol drivers; (2) Just to be sure, we should have a mutex for it; (3) External writers don’t matter if we just cache data regions, because (for a protocol node) reporting a zero region as data isn’t fatal. Patch 1 is clean-up, patch 2 is the main one, patches 3 to 6 make it more useful: Some protocol drivers force-clamp the returned *pnum based on the @bytes parameter, but that’s completely unnecessary, because bdrv_co_block_status() will clamp the value, too. It’s beneficial to return as large a *pnum value as possible to bdrv_co_block_status(), so our cache can be more effective. Max Reitz (6): block: Drop BDS comment regarding bdrv_append() block: block-status cache for data regions block/file-posix: Do not force-cap *pnum block/gluster: Do not force-cap *pnum block/nbd: Do not force-cap *pnum block/iscsi: Do not force-cap *pnum include/block/block_int.h | 21 ++++++++-- block.c | 2 + block/file-posix.c | 7 ++-- block/gluster.c | 7 ++-- block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- block/iscsi.c | 3 -- block/nbd.c | 2 +- 7 files changed, 105 insertions(+), 17 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 17:42 ` Eric Blake 2021-06-19 9:38 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz ` (4 subsequent siblings) 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz There is a comment above the BDS definition stating care must be taken to consider handling newly added fields in bdrv_append(). Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac (nine years ago), and in any case, bdrv_swap() was dropped in 8e419aefa (six years ago). So no such care is necessary anymore. Signed-off-by: Max Reitz <mreitz@redhat.com> --- include/block/block_int.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index 057d88b1fc..a8f9598102 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,12 +832,6 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; -/* - * Note: the function bdrv_append() copies and swaps contents of - * BlockDriverStates, so if you add new fields to this struct, please - * inspect bdrv_append() to determine if the new fields need to be - * copied as well. - */ struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz @ 2021-06-18 17:42 ` Eric Blake 2021-06-19 9:38 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Eric Blake @ 2021-06-18 17:42 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:42PM +0200, Max Reitz wrote: > There is a comment above the BDS definition stating care must be taken > to consider handling newly added fields in bdrv_append(). > > Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac > (nine years ago), and in any case, bdrv_swap() was dropped in > 8e419aefa (six years ago). So no such care is necessary anymore. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > include/block/block_int.h | 6 ------ > 1 file changed, 6 deletions(-) Reviewed-by: Eric Blake <eblake@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz 2021-06-18 17:42 ` Eric Blake @ 2021-06-19 9:38 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 9:38 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > There is a comment above the BDS definition stating care must be taken > to consider handling newly added fields in bdrv_append(). > > Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac > (nine years ago), and in any case, bdrv_swap() was dropped in > 8e419aefa (six years ago). So no such care is necessary anymore. > > Signed-off-by: Max Reitz<mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/6] block: block-status cache for data regions 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 18:51 ` Eric Blake 2021-06-19 10:20 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz ` (3 subsequent siblings) 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz As we have attempted before (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), this patch seeks to reduce the number of SEEK_DATA/HOLE operations the file-posix driver has to perform. The main difference is that this time it is implemented as part of the general block layer code. The problem we face is that on some filesystems or in some circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the implementation is outside of qemu, there is little we can do about its performance. We have already introduced the want_zero parameter to bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls unless we really want zero information; but sometimes we do want that information, because for files that consist largely of zero areas, special-casing those areas can give large performance boosts. So the real problem is with files that consist largely of data, so that inquiring the block status does not gain us much performance, but where such an inquiry itself takes a lot of time. To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz <mreitz@redhat.com> --- include/block/block_int.h | 19 ++++++++++ block.c | 2 + block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- 3 files changed, 98 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..c09512556a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,23 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @lock: Lock for accessing this object's fields + * @valid: Whether the cache is valid + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + * the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { + CoMutex lock; + bool valid; + int64_t data_start; + int64_t data_end; +} BdrvBlockStatusCache; + struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... @@ -997,6 +1014,8 @@ struct BlockDriverState { /* BdrvChild links to this node may never be frozen */ bool never_freeze; + + BdrvBlockStatusCache block_status_cache; }; struct BlockBackendRootState { diff --git a/block.c b/block.c index 3f456892d0..bad64d5c4f 100644 --- a/block.c +++ b/block.c @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void) qemu_co_queue_init(&bs->flush_queue); + qemu_co_mutex_init(&bs->block_status_cache.lock); + for (i = 0; i < bdrv_drain_all_count; i++) { bdrv_drained_begin(bs); } diff --git a/block/io.c b/block/io.c index 323854d063..320638cc48 100644 --- a/block/io.c +++ b/block/io.c @@ -35,6 +35,7 @@ #include "qapi/error.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" +#include "qemu/range.h" #include "sysemu/replay.h" /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, bool need_flush = false; int head = 0; int tail = 0; + BdrvBlockStatusCache *bsc = &bs->block_status_cache; int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); int alignment = MAX(bs->bl.pwrite_zeroes_alignment, @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, return -ENOTSUP; } + /* Invalidate the cached block-status data range if this write overlaps */ + qemu_co_mutex_lock(&bsc->lock); + if (bsc->valid && + ranges_overlap(offset, bytes, bsc->data_start, + bsc->data_end - bsc->data_start)) + { + bsc->valid = false; + } + qemu_co_mutex_unlock(&bsc->lock); + assert(alignment % bs->bl.request_alignment == 0); head = offset % alignment; tail = (offset + bytes) % alignment; @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, int64_t aligned_offset, aligned_bytes; uint32_t align; bool has_filtered_child; + BdrvBlockStatusCache *bsc = &bs->block_status_cache; assert(pnum); *pnum = 0; @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset; if (bs->drv->bdrv_co_block_status) { - ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, - aligned_bytes, pnum, &local_map, - &local_file); + bool from_cache = false; + + /* + * Use the block-status cache only for protocol nodes: Format + * drivers are generally quick to inquire the status, but protocol + * drivers often need to get information from outside of qemu, so + * we do not have control over the actual implementation. There + * have been cases where inquiring the status took an unreasonably + * long time, and we can do nothing in qemu to fix it. + * This is especially problematic for images with large data areas, + * because finding the few holes in them and giving them special + * treatment does not gain much performance. Therefore, we try to + * cache the last-identified data region. + * + * Second, limiting ourselves to protocol nodes allows us to assume + * the block status for data regions to be DATA | OFFSET_VALID, and + * that the host offset is the same as the guest offset. + * + * Note that it is possible that external writers zero parts of + * the cached regions without the cache being invalidated, and so + * we may report zeroes as data. This is not catastrophic, + * however, because reporting zeroes as data is fine. + */ + if (QLIST_EMPTY(&bs->children)) { + qemu_co_mutex_lock(&bsc->lock); + if (bsc->valid && + aligned_offset >= bsc->data_start && + aligned_offset < bsc->data_end) + { + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; + local_file = bs; + local_map = aligned_offset; + + *pnum = bsc->data_end - aligned_offset; + + from_cache = true; + } + qemu_co_mutex_unlock(&bsc->lock); + } + + if (!from_cache) { + ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, + aligned_bytes, pnum, &local_map, + &local_file); + + if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) { + qemu_co_mutex_lock(&bsc->lock); + bsc->data_start = aligned_offset; + bsc->data_end = aligned_offset + *pnum; + bsc->valid = true; + qemu_co_mutex_unlock(&bsc->lock); + } + } } else { /* Default code for filters */ @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int max_pdiscard, ret; int head, tail, align; BlockDriverState *bs = child->bs; + BdrvBlockStatusCache *bsc = &bs->block_status_cache; if (!bs || !bs->drv || !bdrv_is_inserted(bs)) { return -ENOMEDIUM; @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, return 0; } + /* Invalidate the cached block-status data range if this discard overlaps */ + qemu_co_mutex_lock(&bsc->lock); + if (bsc->valid && + ranges_overlap(offset, bytes, bsc->data_start, + bsc->data_end - bsc->data_start)) + { + bsc->valid = false; + } + qemu_co_mutex_unlock(&bsc->lock); + /* Discard is advisory, but some devices track and coalesce * unaligned requests, so we must pass everything down rather than * round here. Still, most devices will just silently ignore -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/6] block: block-status cache for data regions 2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz @ 2021-06-18 18:51 ` Eric Blake 2021-06-21 9:37 ` Max Reitz 2021-06-19 10:20 ` Vladimir Sementsov-Ogievskiy 1 sibling, 1 reply; 28+ messages in thread From: Eric Blake @ 2021-06-18 18:51 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote: > > To address this, we want to cache data regions. Most of the time, when > bad performance is reported, it is in places where the image is iterated > over from start to end (qemu-img convert or the mirror job), so a simple > yet effective solution is to cache only the current data region. Here's hoping third time's the charm! > > (Note that only caching data regions but not zero regions means that > returning false information from the cache is not catastrophic: Treating > zeroes as data is fine. While we try to invalidate the cache on zero > writes and discards, such incongruences may still occur when there are > other processes writing to the image.) > > We only use the cache for nodes without children (i.e. protocol nodes), > because that is where the problem is: Drivers that rely on block-status > implementations outside of qemu (e.g. SEEK_DATA/HOLE). > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > include/block/block_int.h | 19 ++++++++++ > block.c | 2 + > block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- > 3 files changed, 98 insertions(+), 3 deletions(-) > > diff --git a/include/block/block_int.h b/include/block/block_int.h > index a8f9598102..c09512556a 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -832,6 +832,23 @@ struct BdrvChild { > QLIST_ENTRY(BdrvChild) next_parent; > }; > > +/* > + * Allows bdrv_co_block_status() to cache one data region for a > + * protocol node. > + * > + * @lock: Lock for accessing this object's fields > + * @valid: Whether the cache is valid > + * @data_start: Offset where we know (or strongly assume) is data > + * @data_end: Offset where the data region ends (which is not necessarily > + * the start of a zeroed region) > + */ > +typedef struct BdrvBlockStatusCache { > + CoMutex lock; > + bool valid; > + int64_t data_start; > + int64_t data_end; > +} BdrvBlockStatusCache; Looks like the right bits of information, and I'm glad you documented the need to be prepared for protocols that report split data sections rather than consolidated. > +++ b/block/io.c > @@ -35,6 +35,7 @@ > #include "qapi/error.h" > #include "qemu/error-report.h" > #include "qemu/main-loop.h" > +#include "qemu/range.h" > #include "sysemu/replay.h" > > /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ > @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, > bool need_flush = false; > int head = 0; > int tail = 0; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); > int alignment = MAX(bs->bl.pwrite_zeroes_alignment, > @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, > return -ENOTSUP; > } > > + /* Invalidate the cached block-status data range if this write overlaps */ > + qemu_co_mutex_lock(&bsc->lock); Are we going to be suffering from a lot of lock contention performance degradation? Is there a way to take advantage of RCU access patterns for any more performance without sacrificing correctness? > + if (bsc->valid && > + ranges_overlap(offset, bytes, bsc->data_start, > + bsc->data_end - bsc->data_start)) > + { > + bsc->valid = false; > + } Do we want to invalidate the entire bsc, or can we be smart and leave the prefix intact (if offset > bsc->data_start, then set bsc->data_end to offset)? > + qemu_co_mutex_unlock(&bsc->lock); Worth using WITH_QEMU_LOCK_GUARD? > + > assert(alignment % bs->bl.request_alignment == 0); > head = offset % alignment; > tail = (offset + bytes) % alignment; > @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, > int64_t aligned_offset, aligned_bytes; > uint32_t align; > bool has_filtered_child; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > assert(pnum); > *pnum = 0; > @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, > aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset; > > if (bs->drv->bdrv_co_block_status) { > - ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, > - aligned_bytes, pnum, &local_map, > - &local_file); > + bool from_cache = false; > + > + /* > + * Use the block-status cache only for protocol nodes: Format > + * drivers are generally quick to inquire the status, but protocol > + * drivers often need to get information from outside of qemu, so > + * we do not have control over the actual implementation. There > + * have been cases where inquiring the status took an unreasonably > + * long time, and we can do nothing in qemu to fix it. > + * This is especially problematic for images with large data areas, > + * because finding the few holes in them and giving them special > + * treatment does not gain much performance. Therefore, we try to > + * cache the last-identified data region. > + * > + * Second, limiting ourselves to protocol nodes allows us to assume > + * the block status for data regions to be DATA | OFFSET_VALID, and > + * that the host offset is the same as the guest offset. > + * > + * Note that it is possible that external writers zero parts of > + * the cached regions without the cache being invalidated, and so > + * we may report zeroes as data. This is not catastrophic, > + * however, because reporting zeroes as data is fine. > + */ Useful comment. > + if (QLIST_EMPTY(&bs->children)) { > + qemu_co_mutex_lock(&bsc->lock); > + if (bsc->valid && > + aligned_offset >= bsc->data_start && > + aligned_offset < bsc->data_end) > + { > + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > + local_file = bs; > + local_map = aligned_offset; > + > + *pnum = bsc->data_end - aligned_offset; > + > + from_cache = true; > + } > + qemu_co_mutex_unlock(&bsc->lock); > + } > + > + if (!from_cache) { > + ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, > + aligned_bytes, pnum, &local_map, > + &local_file); > + > + if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) { > + qemu_co_mutex_lock(&bsc->lock); > + bsc->data_start = aligned_offset; > + bsc->data_end = aligned_offset + *pnum; > + bsc->valid = true; > + qemu_co_mutex_unlock(&bsc->lock); > + } > + } Looks correct. > } else { > /* Default code for filters */ > > @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, > int max_pdiscard, ret; > int head, tail, align; > BlockDriverState *bs = child->bs; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > if (!bs || !bs->drv || !bdrv_is_inserted(bs)) { > return -ENOMEDIUM; > @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, > return 0; > } > > + /* Invalidate the cached block-status data range if this discard overlaps */ > + qemu_co_mutex_lock(&bsc->lock); > + if (bsc->valid && > + ranges_overlap(offset, bytes, bsc->data_start, > + bsc->data_end - bsc->data_start)) > + { > + bsc->valid = false; > + } Same question as above about possibly shortening the cached range in-place rather than discarding it altogether. > + qemu_co_mutex_unlock(&bsc->lock); > + > /* Discard is advisory, but some devices track and coalesce > * unaligned requests, so we must pass everything down rather than > * round here. Still, most devices will just silently ignore > -- > 2.31.1 > > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/6] block: block-status cache for data regions 2021-06-18 18:51 ` Eric Blake @ 2021-06-21 9:37 ` Max Reitz 0 siblings, 0 replies; 28+ messages in thread From: Max Reitz @ 2021-06-21 9:37 UTC (permalink / raw) To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block On 18.06.21 20:51, Eric Blake wrote: > On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote: >> To address this, we want to cache data regions. Most of the time, when >> bad performance is reported, it is in places where the image is iterated >> over from start to end (qemu-img convert or the mirror job), so a simple >> yet effective solution is to cache only the current data region. > Here's hoping third time's the charm! Indeed :) >> (Note that only caching data regions but not zero regions means that >> returning false information from the cache is not catastrophic: Treating >> zeroes as data is fine. While we try to invalidate the cache on zero >> writes and discards, such incongruences may still occur when there are >> other processes writing to the image.) >> >> We only use the cache for nodes without children (i.e. protocol nodes), >> because that is where the problem is: Drivers that rely on block-status >> implementations outside of qemu (e.g. SEEK_DATA/HOLE). >> >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> include/block/block_int.h | 19 ++++++++++ >> block.c | 2 + >> block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- >> 3 files changed, 98 insertions(+), 3 deletions(-) >> >> diff --git a/include/block/block_int.h b/include/block/block_int.h >> index a8f9598102..c09512556a 100644 >> --- a/include/block/block_int.h >> +++ b/include/block/block_int.h >> @@ -832,6 +832,23 @@ struct BdrvChild { >> QLIST_ENTRY(BdrvChild) next_parent; >> }; >> >> +/* >> + * Allows bdrv_co_block_status() to cache one data region for a >> + * protocol node. >> + * >> + * @lock: Lock for accessing this object's fields >> + * @valid: Whether the cache is valid >> + * @data_start: Offset where we know (or strongly assume) is data >> + * @data_end: Offset where the data region ends (which is not necessarily >> + * the start of a zeroed region) >> + */ >> +typedef struct BdrvBlockStatusCache { >> + CoMutex lock; >> + bool valid; >> + int64_t data_start; >> + int64_t data_end; >> +} BdrvBlockStatusCache; > Looks like the right bits of information, and I'm glad you documented > the need to be prepared for protocols that report split data sections > rather than consolidated. > >> +++ b/block/io.c >> @@ -35,6 +35,7 @@ >> #include "qapi/error.h" >> #include "qemu/error-report.h" >> #include "qemu/main-loop.h" >> +#include "qemu/range.h" >> #include "sysemu/replay.h" >> >> /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ >> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, >> bool need_flush = false; >> int head = 0; >> int tail = 0; >> + BdrvBlockStatusCache *bsc = &bs->block_status_cache; >> >> int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); >> int alignment = MAX(bs->bl.pwrite_zeroes_alignment, >> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, >> return -ENOTSUP; >> } >> >> + /* Invalidate the cached block-status data range if this write overlaps */ >> + qemu_co_mutex_lock(&bsc->lock); > Are we going to be suffering from a lot of lock contention performance > degradation? Is there a way to take advantage of RCU access patterns > for any more performance without sacrificing correctness? The critical section is so short that I considered it fine. I wanted to use RW locks, but then realized that every RW lock operation is internally locked by another mutex, so it wouldn’t gain anything. I’m not sure whether RCU is worth it here. We could try something very crude, namely to just not take a lock and make `valid` an atomic. After all, it doesn’t really matter whether `data_start` and `data_end` are consistent values, and resetting `valid` to false is always safe. The worst that could happen is that a concurrent block-status call tries to set up an overlapping data area, which we thus fail to recognize here. But if such a thing were to happen, it could just as well happen before said concurrent call took any lock on `bsc`. >> + if (bsc->valid && >> + ranges_overlap(offset, bytes, bsc->data_start, >> + bsc->data_end - bsc->data_start)) >> + { >> + bsc->valid = false; >> + } > Do we want to invalidate the entire bsc, or can we be smart and leave > the prefix intact (if offset > bsc->data_start, then set bsc->data_end > to offset)? Perhaps we could be smart, but I don’t think it really makes a difference in practice, so I think keeping it simple is better. >> + qemu_co_mutex_unlock(&bsc->lock); > Worth using WITH_QEMU_LOCK_GUARD? I knew I forgot something, right. Will use! Max ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/6] block: block-status cache for data regions 2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz 2021-06-18 18:51 ` Eric Blake @ 2021-06-19 10:20 ` Vladimir Sementsov-Ogievskiy 2021-06-21 10:05 ` Max Reitz 1 sibling, 1 reply; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:20 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > As we have attempted before > (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, > "file-posix: Cache lseek result for data regions"; > https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, > "file-posix: Cache next hole"), this patch seeks to reduce the number of > SEEK_DATA/HOLE operations the file-posix driver has to perform. The > main difference is that this time it is implemented as part of the > general block layer code. > > The problem we face is that on some filesystems or in some > circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the > implementation is outside of qemu, there is little we can do about its > performance. > > We have already introduced the want_zero parameter to > bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls > unless we really want zero information; but sometimes we do want that > information, because for files that consist largely of zero areas, > special-casing those areas can give large performance boosts. So the > real problem is with files that consist largely of data, so that > inquiring the block status does not gain us much performance, but where > such an inquiry itself takes a lot of time. > > To address this, we want to cache data regions. Most of the time, when > bad performance is reported, it is in places where the image is iterated > over from start to end (qemu-img convert or the mirror job), so a simple > yet effective solution is to cache only the current data region. > > (Note that only caching data regions but not zero regions means that > returning false information from the cache is not catastrophic: Treating > zeroes as data is fine. While we try to invalidate the cache on zero > writes and discards, such incongruences may still occur when there are > other processes writing to the image.) > > We only use the cache for nodes without children (i.e. protocol nodes), > because that is where the problem is: Drivers that rely on block-status > implementations outside of qemu (e.g. SEEK_DATA/HOLE). > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > include/block/block_int.h | 19 ++++++++++ > block.c | 2 + > block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- > 3 files changed, 98 insertions(+), 3 deletions(-) > > diff --git a/include/block/block_int.h b/include/block/block_int.h > index a8f9598102..c09512556a 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -832,6 +832,23 @@ struct BdrvChild { > QLIST_ENTRY(BdrvChild) next_parent; > }; > > +/* > + * Allows bdrv_co_block_status() to cache one data region for a > + * protocol node. > + * > + * @lock: Lock for accessing this object's fields > + * @valid: Whether the cache is valid > + * @data_start: Offset where we know (or strongly assume) is data > + * @data_end: Offset where the data region ends (which is not necessarily > + * the start of a zeroed region) > + */ > +typedef struct BdrvBlockStatusCache { > + CoMutex lock; > + bool valid; > + int64_t data_start; > + int64_t data_end; > +} BdrvBlockStatusCache; > + > struct BlockDriverState { > /* Protected by big QEMU lock or read-only after opening. No special > * locking needed during I/O... > @@ -997,6 +1014,8 @@ struct BlockDriverState { > > /* BdrvChild links to this node may never be frozen */ > bool never_freeze; > + > + BdrvBlockStatusCache block_status_cache; > }; > > struct BlockBackendRootState { > diff --git a/block.c b/block.c > index 3f456892d0..bad64d5c4f 100644 > --- a/block.c > +++ b/block.c > @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void) > > qemu_co_queue_init(&bs->flush_queue); > > + qemu_co_mutex_init(&bs->block_status_cache.lock); > + > for (i = 0; i < bdrv_drain_all_count; i++) { > bdrv_drained_begin(bs); > } > diff --git a/block/io.c b/block/io.c > index 323854d063..320638cc48 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -35,6 +35,7 @@ > #include "qapi/error.h" > #include "qemu/error-report.h" > #include "qemu/main-loop.h" > +#include "qemu/range.h" > #include "sysemu/replay.h" > > /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ > @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, > bool need_flush = false; > int head = 0; > int tail = 0; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); > int alignment = MAX(bs->bl.pwrite_zeroes_alignment, > @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, > return -ENOTSUP; > } > > + /* Invalidate the cached block-status data range if this write overlaps */ > + qemu_co_mutex_lock(&bsc->lock); > + if (bsc->valid && > + ranges_overlap(offset, bytes, bsc->data_start, > + bsc->data_end - bsc->data_start)) > + { > + bsc->valid = false; > + } > + qemu_co_mutex_unlock(&bsc->lock); I think this fragment used twice worth small function, like block_status_cache_discard_region(). That would be clean and avoids duplication.. > + > assert(alignment % bs->bl.request_alignment == 0); > head = offset % alignment; > tail = (offset + bytes) % alignment; > @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, > int64_t aligned_offset, aligned_bytes; > uint32_t align; > bool has_filtered_child; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > assert(pnum); > *pnum = 0; > @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, > aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset; > > if (bs->drv->bdrv_co_block_status) { > - ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, > - aligned_bytes, pnum, &local_map, > - &local_file); > + bool from_cache = false; > + > + /* > + * Use the block-status cache only for protocol nodes: Format > + * drivers are generally quick to inquire the status, but protocol > + * drivers often need to get information from outside of qemu, so > + * we do not have control over the actual implementation. There > + * have been cases where inquiring the status took an unreasonably > + * long time, and we can do nothing in qemu to fix it. > + * This is especially problematic for images with large data areas, > + * because finding the few holes in them and giving them special > + * treatment does not gain much performance. Therefore, we try to > + * cache the last-identified data region. > + * > + * Second, limiting ourselves to protocol nodes allows us to assume > + * the block status for data regions to be DATA | OFFSET_VALID, and > + * that the host offset is the same as the guest offset. > + * > + * Note that it is possible that external writers zero parts of > + * the cached regions without the cache being invalidated, and so > + * we may report zeroes as data. This is not catastrophic, > + * however, because reporting zeroes as data is fine. > + */ > + if (QLIST_EMPTY(&bs->children)) { > + qemu_co_mutex_lock(&bsc->lock); > + if (bsc->valid && > + aligned_offset >= bsc->data_start && > + aligned_offset < bsc->data_end) > + { > + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > + local_file = bs; > + local_map = aligned_offset; > + > + *pnum = bsc->data_end - aligned_offset; > + > + from_cache = true; > + } > + qemu_co_mutex_unlock(&bsc->lock); > + } > + > + if (!from_cache) { > + ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, > + aligned_bytes, pnum, &local_map, > + &local_file); > + > + if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) { You do cache the data for any driver that returns DATA | OFFSET_VALID. But it's unused for format drivers (which may return OFFSET_VALID of course). I think, here should be if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && QLIST_EMPTY(&bs->children)) { > + qemu_co_mutex_lock(&bsc->lock); > + bsc->data_start = aligned_offset; > + bsc->data_end = aligned_offset + *pnum; > + bsc->valid = true; > + qemu_co_mutex_unlock(&bsc->lock); > + } > + } Hmm, also, new additions may worth a separate small functions too, so that new object gets a clean API: bsc_discard() bsc_get() bsc_set() or something like this. I don't insist. > } else { > /* Default code for filters */ > > @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, > int max_pdiscard, ret; > int head, tail, align; > BlockDriverState *bs = child->bs; > + BdrvBlockStatusCache *bsc = &bs->block_status_cache; > > if (!bs || !bs->drv || !bdrv_is_inserted(bs)) { > return -ENOMEDIUM; > @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, > return 0; > } > > + /* Invalidate the cached block-status data range if this discard overlaps */ > + qemu_co_mutex_lock(&bsc->lock); > + if (bsc->valid && > + ranges_overlap(offset, bytes, bsc->data_start, > + bsc->data_end - bsc->data_start)) > + { > + bsc->valid = false; > + } > + qemu_co_mutex_unlock(&bsc->lock); > + > /* Discard is advisory, but some devices track and coalesce > * unaligned requests, so we must pass everything down rather than > * round here. Still, most devices will just silently ignore > Overall, seems good to me. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/6] block: block-status cache for data regions 2021-06-19 10:20 ` Vladimir Sementsov-Ogievskiy @ 2021-06-21 10:05 ` Max Reitz 0 siblings, 0 replies; 28+ messages in thread From: Max Reitz @ 2021-06-21 10:05 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel On 19.06.21 12:20, Vladimir Sementsov-Ogievskiy wrote: > 17.06.2021 18:52, Max Reitz wrote: >> As we have attempted before >> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, >> "file-posix: Cache lseek result for data regions"; >> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, >> "file-posix: Cache next hole"), this patch seeks to reduce the number of >> SEEK_DATA/HOLE operations the file-posix driver has to perform. The >> main difference is that this time it is implemented as part of the >> general block layer code. >> >> The problem we face is that on some filesystems or in some >> circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the >> implementation is outside of qemu, there is little we can do about its >> performance. >> >> We have already introduced the want_zero parameter to >> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls >> unless we really want zero information; but sometimes we do want that >> information, because for files that consist largely of zero areas, >> special-casing those areas can give large performance boosts. So the >> real problem is with files that consist largely of data, so that >> inquiring the block status does not gain us much performance, but where >> such an inquiry itself takes a lot of time. >> >> To address this, we want to cache data regions. Most of the time, when >> bad performance is reported, it is in places where the image is iterated >> over from start to end (qemu-img convert or the mirror job), so a simple >> yet effective solution is to cache only the current data region. >> >> (Note that only caching data regions but not zero regions means that >> returning false information from the cache is not catastrophic: Treating >> zeroes as data is fine. While we try to invalidate the cache on zero >> writes and discards, such incongruences may still occur when there are >> other processes writing to the image.) >> >> We only use the cache for nodes without children (i.e. protocol nodes), >> because that is where the problem is: Drivers that rely on block-status >> implementations outside of qemu (e.g. SEEK_DATA/HOLE). >> >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> include/block/block_int.h | 19 ++++++++++ >> block.c | 2 + >> block/io.c | 80 +++++++++++++++++++++++++++++++++++++-- >> 3 files changed, 98 insertions(+), 3 deletions(-) >> >> diff --git a/include/block/block_int.h b/include/block/block_int.h >> index a8f9598102..c09512556a 100644 >> --- a/include/block/block_int.h >> +++ b/include/block/block_int.h >> @@ -832,6 +832,23 @@ struct BdrvChild { >> QLIST_ENTRY(BdrvChild) next_parent; >> }; >> +/* >> + * Allows bdrv_co_block_status() to cache one data region for a >> + * protocol node. >> + * >> + * @lock: Lock for accessing this object's fields >> + * @valid: Whether the cache is valid >> + * @data_start: Offset where we know (or strongly assume) is data >> + * @data_end: Offset where the data region ends (which is not >> necessarily >> + * the start of a zeroed region) >> + */ >> +typedef struct BdrvBlockStatusCache { >> + CoMutex lock; >> + bool valid; >> + int64_t data_start; >> + int64_t data_end; >> +} BdrvBlockStatusCache; >> + >> struct BlockDriverState { >> /* Protected by big QEMU lock or read-only after opening. No >> special >> * locking needed during I/O... >> @@ -997,6 +1014,8 @@ struct BlockDriverState { >> /* BdrvChild links to this node may never be frozen */ >> bool never_freeze; >> + >> + BdrvBlockStatusCache block_status_cache; >> }; >> struct BlockBackendRootState { >> diff --git a/block.c b/block.c >> index 3f456892d0..bad64d5c4f 100644 >> --- a/block.c >> +++ b/block.c >> @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void) >> qemu_co_queue_init(&bs->flush_queue); >> + qemu_co_mutex_init(&bs->block_status_cache.lock); >> + >> for (i = 0; i < bdrv_drain_all_count; i++) { >> bdrv_drained_begin(bs); >> } >> diff --git a/block/io.c b/block/io.c >> index 323854d063..320638cc48 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -35,6 +35,7 @@ >> #include "qapi/error.h" >> #include "qemu/error-report.h" >> #include "qemu/main-loop.h" >> +#include "qemu/range.h" >> #include "sysemu/replay.h" >> /* Maximum bounce buffer for copy-on-read and write zeroes, in >> bytes */ >> @@ -1862,6 +1863,7 @@ static int coroutine_fn >> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, >> bool need_flush = false; >> int head = 0; >> int tail = 0; >> + BdrvBlockStatusCache *bsc = &bs->block_status_cache; >> int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, >> INT_MAX); >> int alignment = MAX(bs->bl.pwrite_zeroes_alignment, >> @@ -1878,6 +1880,16 @@ static int coroutine_fn >> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, >> return -ENOTSUP; >> } >> + /* Invalidate the cached block-status data range if this write >> overlaps */ >> + qemu_co_mutex_lock(&bsc->lock); >> + if (bsc->valid && >> + ranges_overlap(offset, bytes, bsc->data_start, >> + bsc->data_end - bsc->data_start)) >> + { >> + bsc->valid = false; >> + } >> + qemu_co_mutex_unlock(&bsc->lock); > > I think this fragment used twice worth small function, like > > block_status_cache_discard_region(). > > That would be clean and avoids duplication.. Sure, sounds good. >> + >> assert(alignment % bs->bl.request_alignment == 0); >> head = offset % alignment; >> tail = (offset + bytes) % alignment; >> @@ -2394,6 +2406,7 @@ static int coroutine_fn >> bdrv_co_block_status(BlockDriverState *bs, >> int64_t aligned_offset, aligned_bytes; >> uint32_t align; >> bool has_filtered_child; >> + BdrvBlockStatusCache *bsc = &bs->block_status_cache; >> assert(pnum); >> *pnum = 0; >> @@ -2442,9 +2455,59 @@ static int coroutine_fn >> bdrv_co_block_status(BlockDriverState *bs, >> aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset; >> if (bs->drv->bdrv_co_block_status) { >> - ret = bs->drv->bdrv_co_block_status(bs, want_zero, >> aligned_offset, >> - aligned_bytes, pnum, >> &local_map, >> - &local_file); >> + bool from_cache = false; >> + >> + /* >> + * Use the block-status cache only for protocol nodes: Format >> + * drivers are generally quick to inquire the status, but >> protocol >> + * drivers often need to get information from outside of >> qemu, so >> + * we do not have control over the actual implementation. >> There >> + * have been cases where inquiring the status took an >> unreasonably >> + * long time, and we can do nothing in qemu to fix it. >> + * This is especially problematic for images with large data >> areas, >> + * because finding the few holes in them and giving them >> special >> + * treatment does not gain much performance. Therefore, we >> try to >> + * cache the last-identified data region. >> + * >> + * Second, limiting ourselves to protocol nodes allows us to >> assume >> + * the block status for data regions to be DATA | >> OFFSET_VALID, and >> + * that the host offset is the same as the guest offset. >> + * >> + * Note that it is possible that external writers zero parts of >> + * the cached regions without the cache being invalidated, >> and so >> + * we may report zeroes as data. This is not catastrophic, >> + * however, because reporting zeroes as data is fine. >> + */ >> + if (QLIST_EMPTY(&bs->children)) { >> + qemu_co_mutex_lock(&bsc->lock); >> + if (bsc->valid && >> + aligned_offset >= bsc->data_start && >> + aligned_offset < bsc->data_end) >> + { >> + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; >> + local_file = bs; >> + local_map = aligned_offset; >> + >> + *pnum = bsc->data_end - aligned_offset; >> + >> + from_cache = true; >> + } >> + qemu_co_mutex_unlock(&bsc->lock); >> + } >> + >> + if (!from_cache) { >> + ret = bs->drv->bdrv_co_block_status(bs, want_zero, >> aligned_offset, >> + aligned_bytes, pnum, >> &local_map, >> + &local_file); >> + >> + if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) { > > You do cache the data for any driver that returns DATA | OFFSET_VALID. > But it's unused for format drivers (which may return OFFSET_VALID of > course). > > I think, here should be > > if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && > QLIST_EMPTY(&bs->children)) { Actually, I originally had it like this but then was afraid that a reviewer would ask me why I check the children list twice, once here and once when reading from the cache. :) So I thought I should check it only once, and the better place to do so was where the cache is actually used. I don’t think it makes much of a difference in practice (setting the cache doesn’t really matter if it isn’t used), but perhaps it might make the checks in write-zero/discard a bit quicker, because they can return early with `valid == false`. >> + qemu_co_mutex_lock(&bsc->lock); >> + bsc->data_start = aligned_offset; >> + bsc->data_end = aligned_offset + *pnum; >> + bsc->valid = true; >> + qemu_co_mutex_unlock(&bsc->lock); >> + } >> + } > > > Hmm, also, new additions may worth a separate small functions too, so > that new object gets a clean API: > > bsc_discard() > bsc_get() > bsc_set() > > or something like this. I don't insist. Sounds good, will do. >> } else { >> /* Default code for filters */ >> @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild >> *child, int64_t offset, >> int max_pdiscard, ret; >> int head, tail, align; >> BlockDriverState *bs = child->bs; >> + BdrvBlockStatusCache *bsc = &bs->block_status_cache; >> if (!bs || !bs->drv || !bdrv_is_inserted(bs)) { >> return -ENOMEDIUM; >> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild >> *child, int64_t offset, >> return 0; >> } >> + /* Invalidate the cached block-status data range if this >> discard overlaps */ >> + qemu_co_mutex_lock(&bsc->lock); >> + if (bsc->valid && >> + ranges_overlap(offset, bytes, bsc->data_start, >> + bsc->data_end - bsc->data_start)) >> + { >> + bsc->valid = false; >> + } >> + qemu_co_mutex_unlock(&bsc->lock); >> + >> /* Discard is advisory, but some devices track and coalesce >> * unaligned requests, so we must pass everything down rather than >> * round here. Still, most devices will just silently ignore >> > > Overall, seems good to me. Great :) Max ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/6] block/file-posix: Do not force-cap *pnum 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz 2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 20:16 ` Eric Blake 2021-06-19 10:32 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz ` (2 subsequent siblings) 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz <mreitz@redhat.com> --- block/file-posix.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index b3fbb9bd63..aeb370d5bb 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t start, * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. */ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, bool want_zero, @@ -2727,7 +2728,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ - *pnum = MIN(bytes, hole - offset); + *pnum = hole - offset; /* * We are not allowed to return partial sectors, though, so @@ -2746,7 +2747,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); - *pnum = MIN(bytes, data - offset); + *pnum = data - offset; ret = BDRV_BLOCK_ZERO; } *map = offset; -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz @ 2021-06-18 20:16 ` Eric Blake 2021-06-21 9:38 ` Max Reitz 2021-06-19 10:32 ` Vladimir Sementsov-Ogievskiy 1 sibling, 1 reply; 28+ messages in thread From: Eric Blake @ 2021-06-18 20:16 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. We should update the documentation in include/block/block_int.h to mention that the driver's block_status callback may treat *pnum as a soft cap, and that returning a larger value is fine. But I agree with this change in the individual drivers, as long as we remember to make our global contract explicit that we can now rely on it ;) Reviewed-by: Eric Blake <eblake@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum 2021-06-18 20:16 ` Eric Blake @ 2021-06-21 9:38 ` Max Reitz 0 siblings, 0 replies; 28+ messages in thread From: Max Reitz @ 2021-06-21 9:38 UTC (permalink / raw) To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block On 18.06.21 22:16, Eric Blake wrote: > On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote: >> bdrv_co_block_status() does it for us, we do not need to do it here. >> >> The advantage of not capping *pnum is that bdrv_co_block_status() can >> cache larger data regions than requested by its caller. > We should update the documentation in include/block/block_int.h to > mention that the driver's block_status callback may treat *pnum as a > soft cap, and that returning a larger value is fine. Oh, sure. Max > But I agree with this change in the individual drivers, as long as we > remember to make our global contract explicit that we can now rely on > it ;) > > Reviewed-by: Eric Blake <eblake@redhat.com> > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz 2021-06-18 20:16 ` Eric Blake @ 2021-06-19 10:32 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:32 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz<mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/6] block/gluster: Do not force-cap *pnum 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz ` (2 preceding siblings ...) 2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 20:17 ` Eric Blake 2021-06-19 10:36 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz 2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz <mreitz@redhat.com> --- block/gluster.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index e8ee14c8e9..8ef7bb18d5 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1461,7 +1461,8 @@ exit: * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. * * (Based on raw_co_block_status() from file-posix.c.) */ @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ - *pnum = MIN(bytes, hole - offset); + *pnum = hole - offset; ret = BDRV_BLOCK_DATA; } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); - *pnum = MIN(bytes, data - offset); + *pnum = data - offset; ret = BDRV_BLOCK_ZERO; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz @ 2021-06-18 20:17 ` Eric Blake 2021-06-19 10:36 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Eric Blake @ 2021-06-18 20:17 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:45PM +0200, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > block/gluster.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) Reviewed-by: Eric Blake <eblake@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz 2021-06-18 20:17 ` Eric Blake @ 2021-06-19 10:36 ` Vladimir Sementsov-Ogievskiy 2021-06-21 9:47 ` Max Reitz 1 sibling, 1 reply; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:36 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > block/gluster.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/block/gluster.c b/block/gluster.c > index e8ee14c8e9..8ef7bb18d5 100644 > --- a/block/gluster.c > +++ b/block/gluster.c > @@ -1461,7 +1461,8 @@ exit: > * the specified offset) that are known to be in the same > * allocated/unallocated state. > * > - * 'bytes' is the max value 'pnum' should be set to. > + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may > + * well exceed it. > * > * (Based on raw_co_block_status() from file-posix.c.) > */ > @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, > } else if (data == offset) { > /* On a data extent, compute bytes to the end of the extent, > * possibly including a partial sector at EOF. */ > - *pnum = MIN(bytes, hole - offset); > + *pnum = hole - offset; > ret = BDRV_BLOCK_DATA; Interesting, isn't it a bug that we don't ROUND_UP *pnum to request_alignment here like it is done in file-posix ? > } else { > /* On a hole, compute bytes to the beginning of the next extent. */ > assert(hole == offset); > - *pnum = MIN(bytes, data - offset); > + *pnum = data - offset; > ret = BDRV_BLOCK_ZERO; > } > > -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum 2021-06-19 10:36 ` Vladimir Sementsov-Ogievskiy @ 2021-06-21 9:47 ` Max Reitz 0 siblings, 0 replies; 28+ messages in thread From: Max Reitz @ 2021-06-21 9:47 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel On 19.06.21 12:36, Vladimir Sementsov-Ogievskiy wrote: > 17.06.2021 18:52, Max Reitz wrote: >> bdrv_co_block_status() does it for us, we do not need to do it here. >> >> The advantage of not capping *pnum is that bdrv_co_block_status() can >> cache larger data regions than requested by its caller. >> >> Signed-off-by: Max Reitz <mreitz@redhat.com> > > > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > >> --- >> block/gluster.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/block/gluster.c b/block/gluster.c >> index e8ee14c8e9..8ef7bb18d5 100644 >> --- a/block/gluster.c >> +++ b/block/gluster.c >> @@ -1461,7 +1461,8 @@ exit: >> * the specified offset) that are known to be in the same >> * allocated/unallocated state. >> * >> - * 'bytes' is the max value 'pnum' should be set to. >> + * 'bytes' is a soft cap for 'pnum'. If the information is free, >> 'pnum' may >> + * well exceed it. >> * >> * (Based on raw_co_block_status() from file-posix.c.) >> */ >> @@ -1500,12 +1501,12 @@ static int coroutine_fn >> qemu_gluster_co_block_status(BlockDriverState *bs, >> } else if (data == offset) { >> /* On a data extent, compute bytes to the end of the extent, >> * possibly including a partial sector at EOF. */ >> - *pnum = MIN(bytes, hole - offset); >> + *pnum = hole - offset; >> ret = BDRV_BLOCK_DATA; > > Interesting, isn't it a bug that we don't ROUND_UP *pnum to > request_alignment here like it is done in file-posix ? Guess I forgot gluster in 9c3db310ff0 O:) I don’t think I’ll be able to reproduce it for gluster, but I suppose just doing the same thing for gluster should be fine... Max ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz ` (3 preceding siblings ...) 2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 20:20 ` Eric Blake 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz <mreitz@redhat.com> --- block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/nbd.c b/block/nbd.c index 616f9ae6c4..930bd234de 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( .type = NBD_CMD_BLOCK_STATUS, .from = offset, .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), - MIN(bytes, s->info.size - offset)), + s->info.size - offset), .flags = NBD_CMD_FLAG_REQ_ONE, }; -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz @ 2021-06-18 20:20 ` Eric Blake 2021-06-19 11:12 ` Vladimir Sementsov-Ogievskiy 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy 1 sibling, 1 reply; 28+ messages in thread From: Eric Blake @ 2021-06-18 20:20 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, vsementsov, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:46PM +0200, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > block/nbd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Eric Blake <eblake@redhat.com> > > diff --git a/block/nbd.c b/block/nbd.c > index 616f9ae6c4..930bd234de 100644 > --- a/block/nbd.c > +++ b/block/nbd.c > @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( > .type = NBD_CMD_BLOCK_STATUS, > .from = offset, > .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), > - MIN(bytes, s->info.size - offset)), > + s->info.size - offset), > .flags = NBD_CMD_FLAG_REQ_ONE, I'd love to someday get rid of using NBD_CMD_FLAG_REQ_ONE (so the server can reply with more extents in one go), but that's a bigger task and unrelated to your block-layer cache. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-18 20:20 ` Eric Blake @ 2021-06-19 11:12 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 11:12 UTC (permalink / raw) To: Eric Blake, Max Reitz; +Cc: qemu-block, Kevin Wolf, qemu-devel 18.06.2021 23:20, Eric Blake wrote: > On Thu, Jun 17, 2021 at 05:52:46PM +0200, Max Reitz wrote: >> bdrv_co_block_status() does it for us, we do not need to do it here. >> >> The advantage of not capping *pnum is that bdrv_co_block_status() can >> cache larger data regions than requested by its caller. >> >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> block/nbd.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) > > Reviewed-by: Eric Blake <eblake@redhat.com> > >> >> diff --git a/block/nbd.c b/block/nbd.c >> index 616f9ae6c4..930bd234de 100644 >> --- a/block/nbd.c >> +++ b/block/nbd.c >> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( >> .type = NBD_CMD_BLOCK_STATUS, >> .from = offset, >> .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), >> - MIN(bytes, s->info.size - offset)), >> + s->info.size - offset), >> .flags = NBD_CMD_FLAG_REQ_ONE, > > I'd love to someday get rid of using NBD_CMD_FLAG_REQ_ONE (so the > server can reply with more extents in one go), but that's a bigger > task and unrelated to your block-layer cache. > I think for this to work, the generic block_status should be updated so we can work with several extents in one go. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz 2021-06-18 20:20 ` Eric Blake @ 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy 2021-06-21 9:50 ` Max Reitz 2021-06-21 18:53 ` Eric Blake 1 sibling, 2 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 10:53 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > block/nbd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/nbd.c b/block/nbd.c > index 616f9ae6c4..930bd234de 100644 > --- a/block/nbd.c > +++ b/block/nbd.c > @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( > .type = NBD_CMD_BLOCK_STATUS, > .from = offset, > .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), > - MIN(bytes, s->info.size - offset)), > + s->info.size - offset), > .flags = NBD_CMD_FLAG_REQ_ONE, > }; > > Hmm.. I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work. (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop). For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent. So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy @ 2021-06-21 9:50 ` Max Reitz 2021-06-21 18:54 ` Eric Blake 2021-06-21 18:53 ` Eric Blake 1 sibling, 1 reply; 28+ messages in thread From: Max Reitz @ 2021-06-21 9:50 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel On 19.06.21 12:53, Vladimir Sementsov-Ogievskiy wrote: > 17.06.2021 18:52, Max Reitz wrote: >> bdrv_co_block_status() does it for us, we do not need to do it here. >> >> The advantage of not capping *pnum is that bdrv_co_block_status() can >> cache larger data regions than requested by its caller. >> >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> block/nbd.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/block/nbd.c b/block/nbd.c >> index 616f9ae6c4..930bd234de 100644 >> --- a/block/nbd.c >> +++ b/block/nbd.c >> @@ -1702,7 +1702,7 @@ static int coroutine_fn >> nbd_client_co_block_status( >> .type = NBD_CMD_BLOCK_STATUS, >> .from = offset, >> .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), >> - MIN(bytes, s->info.size - offset)), >> + s->info.size - offset), >> .flags = NBD_CMD_FLAG_REQ_ONE, >> }; >> > > Hmm.. > > I don't that this change is correct. In contrast with file-posix you > don't get extra information for free, you just make a larger request. > This means that server will have to do more work. Oh, oops. Seems I was blind in my rage to replace this MIN() pattern. You’re absolutely right. So this patch should be dropped. Max > (look at blockstatus_to_extents, it calls bdrv_block_status_above in a > loop). > > For example, assume that nbd export is a qcow2 image with all clusters > allocated. With this change, nbd server will loop through the whole > qcow2 image, load all L2 tables to return big allocated extent. > > So, only server can decide, could it add some extra free information > to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't > allow it. > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-21 9:50 ` Max Reitz @ 2021-06-21 18:54 ` Eric Blake 0 siblings, 0 replies; 28+ messages in thread From: Eric Blake @ 2021-06-21 18:54 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block On Mon, Jun 21, 2021 at 11:50:02AM +0200, Max Reitz wrote: > > I don't that this change is correct. In contrast with file-posix you > > don't get extra information for free, you just make a larger request. > > This means that server will have to do more work. > > Oh, oops. Seems I was blind in my rage to replace this MIN() pattern. > > You’re absolutely right. So this patch should be dropped. I disagree - I think ths patch is still correct, as written, _because_ we use the REQ_ONE flag. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy 2021-06-21 9:50 ` Max Reitz @ 2021-06-21 18:53 ` Eric Blake 2021-06-22 9:07 ` Vladimir Sementsov-Ogievskiy 1 sibling, 1 reply; 28+ messages in thread From: Eric Blake @ 2021-06-21 18:53 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz On Sat, Jun 19, 2021 at 01:53:24PM +0300, Vladimir Sementsov-Ogievskiy wrote: > > +++ b/block/nbd.c > > @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( > > .type = NBD_CMD_BLOCK_STATUS, > > .from = offset, > > .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), > > - MIN(bytes, s->info.size - offset)), > > + s->info.size - offset), > > .flags = NBD_CMD_FLAG_REQ_ONE, > > }; > > > > Hmm.. > > I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work. Not necessarily. The fact that we have passed NBD_CMD_FLAG_REQ_ONE means that the server is still only allowed to give us one extent in its answer, and that it may not give us information beyond the length we requested. You are right that if we lose the REQ_ONE flag we may result in the server doing more work to provide us additional extents that we will then be ignoring because we aren't yet set up for avoiding REQ_ONE. Fixing that is a longer-term goal. But in the short term, I see no harm in giving a larger length to the server with REQ_ONE. > > (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop). > > For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent. No, the server is allowed to reply with less length than our request, and that is particularly true if the server does NOT have free access to the full length of our request. In the case of qcow2, since bdrv_block_status is (by current design) clamped at cluster boundaries, requesting a 4G length will NOT increase the amount of the server response any further than the first cluster boundary (that is, the point where the server no longer has free access to status without loading another cluster of L2 entries). > > So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it. What the flag prohibits is the server giving us more information than the length we requested. But this patch is increasing our request length for the case where the server CAN give us more information than we need locally, on the hopes that even though the server can only reply with one extent, we aren't wasting as many network back-and-forth trips when a larger request would have worked. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum 2021-06-21 18:53 ` Eric Blake @ 2021-06-22 9:07 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-22 9:07 UTC (permalink / raw) To: Eric Blake; +Cc: Max Reitz, qemu-block, Kevin Wolf, qemu-devel 21.06.2021 21:53, Eric Blake wrote: > On Sat, Jun 19, 2021 at 01:53:24PM +0300, Vladimir Sementsov-Ogievskiy wrote: >>> +++ b/block/nbd.c >>> @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( >>> .type = NBD_CMD_BLOCK_STATUS, >>> .from = offset, >>> .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), >>> - MIN(bytes, s->info.size - offset)), >>> + s->info.size - offset), >>> .flags = NBD_CMD_FLAG_REQ_ONE, >>> }; >>> >> >> Hmm.. >> >> I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work. > > Not necessarily. The fact that we have passed NBD_CMD_FLAG_REQ_ONE > means that the server is still only allowed to give us one extent in > its answer, and that it may not give us information beyond the length > we requested. You are right that if we lose the REQ_ONE flag we may > result in the server doing more work to provide us additional extents > that we will then be ignoring because we aren't yet set up for > avoiding REQ_ONE. Fixing that is a longer-term goal. But in the > short term, I see no harm in giving a larger length to the server with > REQ_ONE. > >> >> (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop). >> >> For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent. > > No, the server is allowed to reply with less length than our request, > and that is particularly true if the server does NOT have free access > to the full length of our request. In the case of qcow2, since > bdrv_block_status is (by current design) clamped at cluster > boundaries, requesting a 4G length will NOT increase the amount of the > server response any further than the first cluster boundary (that is, > the point where the server no longer has free access to status without > loading another cluster of L2 entries). No. No matter where bdrv_block_status_above is clamped. If the whole disk is allocated, blockstatus_to_extents() in nbd/server.c will loop through the whole requested range and merge all the information into one extent. This doesn't violate NBD_CMD_FLAG_REQ_ONE: we have one extent on output and don't go beyound the length. It's valid for the server to try to satisfy as much as possible of request, and blockstatus_to_extents works in this way currently. Remember that nbd_extent_array_add() can merge new extent to the previous if it has the same type. > >> >> So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it. > > What the flag prohibits is the server giving us more information than > the length we requested. But this patch is increasing our request > length for the case where the server CAN give us more information than > we need locally, on the hopes that even though the server can only > reply with one extent, we aren't wasting as many network > back-and-forth trips when a larger request would have worked. > -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 6/6] block/iscsi: Do not force-cap *pnum 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz ` (4 preceding siblings ...) 2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz @ 2021-06-17 15:52 ` Max Reitz 2021-06-18 20:20 ` Eric Blake 2021-06-19 11:13 ` Vladimir Sementsov-Ogievskiy 5 siblings, 2 replies; 28+ messages in thread From: Max Reitz @ 2021-06-17 15:52 UTC (permalink / raw) To: qemu-block; +Cc: Kevin Wolf, qemu-devel, Max Reitz bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz <mreitz@redhat.com> --- block/iscsi.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 4d2a416ce7..852384086b 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -781,9 +781,6 @@ retry: iscsi_allocmap_set_allocated(iscsilun, offset, *pnum); } - if (*pnum > bytes) { - *pnum = bytes; - } out_unlock: qemu_mutex_unlock(&iscsilun->mutex); g_free(iTask.err_str); -- 2.31.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] block/iscsi: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz @ 2021-06-18 20:20 ` Eric Blake 2021-06-19 11:13 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Eric Blake @ 2021-06-18 20:20 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block On Thu, Jun 17, 2021 at 05:52:47PM +0200, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > block/iscsi.c | 3 --- > 1 file changed, 3 deletions(-) Reviewed-by: Eric Blake <eblake@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] block/iscsi: Do not force-cap *pnum 2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz 2021-06-18 20:20 ` Eric Blake @ 2021-06-19 11:13 ` Vladimir Sementsov-Ogievskiy 1 sibling, 0 replies; 28+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-19 11:13 UTC (permalink / raw) To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel 17.06.2021 18:52, Max Reitz wrote: > bdrv_co_block_status() does it for us, we do not need to do it here. > > The advantage of not capping *pnum is that bdrv_co_block_status() can > cache larger data regions than requested by its caller. > > Signed-off-by: Max Reitz<mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2021-06-22 9:09 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-17 15:52 [PATCH 0/6] block: block-status cache for data regions Max Reitz 2021-06-17 15:52 ` [PATCH 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz 2021-06-18 17:42 ` Eric Blake 2021-06-19 9:38 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 2/6] block: block-status cache for data regions Max Reitz 2021-06-18 18:51 ` Eric Blake 2021-06-21 9:37 ` Max Reitz 2021-06-19 10:20 ` Vladimir Sementsov-Ogievskiy 2021-06-21 10:05 ` Max Reitz 2021-06-17 15:52 ` [PATCH 3/6] block/file-posix: Do not force-cap *pnum Max Reitz 2021-06-18 20:16 ` Eric Blake 2021-06-21 9:38 ` Max Reitz 2021-06-19 10:32 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 4/6] block/gluster: " Max Reitz 2021-06-18 20:17 ` Eric Blake 2021-06-19 10:36 ` Vladimir Sementsov-Ogievskiy 2021-06-21 9:47 ` Max Reitz 2021-06-17 15:52 ` [PATCH 5/6] block/nbd: " Max Reitz 2021-06-18 20:20 ` Eric Blake 2021-06-19 11:12 ` Vladimir Sementsov-Ogievskiy 2021-06-19 10:53 ` Vladimir Sementsov-Ogievskiy 2021-06-21 9:50 ` Max Reitz 2021-06-21 18:54 ` Eric Blake 2021-06-21 18:53 ` Eric Blake 2021-06-22 9:07 ` Vladimir Sementsov-Ogievskiy 2021-06-17 15:52 ` [PATCH 6/6] block/iscsi: " Max Reitz 2021-06-18 20:20 ` Eric Blake 2021-06-19 11:13 ` Vladimir Sementsov-Ogievskiy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.