* [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image @ 2015-11-06 10:22 Fam Zheng 2015-11-06 18:36 ` [Qemu-devel] [Qemu-block] " Max Reitz 2015-11-09 16:04 ` [Qemu-devel] " Kevin Wolf 0 siblings, 2 replies; 10+ messages in thread From: Fam Zheng @ 2015-11-06 10:22 UTC (permalink / raw) To: qemu-devel; +Cc: Kevin Wolf, pbonzini, Jeff Cody, qemu-block The "pnum < nb_sectors" condition in deciding whether to actually copy data is unnecessarily strict, and the qiov initialization is unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard branches. Reorganize mirror_iteration flow so that we: 1) Find the contiguous zero/discarded sectors with bdrv_get_block_status_above() before deciding what to do. We query s->buf_size sized blocks at a time. 2) If the sectors in question are zeroed/discarded and aligned to target cluster, issue zero write or discard accordingly. It's done in mirror_do_zero_or_discard, where we don't add buffer to qiov. 3) Otherwise, do the same loop as before in mirror_do_read. Signed-off-by: Fam Zheng <famz@redhat.com> --- block/mirror.c | 161 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 127 insertions(+), 34 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index b1252a1..ac796b4 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -157,28 +157,21 @@ static void mirror_read_complete(void *opaque, int ret) mirror_write_complete, op); } -static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) +static uint64_t mirror_do_read(MirrorBlockJob *s) { BlockDriverState *source = s->common.bs; - int nb_sectors, sectors_per_chunk, nb_chunks; - int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector; + int sectors_per_chunk, nb_sectors, nb_chunks; + int64_t end, next_chunk, next_sector, hbitmap_next_sector, sector_num; uint64_t delay_ns = 0; MirrorOp *op; - int pnum; - int64_t ret; - s->sector_num = hbitmap_iter_next(&s->hbi); - if (s->sector_num < 0) { - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); - s->sector_num = hbitmap_iter_next(&s->hbi); - trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); - assert(s->sector_num >= 0); - } + sector_num = s->sector_num; + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; + end = s->bdev_length / BDRV_SECTOR_SIZE; + next_sector = sector_num; + next_chunk = sector_num / sectors_per_chunk; hbitmap_next_sector = s->sector_num; - sector_num = s->sector_num; - sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; - end = s->bdev_length / BDRV_SECTOR_SIZE; /* Extend the QEMUIOVector to include all adjacent blocks that will * be copied in this operation. @@ -198,14 +191,6 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) next_sector = sector_num; next_chunk = sector_num / sectors_per_chunk; - /* Wait for I/O to this cluster (from a previous iteration) to be done. */ - while (test_bit(next_chunk, s->in_flight_bitmap)) { - trace_mirror_yield_in_flight(s, sector_num, s->in_flight); - s->waiting_for_io = true; - qemu_coroutine_yield(); - s->waiting_for_io = false; - } - do { int added_sectors, added_chunks; @@ -301,24 +286,132 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) s->sectors_in_flight += nb_sectors; trace_mirror_one_iteration(s, sector_num, nb_sectors); - ret = bdrv_get_block_status_above(source, NULL, sector_num, - nb_sectors, &pnum); - if (ret < 0 || pnum < nb_sectors || - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, - mirror_read_complete, op); - } else if (ret & BDRV_BLOCK_ZERO) { + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, + mirror_read_complete, op); + return delay_ns; +} + +static uint64_t mirror_do_zero_or_discard(MirrorBlockJob *s, + int64_t sector_num, + int nb_sectors, + bool is_discard) +{ + int sectors_per_chunk, nb_chunks; + int64_t next_chunk, next_sector, hbitmap_next_sector; + uint64_t delay_ns = 0; + MirrorOp *op; + + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; + assert(nb_sectors >= sectors_per_chunk); + next_chunk = sector_num / sectors_per_chunk; + nb_chunks = DIV_ROUND_UP(nb_sectors, sectors_per_chunk); + bitmap_set(s->in_flight_bitmap, next_chunk, nb_chunks); + delay_ns = ratelimit_calculate_delay(&s->limit, nb_sectors); + + /* Allocate a MirrorOp that is used as an AIO callback. The qiov is zeroed + * out so the freeing in iteration is nop. */ + op = g_new0(MirrorOp, 1); + op->s = s; + op->sector_num = sector_num; + op->nb_sectors = nb_sectors; + + /* Advance the HBitmapIter in parallel, so that we do not examine + * the same sector twice. + */ + hbitmap_next_sector = sector_num; + next_sector = sector_num + nb_sectors; + while (next_sector > hbitmap_next_sector) { + hbitmap_next_sector = hbitmap_iter_next(&s->hbi); + if (hbitmap_next_sector < 0) { + break; + } + } + + bdrv_reset_dirty_bitmap(s->dirty_bitmap, sector_num, nb_sectors); + s->in_flight++; + s->sectors_in_flight += nb_sectors; + if (is_discard) { + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, + mirror_write_complete, op); + } else { bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, s->unmap ? BDRV_REQ_MAY_UNMAP : 0, mirror_write_complete, op); - } else { - assert(!(ret & BDRV_BLOCK_DATA)); - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, - mirror_write_complete, op); } + return delay_ns; } +static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) +{ + BlockDriverState *source = s->common.bs; + int sectors_per_chunk; + int64_t sector_num, next_chunk; + int ret; + int contiguous_sectors = s->buf_size >> BDRV_SECTOR_BITS; + enum MirrorMethod { + MIRROR_METHOD_COPY, + MIRROR_METHOD_ZERO, + MIRROR_METHOD_DISCARD + } mirror_method; + + s->sector_num = hbitmap_iter_next(&s->hbi); + if (s->sector_num < 0) { + bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); + s->sector_num = hbitmap_iter_next(&s->hbi); + trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); + assert(s->sector_num >= 0); + } + + sector_num = s->sector_num; + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; + next_chunk = sector_num / sectors_per_chunk; + + /* Wait for I/O to this cluster (from a previous iteration) to be done. */ + while (test_bit(next_chunk, s->in_flight_bitmap)) { + trace_mirror_yield_in_flight(s, sector_num, s->in_flight); + s->waiting_for_io = true; + qemu_coroutine_yield(); + s->waiting_for_io = false; + } + + mirror_method = MIRROR_METHOD_COPY; + ret = bdrv_get_block_status_above(source, NULL, sector_num, + contiguous_sectors, + &contiguous_sectors); + + contiguous_sectors -= contiguous_sectors % sectors_per_chunk; + if (ret < 0 || contiguous_sectors < sectors_per_chunk) { + contiguous_sectors = sectors_per_chunk; + } else if (!(ret & BDRV_BLOCK_DATA)) { + int64_t target_sector_num; + int target_nb_sectors; + bdrv_round_to_clusters(s->target, sector_num, contiguous_sectors, + &target_sector_num, &target_nb_sectors); + if (target_sector_num == sector_num && + target_nb_sectors == contiguous_sectors) { + mirror_method = ret & BDRV_BLOCK_ZERO ? + MIRROR_METHOD_ZERO : + MIRROR_METHOD_DISCARD; + } + } + + switch (mirror_method) { + case MIRROR_METHOD_COPY: + return mirror_do_read(s); + case MIRROR_METHOD_ZERO: + return mirror_do_zero_or_discard(s, sector_num, + contiguous_sectors, + false); + case MIRROR_METHOD_DISCARD: + return mirror_do_zero_or_discard(s, sector_num, + contiguous_sectors, + false); + default: + abort(); + } +} + static void mirror_free_init(MirrorBlockJob *s) { int granularity = s->granularity; -- 2.4.3 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-06 10:22 [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image Fam Zheng @ 2015-11-06 18:36 ` Max Reitz 2015-11-09 2:12 ` Fam Zheng 2015-11-09 16:04 ` [Qemu-devel] " Kevin Wolf 1 sibling, 1 reply; 10+ messages in thread From: Max Reitz @ 2015-11-06 18:36 UTC (permalink / raw) To: Fam Zheng, qemu-devel; +Cc: Kevin Wolf, pbonzini, qemu-block [-- Attachment #1: Type: text/plain, Size: 13786 bytes --] On 06.11.2015 11:22, Fam Zheng wrote: > The "pnum < nb_sectors" condition in deciding whether to actually copy > data is unnecessarily strict, and the qiov initialization is > unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > branches. > > Reorganize mirror_iteration flow so that we: > > 1) Find the contiguous zero/discarded sectors with > bdrv_get_block_status_above() before deciding what to do. We query > s->buf_size sized blocks at a time. > > 2) If the sectors in question are zeroed/discarded and aligned to > target cluster, issue zero write or discard accordingly. It's done > in mirror_do_zero_or_discard, where we don't add buffer to qiov. > > 3) Otherwise, do the same loop as before in mirror_do_read. > > Signed-off-by: Fam Zheng <famz@redhat.com> > --- > block/mirror.c | 161 +++++++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 127 insertions(+), 34 deletions(-) Looks good overall, some comments below. Max > diff --git a/block/mirror.c b/block/mirror.c > index b1252a1..ac796b4 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -157,28 +157,21 @@ static void mirror_read_complete(void *opaque, int ret) > mirror_write_complete, op); > } > > -static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > +static uint64_t mirror_do_read(MirrorBlockJob *s) > { > BlockDriverState *source = s->common.bs; > - int nb_sectors, sectors_per_chunk, nb_chunks; > - int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector; > + int sectors_per_chunk, nb_sectors, nb_chunks; > + int64_t end, next_chunk, next_sector, hbitmap_next_sector, sector_num; > uint64_t delay_ns = 0; > MirrorOp *op; > - int pnum; > - int64_t ret; > > - s->sector_num = hbitmap_iter_next(&s->hbi); > - if (s->sector_num < 0) { > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > - s->sector_num = hbitmap_iter_next(&s->hbi); > - trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); > - assert(s->sector_num >= 0); > - } > + sector_num = s->sector_num; > + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; > + end = s->bdev_length / BDRV_SECTOR_SIZE; > > + next_sector = sector_num; > + next_chunk = sector_num / sectors_per_chunk; @next_sector and @next_chunk set here... > hbitmap_next_sector = s->sector_num; > - sector_num = s->sector_num; > - sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; > - end = s->bdev_length / BDRV_SECTOR_SIZE; > > /* Extend the QEMUIOVector to include all adjacent blocks that will > * be copied in this operation. > @@ -198,14 +191,6 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > next_sector = sector_num; > next_chunk = sector_num / sectors_per_chunk; ...and here already. So the above seems superfluous, considering that they are not read in between. (If you keep hbitmap_next_sector = s->sector_num; above the sector_num = ... block, that may reduce conflicts further) > > - /* Wait for I/O to this cluster (from a previous iteration) to be done. */ > - while (test_bit(next_chunk, s->in_flight_bitmap)) { > - trace_mirror_yield_in_flight(s, sector_num, s->in_flight); > - s->waiting_for_io = true; > - qemu_coroutine_yield(); > - s->waiting_for_io = false; > - } > - > do { > int added_sectors, added_chunks; > > @@ -301,24 +286,132 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > s->sectors_in_flight += nb_sectors; > trace_mirror_one_iteration(s, sector_num, nb_sectors); > > - ret = bdrv_get_block_status_above(source, NULL, sector_num, > - nb_sectors, &pnum); > - if (ret < 0 || pnum < nb_sectors || > - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > - mirror_read_complete, op); > - } else if (ret & BDRV_BLOCK_ZERO) { > + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > + mirror_read_complete, op); > + return delay_ns; > +} > + > +static uint64_t mirror_do_zero_or_discard(MirrorBlockJob *s, > + int64_t sector_num, > + int nb_sectors, > + bool is_discard) > +{ > + int sectors_per_chunk, nb_chunks; > + int64_t next_chunk, next_sector, hbitmap_next_sector; > + uint64_t delay_ns = 0; > + MirrorOp *op; > + > + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; > + assert(nb_sectors >= sectors_per_chunk); > + next_chunk = sector_num / sectors_per_chunk; > + nb_chunks = DIV_ROUND_UP(nb_sectors, sectors_per_chunk); > + bitmap_set(s->in_flight_bitmap, next_chunk, nb_chunks); > + delay_ns = ratelimit_calculate_delay(&s->limit, nb_sectors); > + > + /* Allocate a MirrorOp that is used as an AIO callback. The qiov is zeroed > + * out so the freeing in iteration is nop. */ > + op = g_new0(MirrorOp, 1); > + op->s = s; > + op->sector_num = sector_num; > + op->nb_sectors = nb_sectors; > + > + /* Advance the HBitmapIter in parallel, so that we do not examine > + * the same sector twice. > + */ > + hbitmap_next_sector = sector_num; > + next_sector = sector_num + nb_sectors; > + while (next_sector > hbitmap_next_sector) { > + hbitmap_next_sector = hbitmap_iter_next(&s->hbi); > + if (hbitmap_next_sector < 0) { > + break; > + } > + } > + > + bdrv_reset_dirty_bitmap(s->dirty_bitmap, sector_num, nb_sectors); > + s->in_flight++; > + s->sectors_in_flight += nb_sectors; > + if (is_discard) { > + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > + mirror_write_complete, op); > + } else { > bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > mirror_write_complete, op); > - } else { > - assert(!(ret & BDRV_BLOCK_DATA)); > - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > - mirror_write_complete, op); > } > + > return delay_ns; > } > > +static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > +{ > + BlockDriverState *source = s->common.bs; > + int sectors_per_chunk; > + int64_t sector_num, next_chunk; > + int ret; > + int contiguous_sectors = s->buf_size >> BDRV_SECTOR_BITS; > + enum MirrorMethod { > + MIRROR_METHOD_COPY, > + MIRROR_METHOD_ZERO, > + MIRROR_METHOD_DISCARD > + } mirror_method; > + > + s->sector_num = hbitmap_iter_next(&s->hbi); > + if (s->sector_num < 0) { > + bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > + s->sector_num = hbitmap_iter_next(&s->hbi); > + trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); > + assert(s->sector_num >= 0); > + } > + > + sector_num = s->sector_num; > + sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; > + next_chunk = sector_num / sectors_per_chunk; > + > + /* Wait for I/O to this cluster (from a previous iteration) to be done. */ > + while (test_bit(next_chunk, s->in_flight_bitmap)) { > + trace_mirror_yield_in_flight(s, sector_num, s->in_flight); > + s->waiting_for_io = true; > + qemu_coroutine_yield(); > + s->waiting_for_io = false; > + } > + > + mirror_method = MIRROR_METHOD_COPY; > + ret = bdrv_get_block_status_above(source, NULL, sector_num, > + contiguous_sectors, > + &contiguous_sectors); We did have NULL as the base before, but is that correct? Shouldn't we use backing_bs(source) or s->base if @sync is none or top? Examining this is fun, the short answer is probably "NULL is wrong, but it's the best we can do here". (1) with NULL here: $ ./qemu-img create -f qcow2 -b base.qcow2 top.qcow2 $ ./qemu-io -c 'write -P 42 0 64k' base.qcow2 $ ./qemu-img create -f qcow2 -o compat=0.10 -b base.qcow2 top.qcow2 # compat=0.10 is important because otherwise discard will create a # zero cluster instead of truly discarding $ (echo "{'execute':'qmp_capabilities'} {'execute':'drive-mirror','arguments': {'device':'drive0','target':'out.qcow2', 'format':'qcow2','sync':'top'}} {'execute':'human-monitor-command','arguments': {'command-line':'qemu-io drive0 \"discard 0 64k\"'}} {'execute':'block-job-complete','arguments': {'device':'drive0'}} {'execute':'quit'}") | \ x86_64-softmmu/qemu-system-x86_64 -qmp stdio \ -drive if=none,file=top.qcow2,id=drive0,discard=unmap {"QMP": {"version": {"qemu": {"micro": 50, "minor": 4, "major": 2}, "package": ""}, "capabilities": []}} {"return": {}} Formatting 'out.qcow2', fmt=qcow2 size=67108864 backing_file=base.qcow2 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 {"return": {}} discard 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0000 sec (3.815 GiB/sec and 62500.0000 ops/sec) {"return": ""} {"timestamp": {"seconds": 1446830775, "microseconds": 198114}, "event": "BLOCK_JOB_READY", "data": {"device": "drive0", "len": 0, "offset": 0, "speed": 0, "type": "mirror"}} {"return": {}} {"return": {}} {"timestamp": {"seconds": 1446830775, "microseconds": 198371}, "event": "SHUTDOWN"} {"timestamp": {"seconds": 1446830775, "microseconds": 205506}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive0", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}} (Note that the discard must have been executed before BLOCK_JOB_READY, it's a bit racy) Now we check: $ ./qemu-io -c 'read -P 0 0 64k' out.qcow2 Pattern verification failed at offset 0, 65536 bytes read 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0000 sec (856.164 MiB/sec and 13698.6301 ops/sec) $ ./qemu-io -c 'read -P 42 0 64k' out.qcow2 read 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0000 sec (868.056 MiB/sec and 13888.8889 ops/sec) Well, that seems reasonable, considering that out.qcow2 has base.qcow2 as the backing file, but: $ ./qemu-img map out.qcow2 Offset Length Mapped to File 0 0x10000 0x50000 out.qcow2 Hm... Well, at least out.qcow2 and top.qcow2 return the same values when read, but we don't need to copy that from the backing file. Just leaving the cluster unallocated would have been enough. (2) So with s->base instead of NULL we get: $ ./qemu-io -c 'read -P 0 0 64k' out.qcow2 read 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0000 sec (856.164 MiB/sec and 13698.6301 ops/sec) $ ./qemu-io -c 'read -P 42 0 64k' out.qcow2 Pattern verification failed at offset 0, 65536 bytes read 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0000 sec (892.857 MiB/sec and 14285.7143 ops/sec) And: $ ./qemu-img map out.qcow2 Offset Length Mapped to File So the map output looks right, but now we don't read the backing file anymore. That's because for qcow2 v3, qemu writes zero clusters when asked to discard (that's why I had to use compat=0.10 for top.qcow2). Using a compat=0.10 image with @mode = existing fixes that. But well... So the issue is: We don't have a reliable function to punch a hole into an overlay file. discard could do that (and does so for qcow2 v2), but we considered returning zeros the better option if possible because people might find that more reasonable. So while it would be best to copy holes in the overlay file created by discards to the mirrored file, there is no way for us to do that. Therefore, the best we can do is copy from the backing file. So, OK, NULL it is. > + > + contiguous_sectors -= contiguous_sectors % sectors_per_chunk; > + if (ret < 0 || contiguous_sectors < sectors_per_chunk) { > + contiguous_sectors = sectors_per_chunk; > + } else if (!(ret & BDRV_BLOCK_DATA)) { > + int64_t target_sector_num; > + int target_nb_sectors; > + bdrv_round_to_clusters(s->target, sector_num, contiguous_sectors, > + &target_sector_num, &target_nb_sectors); > + if (target_sector_num == sector_num && > + target_nb_sectors == contiguous_sectors) { > + mirror_method = ret & BDRV_BLOCK_ZERO ? > + MIRROR_METHOD_ZERO : > + MIRROR_METHOD_DISCARD; > + } > + } > + > + switch (mirror_method) { > + case MIRROR_METHOD_COPY: > + return mirror_do_read(s); > + case MIRROR_METHOD_ZERO: > + return mirror_do_zero_or_discard(s, sector_num, > + contiguous_sectors, > + false); > + case MIRROR_METHOD_DISCARD: > + return mirror_do_zero_or_discard(s, sector_num, > + contiguous_sectors, > + false); s/false/true/? > + default: > + abort(); > + } > +} > + > static void mirror_free_init(MirrorBlockJob *s) > { > int granularity = s->granularity; > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-06 18:36 ` [Qemu-devel] [Qemu-block] " Max Reitz @ 2015-11-09 2:12 ` Fam Zheng 0 siblings, 0 replies; 10+ messages in thread From: Fam Zheng @ 2015-11-09 2:12 UTC (permalink / raw) To: Max Reitz; +Cc: Kevin Wolf, pbonzini, qemu-devel, qemu-block On Fri, 11/06 19:36, Max Reitz wrote: > > + next_sector = sector_num; > > + next_chunk = sector_num / sectors_per_chunk; > > @next_sector and @next_chunk set here... > > > hbitmap_next_sector = s->sector_num; > > - sector_num = s->sector_num; > > - sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS; > > - end = s->bdev_length / BDRV_SECTOR_SIZE; > > > > /* Extend the QEMUIOVector to include all adjacent blocks that will > > * be copied in this operation. > > @@ -198,14 +191,6 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > > next_sector = sector_num; > > next_chunk = sector_num / sectors_per_chunk; > > ...and here already. So the above seems superfluous, considering that > they are not read in between. > > (If you keep hbitmap_next_sector = s->sector_num; above the sector_num = > ... block, that may reduce conflicts further) Indeed, thanks for noticing this. > > + case MIRROR_METHOD_DISCARD: > > + return mirror_do_zero_or_discard(s, sector_num, > > + contiguous_sectors, > > + false); > > s/false/true/? Yes, thanks. Fam ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-06 10:22 [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image Fam Zheng 2015-11-06 18:36 ` [Qemu-devel] [Qemu-block] " Max Reitz @ 2015-11-09 16:04 ` Kevin Wolf 2015-11-09 16:18 ` Paolo Bonzini 1 sibling, 1 reply; 10+ messages in thread From: Kevin Wolf @ 2015-11-09 16:04 UTC (permalink / raw) To: Fam Zheng; +Cc: pbonzini, Jeff Cody, qemu-devel, qemu-block Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > The "pnum < nb_sectors" condition in deciding whether to actually copy > data is unnecessarily strict, and the qiov initialization is > unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > branches. > > Reorganize mirror_iteration flow so that we: > > 1) Find the contiguous zero/discarded sectors with > bdrv_get_block_status_above() before deciding what to do. We query > s->buf_size sized blocks at a time. > > 2) If the sectors in question are zeroed/discarded and aligned to > target cluster, issue zero write or discard accordingly. It's done > in mirror_do_zero_or_discard, where we don't add buffer to qiov. > > 3) Otherwise, do the same loop as before in mirror_do_read. > > Signed-off-by: Fam Zheng <famz@redhat.com> I'm not sure where in the patch to comment on this, so I'll just do it here right in the beginning. I'm concerned that we need to be more careful about races in this patch, in particular regarding the bitmaps. I think the conditions for the two bitmaps are: * Dirty bitmap: We must clear the bit after finding the next piece of data to be mirrored, but before we yield after getting information that we use for the decision which kind of operation we need. In other words, we need to clear the dirty bitmap bit before calling bdrv_get_block_status_above(), because that's both the function that retrieves information about the next chunk and also a function that can yield. If after this point the data is written to, we need to mirror it again. * In-flight bitmap: We need to make sure that we never mirror the same data twice at the same time as older data could overwrite newer data then. Strictly speaking, it looks to me as if this meant we can delay setting the bit until before we issue an AIO operation. It might be more obviously correct to set it at the same time as the dirty bitmap is cleared. Otherwise, this looks good to me. Kevin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-09 16:04 ` [Qemu-devel] " Kevin Wolf @ 2015-11-09 16:18 ` Paolo Bonzini 2015-11-09 16:29 ` Kevin Wolf 0 siblings, 1 reply; 10+ messages in thread From: Paolo Bonzini @ 2015-11-09 16:18 UTC (permalink / raw) To: Kevin Wolf, Fam Zheng; +Cc: Jeff Cody, qemu-devel, qemu-block On 09/11/2015 17:04, Kevin Wolf wrote: > Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: >> The "pnum < nb_sectors" condition in deciding whether to actually copy >> data is unnecessarily strict, and the qiov initialization is >> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard >> branches. >> >> Reorganize mirror_iteration flow so that we: >> >> 1) Find the contiguous zero/discarded sectors with >> bdrv_get_block_status_above() before deciding what to do. We query >> s->buf_size sized blocks at a time. >> >> 2) If the sectors in question are zeroed/discarded and aligned to >> target cluster, issue zero write or discard accordingly. It's done >> in mirror_do_zero_or_discard, where we don't add buffer to qiov. >> >> 3) Otherwise, do the same loop as before in mirror_do_read. >> >> Signed-off-by: Fam Zheng <famz@redhat.com> > > I'm not sure where in the patch to comment on this, so I'll just do it > here right in the beginning. > > I'm concerned that we need to be more careful about races in this patch, > in particular regarding the bitmaps. I think the conditions for the two > bitmaps are: > > * Dirty bitmap: We must clear the bit after finding the next piece of > data to be mirrored, but before we yield after getting information > that we use for the decision which kind of operation we need. > > In other words, we need to clear the dirty bitmap bit before calling > bdrv_get_block_status_above(), because that's both the function that > retrieves information about the next chunk and also a function that > can yield. > > If after this point the data is written to, we need to mirror it > again. With Fam's patch, that's not trivial for two reasons: 1) bdrv_get_block_status_above() can return a smaller amount than what is asked. 2) the "read and write" case can handle s->granularity sectors per iteration (many of them can be coalesced, but still that's how the iteration works). The simplest solution is to perform the query with s->granularity size rather than s->buf_size. Paolo > * In-flight bitmap: We need to make sure that we never mirror the same > data twice at the same time as older data could overwrite newer data > then. > > Strictly speaking, it looks to me as if this meant we can delay > setting the bit until before we issue an AIO operation. It might be > more obviously correct to set it at the same time as the dirty bitmap > is cleared. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-09 16:18 ` Paolo Bonzini @ 2015-11-09 16:29 ` Kevin Wolf 2015-11-10 6:14 ` Fam Zheng 0 siblings, 1 reply; 10+ messages in thread From: Kevin Wolf @ 2015-11-09 16:29 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Jeff Cody, Fam Zheng, qemu-devel, qemu-block Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: > > > On 09/11/2015 17:04, Kevin Wolf wrote: > > Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > >> The "pnum < nb_sectors" condition in deciding whether to actually copy > >> data is unnecessarily strict, and the qiov initialization is > >> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > >> branches. > >> > >> Reorganize mirror_iteration flow so that we: > >> > >> 1) Find the contiguous zero/discarded sectors with > >> bdrv_get_block_status_above() before deciding what to do. We query > >> s->buf_size sized blocks at a time. > >> > >> 2) If the sectors in question are zeroed/discarded and aligned to > >> target cluster, issue zero write or discard accordingly. It's done > >> in mirror_do_zero_or_discard, where we don't add buffer to qiov. > >> > >> 3) Otherwise, do the same loop as before in mirror_do_read. > >> > >> Signed-off-by: Fam Zheng <famz@redhat.com> > > > > I'm not sure where in the patch to comment on this, so I'll just do it > > here right in the beginning. > > > > I'm concerned that we need to be more careful about races in this patch, > > in particular regarding the bitmaps. I think the conditions for the two > > bitmaps are: > > > > * Dirty bitmap: We must clear the bit after finding the next piece of > > data to be mirrored, but before we yield after getting information > > that we use for the decision which kind of operation we need. > > > > In other words, we need to clear the dirty bitmap bit before calling > > bdrv_get_block_status_above(), because that's both the function that > > retrieves information about the next chunk and also a function that > > can yield. > > > > If after this point the data is written to, we need to mirror it > > again. > > With Fam's patch, that's not trivial for two reasons: > > 1) bdrv_get_block_status_above() can return a smaller amount than what > is asked. > > 2) the "read and write" case can handle s->granularity sectors per > iteration (many of them can be coalesced, but still that's how the > iteration works). > > The simplest solution is to perform the query with s->granularity size > rather than s->buf_size. Then we end up with many small operations, that's not what we want. Why can't we mark up to s->buf_size dirty clusters as clean first, then query the status, and mark all of those that we can't handle dirty again? Kevin > > * In-flight bitmap: We need to make sure that we never mirror the same > > data twice at the same time as older data could overwrite newer data > > then. > > > > Strictly speaking, it looks to me as if this meant we can delay > > setting the bit until before we issue an AIO operation. It might be > > more obviously correct to set it at the same time as the dirty bitmap > > is cleared. > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-09 16:29 ` Kevin Wolf @ 2015-11-10 6:14 ` Fam Zheng 2015-11-10 9:01 ` Paolo Bonzini 0 siblings, 1 reply; 10+ messages in thread From: Fam Zheng @ 2015-11-10 6:14 UTC (permalink / raw) To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, qemu-block On Mon, 11/09 17:29, Kevin Wolf wrote: > Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: > > > > > > On 09/11/2015 17:04, Kevin Wolf wrote: > > > Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > > >> The "pnum < nb_sectors" condition in deciding whether to actually copy > > >> data is unnecessarily strict, and the qiov initialization is > > >> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > > >> branches. > > >> > > >> Reorganize mirror_iteration flow so that we: > > >> > > >> 1) Find the contiguous zero/discarded sectors with > > >> bdrv_get_block_status_above() before deciding what to do. We query > > >> s->buf_size sized blocks at a time. > > >> > > >> 2) If the sectors in question are zeroed/discarded and aligned to > > >> target cluster, issue zero write or discard accordingly. It's done > > >> in mirror_do_zero_or_discard, where we don't add buffer to qiov. > > >> > > >> 3) Otherwise, do the same loop as before in mirror_do_read. > > >> > > >> Signed-off-by: Fam Zheng <famz@redhat.com> > > > > > > I'm not sure where in the patch to comment on this, so I'll just do it > > > here right in the beginning. > > > > > > I'm concerned that we need to be more careful about races in this patch, > > > in particular regarding the bitmaps. I think the conditions for the two > > > bitmaps are: > > > > > > * Dirty bitmap: We must clear the bit after finding the next piece of > > > data to be mirrored, but before we yield after getting information > > > that we use for the decision which kind of operation we need. > > > > > > In other words, we need to clear the dirty bitmap bit before calling > > > bdrv_get_block_status_above(), because that's both the function that > > > retrieves information about the next chunk and also a function that > > > can yield. > > > > > > If after this point the data is written to, we need to mirror it > > > again. > > > > With Fam's patch, that's not trivial for two reasons: > > > > 1) bdrv_get_block_status_above() can return a smaller amount than what > > is asked. > > > > 2) the "read and write" case can handle s->granularity sectors per > > iteration (many of them can be coalesced, but still that's how the > > iteration works). > > > > The simplest solution is to perform the query with s->granularity size > > rather than s->buf_size. > > Then we end up with many small operations, that's not what we want. > > Why can't we mark up to s->buf_size dirty clusters as clean first, then > query the status, and mark all of those that we can't handle dirty > again? Then we may end up marking more clusters as dirty than it should be. Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are coroutine, we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and bdrv_set_dirty_bitmap. Fam > > Kevin > > > > * In-flight bitmap: We need to make sure that we never mirror the same > > > data twice at the same time as older data could overwrite newer data > > > then. > > > > > > Strictly speaking, it looks to me as if this meant we can delay > > > setting the bit until before we issue an AIO operation. It might be > > > more obviously correct to set it at the same time as the dirty bitmap > > > is cleared. > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-10 6:14 ` Fam Zheng @ 2015-11-10 9:01 ` Paolo Bonzini 2015-11-10 10:12 ` Kevin Wolf 0 siblings, 1 reply; 10+ messages in thread From: Paolo Bonzini @ 2015-11-10 9:01 UTC (permalink / raw) To: Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block On 10/11/2015 07:14, Fam Zheng wrote: > On Mon, 11/09 17:29, Kevin Wolf wrote: >> Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: >>> >>> >>> On 09/11/2015 17:04, Kevin Wolf wrote: >>>> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: >>>>> The "pnum < nb_sectors" condition in deciding whether to actually copy >>>>> data is unnecessarily strict, and the qiov initialization is >>>>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard >>>>> branches. >>>>> >>>>> Reorganize mirror_iteration flow so that we: >>>>> >>>>> 1) Find the contiguous zero/discarded sectors with >>>>> bdrv_get_block_status_above() before deciding what to do. We query >>>>> s->buf_size sized blocks at a time. >>>>> >>>>> 2) If the sectors in question are zeroed/discarded and aligned to >>>>> target cluster, issue zero write or discard accordingly. It's done >>>>> in mirror_do_zero_or_discard, where we don't add buffer to qiov. >>>>> >>>>> 3) Otherwise, do the same loop as before in mirror_do_read. >>>>> >>>>> Signed-off-by: Fam Zheng <famz@redhat.com> >>>> >>>> I'm not sure where in the patch to comment on this, so I'll just do it >>>> here right in the beginning. >>>> >>>> I'm concerned that we need to be more careful about races in this patch, >>>> in particular regarding the bitmaps. I think the conditions for the two >>>> bitmaps are: >>>> >>>> * Dirty bitmap: We must clear the bit after finding the next piece of >>>> data to be mirrored, but before we yield after getting information >>>> that we use for the decision which kind of operation we need. >>>> >>>> In other words, we need to clear the dirty bitmap bit before calling >>>> bdrv_get_block_status_above(), because that's both the function that >>>> retrieves information about the next chunk and also a function that >>>> can yield. >>>> >>>> If after this point the data is written to, we need to mirror it >>>> again. >>> >>> With Fam's patch, that's not trivial for two reasons: >>> >>> 1) bdrv_get_block_status_above() can return a smaller amount than what >>> is asked. >>> >>> 2) the "read and write" case can handle s->granularity sectors per >>> iteration (many of them can be coalesced, but still that's how the >>> iteration works). >>> >>> The simplest solution is to perform the query with s->granularity size >>> rather than s->buf_size. >> >> Then we end up with many small operations, that's not what we want. >> >> Why can't we mark up to s->buf_size dirty clusters as clean first, then >> query the status, and mark all of those that we can't handle dirty >> again? > > Then we may end up marking more clusters as dirty than it should be. You're both right. > Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are coroutine, > we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and > bdrv_set_dirty_bitmap. I think this is not necessary. I think the following is safe: 1) before calling bdrv_get_block_status_above(), find out how many consecutive bits in the dirty bitmap are 1 2) zero all those bits in the dirty bitmap 3) call bdrv_get_block_status_above() with a size equivalent to the number of dirty bits 4) if bdrv_get_block_status_above() only returns a partial result, loop step (3) until all the dirty bits are processed For full mirroring, this strategy will probably make the first incremental iteration more expensive. Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-10 9:01 ` Paolo Bonzini @ 2015-11-10 10:12 ` Kevin Wolf 2015-11-10 10:30 ` Paolo Bonzini 0 siblings, 1 reply; 10+ messages in thread From: Kevin Wolf @ 2015-11-10 10:12 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Fam Zheng, qemu-devel, qemu-block Am 10.11.2015 um 10:01 hat Paolo Bonzini geschrieben: > > > On 10/11/2015 07:14, Fam Zheng wrote: > > On Mon, 11/09 17:29, Kevin Wolf wrote: > >> Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: > >>> > >>> > >>> On 09/11/2015 17:04, Kevin Wolf wrote: > >>>> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > >>>>> The "pnum < nb_sectors" condition in deciding whether to actually copy > >>>>> data is unnecessarily strict, and the qiov initialization is > >>>>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > >>>>> branches. > >>>>> > >>>>> Reorganize mirror_iteration flow so that we: > >>>>> > >>>>> 1) Find the contiguous zero/discarded sectors with > >>>>> bdrv_get_block_status_above() before deciding what to do. We query > >>>>> s->buf_size sized blocks at a time. > >>>>> > >>>>> 2) If the sectors in question are zeroed/discarded and aligned to > >>>>> target cluster, issue zero write or discard accordingly. It's done > >>>>> in mirror_do_zero_or_discard, where we don't add buffer to qiov. > >>>>> > >>>>> 3) Otherwise, do the same loop as before in mirror_do_read. > >>>>> > >>>>> Signed-off-by: Fam Zheng <famz@redhat.com> > >>>> > >>>> I'm not sure where in the patch to comment on this, so I'll just do it > >>>> here right in the beginning. > >>>> > >>>> I'm concerned that we need to be more careful about races in this patch, > >>>> in particular regarding the bitmaps. I think the conditions for the two > >>>> bitmaps are: > >>>> > >>>> * Dirty bitmap: We must clear the bit after finding the next piece of > >>>> data to be mirrored, but before we yield after getting information > >>>> that we use for the decision which kind of operation we need. > >>>> > >>>> In other words, we need to clear the dirty bitmap bit before calling > >>>> bdrv_get_block_status_above(), because that's both the function that > >>>> retrieves information about the next chunk and also a function that > >>>> can yield. > >>>> > >>>> If after this point the data is written to, we need to mirror it > >>>> again. > >>> > >>> With Fam's patch, that's not trivial for two reasons: > >>> > >>> 1) bdrv_get_block_status_above() can return a smaller amount than what > >>> is asked. > >>> > >>> 2) the "read and write" case can handle s->granularity sectors per > >>> iteration (many of them can be coalesced, but still that's how the > >>> iteration works). > >>> > >>> The simplest solution is to perform the query with s->granularity size > >>> rather than s->buf_size. > >> > >> Then we end up with many small operations, that's not what we want. > >> > >> Why can't we mark up to s->buf_size dirty clusters as clean first, then > >> query the status, and mark all of those that we can't handle dirty > >> again? > > > > Then we may end up marking more clusters as dirty than it should be. > > You're both right. > > > Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are coroutine, > > we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and > > bdrv_set_dirty_bitmap. > > I think this is not necessary. > > I think the following is safe: > > 1) before calling bdrv_get_block_status_above(), find out how many > consecutive bits in the dirty bitmap are 1 > > 2) zero all those bits in the dirty bitmap > > 3) call bdrv_get_block_status_above() with a size equivalent to the > number of dirty bits > > 4) if bdrv_get_block_status_above() only returns a partial result, loop > step (3) until all the dirty bits are processed Right, you can always implement one iteration with more than one I/O request. And maybe that would be the time to start a coroutine for the requests already in the mirror code instead of complicating the AIO state machine and letting block.c start coroutines. > For full mirroring, this strategy will probably make the first > incremental iteration more expensive. You mean because we issue smaller, interleaved write and write_zeroes requests now instead of only large writes? That's probably right, but getting the right result should be more important than speed. :-) Kevin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image 2015-11-10 10:12 ` Kevin Wolf @ 2015-11-10 10:30 ` Paolo Bonzini 0 siblings, 0 replies; 10+ messages in thread From: Paolo Bonzini @ 2015-11-10 10:30 UTC (permalink / raw) To: Kevin Wolf; +Cc: Fam Zheng, qemu-devel, qemu-block On 10/11/2015 11:12, Kevin Wolf wrote: > > For full mirroring, this strategy will probably make the first > > incremental iteration more expensive. > > You mean because we issue smaller, interleaved write and write_zeroes > requests now instead of only large writes? That's probably right, but > getting the right result should be more important than speed. :-) No, because you might end up clearing the whole dirty bitmap before issuing the first bdrv_get_block_status_above(). Blocks are actually read much later; if someone sets the dirty bitmap in between, you will re-write those blocks unnecessarily during the first incremental iteration. It's not specific to the first iteration, it's just more likely. However, it may be enough to clamp the number of dirty bitmap bits that you process in one go (e.g. to 100 MB of so). Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-11-10 10:30 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-11-06 10:22 [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image Fam Zheng 2015-11-06 18:36 ` [Qemu-devel] [Qemu-block] " Max Reitz 2015-11-09 2:12 ` Fam Zheng 2015-11-09 16:04 ` [Qemu-devel] " Kevin Wolf 2015-11-09 16:18 ` Paolo Bonzini 2015-11-09 16:29 ` Kevin Wolf 2015-11-10 6:14 ` Fam Zheng 2015-11-10 9:01 ` Paolo Bonzini 2015-11-10 10:12 ` Kevin Wolf 2015-11-10 10:30 ` Paolo Bonzini
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.