Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image

From: Kevin Wolf <kwolf@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Fam Zheng <famz@redhat.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard with fragmented image
Date: Tue, 10 Nov 2015 11:12:59 +0100	[thread overview]
Message-ID: <20151110101259.GA4163@noname.redhat.com> (raw)
In-Reply-To: <5641B282.10706@redhat.com>

Am 10.11.2015 um 10:01 hat Paolo Bonzini geschrieben:
> 
> 
> On 10/11/2015 07:14, Fam Zheng wrote:
> > On Mon, 11/09 17:29, Kevin Wolf wrote:
> >> Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben:
> >>>
> >>>
> >>> On 09/11/2015 17:04, Kevin Wolf wrote:
> >>>> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben:
> >>>>> The "pnum < nb_sectors" condition in deciding whether to actually copy
> >>>>> data is unnecessarily strict, and the qiov initialization is
> >>>>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard
> >>>>> branches.
> >>>>>
> >>>>> Reorganize mirror_iteration flow so that we:
> >>>>>
> >>>>>     1) Find the contiguous zero/discarded sectors with
> >>>>>     bdrv_get_block_status_above() before deciding what to do. We query
> >>>>>     s->buf_size sized blocks at a time.
> >>>>>
> >>>>>     2) If the sectors in question are zeroed/discarded and aligned to
> >>>>>     target cluster, issue zero write or discard accordingly. It's done
> >>>>>     in mirror_do_zero_or_discard, where we don't add buffer to qiov.
> >>>>>
> >>>>>     3) Otherwise, do the same loop as before in mirror_do_read.
> >>>>>
> >>>>> Signed-off-by: Fam Zheng <famz@redhat.com>
> >>>>
> >>>> I'm not sure where in the patch to comment on this, so I'll just do it
> >>>> here right in the beginning.
> >>>>
> >>>> I'm concerned that we need to be more careful about races in this patch,
> >>>> in particular regarding the bitmaps. I think the conditions for the two
> >>>> bitmaps are:
> >>>>
> >>>> * Dirty bitmap: We must clear the bit after finding the next piece of
> >>>>   data to be mirrored, but before we yield after getting information
> >>>>   that we use for the decision which kind of operation we need.
> >>>>
> >>>>   In other words, we need to clear the dirty bitmap bit before calling
> >>>>   bdrv_get_block_status_above(), because that's both the function that
> >>>>   retrieves information about the next chunk and also a function that
> >>>>   can yield.
> >>>>
> >>>>   If after this point the data is written to, we need to mirror it
> >>>>   again.
> >>>
> >>> With Fam's patch, that's not trivial for two reasons:
> >>>
> >>> 1) bdrv_get_block_status_above() can return a smaller amount than what
> >>> is asked.
> >>>
> >>> 2) the "read and write" case can handle s->granularity sectors per
> >>> iteration (many of them can be coalesced, but still that's how the
> >>> iteration works).
> >>>
> >>> The simplest solution is to perform the query with s->granularity size
> >>> rather than s->buf_size.
> >>
> >> Then we end up with many small operations, that's not what we want.
> >>
> >> Why can't we mark up to s->buf_size dirty clusters as clean first, then
> >> query the status, and mark all of those that we can't handle dirty
> >> again?
> > 
> > Then we may end up marking more clusters as dirty than it should be.
> 
> You're both right.
> 
> > Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are coroutine,
> > we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and
> > bdrv_set_dirty_bitmap.
> 
> I think this is not necessary.
> 
> I think the following is safe:
> 
> 1) before calling bdrv_get_block_status_above(), find out how many
> consecutive bits in the dirty bitmap are 1
> 
> 2) zero all those bits in the dirty bitmap
> 
> 3) call bdrv_get_block_status_above() with a size equivalent to the
> number of dirty bits
> 
> 4) if bdrv_get_block_status_above() only returns a partial result, loop
> step (3) until all the dirty bits are processed

Right, you can always implement one iteration with more than one I/O
request. And maybe that would be the time to start a coroutine for the
requests already in the mirror code instead of complicating the AIO
state machine and letting block.c start coroutines.

> For full mirroring, this strategy will probably make the first
> incremental iteration more expensive.

You mean because we issue smaller, interleaved write and write_zeroes
requests now instead of only large writes? That's probably right, but
getting the right result should be more important than speed. :-)

Kevin