From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40842)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Zvp9s-00078s-1G
	for qemu-devel@nongnu.org; Mon, 09 Nov 2015 11:19:04 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Zvp9r-0002dK-2M
	for qemu-devel@nongnu.org; Mon, 09 Nov 2015 11:19:04 -0500
References: <1446805353-3047-1-git-send-email-famz@redhat.com>
	<20151109160409.GC3621@noname.redhat.com>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <5640C768.50704@redhat.com>
Date: Mon, 9 Nov 2015 17:18:48 +0100
MIME-Version: 1.0
In-Reply-To: <20151109160409.GC3621@noname.redhat.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH] mirror: Improve zero-write and discard
 with fragmented image
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <famz@redhat.com>
Cc: Jeff Cody <jcody@redhat.com>, qemu-devel@nongnu.org, qemu-block@nongnu.org


On 09/11/2015 17:04, Kevin Wolf wrote:
> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben:
>> The "pnum < nb_sectors" condition in deciding whether to actually copy
>> data is unnecessarily strict, and the qiov initialization is
>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard
>> branches.
>>
>> Reorganize mirror_iteration flow so that we:
>>
>>     1) Find the contiguous zero/discarded sectors with
>>     bdrv_get_block_status_above() before deciding what to do. We query
>>     s->buf_size sized blocks at a time.
>>
>>     2) If the sectors in question are zeroed/discarded and aligned to
>>     target cluster, issue zero write or discard accordingly. It's done
>>     in mirror_do_zero_or_discard, where we don't add buffer to qiov.
>>
>>     3) Otherwise, do the same loop as before in mirror_do_read.
>>
>> Signed-off-by: Fam Zheng <famz@redhat.com>
>=20
> I'm not sure where in the patch to comment on this, so I'll just do it
> here right in the beginning.
>=20
> I'm concerned that we need to be more careful about races in this patch=
,
> in particular regarding the bitmaps. I think the conditions for the two
> bitmaps are:
>=20
> * Dirty bitmap: We must clear the bit after finding the next piece of
>   data to be mirrored, but before we yield after getting information
>   that we use for the decision which kind of operation we need.
>=20
>   In other words, we need to clear the dirty bitmap bit before calling
>   bdrv_get_block_status_above(), because that's both the function that
>   retrieves information about the next chunk and also a function that
>   can yield.
>=20
>   If after this point the data is written to, we need to mirror it
>   again.

With Fam's patch, that's not trivial for two reasons:

1) bdrv_get_block_status_above() can return a smaller amount than what
is asked.

2) the "read and write" case can handle s->granularity sectors per
iteration (many of them can be coalesced, but still that's how the
iteration works).

The simplest solution is to perform the query with s->granularity size
rather than s->buf_size.

Paolo

> * In-flight bitmap: We need to make sure that we never mirror the same
>   data twice at the same time as older data could overwrite newer data
>   then.
>=20
>   Strictly speaking, it looks to me as if this meant we can delay
>   setting the bit until before we issue an AIO operation. It might be
>   more obviously correct to set it at the same time as the dirty bitmap
>   is cleared.