Re: [Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k

From: Ed Swierk <eswierk@skyportsystems.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Kevin Wolf <kwolf@redhat.com>, "Denis V. Lunev" <den@openvz.org>
Subject: Re: [Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k
Date: Mon, 24 Oct 2016 16:06:26 -0700	[thread overview]
Message-ID: <CAO_EM_=N03RGb=HZwnFHwHfwR5z01TT6_BZ3asscePnKZMmWUw@mail.gmail.com> (raw)
In-Reply-To: <8a8853e4-d1ce-66b2-5214-b508fb897386@redhat.com>

On Mon, Oct 24, 2016 at 2:21 PM, Eric Blake <eblake@redhat.com> wrote:
> How are you getting max_transfer == 65536?  I can't reproduce it with
> the following setup:
>
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ qemu-io -f qcow2 -c 'w 7m 1k' file
> $ qemu-io -f qcow2 -c 'w -z 8003584 2093056' file
>
> although I did confirm that the above sequence was enough to get the
> -ENOTSUP failure and fall into the code calculating max_transfer.
>
> I'm guessing that you are using something other than a file system as
> the backing protocol for your qcow2 image.  But do you really have a
> protocol that takes AT MOST 64k per transaction, while still trying to a
> cluster size of 1M in the qcow2 format?  That's rather awkward, as it
> means that you are required to do 16 transactions per cluster (the whole
> point of using larger clusters is usually to get fewer transactions).  I
> think we need to get to a root cause of why you are seeing such a small
> max_transfer, before I can propose the right patch, since I haven't been
> able to reproduce it locally yet (although I admit I haven't tried to
> see if blkdebug could reliably introduce artificial limits to simulate
> your setup).  And it may turn out that I just have to fix the
> bdrv_co_do_pwrite_zeroes() code to loop multiple times if the size of
> the unaligned head really does exceed the max_transfer size that the
> underlying protocol is able to support, rather than assuming that the
> unaligned head/tail always fit in a single fallback write.

In this case I'm using a qcow2 image that's stored directly in a raw
dm-crypt/LUKS container, which is itself a loop device on an ext4
filesystem.

It appears loop devices (with or without dm-crypt/LUKS) report a
255-sector maximum per request via the BLKSECTGET ioctl, which qemu
rounds down to 64k in raw_refresh_limits(). However this maximum
appears to be just a hint: bdrv_driver_pwritev() succeeds even with a
385024-byte buffer of zeroes.

As for the 1M cluster size, this is a temporary workaround for another
qemu issue (the default qcow2 L2 table cache size performs well with
random reads covering only up to 8 GB of image data with 64k clusters;
beyond that the L2 table cache thrashes). I agree this is not an
optimal configuration for writes.

> Can you also try this patch? If I'm right, you'll still fail, but the
> assertion will be slightly different.  (Again, I'm passing locally, but
> that's because I'm using the file protocol, and my file system does not
> impose a puny 64k max transfer).
>
> diff --git i/block/io.c w/block/io.c
> index b136c89..8757063 100644
> --- i/block/io.c
> +++ w/block/io.c
> @@ -1179,6 +1179,8 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>                          bs->bl.request_alignment);
> +    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
> @@ -1197,6 +1199,8 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>              /* Make a small request up to the first aligned sector.  */
>              num = MIN(count, alignment - head);
>              head = 0;
> +            assert(num < max_write_zeroes);
> +            assert(num < max_transfer);
>          } else if (tail && num > alignment) {
>              /* Shorten the request to the last aligned sector.  */
>              num -= tail;
> @@ -1222,8 +1226,6 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>
>          if (ret == -ENOTSUP) {
>              /* Fall back to bounce buffer if write zeroes is unsupported */
> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -
> MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>
>              if ((flags & BDRV_REQ_FUA) &&

With this change, the num < max_transfer assertion fails on the first
iteration (with num=385024 and max_transfer=65536).

--Ed