All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
@ 2016-11-08 22:52 Eric Blake
  2016-11-09  2:35 ` Fam Zheng
  2016-11-09 13:49 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Blake @ 2016-11-08 22:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, qemu-stable, Ed Swierk, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz

Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary.  But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.

However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k.  Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.

Test case:

$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056

In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea.  But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.

CC: qemu-stable@nongnu.org
CC: Ed Swierk <eswierk@skyportsystems.com>
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
     int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
                         bs->bl.request_alignment);
+    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);

     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
@@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
          * boundaries.
          */
         if (head) {
-            /* Make a small request up to the first aligned sector.  */
-            num = MIN(count, alignment - head);
-            head = 0;
+            /* Make a small request up to the first aligned sector. For
+             * convenience, limit this request to max_transfer even if
+             * we don't need to fall back to writes.  */
+            num = MIN(MIN(count, max_transfer), alignment - head);
+            head = (head + num) % alignment;
+            assert(num < max_write_zeroes);
         } else if (tail && num > alignment) {
             /* Shorten the request to the last aligned sector.  */
             num -= tail;
@@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

         if (ret == -ENOTSUP) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
             BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

             if ((flags & BDRV_REQ_FUA) &&
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-08 22:52 [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer Eric Blake
@ 2016-11-09  2:35 ` Fam Zheng
  2016-11-09 13:49 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 0 replies; 8+ messages in thread
From: Fam Zheng @ 2016-11-09  2:35 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, qemu-stable, Ed Swierk, Denis V . Lunev,
	Stefan Hajnoczi, Kevin Wolf, Max Reitz

On Tue, 11/08 16:52, Eric Blake wrote:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> CC: qemu-stable@nongnu.org
> CC: Ed Swierk <eswierk@skyportsystems.com>
> CC: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index aa532a5..085ac34 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>                          bs->bl.request_alignment);
> +    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> 
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
> @@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           * boundaries.
>           */
>          if (head) {
> -            /* Make a small request up to the first aligned sector.  */
> -            num = MIN(count, alignment - head);
> -            head = 0;
> +            /* Make a small request up to the first aligned sector. For
> +             * convenience, limit this request to max_transfer even if
> +             * we don't need to fall back to writes.  */
> +            num = MIN(MIN(count, max_transfer), alignment - head);
> +            head = (head + num) % alignment;
> +            assert(num < max_write_zeroes);
>          } else if (tail && num > alignment) {
>              /* Shorten the request to the last aligned sector.  */
>              num -= tail;
> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> 
>          if (ret == -ENOTSUP) {
>              /* Fall back to bounce buffer if write zeroes is unsupported */
> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> 
>              if ((flags & BDRV_REQ_FUA) &&
> -- 
> 2.7.4
> 

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-08 22:52 [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer Eric Blake
  2016-11-09  2:35 ` Fam Zheng
@ 2016-11-09 13:49 ` Stefan Hajnoczi
  2016-11-09 20:06   ` Eric Blake
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2016-11-09 13:49 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, Kevin Wolf, Fam Zheng, qemu-block, qemu-stable,
	Max Reitz, Ed Swierk, Stefan Hajnoczi, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056

Please include a qemu-iotests test case to protect against regressions.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-09 13:49 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2016-11-09 20:06   ` Eric Blake
  2016-11-10  2:11     ` Fam Zheng
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Blake @ 2016-11-09 20:06 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Kevin Wolf, Fam Zheng, qemu-block, qemu-stable,
	Max Reitz, Ed Swierk, Stefan Hajnoczi, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 1531 bytes --]

On 11/09/2016 07:49 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote:
>> Commit 443668ca rewrote the write_zeroes logic to guarantee that
>> an unaligned request never crosses a cluster boundary.  But
>> in the rewrite, the new code assumed that at most one iteration
>> would be needed to get to an alignment boundary.
>>
>> However, it is easy to trigger an assertion failure: the Linux
>> kernel limits loopback devices to advertise a max_transfer of
>> only 64k.  Any operation that requires falling back to writes
>> rather than more efficient zeroing must obey max_transfer during
>> that fallback, which means an unaligned head may require multiple
>> iterations of the write fallbacks before reaching the aligned
>> boundaries, when layering a format with clusters larger than 64k
>> atop the protocol of file access to a loopback device.
>>
>> Test case:
>>
>> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
>> $ losetup /dev/loop2 /path/to/file
>> $ qemu-io -f qcow2 /dev/loop2
>> qemu-io> w 7m 1k
>> qemu-io> w -z 8003584 2093056
> 
> Please include a qemu-iotests test case to protect against regressions.

None of the existing qemu-iotests use losetup; I guess the closest thing
to do is crib from a test that uses passwordless sudo?

It will certainly be a separate commit, but I'll give it my best shot to
post something soon.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-09 20:06   ` Eric Blake
@ 2016-11-10  2:11     ` Fam Zheng
  2016-11-10  8:03       ` Kevin Wolf
  0 siblings, 1 reply; 8+ messages in thread
From: Fam Zheng @ 2016-11-10  2:11 UTC (permalink / raw)
  To: Eric Blake
  Cc: Stefan Hajnoczi, qemu-devel, Kevin Wolf, qemu-block, qemu-stable,
	Max Reitz, Ed Swierk, Stefan Hajnoczi, Denis V . Lunev

On Wed, 11/09 14:06, Eric Blake wrote:
> On 11/09/2016 07:49 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote:
> >> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> >> an unaligned request never crosses a cluster boundary.  But
> >> in the rewrite, the new code assumed that at most one iteration
> >> would be needed to get to an alignment boundary.
> >>
> >> However, it is easy to trigger an assertion failure: the Linux
> >> kernel limits loopback devices to advertise a max_transfer of
> >> only 64k.  Any operation that requires falling back to writes
> >> rather than more efficient zeroing must obey max_transfer during
> >> that fallback, which means an unaligned head may require multiple
> >> iterations of the write fallbacks before reaching the aligned
> >> boundaries, when layering a format with clusters larger than 64k
> >> atop the protocol of file access to a loopback device.
> >>
> >> Test case:
> >>
> >> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> >> $ losetup /dev/loop2 /path/to/file
> >> $ qemu-io -f qcow2 /dev/loop2
> >> qemu-io> w 7m 1k
> >> qemu-io> w -z 8003584 2093056
> > 
> > Please include a qemu-iotests test case to protect against regressions.
> 
> None of the existing qemu-iotests use losetup; I guess the closest thing
> to do is crib from a test that uses passwordless sudo?
> 
> It will certainly be a separate commit, but I'll give it my best shot to
> post something soon.

Alternatively, maybe add a blkdebug option to emulate a small max_transfer at
the protocol layer?

Fam

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-10  2:11     ` Fam Zheng
@ 2016-11-10  8:03       ` Kevin Wolf
  2016-11-14 15:50         ` Eric Blake
  0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2016-11-10  8:03 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Eric Blake, Stefan Hajnoczi, qemu-devel, qemu-block, qemu-stable,
	Max Reitz, Ed Swierk, Stefan Hajnoczi, Denis V . Lunev

Am 10.11.2016 um 03:11 hat Fam Zheng geschrieben:
> On Wed, 11/09 14:06, Eric Blake wrote:
> > On 11/09/2016 07:49 AM, Stefan Hajnoczi wrote:
> > > On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote:
> > >> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> > >> an unaligned request never crosses a cluster boundary.  But
> > >> in the rewrite, the new code assumed that at most one iteration
> > >> would be needed to get to an alignment boundary.
> > >>
> > >> However, it is easy to trigger an assertion failure: the Linux
> > >> kernel limits loopback devices to advertise a max_transfer of
> > >> only 64k.  Any operation that requires falling back to writes
> > >> rather than more efficient zeroing must obey max_transfer during
> > >> that fallback, which means an unaligned head may require multiple
> > >> iterations of the write fallbacks before reaching the aligned
> > >> boundaries, when layering a format with clusters larger than 64k
> > >> atop the protocol of file access to a loopback device.
> > >>
> > >> Test case:
> > >>
> > >> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> > >> $ losetup /dev/loop2 /path/to/file
> > >> $ qemu-io -f qcow2 /dev/loop2
> > >> qemu-io> w 7m 1k
> > >> qemu-io> w -z 8003584 2093056
> > > 
> > > Please include a qemu-iotests test case to protect against regressions.
> > 
> > None of the existing qemu-iotests use losetup; I guess the closest thing
> > to do is crib from a test that uses passwordless sudo?
> > 
> > It will certainly be a separate commit, but I'll give it my best shot to
> > post something soon.
> 
> Alternatively, maybe add a blkdebug option to emulate a small max_transfer at
> the protocol layer?

This sounds like the better option indeed.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-10  8:03       ` Kevin Wolf
@ 2016-11-14 15:50         ` Eric Blake
  2016-11-15 12:57           ` Stefan Hajnoczi
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Blake @ 2016-11-14 15:50 UTC (permalink / raw)
  To: Kevin Wolf, Fam Zheng
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, qemu-stable, Max Reitz,
	Ed Swierk, Stefan Hajnoczi, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 1490 bytes --]

On 11/10/2016 02:03 AM, Kevin Wolf wrote:

>>>>> Test case:
>>>>>
>>>>> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
>>>>> $ losetup /dev/loop2 /path/to/file
>>>>> $ qemu-io -f qcow2 /dev/loop2
>>>>> qemu-io> w 7m 1k
>>>>> qemu-io> w -z 8003584 2093056
>>>>
>>>> Please include a qemu-iotests test case to protect against regressions.
>>>
>>> None of the existing qemu-iotests use losetup; I guess the closest thing
>>> to do is crib from a test that uses passwordless sudo?
>>>
>>> It will certainly be a separate commit, but I'll give it my best shot to
>>> post something soon.
>>
>> Alternatively, maybe add a blkdebug option to emulate a small max_transfer at
>> the protocol layer?
> 
> This sounds like the better option indeed.

I'm working on this, but found that blkdebug doesn't yet support discard
or write zero. While I do plan on adding that support, it is a new
feature to blkdebug, and therefore probably belongs in 2.9.  That said,
I'm still hoping to post an entire series with improved blkdebug and
qemu-iotest coverage of the two tangentially related patches (this one
for write zeroes, and another for discard support), where we can pick
the first half of the series (basically v2 of my pending patches) into
2.8 while feeling more confident that the second half (the blkdebug and
testsuite additions) wait for 2.9.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
  2016-11-14 15:50         ` Eric Blake
@ 2016-11-15 12:57           ` Stefan Hajnoczi
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2016-11-15 12:57 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Fam Zheng, Stefan Hajnoczi, qemu-devel, qemu-block,
	qemu-stable, Max Reitz, Ed Swierk, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

On Mon, Nov 14, 2016 at 09:50:33AM -0600, Eric Blake wrote:
> On 11/10/2016 02:03 AM, Kevin Wolf wrote:
> 
> >>>>> Test case:
> >>>>>
> >>>>> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> >>>>> $ losetup /dev/loop2 /path/to/file
> >>>>> $ qemu-io -f qcow2 /dev/loop2
> >>>>> qemu-io> w 7m 1k
> >>>>> qemu-io> w -z 8003584 2093056
> >>>>
> >>>> Please include a qemu-iotests test case to protect against regressions.
> >>>
> >>> None of the existing qemu-iotests use losetup; I guess the closest thing
> >>> to do is crib from a test that uses passwordless sudo?
> >>>
> >>> It will certainly be a separate commit, but I'll give it my best shot to
> >>> post something soon.
> >>
> >> Alternatively, maybe add a blkdebug option to emulate a small max_transfer at
> >> the protocol layer?
> > 
> > This sounds like the better option indeed.
> 
> I'm working on this, but found that blkdebug doesn't yet support discard
> or write zero. While I do plan on adding that support, it is a new
> feature to blkdebug, and therefore probably belongs in 2.9.  That said,
> I'm still hoping to post an entire series with improved blkdebug and
> qemu-iotest coverage of the two tangentially related patches (this one
> for write zeroes, and another for discard support), where we can pick
> the first half of the series (basically v2 of my pending patches) into
> 2.8 while feeling more confident that the second half (the blkdebug and
> testsuite additions) wait for 2.9.

We've reached hard freeze so I like the idea of taking just small fixes
now.

Please send them for -rc0 or -rc1.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-15 12:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08 22:52 [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer Eric Blake
2016-11-09  2:35 ` Fam Zheng
2016-11-09 13:49 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-11-09 20:06   ` Eric Blake
2016-11-10  2:11     ` Fam Zheng
2016-11-10  8:03       ` Kevin Wolf
2016-11-14 15:50         ` Eric Blake
2016-11-15 12:57           ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.