All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
@ 2016-07-04 13:49 Peter Lieven
  2016-07-05  1:53 ` Eric Blake
  2016-07-20 23:35 ` Eric Blake
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Lieven @ 2016-07-04 13:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eric Blake, Kevin Wolf

Hi,

the above commit:

commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
Author: Eric Blake <eblake@redhat.com>
Date:   Wed Jun 1 15:10:03 2016 -0600

     block: Add .bdrv_co_pwrite_zeroes()

introduces a regression (at least for me).

The Limits from the iSCSI Block Limits VPD have no requirement of being a power of two.
We use Dell Equallogic iSCSI SANs for instance. They have an internal page size of 15MB. And
they advertise this page size as max_ws_len, opt_transfer_len and opt_discard_alignment.

I think we cannot assert that that these alignments are a power of 2.

Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-04 13:49 [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes() Peter Lieven
@ 2016-07-05  1:53 ` Eric Blake
  2016-07-05  7:30   ` Peter Lieven
  2016-07-05 13:37   ` Paolo Bonzini
  2016-07-20 23:35 ` Eric Blake
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Blake @ 2016-07-05  1:53 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On 07/04/2016 07:49 AM, Peter Lieven wrote:
> Hi,
> 
> the above commit:
> 
> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
> Author: Eric Blake <eblake@redhat.com>
> Date:   Wed Jun 1 15:10:03 2016 -0600
> 
>     block: Add .bdrv_co_pwrite_zeroes()
> 
> introduces a regression (at least for me).
> 
> The Limits from the iSCSI Block Limits VPD have no requirement of being
> a power of two.
> We use Dell Equallogic iSCSI SANs for instance. They have an internal
> page size of 15MB. And
> they advertise this page size as max_ws_len, opt_transfer_len and
> opt_discard_alignment.

A non-power-of-2 max_ws_len shouldn't be a problem, but opt_transfer_len
and opt_discard_alignment not being a power of 2 impacts other code.
15MB is a rather odd page size.

> 
> I think we cannot assert that that these alignments are a power of 2.

Perhaps that means we should just fix our code to round things down to
the nearest power of 2 (8MB) for the opt_transfer_len and
opt_discard_alignment values.  Can you post a stack-trace of the actual
assertion you are hitting?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05  1:53 ` Eric Blake
@ 2016-07-05  7:30   ` Peter Lieven
  2016-07-05 13:03     ` Eric Blake
  2016-07-05 13:37   ` Paolo Bonzini
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Lieven @ 2016-07-05  7:30 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf

Am 05.07.2016 um 03:53 schrieb Eric Blake:
> On 07/04/2016 07:49 AM, Peter Lieven wrote:
>> Hi,
>>
>> the above commit:
>>
>> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
>> Author: Eric Blake <eblake@redhat.com>
>> Date:   Wed Jun 1 15:10:03 2016 -0600
>>
>>      block: Add .bdrv_co_pwrite_zeroes()
>>
>> introduces a regression (at least for me).
>>
>> The Limits from the iSCSI Block Limits VPD have no requirement of being
>> a power of two.
>> We use Dell Equallogic iSCSI SANs for instance. They have an internal
>> page size of 15MB. And
>> they advertise this page size as max_ws_len, opt_transfer_len and
>> opt_discard_alignment.
> A non-power-of-2 max_ws_len shouldn't be a problem, but opt_transfer_len
> and opt_discard_alignment not being a power of 2 impacts other code.
> 15MB is a rather odd page size.

I know, not my idea ;-) I think at least opt_discard_alignment of 15MB used to work
before.

>
>> I think we cannot assert that that these alignments are a power of 2.
> Perhaps that means we should just fix our code to round things down to
> the nearest power of 2 (8MB) for the opt_transfer_len and
> opt_discard_alignment values.  Can you post a stack-trace of the actual
> assertion you are hitting?
>

Sure:

qemu-system-x86_64: block/io.c:1165: bdrv_co_do_pwrite_zeroes: Assertion `is_power_of_2(alignment)' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff5222c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56    ../nptl/sysdeps/unix/sysv/linux/raise.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt full
#0  0x00007ffff5222c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
         resultvar = 0
         pid = 9610
         selftid = 9610
#1  0x00007ffff5226028 in __GI_abort () at abort.c:89
         save_stage = 2
         act = {__sigaction_handler = {sa_handler = 0x7fffffffe1fe, sa_sigaction = 0x7fffffffe1fe}, sa_mask = {__val = {140737307379972,
               93824998987040, 1165, 93823560581120, 140737306023251, 0, 93825220359792, 47244640260, 93825009612448, 256, 0, 0, 0, 21474836480,
               140737354027008, 140737307395000}}, sa_flags = 1438406118, sa_restorer = 0x555555bc5ca0 <__PRETTY_FUNCTION__.34924>}
         sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff521bbf6 in __assert_fail_base (fmt=0x7ffff536c3b8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
     assertion=assertion@entry=0x555555bc55e6 "is_power_of_2(alignment)", file=file@entry=0x555555bc5520 "block/io.c", line=line@entry=1165,
     function=function@entry=0x555555bc5ca0 <__PRETTY_FUNCTION__.34924> "bdrv_co_do_pwrite_zeroes") at assert.c:92
         str = 0x5555586f5780 ""
         total = 4096
#3  0x00007ffff521bca2 in __GI___assert_fail (assertion=assertion@entry=0x555555bc55e6 "is_power_of_2(alignment)",
     file=file@entry=0x555555bc5520 "block/io.c", line=line@entry=1165,
     function=function@entry=0x555555bc5ca0 <__PRETTY_FUNCTION__.34924> "bdrv_co_do_pwrite_zeroes") at assert.c:101
No locals.
#4  0x0000555555a8a968 in bdrv_co_do_pwrite_zeroes (bs=bs@entry=0x5555565b2df0, offset=offset@entry=1359998976, count=count@entry=4096,
     flags=flags@entry=6) at block/io.c:1165
         drv = 0x5555560b3580 <bdrv_raw>
         qiov = {iov = 0x100000000, niov = 4096, nalloc = 0, size = 140737148473344}
         iov = {iov_base = 0x0, iov_len = 0}
         ret = 0
         need_flush = false
         head = 0
         tail = 0
         max_write_zeroes = <optimized out>
         alignment = 15728640
         __PRETTY_FUNCTION__ = "bdrv_co_do_pwrite_zeroes"
#5  0x0000555555a8ae3b in bdrv_aligned_pwritev (bs=bs@entry=0x5555565b2df0, req=req@entry=0x555562ee3970, offset=offset@entry=1359998976,
     bytes=bytes@entry=4096, qiov=0x555558909e58, flags=<optimized out>, flags@entry=0) at block/io.c:1290
---Type <return> to continue, or q <return> to quit---
         drv = 0x5555560b3580 <bdrv_raw>
         waited = <optimized out>
         ret = <optimized out>
         start_sector = 2656248
         end_sector = 2656256
         __PRETTY_FUNCTION__ = "bdrv_aligned_pwritev"
#6  0x0000555555a8b95f in bdrv_co_pwritev (bs=0x5555565b2df0, offset=offset@entry=1359998976, bytes=bytes@entry=4096,
     qiov=qiov@entry=0x555558909e58, flags=flags@entry=0) at block/io.c:1514
         req = {bs = 0x5555565b2df0, offset = 1359998976, bytes = 4096, type = BDRV_TRACKED_WRITE, serialising = false,
           overlap_offset = 1359998976, overlap_bytes = 4096, list = {le_next = 0x55555a3c9cc0, le_prev = 0x5555565b5f28}, co = 0x555557ccc100,
           wait_queue = {entries = {tqh_first = 0x0, tqh_last = 0x555562ee39b8}}, waiting_for = 0x0}
         align = 512
         head_buf = 0x0
         tail_buf = <optimized out>
         local_qiov = {iov = 0x0, niov = 1448841360, nalloc = 21845, size = 0}
         use_local_qiov = <optimized out>
         ret = <optimized out>
         __PRETTY_FUNCTION__ = "bdrv_co_pwritev"
#7  0x0000555555a7cae3 in blk_co_pwritev (blk=0x5555565b2c30, offset=1359998976, bytes=4096, qiov=0x555558909e58, flags=0)
     at block/block-backend.c:788
         ret = <optimized out>
#8  0x0000555555a7cc2e in blk_aio_write_entry (opaque=0x555557da5200) at block/block-backend.c:977
         acb = 0x555557da5200
         rwco = 0x555557da5228
#9  0x0000555555afafda in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:78
         self = 0x555557ccc100
         co = 0x555557ccc100
#10 0x00007ffff5235800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#11 0x00007fffffffcf20 in ?? ()
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.

Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05  7:30   ` Peter Lieven
@ 2016-07-05 13:03     ` Eric Blake
  2016-07-05 13:39       ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Blake @ 2016-07-05 13:03 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2216 bytes --]

On 07/05/2016 01:30 AM, Peter Lieven wrote:
> Am 05.07.2016 um 03:53 schrieb Eric Blake:
>> On 07/04/2016 07:49 AM, Peter Lieven wrote:
>>> Hi,
>>>
>>> the above commit:
>>>
>>> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
>>> Author: Eric Blake <eblake@redhat.com>
>>> Date:   Wed Jun 1 15:10:03 2016 -0600
>>>
>>>      block: Add .bdrv_co_pwrite_zeroes()
>>>
>>> introduces a regression (at least for me).
>>>
>>> The Limits from the iSCSI Block Limits VPD have no requirement of being
>>> a power of two.
>>> We use Dell Equallogic iSCSI SANs for instance. They have an internal
>>> page size of 15MB. And
>>> they advertise this page size as max_ws_len, opt_transfer_len and
>>> opt_discard_alignment.
>> A non-power-of-2 max_ws_len shouldn't be a problem, but opt_transfer_len
>> and opt_discard_alignment not being a power of 2 impacts other code.
>> 15MB is a rather odd page size.
> 
> I know, not my idea ;-) I think at least opt_discard_alignment of 15MB
> used to work
> before.

Does this fix it for you?

diff --git i/block/iscsi.c w/block/iscsi.c
index 9bb5ff6..556486c 100644
--- i/block/iscsi.c
+++ w/block/iscsi.c
@@ -1732,12 +1732,12 @@ static void
iscsi_refresh_limits(BlockDriverState *bs, Error **errp)
     }
     if (iscsilun->lbp.lbpws) {
         bs->bl.pwrite_zeroes_alignment =
-            iscsilun->bl.opt_unmap_gran * iscsilun->block_size;
+            pow2floor(iscsilun->bl.opt_unmap_gran * iscsilun->block_size);
     } else {
         bs->bl.pwrite_zeroes_alignment = iscsilun->block_size;
     }
     bs->bl.opt_transfer_length =
-        sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len, iscsilun);
+        pow2floor(sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len,
iscsilun));
 }

 /* Note that this will not re-establish a connection with an iSCSI
target - it


One of those two hunks is covered by my pending 'block: Switch discard
length bounds to byte-based' making its way through the block queue
review, so if it works for you, I just need to formalize the patch for
the pwrite_zeroes_alignment.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05  1:53 ` Eric Blake
  2016-07-05  7:30   ` Peter Lieven
@ 2016-07-05 13:37   ` Paolo Bonzini
  2016-07-05 13:40     ` Peter Lieven
  2016-07-05 14:59     ` Eric Blake
  1 sibling, 2 replies; 19+ messages in thread
From: Paolo Bonzini @ 2016-07-05 13:37 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf



On 05/07/2016 03:53, Eric Blake wrote:
>> > I think we cannot assert that that these alignments are a power of 2.
> Perhaps that means we should just fix our code to round things down to
> the nearest power of 2 (8MB) for the opt_transfer_len and
> opt_discard_alignment values.  Can you post a stack-trace of the actual
> assertion you are hitting?

It doesn't work for opt_discard_alignment.  Neither 8MB nor 16MB is
aligned to 15MB for example.

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05 13:03     ` Eric Blake
@ 2016-07-05 13:39       ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2016-07-05 13:39 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf



On 05/07/2016 15:03, Eric Blake wrote:
>          bs->bl.pwrite_zeroes_alignment =
> -            iscsilun->bl.opt_unmap_gran * iscsilun->block_size;
> +            pow2floor(iscsilun->bl.opt_unmap_gran * iscsilun->block_size);
>      } else {
>          bs->bl.pwrite_zeroes_alignment = iscsilun->block_size;
>      }
>      bs->bl.opt_transfer_length =
> -        sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len, iscsilun);
> +        pow2floor(sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len,
> iscsilun));

I see no reason why the alignment needs to be a power of two in
block/io.c, if you use / % * instead of &.

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05 13:37   ` Paolo Bonzini
@ 2016-07-05 13:40     ` Peter Lieven
  2016-07-05 14:59     ` Eric Blake
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Lieven @ 2016-07-05 13:40 UTC (permalink / raw)
  To: Paolo Bonzini, Eric Blake, qemu-devel; +Cc: Kevin Wolf

Am 05.07.2016 um 15:37 schrieb Paolo Bonzini:
>
> On 05/07/2016 03:53, Eric Blake wrote:
>>>> I think we cannot assert that that these alignments are a power of 2.
>> Perhaps that means we should just fix our code to round things down to
>> the nearest power of 2 (8MB) for the opt_transfer_len and
>> opt_discard_alignment values.  Can you post a stack-trace of the actual
>> assertion you are hitting?
> It doesn't work for opt_discard_alignment.  Neither 8MB nor 16MB is
> aligned to 15MB for example.

Right, we should keep these limits. And if we need to round down for any reason
we should use the gcd of both values.

Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05 13:37   ` Paolo Bonzini
  2016-07-05 13:40     ` Peter Lieven
@ 2016-07-05 14:59     ` Eric Blake
  2016-07-05 15:09       ` Paolo Bonzini
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Blake @ 2016-07-05 14:59 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

On 07/05/2016 07:37 AM, Paolo Bonzini wrote:
> 
> 
> On 05/07/2016 03:53, Eric Blake wrote:
>>>> I think we cannot assert that that these alignments are a power of 2.
>> Perhaps that means we should just fix our code to round things down to
>> the nearest power of 2 (8MB) for the opt_transfer_len and
>> opt_discard_alignment values.  Can you post a stack-trace of the actual
>> assertion you are hitting?
> 
> It doesn't work for opt_discard_alignment.  Neither 8MB nor 16MB is
> aligned to 15MB for example.

The largest power-of-2 alignment that will align with every 15M page is
1M.  Is there a measurable performance difference between doing lots of
1M accesses (14 out of 15 are unaligned, but none of them cross page
boundaries), vs. doing 8M accesses (14 out of 15 are unaligned, and 7
out of 15 cross page boundaries, but there are fewer accesses overall)?

The optimal alignments are advisory - they should be a hint that says
that accesses smaller than the alignment may require RMW and are
therefore slower.  I agree that at a certain point, we will definitely
see slowdowns (if we do all 64k accesses, we could probably notice it),
but I'm having a hard time seeing how hardware that advertises a
non-power-of-2 can behave less efficiently for 1M than it would for 8M,
particularly if the smallest addressable block size is indeed smaller
than 1M on that device.

And yes, we could probably switch to (potentially slower) / % * instead
of bit operations in block/io.c to accommodate a non-power-of-2 optimal
size, but it would require a careful audit to make sure we don't have
even more bit-wise operations lurking that were assuming a power of 2.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05 14:59     ` Eric Blake
@ 2016-07-05 15:09       ` Paolo Bonzini
  2016-07-15 10:09         ` Peter Lieven
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2016-07-05 15:09 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf



On 05/07/2016 16:59, Eric Blake wrote:
> And yes, we could probably switch to (potentially slower) / % * instead
> of bit operations in block/io.c to accommodate a non-power-of-2 optimal
> size, but it would require a careful audit to make sure we don't have
> even more bit-wise operations lurking that were assuming a power of 2.

Yes, it would require some auditing but it has always worked.  Regarding
speed, I'm sure that / % * are not slower than a system call! :)

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-05 15:09       ` Paolo Bonzini
@ 2016-07-15 10:09         ` Peter Lieven
  2016-07-15 15:40           ` Eric Blake
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Lieven @ 2016-07-15 10:09 UTC (permalink / raw)
  To: Paolo Bonzini, Eric Blake, qemu-devel; +Cc: Kevin Wolf

Am 05.07.2016 um 17:09 schrieb Paolo Bonzini:
>
> On 05/07/2016 16:59, Eric Blake wrote:
>> And yes, we could probably switch to (potentially slower) / % * instead
>> of bit operations in block/io.c to accommodate a non-power-of-2 optimal
>> size, but it would require a careful audit to make sure we don't have
>> even more bit-wise operations lurking that were assuming a power of 2.
> Yes, it would require some auditing but it has always worked.  Regarding
> speed, I'm sure that / % * are not slower than a system call! :)

Eric, whats the status of this? Will you send a follow-up to your series
restoring the old behaviour?

Thanks,
Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-15 10:09         ` Peter Lieven
@ 2016-07-15 15:40           ` Eric Blake
  2016-07-18  7:06             ` Peter Lieven
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Blake @ 2016-07-15 15:40 UTC (permalink / raw)
  To: Peter Lieven, Paolo Bonzini, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On 07/15/2016 04:09 AM, Peter Lieven wrote:
> Am 05.07.2016 um 17:09 schrieb Paolo Bonzini:
>>
>> On 05/07/2016 16:59, Eric Blake wrote:
>>> And yes, we could probably switch to (potentially slower) / % * instead
>>> of bit operations in block/io.c to accommodate a non-power-of-2 optimal
>>> size, but it would require a careful audit to make sure we don't have
>>> even more bit-wise operations lurking that were assuming a power of 2.
>> Yes, it would require some auditing but it has always worked.  Regarding
>> speed, I'm sure that / % * are not slower than a system call! :)
> 
> Eric, whats the status of this? Will you send a follow-up to your series
> restoring the old behaviour?

Still on my plate.  As it is a bug-fix, the patch can go in even after
hard freeze, so at the moment I've unfortunately been more focused on
other patches (such as NBD_CMD_WRITE_ZEROES support) that were posted
before soft freeze, but which must be committed prior to hard freeze or
else they miss 2.7.  But rest assured that I'll make sure 2.7 doesn't
regress on your setup.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-15 15:40           ` Eric Blake
@ 2016-07-18  7:06             ` Peter Lieven
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Lieven @ 2016-07-18  7:06 UTC (permalink / raw)
  To: Eric Blake, Paolo Bonzini, qemu-devel; +Cc: Kevin Wolf

Am 15.07.2016 um 17:40 schrieb Eric Blake:
> On 07/15/2016 04:09 AM, Peter Lieven wrote:
>> Am 05.07.2016 um 17:09 schrieb Paolo Bonzini:
>>> On 05/07/2016 16:59, Eric Blake wrote:
>>>> And yes, we could probably switch to (potentially slower) / % * instead
>>>> of bit operations in block/io.c to accommodate a non-power-of-2 optimal
>>>> size, but it would require a careful audit to make sure we don't have
>>>> even more bit-wise operations lurking that were assuming a power of 2.
>>> Yes, it would require some auditing but it has always worked.  Regarding
>>> speed, I'm sure that / % * are not slower than a system call! :)
>> Eric, whats the status of this? Will you send a follow-up to your series
>> restoring the old behaviour?
> Still on my plate.  As it is a bug-fix, the patch can go in even after
> hard freeze, so at the moment I've unfortunately been more focused on
> other patches (such as NBD_CMD_WRITE_ZEROES support) that were posted
> before soft freeze, but which must be committed prior to hard freeze or
> else they miss 2.7.  But rest assured that I'll make sure 2.7 doesn't
> regress on your setup.

Okay, thanks for the update.

Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-04 13:49 [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes() Peter Lieven
  2016-07-05  1:53 ` Eric Blake
@ 2016-07-20 23:35 ` Eric Blake
  2016-07-21  7:01   ` Peter Lieven
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Eric Blake @ 2016-07-20 23:35 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2659 bytes --]

On 07/04/2016 07:49 AM, Peter Lieven wrote:
> Hi,
> 
> the above commit:
> 
> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
> Author: Eric Blake <eblake@redhat.com>
> Date:   Wed Jun 1 15:10:03 2016 -0600
> 
>     block: Add .bdrv_co_pwrite_zeroes()
> 
> introduces a regression (at least for me).
> 
> The Limits from the iSCSI Block Limits VPD have no requirement of being
> a power of two.
> We use Dell Equallogic iSCSI SANs for instance. They have an internal
> page size of 15MB. And
> they advertise this page size as max_ws_len, opt_transfer_len and
> opt_discard_alignment.

Since I don't have access to this device, let me double check: if you
put a breakpoint in iscsi.c:iscsi_refresh_limits(), can you dump the
contents of the struct iscsilun->bl?  What is the block size of this
device (512, 4096, something else)?

Also, while the device is advertising that the optimal discard alignment
is 15M, that does not tell me the minimum granularity that it can
actually discard.  Can you determine that value?  That is, if I try to
discard only 1M, does that actually result in a 1M allocation hole, or
is it ignored?  It sounds like qemu should be tracking 2 separate
values: the minimum discard granularity (I suspect this number is a
power of 2, at least the block size, and perhaps precisely equal to the
block size), and the maximum discard granularity that results in the
fewest/fastest discard of the entire device (not necessarily a power of
2).  Or, maybe that merely means that qemu's pdiscard_alignment should
be the MINIMUM granularity, and NOT the non-power-of-2
iscsilun->bl.opt_unmap_gran.

Or put another way, I get that I can't discard more than 15M at a time.
 But I highly suspect that I do not have to align my discard requests to
15M boundaries.  That is, if the discard granularity is 1M, then in
qemu-io, 'discard 1M 15M' should result in a 15M hole, and should be no
different from the result of 'discard 1M 14M; discard 15M 1M'.  But if
qemu sticks to pdiscard_alignment == iscsilun->bl.opt_unmap_gran of 15M,
then both operations mistakenly discard nothing (because it is not
aligned to a 15M boundary).

> 
> I think we cannot assert that that these alignments are a power of 2.

Optimal size not being a power of 2 is not a problem, but I still
suspect MINIMUM alignment is a power of 2, and I need to know how much
head and tail to discard in the new byte-based discard routines in order
to align requests up to the minimal discard alignment boundaries.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-20 23:35 ` Eric Blake
@ 2016-07-21  7:01   ` Peter Lieven
  2016-07-21  9:10     ` Paolo Bonzini
  2016-07-21  9:08   ` Paolo Bonzini
  2016-07-21 13:38   ` wangweiwei
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Lieven @ 2016-07-21  7:01 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf

Hi Eric,

Am 21.07.2016 um 01:35 schrieb Eric Blake:
> On 07/04/2016 07:49 AM, Peter Lieven wrote:
>> Hi,
>>
>> the above commit:
>>
>> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
>> Author: Eric Blake <eblake@redhat.com>
>> Date:   Wed Jun 1 15:10:03 2016 -0600
>>
>>      block: Add .bdrv_co_pwrite_zeroes()
>>
>> introduces a regression (at least for me).
>>
>> The Limits from the iSCSI Block Limits VPD have no requirement of being
>> a power of two.
>> We use Dell Equallogic iSCSI SANs for instance. They have an internal
>> page size of 15MB. And
>> they advertise this page size as max_ws_len, opt_transfer_len and
>> opt_discard_alignment.
> Since I don't have access to this device, let me double check: if you
> put a breakpoint in iscsi.c:iscsi_refresh_limits(), can you dump the
> contents of the struct iscsilun->bl?  What is the block size of this
> device (512, 4096, something else)?

I can choose between 512 and 4096. 512 is the default.

Here are the advertised limits in the Block Limits VPD:

$ iscsi-inq -e 1 -c $((0xb0)) iscsi://XXX/0
wsnz:0
maximum compare and write length:1
optimal transfer length granularity:0
maximum transfer length:0
optimal transfer length:0
maximum prefetch xdread xdwrite transfer length:0
maximum unmap lba count:30720
maximum unmap block descriptor count:2
optimal unmap granularity:30720
ugavalid:1
unmap granularity alignment:0
maximum write same length:30720



>
> Also, while the device is advertising that the optimal discard alignment
> is 15M, that does not tell me the minimum granularity that it can
> actually discard.  Can you determine that value?  That is, if I try to
> discard only 1M, does that actually result in a 1M allocation hole, or
> is it ignored?  It sounds like qemu should be tracking 2 separate
> values: the minimum discard granularity (I suspect this number is a
> power of 2, at least the block size, and perhaps precisely equal to the
> block size), and the maximum discard granularity that results in the
> fewest/fastest discard of the entire device (not necessarily a power of
> 2).  Or, maybe that merely means that qemu's pdiscard_alignment should
> be the MINIMUM granularity, and NOT the non-power-of-2
> iscsilun->bl.opt_unmap_gran.

As far as I know there is no minimum discard granularity. Only optimum
and maximum.


>
> Or put another way, I get that I can't discard more than 15M at a time.
>   But I highly suspect that I do not have to align my discard requests to
> 15M boundaries.  That is, if the discard granularity is 1M, then in
> qemu-io, 'discard 1M 15M' should result in a 15M hole, and should be no
> different from the result of 'discard 1M 14M; discard 15M 1M'.  But if
> qemu sticks to pdiscard_alignment == iscsilun->bl.opt_unmap_gran of 15M,
> then both operations mistakenly discard nothing (because it is not
> aligned to a 15M boundary).

I do not know what the storage does internally. But I agree the block
provisioning info will not change. However, if you issue a discard 1M 15M
and later a discard 0 1M it still might to report the first block as unallocated
later.


Peter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-20 23:35 ` Eric Blake
  2016-07-21  7:01   ` Peter Lieven
@ 2016-07-21  9:08   ` Paolo Bonzini
  2016-07-21 15:12     ` Eric Blake
  2016-07-21 13:38   ` wangweiwei
  2 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2016-07-21  9:08 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf



On 21/07/2016 01:35, Eric Blake wrote:
> Also, while the device is advertising that the optimal discard alignment
> is 15M, that does not tell me the minimum granularity that it can
> actually discard.  Can you determine that value?  That is, if I try to
> discard only 1M, does that actually result in a 1M allocation hole, or
> is it ignored?  It sounds like qemu should be tracking 2 separate
> values: the minimum discard granularity (I suspect this number is a
> power of 2, at least the block size, and perhaps precisely equal to the
> block size)

It's very unlikely to be equal to the block size.  The block size is
probably 512, perhaps 4096, while the optimal discard alignment is
usually at least 64K.

> Or put another way, I get that I can't discard more than 15M at a time.

I don't think so; optimal discard alignment is 15M but maximum discard
size is most likely _not_ 15M.

> Optimal size not being a power of 2 is not a problem, but I still
> suspect MINIMUM alignment is a power of 2, and I need to know how much
> head and tail to discard in the new byte-based discard routines in order
> to align requests up to the minimal discard alignment boundaries.

But why does it matter if it is a power of 2?  Can't you just use
DIV_ROUND_UP?

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-21  7:01   ` Peter Lieven
@ 2016-07-21  9:10     ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2016-07-21  9:10 UTC (permalink / raw)
  To: Peter Lieven, Eric Blake, qemu-devel; +Cc: Kevin Wolf



On 21/07/2016 09:01, Peter Lieven wrote:
> 
> maximum unmap lba count:30720
> maximum unmap block descriptor count:2
> optimal unmap granularity:30720
> ugavalid:1
> unmap granularity alignment:0
> maximum write same length:30720

Uhm, that's weird.  The optimal unmap granularity should really be the
minimum discard size.  But since this is what the storage gives us, I
still prefer to do the division and support non-power-of-two optimal
unmap granularity (aka optimal discard alignment in QEMU parlance).

Paolo

> 
> 
> 
>>
>> Also, while the device is advertising that the optimal discard alignment
>> is 15M, that does not tell me the minimum granularity that it can
>> actually discard.  Can you determine that value?  That is, if I try to
>> discard only 1M, does that actually result in a 1M allocation hole, or
>> is it ignored?  It sounds like qemu should be tracking 2 separate
>> values: the minimum discard granularity (I suspect this number is a
>> power of 2, at least the block size, and perhaps precisely equal to the
>> block size), and the maximum discard granularity that results in the
>> fewest/fastest discard of the entire device (not necessarily a power of
>> 2).  Or, maybe that merely means that qemu's pdiscard_alignment should
>> be the MINIMUM granularity, and NOT the non-power-of-2
>> iscsilun->bl.opt_unmap_gran.
> 
> As far as I know there is no minimum discard granularity. Only optimum
> and maximum.
> 
> 
>>
>> Or put another way, I get that I can't discard more than 15M at a time.
>>   But I highly suspect that I do not have to align my discard requests to
>> 15M boundaries.  That is, if the discard granularity is 1M, then in
>> qemu-io, 'discard 1M 15M' should result in a 15M hole, and should be no
>> different from the result of 'discard 1M 14M; discard 15M 1M'.  But if
>> qemu sticks to pdiscard_alignment == iscsilun->bl.opt_unmap_gran of 15M,
>> then both operations mistakenly discard nothing (because it is not
>> aligned to a 15M boundary).
> 
> I do not know what the storage does internally. But I agree the block
> provisioning info will not change. However, if you issue a discard 1M 15M
> and later a discard 0 1M it still might to report the first block as
> unallocated
> later.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-20 23:35 ` Eric Blake
  2016-07-21  7:01   ` Peter Lieven
  2016-07-21  9:08   ` Paolo Bonzini
@ 2016-07-21 13:38   ` wangweiwei
  2016-07-21 13:45     ` wangweiwei
  2 siblings, 1 reply; 19+ messages in thread
From: wangweiwei @ 2016-07-21 13:38 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2625 bytes --]

在 2016年07月20日 19:35, Eric Blake 写道:
> On 07/04/2016 07:49 AM, Peter Lieven wrote:
>> Hi,
>>
>> the above commit:
>>
>> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
>> Author: Eric Blake <eblake@redhat.com>
>> Date:   Wed Jun 1 15:10:03 2016 -0600
>>
>>      block: Add .bdrv_co_pwrite_zeroes()
>>
>> introduces a regression (at least for me).
>>
>> The Limits from the iSCSI Block Limits VPD have no requirement of being
>> a power of two.
>> We use Dell Equallogic iSCSI SANs for instance. They have an internal
>> page size of 15MB. And
>> they advertise this page size as max_ws_len, opt_transfer_len and
>> opt_discard_alignment.
>
> Since I don't have access to this device, let me double check: if you
> put a breakpoint in iscsi.c:iscsi_refresh_limits(), can you dump the
> contents of the struct iscsilun->bl?  What is the block size of this
> device (512, 4096, something else)?
>
> Also, while the device is advertising that the optimal discard alignment
> is 15M, that does not tell me the minimum granularity that it can
> actually discard.  Can you determine that value?  That is, if I try to
> discard only 1M, does that actually result in a 1M allocation hole, or
> is it ignored?  It sounds like qemu should be tracking 2 separate
> values: the minimum discard granularity (I suspect this number is a
> power of 2, at least the block size, and perhaps precisely equal to the
> block size), and the maximum discard granularity that results in the
> fewest/fastest discard of the entire device (not necessarily a power of
> 2).  Or, maybe that merely means that qemu's pdiscard_alignment should
> be the MINIMUM granularity, and NOT the non-power-of-2
> iscsilun->bl.opt_unmap_gran.
>
> Or put another way, I get that I can't discard more than 15M at a time.
>   But I highly suspect that I do not have to align my discard requests to
> 15M boundaries.  That is, if the discard granularity is 1M, then in
> qemu-io, 'discard 1M 15M' should result in a 15M hole, and should be no
> different from the result of 'discard 1M 14M; discard 15M 1M'.  But if
> qemu sticks to pdiscard_alignment == iscsilun->bl.opt_unmap_gran of 15M,
> then both operations mistakenly discard nothing (because it is not
> aligned to a 15M boundary).
>
>>
>> I think we cannot assert that that these alignments are a power of 2.
>
> Optimal size not being a power of 2 is not a problem, but I still
> suspect MINIMUM alignment is a power of 2, and I need to know how much
> head and tail to discard in the new byte-based discard routines in order
> to align requests up to the minimal discard alignment boundaries.
>
>



[-- Attachment #2: outgoing_v22.tar --]
[-- Type: application/x-tar, Size: 102400 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-21 13:38   ` wangweiwei
@ 2016-07-21 13:45     ` wangweiwei
  0 siblings, 0 replies; 19+ messages in thread
From: wangweiwei @ 2016-07-21 13:45 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel; +Cc: Kevin Wolf

Sorry,  reply is wrong.
在 2016年07月21日 09:38, wangweiwei 写道:
> 在 2016年07月20日 19:35, Eric Blake 写道:
>> On 07/04/2016 07:49 AM, Peter Lieven wrote:
>>> Hi,
>>>
>>> the above commit:
>>>
>>> commit d05aa8bb4a8b6aa9a915ec5074fb12ae632d2323
>>> Author: Eric Blake <eblake@redhat.com>
>>> Date:   Wed Jun 1 15:10:03 2016 -0600
>>>
>>>      block: Add .bdrv_co_pwrite_zeroes()
>>>
>>> introduces a regression (at least for me).
>>>
>>> The Limits from the iSCSI Block Limits VPD have no requirement of being
>>> a power of two.
>>> We use Dell Equallogic iSCSI SANs for instance. They have an internal
>>> page size of 15MB. And
>>> they advertise this page size as max_ws_len, opt_transfer_len and
>>> opt_discard_alignment.
>>
>> Since I don't have access to this device, let me double check: if you
>> put a breakpoint in iscsi.c:iscsi_refresh_limits(), can you dump the
>> contents of the struct iscsilun->bl?  What is the block size of this
>> device (512, 4096, something else)?
>>
>> Also, while the device is advertising that the optimal discard alignment
>> is 15M, that does not tell me the minimum granularity that it can
>> actually discard.  Can you determine that value?  That is, if I try to
>> discard only 1M, does that actually result in a 1M allocation hole, or
>> is it ignored?  It sounds like qemu should be tracking 2 separate
>> values: the minimum discard granularity (I suspect this number is a
>> power of 2, at least the block size, and perhaps precisely equal to the
>> block size), and the maximum discard granularity that results in the
>> fewest/fastest discard of the entire device (not necessarily a power of
>> 2).  Or, maybe that merely means that qemu's pdiscard_alignment should
>> be the MINIMUM granularity, and NOT the non-power-of-2
>> iscsilun->bl.opt_unmap_gran.
>>
>> Or put another way, I get that I can't discard more than 15M at a time.
>>   But I highly suspect that I do not have to align my discard requests to
>> 15M boundaries.  That is, if the discard granularity is 1M, then in
>> qemu-io, 'discard 1M 15M' should result in a 15M hole, and should be no
>> different from the result of 'discard 1M 14M; discard 15M 1M'.  But if
>> qemu sticks to pdiscard_alignment == iscsilun->bl.opt_unmap_gran of 15M,
>> then both operations mistakenly discard nothing (because it is not
>> aligned to a 15M boundary).
>>
>>>
>>> I think we cannot assert that that these alignments are a power of 2.
>>
>> Optimal size not being a power of 2 is not a problem, but I still
>> suspect MINIMUM alignment is a power of 2, and I need to know how much
>> head and tail to discard in the new byte-based discard routines in order
>> to align requests up to the minimal discard alignment boundaries.
>>
>>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes()
  2016-07-21  9:08   ` Paolo Bonzini
@ 2016-07-21 15:12     ` Eric Blake
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Blake @ 2016-07-21 15:12 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Lieven, qemu-devel; +Cc: Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2373 bytes --]

On 07/21/2016 03:08 AM, Paolo Bonzini wrote:
> 
> 
> On 21/07/2016 01:35, Eric Blake wrote:
>> Also, while the device is advertising that the optimal discard alignment
>> is 15M, that does not tell me the minimum granularity that it can
>> actually discard.  Can you determine that value?  That is, if I try to
>> discard only 1M, does that actually result in a 1M allocation hole, or
>> is it ignored?  It sounds like qemu should be tracking 2 separate
>> values: the minimum discard granularity (I suspect this number is a
>> power of 2, at least the block size, and perhaps precisely equal to the
>> block size)
> 
> It's very unlikely to be equal to the block size.  The block size is
> probably 512, perhaps 4096, while the optimal discard alignment is
> usually at least 64K.

It would be nice to determine what the true minimum is, whether it is a
block size, or 64k, or some other number.

> 
>> Or put another way, I get that I can't discard more than 15M at a time.
> 
> I don't think so; optimal discard alignment is 15M but maximum discard
> size is most likely _not_ 15M.

Peter proved that the device itself reports that its optimal size and
maximum size are equal, at 15M each.

> 
>> Optimal size not being a power of 2 is not a problem, but I still
>> suspect MINIMUM alignment is a power of 2, and I need to know how much
>> head and tail to discard in the new byte-based discard routines in order
>> to align requests up to the minimal discard alignment boundaries.
> 
> But why does it matter if it is a power of 2?  Can't you just use
> DIV_ROUND_UP?

Oddly enough, ROUND_UP() requires a power of 2, but is otherwise
identical to QEMU_ALIGN_UP() which does not require a power of 2.
DIV_ROUND_UP() scales down, so I'd have to scale back up, basically
open-coding QEMU_ALIGN_UP() which keeps the scale unchanged.

I'm seriously debating about adding comments to osdep.h on which macros
require a power of 2 or not, and/or a cleanup patch that unifies
ROUND_UP() and QEMU_ALIGN_UP() to be the same functionality (since they
are already the same for power of 2).  A decent compiler should be able
to optimize * and / back into bit twiddling if it knows that the
rounding amount is a power of 2.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-07-21 15:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-04 13:49 [Qemu-devel] Regression: block: Add .bdrv_co_pwrite_zeroes() Peter Lieven
2016-07-05  1:53 ` Eric Blake
2016-07-05  7:30   ` Peter Lieven
2016-07-05 13:03     ` Eric Blake
2016-07-05 13:39       ` Paolo Bonzini
2016-07-05 13:37   ` Paolo Bonzini
2016-07-05 13:40     ` Peter Lieven
2016-07-05 14:59     ` Eric Blake
2016-07-05 15:09       ` Paolo Bonzini
2016-07-15 10:09         ` Peter Lieven
2016-07-15 15:40           ` Eric Blake
2016-07-18  7:06             ` Peter Lieven
2016-07-20 23:35 ` Eric Blake
2016-07-21  7:01   ` Peter Lieven
2016-07-21  9:10     ` Paolo Bonzini
2016-07-21  9:08   ` Paolo Bonzini
2016-07-21 15:12     ` Eric Blake
2016-07-21 13:38   ` wangweiwei
2016-07-21 13:45     ` wangweiwei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.