qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
@ 2021-03-22  9:25 ChangLimin
  2021-03-22 17:50 ` John Snow
  2021-03-24 14:50 ` Max Reitz
  0 siblings, 2 replies; 8+ messages in thread
From: ChangLimin @ 2021-03-22  9:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 1343 bytes --]

For Linux 5.10/5.11, qemu write zeros to a multipath device using
ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
permanently. Fallback to pwritev instead of exit for -EBUSY error.

The issue was introduced in Linux 5.10:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02

Fixed in Linux 5.12:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d

Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
---
 block/file-posix.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 20e14f8e96..d4054ac9cb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1624,8 +1624,12 @@ static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
         } while (errno == EINTR);

         ret = translate_err(-errno);
-        if (ret == -ENOTSUP) {
-            s->has_write_zeroes = false;
+        switch (ret) {
+        case -ENOTSUP:
+            s->has_write_zeroes = false; /* fall through */
+        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath devices */
+            return -ENOTSUP;
+            break;
         }
     }
 #endif
--
2.27.0


[-- Attachment #2: Type: text/html, Size: 2364 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-22  9:25 [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block ChangLimin
@ 2021-03-22 17:50 ` John Snow
  2021-03-24 14:50 ` Max Reitz
  1 sibling, 0 replies; 8+ messages in thread
From: John Snow @ 2021-03-22 17:50 UTC (permalink / raw)
  To: ChangLimin, qemu-block
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, mreitz

On 3/22/21 5:25 AM, ChangLimin wrote:
> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> permanently. Fallback to pwritev instead of exit for -EBUSY error.
> 
> The issue was introduced in Linux 5.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> 
> Fixed in Linux 5.12:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> 
> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>

To be clear, when I asked "When do we get -EINVAL?" it wasn't because I 
doubted that we would ever get it, I was just unclear of the 
circumstances in which we might receive EINVAL and was hoping you would 
explain it to me.

> ---
>   block/file-posix.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 20e14f8e96..d4054ac9cb 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1624,8 +1624,12 @@ static ssize_t 
> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>           } while (errno == EINTR);
> 
>           ret = translate_err(-errno);
> -        if (ret == -ENOTSUP) {
> -            s->has_write_zeroes = false;
> +        switch (ret) {
> +        case -ENOTSUP:
> +            s->has_write_zeroes = false; /* fall through */
> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
> devices */
> +            return -ENOTSUP;
> +            break;

What effect does this have, now?

We'll return ENOTSUP but we won't disable trying it again in the future, 
is that right?

Kevin, is this what you had in mind?

--js

>           }
>       }
>   #endif
> --
> 2.27.0
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-22  9:25 [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block ChangLimin
  2021-03-22 17:50 ` John Snow
@ 2021-03-24 14:50 ` Max Reitz
  2021-03-24 15:49   ` Nir Soffer
  1 sibling, 1 reply; 8+ messages in thread
From: Max Reitz @ 2021-03-24 14:50 UTC (permalink / raw)
  To: ChangLimin, qemu-block; +Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel

On 22.03.21 10:25, ChangLimin wrote:
> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> permanently.

So as far as I can track back the discussion, Kevin asked on v1 why we’d 
set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
You haven’t explicitly replied to that question (as far as I can see), 
so it kind of still stands.

Implicitly, there are two conflicting answers in this patch: On one 
hand, the commit message says “permanently”, and this is what you told 
Nir as a realistic case where this can occur.  So that to me implies 
that we actually should not retry BLKZEROOUT, because the EBUSY will 
remain, and that condition won’t change while the block device is in use 
by qemu.

On the other hand, in the code, you have decided not to reset 
has_write_zeroes to false, so the implementation will retry.

So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
there no chance of it working after it has at one point failed with 
EBUSY?  (Are there other cases besides what’s described in this commit 
message where EBUSY might be returned and it is only temporary?)

> Fallback to pwritev instead of exit for -EBUSY error.
> 
> The issue was introduced in Linux 5.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> 
> Fixed in Linux 5.12:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> 
> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> ---
>   block/file-posix.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 20e14f8e96..d4054ac9cb 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1624,8 +1624,12 @@ static ssize_t 
> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>           } while (errno == EINTR);
> 
>           ret = translate_err(-errno);
> -        if (ret == -ENOTSUP) {
> -            s->has_write_zeroes = false;
> +        switch (ret) {
> +        case -ENOTSUP:
> +            s->has_write_zeroes = false; /* fall through */
> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
> devices */
> +            return -ENOTSUP;
> +            break;

(Not sure why this break is here.)

Max

>           }
>       }
>   #endif
> --
> 2.27.0
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-24 14:50 ` Max Reitz
@ 2021-03-24 15:49   ` Nir Soffer
  2021-03-25  6:06     ` ChangLimin
  0 siblings, 1 reply; 8+ messages in thread
From: Nir Soffer @ 2021-03-24 15:49 UTC (permalink / raw)
  To: Max Reitz
  Cc: ChangLimin, kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3663 bytes --]

On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.03.21 10:25, ChangLimin wrote:
> > For Linux 5.10/5.11, qemu write zeros to a multipath device using
> > ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> > permanently.
>
> So as far as I can track back the discussion, Kevin asked on v1 why we’d
> set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> You haven’t explicitly replied to that question (as far as I can see),
> so it kind of still stands.
>
> Implicitly, there are two conflicting answers in this patch: On one
> hand, the commit message says “permanently”, and this is what you told
> Nir as a realistic case where this can occur.


I'm afraid ChangLimin did not answer my question. I'm looking for real
world used case when qemu cannot write zeros to multipath device, when
nobody else is using the device.

I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
once with a multipath device, and once with logical volume on a vg created
on the multipath device, and I could not reproduce this issue.

If I understand the kernel change correctly, this can happen when there is
a mounted file system on top of the multipath device. I don't think we have
a use case when qemu accesses a multipath device when the device is used
by a file system, but maybe I missed something.


> So that to me implies
> that we actually should not retry BLKZEROOUT, because the EBUSY will
> remain, and that condition won’t change while the block device is in use
> by qemu.
>
> On the other hand, in the code, you have decided not to reset
> has_write_zeroes to false, so the implementation will retry.
>

EBUSY is usually a temporary error, so retrying makes sense. The question
is if we really can write zeroes manually in this case?


> So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> there no chance of it working after it has at one point failed with
> EBUSY?  (Are there other cases besides what’s described in this commit
> message where EBUSY might be returned and it is only temporary?)
>
> > Fallback to pwritev instead of exit for -EBUSY error.
> >
> > The issue was introduced in Linux 5.10:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >
> > Fixed in Linux 5.12:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >
> > Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> > ---
> >   block/file-posix.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 20e14f8e96..d4054ac9cb 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1624,8 +1624,12 @@ static ssize_t
> > handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >           } while (errno == EINTR);
> >
> >           ret = translate_err(-errno);
> > -        if (ret == -ENOTSUP) {
> > -            s->has_write_zeroes = false;
> > +        switch (ret) {
> > +        case -ENOTSUP:
> > +            s->has_write_zeroes = false; /* fall through */
> > +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath
> > devices */
> > +            return -ENOTSUP;
> > +            break;
>
> (Not sure why this break is here.)
>
> Max
>
> >           }
> >       }
> >   #endif
> > --
> > 2.27.0
> >
>
>
>

[-- Attachment #2: Type: text/html, Size: 5122 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-24 15:49   ` Nir Soffer
@ 2021-03-25  6:06     ` ChangLimin
  2021-03-25 15:48       ` Nir Soffer
  0 siblings, 1 reply; 8+ messages in thread
From: ChangLimin @ 2021-03-25  6:06 UTC (permalink / raw)
  To: Nir Soffer, mreitz
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 6000 bytes --]

>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
>On 22.03.21 10:25, ChangLimin wrote:
>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
>> permanently.
>
>So as far as I can track back the discussion, Kevin asked on v1 why we’d 
>set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
>You haven’t explicitly replied to that question (as far as I can see), 
>so it kind of still stands.
>
>Implicitly, there are two conflicting answers in this patch: On one 
>hand, the commit message says “permanently”, and this is what you told 
>Nir as a realistic case where this can occur. 

For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. 
For other Linux version, the EBUSY may be temporary. 
Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false.

>I'm afraid ChangLimin did not answer my question. I'm looking for real
>world used case when qemu cannot write zeros to multipath device, when
>nobody else is using the device.
>
>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
>once with a multipath device, and once with logical volume on a vg created
>on the multipath device, and I could not reproduce this issue.

The following is steps to reproduct the issue on Fedora 34.

# uname -a
Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# qemu-img -V
qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)

1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
# iscsiadm -m discovery -t st -p 192.169.1.109
192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100

# iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
# iscsiadm -m session
tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash)

2. start multipathd service
# mpathconf --enable
# systemctl start multipathd

3.  add multipath path
# multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
wwid '36001405b76856e4816b48b99c6a77de3' added

# multipathd add path /dev/sdb
ok

# multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
size=1.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 5:0:0:0  sdd  8:48   active ready running

4. qemu-img return EBUSY both to dm-1 and sdb
# wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
qemu-img: error while writing at byte 0: Device or resource busy

# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
qemu-img: error while writing at byte 0: Device or resource busy

5. blkdiscard also return EBUSY  both to dm-1 and sdb
# blkdiscard -o 0 -l 4096 /dev/dm-1
blkdiscard: cannot open /dev/dm-1: Device or resource busy

# blkdiscard -o 0 -l 4096 /dev/sdb
blkdiscard: cannot open /dev/sdb: No such file or directory

6. dd write zero is good, because it does not use blkdiscard
# dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct 
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s

7. The LUN should support blkdiscard feature, otherwise it will not write zero 
with  ioctl(fd, BLKZEROOUT, range) 

>If I understand the kernel change correctly, this can happen when there is
>a mounted file system on top of the multipath device. I don't think we have 
>a use case when qemu accesses a multipath device when the device is used
>by a file system, but maybe I missed something.
> 
>So that to me implies 
>that we actually should not retry BLKZEROOUT, because the EBUSY will 
>remain, and that condition won’t change while the block device is in use 
>by qemu.
>
>On the other hand, in the code, you have decided not to reset 
>has_write_zeroes to false, so the implementation will retry.
>
>EBUSY is usually a temporary error, so retrying makes sense. The question
>is if we really can write zeroes manually in this case?
> 
>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
>there no chance of it working after it has at one point failed with 
>EBUSY?  (Are there other cases besides what’s described in this commit 
>message where EBUSY might be returned and it is only temporary?)
>
>> Fallback to pwritev instead of exit for -EBUSY error.
>> 
>> The issue was introduced in Linux 5.10:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
>> 
>> Fixed in Linux 5.12:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
>> 
>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
>> ---
>>   block/file-posix.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 20e14f8e96..d4054ac9cb 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -1624,8 +1624,12 @@ static ssize_t 
>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>>           } while (errno == EINTR);
>> 
>>           ret = translate_err(-errno);
>> -        if (ret == -ENOTSUP) {
>> -            s->has_write_zeroes = false;
>> +        switch (ret) {
>> +        case -ENOTSUP:
>> +            s->has_write_zeroes = false; /* fall through */
>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
>> devices */
>> +            return -ENOTSUP;
>> +            break;
>
>(Not sure why this break is here.)
>
>Max
>
>>           }
>>       }
>>   #endif
>> --
>> 2.27.0
>> 


[-- Attachment #2: Type: text/html, Size: 10393 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-25  6:06     ` ChangLimin
@ 2021-03-25 15:48       ` Nir Soffer
  2021-03-26  0:20         ` ChangLimin
  0 siblings, 1 reply; 8+ messages in thread
From: Nir Soffer @ 2021-03-25 15:48 UTC (permalink / raw)
  To: ChangLimin, Benjamin Marzinski
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 11838 bytes --]

On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote:

> >On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
> >On 22.03.21 10:25, ChangLimin wrote:
> >> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> >> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> >> permanently.
> >
> >So as far as I can track back the discussion, Kevin asked on v1 why we’d
> >set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> >at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> >You haven’t explicitly replied to that question (as far as I can see),
> >so it kind of still stands.
> >
> >Implicitly, there are two conflicting answers in this patch: On one
> >hand, the commit message says “permanently”, and this is what you told
> >Nir as a realistic case where this can occur.
>
> For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is
> below.
> For other Linux version, the EBUSY may be temporary.
> Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes
> to false.
>
> >I'm afraid ChangLimin did not answer my question. I'm looking for real
> >world used case when qemu cannot write zeros to multipath device, when
> >nobody else is using the device.
> >
> >I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
> >once with a multipath device, and once with logical volume on a vg created
> >on the multipath device, and I could not reproduce this issue.
>
> The following is steps to reproduct the issue on Fedora 34.
>
> # uname -a
> Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021
> x86_64 x86_64 x86_64 GNU/Linux
>

Is this the most recent kernel? I have 5.11.7 in fedora 32.


>
> # qemu-img -V
> qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
>
> 1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
> # iscsiadm -m discovery -t st -p 192.169.1.109
> 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
>
> # iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
> # iscsiadm -m session
> tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> (non-flash)
>
> 2. start multipathd service
> # mpathconf --enable
> # systemctl start multipathd
>
> 3.  add multipath path
> # multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
> wwid '36001405b76856e4816b48b99c6a77de3' added
>
> # multipathd add path /dev/sdb
> ok
>
> # multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
> mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
> size=1.0G features='0' hwhandler='1 alua' wp=rw
> `-+- policy='service-time 0' prio=50 status=active
>   `- 5:0:0:0  sdd  8:48   active ready running
>

You are using user_friendly_names which is (sadly) the default.
But I don't think it should matter.

4. qemu-img return EBUSY both to dm-1 and sdb
> # wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
> # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
> qemu-img: error while writing at byte 0: Device or resource busy
>
> # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
> qemu-img: error while writing at byte 0: Device or resource busy
>
> 5. blkdiscard also return EBUSY  both to dm-1 and sdb
> # blkdiscard -o 0 -l 4096 /dev/dm-1
> blkdiscard: cannot open /dev/dm-1: Device or resource busy
>
> # blkdiscard -o 0 -l 4096 /dev/sdb
> blkdiscard: cannot open /dev/sdb: No such file or directory
>
> 6. dd write zero is good, because it does not use blkdiscard
> # dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
>
> 7. The LUN should support blkdiscard feature, otherwise it will not write
> zero
> with  ioctl(fd, BLKZEROOUT, range)
>

Thanks!

I could not reproduce this with kernel 5.10, but now I'm no 5.11:
# uname -r
5.11.7-100.fc32.x86_64

# qemu-img --version
qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

# cat /etc/multipath.conf
defaults {
user_friendly_names no
find_multipaths no
}

blacklist_exceptions {
        property "(SCSI_IDENT_|ID_WWN)"
}

blacklist {
}

# multipath -ll 36001405e884ab8ff4b44fdba6901099c
36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
size=6.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 1:0:0:9 sdk     8:160 active ready running

$ lsblk /dev/sdk
NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
sdk                                   8:160  0   6G  0 disk
└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath

$ virt-builder fedora-32 -o disk.img
[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
[   3.8] Planning how to build this image
[   3.8] Uncompressing
[  11.1] Opening the new disk
[  16.1] Setting a random seed
[  16.1] Setting passwords
virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
[  17.0] Finishing off
                   Output file: disk.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 5.4G
                    Free space: 4.0G (74%)

$ qemu-img info disk.img
image: disk.img
file format: raw
virtual size: 6 GiB (6442450944 bytes)
disk size: 1.29 GiB

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Works.

# lsblk /dev/sdk
NAME                                   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sdk                                      8:160  0    6G  0 disk
└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath
  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part
  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part
  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part
  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Works, wiping the partitions.

# mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt

# mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

This is invalid use, converting to device with mounted file system, but
let's try:

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Still works?!

# wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing
initialization failed: Device or resource busy

# blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c

Works.

This the configuration of the LUN on the server side:

   {

      "aio": false,

      "alua_tpgs": [

        {

          "alua_access_state": 0,

          "alua_access_status": 0,

          "alua_access_type": 3,

          "alua_support_active_nonoptimized": 1,

          "alua_support_active_optimized": 1,

          "alua_support_offline": 1,

          "alua_support_standby": 1,

          "alua_support_transitioning": 1,

          "alua_support_unavailable": 1,

          "alua_write_metadata": 0,

          "implicit_trans_secs": 0,

          "name": "default_tg_pt_gp",

          "nonop_delay_msecs": 100,

          "preferred": 0,

          "tg_pt_gp_id": 0,

          "trans_delay_msecs": 0

        }

      ],

      "attributes": {

        "block_size": 512,

        "emulate_3pc": 1,

        "emulate_caw": 1,

        "emulate_dpo": 1,

        "emulate_fua_read": 1,

        "emulate_fua_write": 1,

        "emulate_model_alias": 1,

        "emulate_pr": 1,

        "emulate_rest_reord": 0,

        "emulate_tas": 1,

        "emulate_tpu": 1,

        "emulate_tpws": 1,

        "emulate_ua_intlck_ctrl": 0,

        "emulate_write_cache": 1,

        "enforce_pr_isids": 1,

        "force_pr_aptpl": 0,

        "is_nonrot": 0,

        "max_unmap_block_desc_count": 1,

        "max_unmap_lba_count": 8192,

        "max_write_same_len": 65335,

        "optimal_sectors": 16384,

        "pi_prot_format": 0,

        "pi_prot_type": 0,

        "pi_prot_verify": 0,

        "queue_depth": 128,

        "unmap_granularity": 1,

        "unmap_granularity_alignment": 0,

        "unmap_zeroes_data": 0

      },

      "dev": "/target/3/09",

      "name": "3-09",

      "plugin": "fileio",

      "size": 6442450944,

      "write_back": true,

      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"

    },


Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7
already includes the fix?

Adding Ben, maybe he had more insight on the multipath side.

>If I understand the kernel change correctly, this can happen when there is
> >a mounted file system on top of the multipath device. I don't think we
> have
> >a use case when qemu accesses a multipath device when the device is used
> >by a file system, but maybe I missed something.
> >
> >So that to me implies
> >that we actually should not retry BLKZEROOUT, because the EBUSY will
> >remain, and that condition won’t change while the block device is in use
> >by qemu.
> >
> >On the other hand, in the code, you have decided not to reset
> >has_write_zeroes to false, so the implementation will retry.
> >
> >EBUSY is usually a temporary error, so retrying makes sense. The question
> >is if we really can write zeroes manually in this case?
> >
> >So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> >there no chance of it working after it has at one point failed with
> >EBUSY?  (Are there other cases besides what’s described in this commit
> >message where EBUSY might be returned and it is only temporary?)
> >
> >> Fallback to pwritev instead of exit for -EBUSY error.
> >>
> >> The issue was introduced in Linux 5.10:
> >>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >>
> >> Fixed in Linux 5.12:
> >>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >>
> >> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> >> ---
> >>   block/file-posix.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/file-posix.c b/block/file-posix.c
> >> index 20e14f8e96..d4054ac9cb 100644
> >> --- a/block/file-posix.c
> >> +++ b/block/file-posix.c
> >> @@ -1624,8 +1624,12 @@ static ssize_t
> >> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >>           } while (errno == EINTR);
> >>
> >>           ret = translate_err(-errno);
> >> -        if (ret == -ENOTSUP) {
> >> -            s->has_write_zeroes = false;
> >> +        switch (ret) {
> >> +        case -ENOTSUP:
> >> +            s->has_write_zeroes = false; /* fall through */
> >> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for
> multipath
> >> devices */
> >> +            return -ENOTSUP;
> >> +            break;
> >
> >(Not sure why this break is here.)
> >
> >Max
> >
> >>           }
> >>       }
> >>   #endif
> >> --
> >> 2.27.0
> >>
>
>

[-- Attachment #2: Type: text/html, Size: 27065 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-25 15:48       ` Nir Soffer
@ 2021-03-26  0:20         ` ChangLimin
  2021-03-26 19:39           ` Nir Soffer
  0 siblings, 1 reply; 8+ messages in thread
From: ChangLimin @ 2021-03-26  0:20 UTC (permalink / raw)
  To: Nir Soffer, Benjamin Marzinski
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 18030 bytes --]

>On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote:
>>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
>>On 22.03.21 10:25, ChangLimin wrote:
>>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
>>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
>>> permanently.
>>
>>So as far as I can track back the discussion, Kevin asked on v1 why we’d 
>>set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
>>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
>>You haven’t explicitly replied to that question (as far as I can see), 
>>so it kind of still stands.
>>
>>Implicitly, there are two conflicting answers in this patch: On one 
>>hand, the commit message says “permanently”, and this is what you told 
>>Nir as a realistic case where this can occur. 
>
>For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. 
>For other Linux version, the EBUSY may be temporary. 
>Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false.
>
>>I'm afraid ChangLimin did not answer my question. I'm looking for real
>>world used case when qemu cannot write zeros to multipath device, when
>>nobody else is using the device.
>>
>>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
>>once with a multipath device, and once with logical volume on a vg created
>>on the multipath device, and I could not reproduce this issue.
>
>The following is steps to reproduct the issue on Fedora 34.
>
># uname -a
>Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>
>Is this the most recent kernel? I have 5.11.7 in fedora 32.
> 
>
># qemu-img -V
>qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
>
>1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
># iscsiadm -m discovery -t st -p 192.169.1.109
>192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
>
># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
># iscsiadm -m session
>tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash)
>
>2. start multipathd service
># mpathconf --enable
># systemctl start multipathd
>
>3.  add multipath path
># multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
>wwid '36001405b76856e4816b48b99c6a77de3' added
>
># multipathd add path /dev/sdb
>ok
>
># multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
>mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
>size=1.0G features='0' hwhandler='1 alua' wp=rw
>`-+- policy='service-time 0' prio=50 status=active
>  `- 5:0:0:0  sdd  8:48   active ready running
>
>You are using user_friendly_names which is (sadly) the default.
>But I don't think it should matter.
>
>4. qemu-img return EBUSY both to dm-1 and sdb
># wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
>qemu-img: error while writing at byte 0: Device or resource busy
>
># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
>qemu-img: error while writing at byte 0: Device or resource busy
>
>5. blkdiscard also return EBUSY  both to dm-1 and sdb
># blkdiscard -o 0 -l 4096 /dev/dm-1
>blkdiscard: cannot open /dev/dm-1: Device or resource busy
>
># blkdiscard -o 0 -l 4096 /dev/sdb
>blkdiscard: cannot open /dev/sdb: No such file or directory
>
>6. dd write zero is good, because it does not use blkdiscard
># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct 
>100+0 records in
>100+0 records out
>104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
>
>7. The LUN should support blkdiscard feature, otherwise it will not write zero 
>with  ioctl(fd, BLKZEROOUT, range) 
>
>Thanks!
>
>I could not reproduce this with kernel 5.10, but now I'm no 5.11:
># uname -r
>5.11.7-100.fc32.x86_64
>
># qemu-img --version
>qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
>Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
>
># cat /etc/multipath.conf
>defaults {
>user_friendly_names no
>find_multipaths no
>}
>
>blacklist_exceptions {
>        property "(SCSI_IDENT_|ID_WWN)"
>}
>
>blacklist {
>}
>
># multipath -ll 36001405e884ab8ff4b44fdba6901099c
>36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
>size=6.0G features='0' hwhandler='1 alua' wp=rw
>`-+- policy='service-time 0' prio=50 status=active
>  `- 1:0:0:9 sdk     8:160 active ready running
>
>$ lsblk /dev/sdk
>NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
>sdk                                   8:160  0   6G  0 disk  
>└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath 
>
>$ virt-builder fedora-32 -o disk.img
>[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
>[   3.8] Planning how to build this image
>[   3.8] Uncompressing
>[  11.1] Opening the new disk
>[  16.1] Setting a random seed
>[  16.1] Setting passwords
>virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
>[  17.0] Finishing off
>                   Output file: disk.img
>                   Output size: 6.0G
>                 Output format: raw
>            Total usable space: 5.4G
>                    Free space: 4.0G (74%)
>
>$ qemu-img info disk.img
>image: disk.img
>file format: raw
>virtual size: 6 GiB (6442450944 bytes)
>disk size: 1.29 GiB
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Works.
>
># lsblk /dev/sdk
>NAME                                   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
>sdk                                      8:160  0    6G  0 disk  
>└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath 
>  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part  
>  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part  
>  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part  
>  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part  
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Works, wiping the partitions.
>
># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
>
># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
>/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>
>This is invalid use, converting to device with mounted file system, but let's try:
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Still works?!
>
># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy
>
># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>
>Works.
>
>This the configuration of the LUN on the server side:
>
>   {                                                                                                                         
>      "aio": false,                                                                                                           
>      "alua_tpgs": [                                                                                                          
>        {                                                                                                                     
>          "alua_access_state": 0,                                                                                             
>          "alua_access_status": 0,                                                                                            
>          "alua_access_type": 3,                                                                                              
>          "alua_support_active_nonoptimized": 1,                                                                              
>          "alua_support_active_optimized": 1,                                                                                 
>          "alua_support_offline": 1,                                                                                          
>          "alua_support_standby": 1,                                                                                          
>          "alua_support_transitioning": 1,                                                                                    
>          "alua_support_unavailable": 1,                                                                                      
>          "alua_write_metadata": 0,                                                                                           
>          "implicit_trans_secs": 0,                                                                                           
>          "name": "default_tg_pt_gp",                                                                                         
>          "nonop_delay_msecs": 100,                                                                                           
>          "preferred": 0,                                                                                                     
>          "tg_pt_gp_id": 0,                                                                                                   
>          "trans_delay_msecs": 0                                                                                              
>        }                                                                                                                     
>      ],                                                                                                                      
>      "attributes": {                                                                                                         
>        "block_size": 512,                                                                                                    
>        "emulate_3pc": 1,                                                                                                     
>        "emulate_caw": 1,                                                                                                     
>        "emulate_dpo": 1,                                                                                                     
>        "emulate_fua_read": 1,                                                                                                
>        "emulate_fua_write": 1,                                                                                               
>        "emulate_model_alias": 1,                                                                                             
>        "emulate_pr": 1,                                                                                                      
>        "emulate_rest_reord": 0,                                                                                              
>        "emulate_tas": 1,                                                                                                     
>        "emulate_tpu": 1,                                                                                                     
>        "emulate_tpws": 1,                                                                                                    
>        "emulate_ua_intlck_ctrl": 0,                                                                                          
>        "emulate_write_cache": 1,                                                                                             
>        "enforce_pr_isids": 1,                                                                                                
>        "force_pr_aptpl": 0,                                                                                                  
>        "is_nonrot": 0,                                                                                                       
>        "max_unmap_block_desc_count": 1,                                                                                      
>        "max_unmap_lba_count": 8192,                                                                                          
>        "max_write_same_len": 65335,                                                                                          
>        "optimal_sectors": 16384,                                                                                             
>        "pi_prot_format": 0,                                                                                                  
>        "pi_prot_type": 0,                                                                                                    
>        "pi_prot_verify": 0,                                                                                                  
>        "queue_depth": 128,                                                                                                   
>        "unmap_granularity": 1,                                                                                               
>        "unmap_granularity_alignment": 0,                                                                                     
>        "unmap_zeroes_data": 0                                                                                                
>      },                                                                                                                      
>      "dev": "/target/3/09",                                                                                                  
>      "name": "3-09",                                                                                                         
>      "plugin": "fileio",                                                                                                     
>      "size": 6442450944,                                                                                                     
>      "write_back": true,                                                                                                     
>      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"                                                                           
>    },                                                                                                                        
>
>Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7
>already includes the fix?

Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg Kroah-Hartman.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f

# git describe d44c9780ed40db88626c9354868eab72159c7a7f
v5.10.23-184-gd44c9780ed40

# git describe 7e0815797656f029fab2edc309406cddf931b9d8
v5.11.6-178-g7e0815797656

>Adding Ben, maybe he had more insight on the multipath side.
>
>>If I understand the kernel change correctly, this can happen when there is
>>a mounted file system on top of the multipath device. I don't think we have 
>>a use case when qemu accesses a multipath device when the device is used
>>by a file system, but maybe I missed something.
>> 
>>So that to me implies 
>>that we actually should not retry BLKZEROOUT, because the EBUSY will 
>>remain, and that condition won’t change while the block device is in use 
>>by qemu.
>>
>>On the other hand, in the code, you have decided not to reset 
>>has_write_zeroes to false, so the implementation will retry.
>>
>>EBUSY is usually a temporary error, so retrying makes sense. The question
>>is if we really can write zeroes manually in this case?
>> 
>>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
>>there no chance of it working after it has at one point failed with 
>>EBUSY?  (Are there other cases besides what’s described in this commit 
>>message where EBUSY might be returned and it is only temporary?)
>>
>>> Fallback to pwritev instead of exit for -EBUSY error.
>>> 
>>> The issue was introduced in Linux 5.10:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
>>> 
>>> Fixed in Linux 5.12:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
>>> 
>>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
>>> ---
>>>   block/file-posix.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 20e14f8e96..d4054ac9cb 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -1624,8 +1624,12 @@ static ssize_t 
>>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>>>           } while (errno == EINTR);
>>> 
>>>           ret = translate_err(-errno);
>>> -        if (ret == -ENOTSUP) {
>>> -            s->has_write_zeroes = false;
>>> +        switch (ret) {
>>> +        case -ENOTSUP:
>>> +            s->has_write_zeroes = false; /* fall through */
>>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
>>> devices */
>>> +            return -ENOTSUP;
>>> +            break;
>>
>>(Not sure why this break is here.)
>>
>>Max
>>
>>>           }
>>>       }
>>>   #endif
>>> --
>>> 2.27.0
>>> 


[-- Attachment #2: Type: text/html, Size: 39627 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block
  2021-03-26  0:20         ` ChangLimin
@ 2021-03-26 19:39           ` Nir Soffer
  0 siblings, 0 replies; 8+ messages in thread
From: Nir Soffer @ 2021-03-26 19:39 UTC (permalink / raw)
  To: ChangLimin
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	mreitz, Benjamin Marzinski

[-- Attachment #1: Type: text/plain, Size: 13323 bytes --]

On Fri, Mar 26, 2021 at 3:21 AM ChangLimin <changlm@chinatelecom.cn> wrote:

> >On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn>
> wrote:
> >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
> >>On 22.03.21 10:25, ChangLimin wrote:
> >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return
> -EBUSY
> >>> permanently.
> >>
> >>So as far as I can track back the discussion, Kevin asked on v1 why we’d
> >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> >>You haven’t explicitly replied to that question (as far as I can see),
> >>so it kind of still stands.
> >>
> >>Implicitly, there are two conflicting answers in this patch: On one
> >>hand, the commit message says “permanently”, and this is what you told
> >>Nir as a realistic case where this can occur.
> >
> >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is
> below.
> >For other Linux version, the EBUSY may be temporary.
> >Because  Linux 5.10/5.11 is not used widely, so do not set
> has_write_zeroes to false.
> >
> >>I'm afraid ChangLimin did not answer my question. I'm looking for real
> >>world used case when qemu cannot write zeros to multipath device, when
> >>nobody else is using the device.
> >>
> >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
> >>once with a multipath device, and once with logical volume on a vg
> created
> >>on the multipath device, and I could not reproduce this issue.
> >
> >The following is steps to reproduct the issue on Fedora 34.
> >
> ># uname -a
> >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021
> x86_64 x86_64 x86_64 GNU/Linux
> >
> >Is this the most recent kernel? I have 5.11.7 in fedora 32.
> >
> >
> ># qemu-img -V
> >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
> >
> >1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
> ># iscsiadm -m discovery -t st -p 192.169.1.109
> >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> >
> ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
> ># iscsiadm -m session
> >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> (non-flash)
> >
> >2. start multipathd service
> ># mpathconf --enable
> ># systemctl start multipathd
> >
> >3.  add multipath path
> ># multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
> >wwid '36001405b76856e4816b48b99c6a77de3' added
> >
> ># multipathd add path /dev/sdb
> >ok
> >
> ># multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
> >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
> >size=1.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 5:0:0:0  sdd  8:48   active ready running
> >
> >You are using user_friendly_names which is (sadly) the default.
> >But I don't think it should matter.
> >
> >4. qemu-img return EBUSY both to dm-1 and sdb
> ># wget
> http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> >5. blkdiscard also return EBUSY  both to dm-1 and sdb
> ># blkdiscard -o 0 -l 4096 /dev/dm-1
> >blkdiscard: cannot open /dev/dm-1: Device or resource busy
> >
> ># blkdiscard -o 0 -l 4096 /dev/sdb
> >blkdiscard: cannot open /dev/sdb: No such file or directory
> >
> >6. dd write zero is good, because it does not use blkdiscard
> ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
> >100+0 records in
> >100+0 records out
> >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
> >
> >7. The LUN should support blkdiscard feature, otherwise it will not write
> zero
> >with  ioctl(fd, BLKZEROOUT, range)
> >
> >Thanks!
> >
> >I could not reproduce this with kernel 5.10, but now I'm no 5.11:
> ># uname -r
> >5.11.7-100.fc32.x86_64
> >
> ># qemu-img --version
> >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
> >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
> >
> ># cat /etc/multipath.conf
> >defaults {
> >user_friendly_names no
> >find_multipaths no
> >}
> >
> >blacklist_exceptions {
> >        property "(SCSI_IDENT_|ID_WWN)"
> >}
> >
> >blacklist {
> >}
> >
> ># multipath -ll 36001405e884ab8ff4b44fdba6901099c
> >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
> >size=6.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 1:0:0:9 sdk     8:160 active ready running
> >
> >$ lsblk /dev/sdk
> >NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> >sdk                                   8:160  0   6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath
> >
> >$ virt-builder fedora-32 -o disk.img
> >[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
> >[   3.8] Planning how to build this image
> >[   3.8] Uncompressing
> >[  11.1] Opening the new disk
> >[  16.1] Setting a random seed
> >[  16.1] Setting passwords
> >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
> >[  17.0] Finishing off
> >                   Output file: disk.img
> >                   Output size: 6.0G
> >                 Output format: raw
> >            Total usable space: 5.4G
> >                    Free space: 4.0G (74%)
> >
> >$ qemu-img info disk.img
> >image: disk.img
> >file format: raw
> >virtual size: 6 GiB (6442450944 bytes)
> >disk size: 1.29 GiB
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works.
> >
> ># lsblk /dev/sdk
> >NAME                                   MAJ:MIN RM  SIZE RO TYPE
>  MOUNTPOINT
> >sdk                                      8:160  0    6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath
> >  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part
> >  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works, wiping the partitions.
> >
> ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
> >
> ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
> >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs
> (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> >
> >This is invalid use, converting to device with mounted file system, but
> let's try:
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Still works?!
> >
> ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing
> initialization failed: Device or resource busy
> >
> ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >
> >Works.
> >
> >This the configuration of the LUN on the server side:
> >
> >   {
>
> >      "aio": false,
>
> >      "alua_tpgs": [
>
> >        {
>
> >          "alua_access_state": 0,
>
> >          "alua_access_status": 0,
>
> >          "alua_access_type": 3,
>
> >          "alua_support_active_nonoptimized": 1,
>
> >          "alua_support_active_optimized": 1,
>
> >          "alua_support_offline": 1,
>
> >          "alua_support_standby": 1,
>
> >          "alua_support_transitioning": 1,
>
> >          "alua_support_unavailable": 1,
>
> >          "alua_write_metadata": 0,
>
> >          "implicit_trans_secs": 0,
>
> >          "name": "default_tg_pt_gp",
>
> >          "nonop_delay_msecs": 100,
>
> >          "preferred": 0,
>
> >          "tg_pt_gp_id": 0,
>
> >          "trans_delay_msecs": 0
>
> >        }
>
> >      ],
>
> >      "attributes": {
>
> >        "block_size": 512,
>
> >        "emulate_3pc": 1,
>
> >        "emulate_caw": 1,
>
> >        "emulate_dpo": 1,
>
> >        "emulate_fua_read": 1,
>
> >        "emulate_fua_write": 1,
>
> >        "emulate_model_alias": 1,
>
> >        "emulate_pr": 1,
>
> >        "emulate_rest_reord": 0,
>
> >        "emulate_tas": 1,
>
> >        "emulate_tpu": 1,
>
> >        "emulate_tpws": 1,
>
> >        "emulate_ua_intlck_ctrl": 0,
>
> >        "emulate_write_cache": 1,
>
> >        "enforce_pr_isids": 1,
>
> >        "force_pr_aptpl": 0,
>
> >        "is_nonrot": 0,
>
> >        "max_unmap_block_desc_count": 1,
>
> >        "max_unmap_lba_count": 8192,
>
> >        "max_write_same_len": 65335,
>
> >        "optimal_sectors": 16384,
>
> >        "pi_prot_format": 0,
>
> >        "pi_prot_type": 0,
>
> >        "pi_prot_verify": 0,
>
> >        "queue_depth": 128,
>
> >        "unmap_granularity": 1,
>
> >        "unmap_granularity_alignment": 0,
>
> >        "unmap_zeroes_data": 0
>
> >      },
>
> >      "dev": "/target/3/09",
>
> >      "name": "3-09",
>
> >      "plugin": "fileio",
>
> >      "size": 6442450944,
>
> >      "write_back": true,
>
> >      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"
>
> >    },
>
> >
> >Maybe this upstream change is not in all downstream 5.11 kernels, or
> 5.11.7
> >already includes the fix?
>
> Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg
> Kroah-Hartman.
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f
>
> # git describe d44c9780ed40db88626c9354868eab72159c7a7f
> v5.10.23-184-gd44c9780ed40
>
> # git describe 7e0815797656f029fab2edc309406cddf931b9d8
> v5.11.6-178-g7e0815797656
>

So this is practically fixed.

I don't think keeping a workaround for this in qemu is a good idea.

>Adding Ben, maybe he had more insight on the multipath side.
> >
> >>If I understand the kernel change correctly, this can happen when there
> is
> >>a mounted file system on top of the multipath device. I don't think we
> have
> >>a use case when qemu accesses a multipath device when the device is used
> >>by a file system, but maybe I missed something.
> >>
> >>So that to me implies
> >>that we actually should not retry BLKZEROOUT, because the EBUSY will
> >>remain, and that condition won’t change while the block device is in use
> >>by qemu.
> >>
> >>On the other hand, in the code, you have decided not to reset
> >>has_write_zeroes to false, so the implementation will retry.
> >>
> >>EBUSY is usually a temporary error, so retrying makes sense. The question
> >>is if we really can write zeroes manually in this case?
> >>
> >>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> >>there no chance of it working after it has at one point failed with
> >>EBUSY?  (Are there other cases besides what’s described in this commit
> >>message where EBUSY might be returned and it is only temporary?)
> >>
> >>> Fallback to pwritev instead of exit for -EBUSY error.
> >>>
> >>> The issue was introduced in Linux 5.10:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >>>
> >>> Fixed in Linux 5.12:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >>>
> >>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> >>> ---
> >>>   block/file-posix.c | 8 ++++++--
> >>>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/file-posix.c b/block/file-posix.c
> >>> index 20e14f8e96..d4054ac9cb 100644
> >>> --- a/block/file-posix.c
> >>> +++ b/block/file-posix.c
> >>> @@ -1624,8 +1624,12 @@ static ssize_t
> >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >>>           } while (errno == EINTR);
> >>>
> >>>           ret = translate_err(-errno);
> >>> -        if (ret == -ENOTSUP) {
> >>> -            s->has_write_zeroes = false;
> >>> +        switch (ret) {
> >>> +        case -ENOTSUP:
> >>> +            s->has_write_zeroes = false; /* fall through */
> >>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for
> multipath
> >>> devices */
> >>> +            return -ENOTSUP;
> >>> +            break;
> >>
> >>(Not sure why this break is here.)
> >>
> >>Max
> >>
> >>>           }
> >>>       }
> >>>   #endif
> >>> --
> >>> 2.27.0
> >>>
>
>

[-- Attachment #2: Type: text/html, Size: 28158 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-26 20:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22  9:25 [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block ChangLimin
2021-03-22 17:50 ` John Snow
2021-03-24 14:50 ` Max Reitz
2021-03-24 15:49   ` Nir Soffer
2021-03-25  6:06     ` ChangLimin
2021-03-25 15:48       ` Nir Soffer
2021-03-26  0:20         ` ChangLimin
2021-03-26 19:39           ` Nir Soffer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).