All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Communication issues between NBD driver and NBDKit server
       [not found] <04a0a06b-e6dc-48b7-bc29-105dab888a56@www.fastmail.com>
@ 2022-05-15 15:51 ` Josef Bacik
  2022-05-15 16:39   ` Nikolaus Rath
  2022-05-15 18:05 ` [Libguestfs] " Richard W.M. Jones
  1 sibling, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2022-05-15 15:51 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: libguestfs, linux-block

On Sun, May 15, 2022 at 11:45 AM Nikolaus Rath <nikolaus@rath.org> wrote:
>
> Hi,
>
> I am observing some strange errors when using the Kernel's NBD driver with NBDkit.
>
> On the kernel side, I see:
>
> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600
> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600
> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request 00000000ae5feee7: control (write@4836316160,32768B). Runtime 30 seconds
> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request 000000007094eddc: control (write@5372947456,10240B). Runtime 30 seconds

The server isn't responding to the request fast enough, and you don't
have a timeout set so it'll just hang until you disconnect.

> May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status 0 flags 0)

The server double replied to a request.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Communication issues between NBD driver and NBDKit server
  2022-05-15 15:51 ` Communication issues between NBD driver and NBDKit server Josef Bacik
@ 2022-05-15 16:39   ` Nikolaus Rath
  0 siblings, 0 replies; 7+ messages in thread
From: Nikolaus Rath @ 2022-05-15 16:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: libguestfs, linux-block

On May 15 2022, Josef Bacik <josef@toxicpanda.com> wrote:
> On Sun, May 15, 2022 at 11:45 AM Nikolaus Rath <nikolaus@rath.org> wrote:
>>
>> Hi,
>>
>> I am observing some strange errors when using the Kernel's NBD driver with NBDkit.
>>
>> On the kernel side, I see:
>>
>> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600
>> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600
>> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 00000000ae5feee7: control (write@4836316160,32768B). Runtime 30 seconds
>> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 000000007094eddc: control (write@5372947456,10240B). Runtime 30 seconds
>
> The server isn't responding to the request fast enough, and you don't
> have a timeout set so it'll just hang until you disconnect.

Are you saying it will continue to hang even after the request has been
responded to? Because NBDkit claims to have sent a response (I'm not
sure if within 30 seconds, but definitely at some point).


Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Libguestfs] Communication issues between NBD driver and NBDKit server
       [not found] <04a0a06b-e6dc-48b7-bc29-105dab888a56@www.fastmail.com>
  2022-05-15 15:51 ` Communication issues between NBD driver and NBDKit server Josef Bacik
@ 2022-05-15 18:05 ` Richard W.M. Jones
  2022-05-15 19:12   ` Nikolaus Rath
  1 sibling, 1 reply; 7+ messages in thread
From: Richard W.M. Jones @ 2022-05-15 18:05 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: libguestfs, linux-block, Josef Bacik

On Sun, May 15, 2022 at 04:45:11PM +0100, Nikolaus Rath wrote:
> Hi,
> 
> I am observing some strange errors when using the Kernel's NBD driver with
> NBDkit.
> 
> On the kernel side, I see:
> 
> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0
> to 104857600
> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0
> to 104857600
> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request
> 00000000ae5feee7: control (write@4836316160,32768B). Runtime 30 seconds
> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request
> 000000007094eddc: control (write@5372947456,10240B). Runtime 30 seconds
> May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status
> 0 flags 0)
> May 15 16:18:31 vostro.rath.org kernel: block nbd0: Possible stuck request
> 0000000075f8b9bc: control (write@8057764864,32768B). Runtime 30 seconds
> May 15 16:18:41 vostro.rath.org kernel: block nbd0: Possible stuck request
> 000000002d1b3e8b: control (write@14499979264,32768B). Runtime 30 seconds
> [...]

Does it really take over 30 seconds for nbdkit to respond?  You might
want to insert some debugging into the S3 plugin to see what stage of
the request cycle is taking so long, although I'm going to guess it's
the remote Amazon server itself.

It seems like you can adjust this timeout using the nbd-client -t flag
(it calls ioctl(NBD_SET_TIMEOUT) in the kernel).  If I understand the
logic correctly, the nbd timeout is currently set to 0, which causes
the default socket timeout to be used.  Using the -t flag overrides this.
So I guess try setting it larger and see if the problem goes away.

Rich.

> And userspace ('zfs snapshot" in this instance) is stuck afterwards.
> 
> On the NBDkit side, there seemingly are write errors when replying back to the
> kernel:
> 
> $ nbdkit --unix /tmp/tmpi5o59_y_/nbd_socket_sb --foreground --filter=exitlast
> --filter=stats --threads 16 S3 size=50G bucket=nikratio-backup key=sb statsfile
> =/tmp/tmpi5o59_y_/stats_sb.txt object-size=32K &
> $ nbd-client -unix /tmp/tmpi5o59_y_/nbd_socket_sb /dev/nbd2
> Warning: the oldstyle protocol is no longer supported.
> This method now uses the newstyle protocol with a default export
> Negotiation: ..size = 51200MB
> Connected /dev/nbd0
> [....]
> nbdkit: python.10: error: write reply: NBD_CMD_WRITE: Broken pipe
> 
> 
> What's the best way to narrow down who's the culprit here (kernel vs NBD
> server)?
> 
> Best,
> -Nikolaus
> 
> --
> GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
> 
>              »Time flies like an arrow, fruit flies like a Banana.«
> 
> 

> _______________________________________________
> Libguestfs mailing list
> Libguestfs@redhat.com
> https://listman.redhat.com/mailman/listinfo/libguestfs


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Libguestfs] Communication issues between NBD driver and NBDKit server
  2022-05-15 18:05 ` [Libguestfs] " Richard W.M. Jones
@ 2022-05-15 19:12   ` Nikolaus Rath
  2022-05-15 19:25     ` Richard W.M. Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolaus Rath @ 2022-05-15 19:12 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: libguestfs, linux-block, Josef Bacik

On May 15 2022, "Richard W.M. Jones" <rjones@redhat.com> wrote:
> On Sun, May 15, 2022 at 04:45:11PM +0100, Nikolaus Rath wrote:
>> Hi,
>> 
>> I am observing some strange errors when using the Kernel's NBD driver with
>> NBDkit.
>> 
>> On the kernel side, I see:
>> 
>> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0
>> to 104857600
>> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0
>> to 104857600
>> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 00000000ae5feee7: control (write@4836316160,32768B). Runtime 30 seconds
>> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 000000007094eddc: control (write@5372947456,10240B). Runtime 30 seconds
>> May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status
>> 0 flags 0)
>> May 15 16:18:31 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 0000000075f8b9bc: control (write@8057764864,32768B). Runtime 30 seconds
>> May 15 16:18:41 vostro.rath.org kernel: block nbd0: Possible stuck request
>> 000000002d1b3e8b: control (write@14499979264,32768B). Runtime 30 seconds
>> [...]
>
> Does it really take over 30 seconds for nbdkit to respond?  You might
> want to insert some debugging into the S3 plugin to see what stage of
> the request cycle is taking so long, although I'm going to guess it's
> the remote Amazon server itself.

It seems unlikely, but it is possible - especially since I'm serializing
requests for debugging.


> It seems like you can adjust this timeout using the nbd-client -t flag
> (it calls ioctl(NBD_SET_TIMEOUT) in the kernel).  If I understand the
> logic correctly, the nbd timeout is currently set to 0, which causes
> the default socket timeout to be used.  Using the -t flag overrides this.
> So I guess try setting it larger and see if the problem goes away.

Well, my concern is more about the "suspicious reply" message which -
according to Josef - means that NBDkit replied twice to the same
request. If that is the case, that might explain why another request
seemingly remained unanswered.

Do you see any way for this to happen?


Best,
Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Libguestfs] Communication issues between NBD driver and NBDKit server
  2022-05-15 19:12   ` Nikolaus Rath
@ 2022-05-15 19:25     ` Richard W.M. Jones
  2022-05-16  1:08       ` yukuai (C)
  0 siblings, 1 reply; 7+ messages in thread
From: Richard W.M. Jones @ 2022-05-15 19:25 UTC (permalink / raw)
  To: libguestfs, linux-block, Josef Bacik

On Sun, May 15, 2022 at 08:12:59PM +0100, Nikolaus Rath wrote:
> Do you see any way for this to happen?

I think it's impossible.  A more likely explanation follows.

If you look at the kernel code, the NBD_CMD_INFLIGHT command flag is
cleared when a command times out:

  https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L407

That's the place where it would have printed the "Possible stuck
request" message.

Some time later, nbdkit actually replies to the message (for the first
and only time) and in that code the flag is checked and found to be
clear already, causing the "Suspicious reply" message to be printed:

  https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L749

I'd say you need to increase the timeout and/or work out why the S3
plugin is taking so long to respond.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Libguestfs] Communication issues between NBD driver and NBDKit server
  2022-05-15 19:25     ` Richard W.M. Jones
@ 2022-05-16  1:08       ` yukuai (C)
  2022-05-17 21:16         ` Nikolaus Rath
  0 siblings, 1 reply; 7+ messages in thread
From: yukuai (C) @ 2022-05-16  1:08 UTC (permalink / raw)
  To: Richard W.M. Jones, libguestfs, linux-block, Josef Bacik

On 2022/05/16 3:25, Richard W.M. Jones wrote:
> On Sun, May 15, 2022 at 08:12:59PM +0100, Nikolaus Rath wrote:
>> Do you see any way for this to happen?
> 
> I think it's impossible.  A more likely explanation follows.
> 
> If you look at the kernel code, the NBD_CMD_INFLIGHT command flag is
> cleared when a command times out:
> 
>    https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L407
> 
> That's the place where it would have printed the "Possible stuck
> request" message.
> 
> Some time later, nbdkit actually replies to the message (for the first
> and only time) and in that code the flag is checked and found to be
> clear already, causing the "Suspicious reply" message to be printed:
Hi,

You are right, can you try the following patch?

https://lists.debian.org/nbd/2022/04/msg00212.html

Thanks,
Kuai
> 
>    https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L749
> 
> I'd say you need to increase the timeout and/or work out why the S3
> plugin is taking so long to respond.
> 
> Rich.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Libguestfs] Communication issues between NBD driver and NBDKit server
  2022-05-16  1:08       ` yukuai (C)
@ 2022-05-17 21:16         ` Nikolaus Rath
  0 siblings, 0 replies; 7+ messages in thread
From: Nikolaus Rath @ 2022-05-17 21:16 UTC (permalink / raw)
  To: yukuai (C); +Cc: Richard W.M. Jones, libguestfs, linux-block, Josef Bacik

On May 16 2022, "yukuai (C)" <yukuai3@huawei.com> wrote:
> On 2022/05/16 3:25, Richard W.M. Jones wrote:
>> On Sun, May 15, 2022 at 08:12:59PM +0100, Nikolaus Rath wrote:
>>> Do you see any way for this to happen?
>> I think it's impossible.  A more likely explanation follows.
>> If you look at the kernel code, the NBD_CMD_INFLIGHT command flag is
>> cleared when a command times out:
>>    https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L407
>> That's the place where it would have printed the "Possible stuck
>> request" message.
>> Some time later, nbdkit actually replies to the message (for the first
>> and only time) and in that code the flag is checked and found to be
>> clear already, causing the "Suspicious reply" message to be printed:
> Hi,
>
> You are right, can you try the following patch?
>
> https://lists.debian.org/nbd/2022/04/msg00212.html

Ah, I guess that makes sense. I thought that not specifying a timeout
meant "never time out".

I'll set this to a large value and report back if the problem recurs.

Thanks!
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-17 21:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <04a0a06b-e6dc-48b7-bc29-105dab888a56@www.fastmail.com>
2022-05-15 15:51 ` Communication issues between NBD driver and NBDKit server Josef Bacik
2022-05-15 16:39   ` Nikolaus Rath
2022-05-15 18:05 ` [Libguestfs] " Richard W.M. Jones
2022-05-15 19:12   ` Nikolaus Rath
2022-05-15 19:25     ` Richard W.M. Jones
2022-05-16  1:08       ` yukuai (C)
2022-05-17 21:16         ` Nikolaus Rath

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.