All of lore.kernel.org
 help / color / mirror / Atom feed
* Virtio-net - add timeouts to control commands
@ 2022-08-24  7:51 Alvaro Karsz
  2022-08-24  9:06 ` Jason Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Alvaro Karsz @ 2022-08-24  7:51 UTC (permalink / raw)
  To: virtualization; +Cc: Michael S. Tsirkin

I think that we should add a timeout to the control virtqueue commands.
If the hypervisor crashes while handling a control command, the guest
will spin forever.
This may not be necessary for a virtual environment, when both the
hypervisor and the guest OS run in the same bare metal, but this
is needed for a physical network device compatible with VirtIO.

(In these cases, the network device acts as the hypervisor, and the
server acts as
the guest OS).

The network device may fail to answer a control command, or may crash, leading
to a stall in the server.

My idea is to add a big enough timeout, to allow the slow devices to
complete the command.

I wrote a simple patch that returns false from virtnet_send_command in
case of timeouts.

The timeout approach introduces some serious problems in cases when
the network device does answer the control command, but after the
timeout.

* The device will think that the command succeeded, while the server won't.
   This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.
   The server may receive packets in an unexpected queue.

* virtqueue_get_buf will return the previous response for the next
control command.

Addressing this case by adding a timeout  to the spec won't be easy,
since the network device and the server have different clocks, and the
server won't know when exactly the network device noticed the kick.

So maybe we should call virtnet_remove if we reach a timeout.

Or maybe we can just assume that the network device crashed after a
long timeout, and nothing should be done.

What do you guys think?

Alvaro
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  7:51 Virtio-net - add timeouts to control commands Alvaro Karsz
@ 2022-08-24  9:06 ` Jason Wang
  2022-08-24  9:16   ` Alvaro Karsz
  2022-08-24  9:21   ` Hannes Reinecke
  0 siblings, 2 replies; 11+ messages in thread
From: Jason Wang @ 2022-08-24  9:06 UTC (permalink / raw)
  To: Alvaro Karsz; +Cc: Michael S. Tsirkin, virtualization

On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz <alvaro.karsz@solid-run.com> wrote:
>
> I think that we should add a timeout to the control virtqueue commands.
> If the hypervisor crashes while handling a control command, the guest
> will spin forever.
> This may not be necessary for a virtual environment, when both the
> hypervisor and the guest OS run in the same bare metal, but this
> is needed for a physical network device compatible with VirtIO.
>
> (In these cases, the network device acts as the hypervisor, and the
> server acts as
> the guest OS).
>
> The network device may fail to answer a control command, or may crash, leading
> to a stall in the server.
>
> My idea is to add a big enough timeout, to allow the slow devices to
> complete the command.
>
> I wrote a simple patch that returns false from virtnet_send_command in
> case of timeouts.
>
> The timeout approach introduces some serious problems in cases when
> the network device does answer the control command, but after the
> timeout.
>
> * The device will think that the command succeeded, while the server won't.
>    This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.
>    The server may receive packets in an unexpected queue.
>
> * virtqueue_get_buf will return the previous response for the next
> control command.
>
> Addressing this case by adding a timeout  to the spec won't be easy,
> since the network device and the server have different clocks, and the
> server won't know when exactly the network device noticed the kick.
>
> So maybe we should call virtnet_remove if we reach a timeout.

Or reset but can we simply use interrupt instead of the busy waiting?

Thanks

>
> Or maybe we can just assume that the network device crashed after a
> long timeout, and nothing should be done.
>
> What do you guys think?
>
> Alvaro
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:06 ` Jason Wang
@ 2022-08-24  9:16   ` Alvaro Karsz
  2022-08-24  9:24     ` Hannes Reinecke
  2022-08-25  2:27     ` Jason Wang
  2022-08-24  9:21   ` Hannes Reinecke
  1 sibling, 2 replies; 11+ messages in thread
From: Alvaro Karsz @ 2022-08-24  9:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, virtualization

Hi Jason,

> Or reset but can we simply use interrupt instead of the busy waiting?

I agree that timeouts are not needed using interrupts.
This can be done, but seems like a big change.
All functions calling virtnet_send_command expect to get a response immediately.
This may be difficult, since some of the commands are sent in the
probe function.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:06 ` Jason Wang
  2022-08-24  9:16   ` Alvaro Karsz
@ 2022-08-24  9:21   ` Hannes Reinecke
  2022-08-24  9:42     ` Alvaro Karsz
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2022-08-24  9:21 UTC (permalink / raw)
  To: Jason Wang, Alvaro Karsz; +Cc: virtualization, Michael S. Tsirkin

On 8/24/22 11:06, Jason Wang wrote:
> On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz <alvaro.karsz@solid-run.com> wrote:
>>
>> I think that we should add a timeout to the control virtqueue commands.
>> If the hypervisor crashes while handling a control command, the guest
>> will spin forever.
>> This may not be necessary for a virtual environment, when both the
>> hypervisor and the guest OS run in the same bare metal, but this
>> is needed for a physical network device compatible with VirtIO.
>>
>> (In these cases, the network device acts as the hypervisor, and the
>> server acts as
>> the guest OS).
>>
>> The network device may fail to answer a control command, or may crash, leading
>> to a stall in the server.
>>
>> My idea is to add a big enough timeout, to allow the slow devices to
>> complete the command.
>>
>> I wrote a simple patch that returns false from virtnet_send_command in
>> case of timeouts.
>>
>> The timeout approach introduces some serious problems in cases when
>> the network device does answer the control command, but after the
>> timeout.
>>
>> * The device will think that the command succeeded, while the server won't.
>>     This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.
>>     The server may receive packets in an unexpected queue.
>>
>> * virtqueue_get_buf will return the previous response for the next
>> control command.
>>
>> Addressing this case by adding a timeout  to the spec won't be easy,
>> since the network device and the server have different clocks, and the
>> server won't know when exactly the network device noticed the kick.
>>
>> So maybe we should call virtnet_remove if we reach a timeout.
> 
> Or reset but can we simply use interrupt instead of the busy waiting?
> 

There are two possible ways of handling this:
a) let the device do the timeout: pass in a timeout value with the 
command, and allow the device to return an ETIMEDOUT error when the 
timeout expires. Then it's up to the device to do the necessary timeout 
handling; the server won't be involved at all (except for handling an 
ETIMEDOUT error)
b) implement an 'abort' command. With that the server controls the 
timeout, and is allowed to send an 'abort' command when the timeout 
expires. That requires the device to be able to abort commands (which 
not all devices are able to), but avoids having to implement a timeout 
handling in the device.

We can actually specify both methods, and have configuration bits 
indicating which method is supported by the device.

I am very much in favour of having timeouts for virtio commands; we've 
had several massive customer escalations which could have been solved if 
we were able to set the command timeout in the VM.
As this was for virtio-scsi/virtio-block I would advocate to have a 
generic virtio command timeout, not a protocol-specific one.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:16   ` Alvaro Karsz
@ 2022-08-24  9:24     ` Hannes Reinecke
  2022-08-24  9:48       ` Alvaro Karsz
  2022-08-25  2:27     ` Jason Wang
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2022-08-24  9:24 UTC (permalink / raw)
  To: Alvaro Karsz, Jason Wang; +Cc: virtualization, Michael S. Tsirkin

On 8/24/22 11:16, Alvaro Karsz wrote:
> Hi Jason,
> 
>> Or reset but can we simply use interrupt instead of the busy waiting?
> 
> I agree that timeouts are not needed using interrupts.

Quite the contrary.
There is no guarantee that a completion for a given command will be 
signaled at all; if the completion is never posted to the virtio queue 
the VM will spin forever.

> This can be done, but seems like a big change.
> All functions calling virtnet_send_command expect to get a response immediately.
> This may be difficult, since some of the commands are sent in the
> probe function.

This is a simplistic approach. _Any_ command sent via virtio will have 
to have some sort of handling in the host, so an immediate response is 
not guaranteed.
Especially virtio-block command will _not_ be handled immediately.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:21   ` Hannes Reinecke
@ 2022-08-24  9:42     ` Alvaro Karsz
  2022-08-24 10:19       ` Hannes Reinecke
  2022-08-25  3:01       ` Jason Wang
  0 siblings, 2 replies; 11+ messages in thread
From: Alvaro Karsz @ 2022-08-24  9:42 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: virtualization, Michael S. Tsirkin

Hi Hannes,

> a) let the device do the timeout: pass in a timeout value with the
> command, and allow the device to return an ETIMEDOUT error when the
> timeout expires. Then it's up to the device to do the necessary timeout
> handling; the server won't be involved at all (except for handling an
> ETIMEDOUT error)


This won't work if the device crashes.

>
> b) implement an 'abort' command. With that the server controls the
> timeout, and is allowed to send an 'abort' command when the timeout
> expires. That requires the device to be able to abort commands (which
> not all devices are able to), but avoids having to implement a timeout
> handling in the device.


I actually thought about this idea.
This may work, but you'll still have a few moments when the server
assumes that the command failed, and the network device assumes that
it succeeded.
So the server may still receive packets in an unexpected queue.


>
> I am very much in favour of having timeouts for virtio commands; we've
> had several massive customer escalations which could have been solved if
> we were able to set the command timeout in the VM.
> As this was for virtio-scsi/virtio-block I would advocate to have a
> generic virtio command timeout, not a protocol-specific one.

This may be difficult to implement.
Especially when multiple commands may be queued at the same time, and
the device can handle the commands in any order.
We'll need to add identifiers for every command.

I'm actually referring here to the Linux kernel implementation of
virtnet control commands, in which the server spins for a response.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:24     ` Hannes Reinecke
@ 2022-08-24  9:48       ` Alvaro Karsz
  0 siblings, 0 replies; 11+ messages in thread
From: Alvaro Karsz @ 2022-08-24  9:48 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: virtualization, Michael S. Tsirkin

> Quite the contrary.
> There is no guarantee that a completion for a given command will be
> signaled at all; if the completion is never posted to the virtio queue
> the VM will spin forever.
>
> This is a simplistic approach. _Any_ command sent via virtio will have
> to have some sort of handling in the host, so an immediate response is
> not guaranteed.
> Especially virtio-block command will _not_ be handled immediately.


I'm referring to virtnet control commands.
virtnet_send_command is a blocking function, and returns once a
response is received for the sent control command.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:42     ` Alvaro Karsz
@ 2022-08-24 10:19       ` Hannes Reinecke
  2022-08-24 10:31         ` Alvaro Karsz
  2022-08-25  3:01       ` Jason Wang
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2022-08-24 10:19 UTC (permalink / raw)
  To: Alvaro Karsz; +Cc: virtualization, Michael S. Tsirkin

On 8/24/22 11:42, Alvaro Karsz wrote:
> Hi Hannes,
> 
>> a) let the device do the timeout: pass in a timeout value with the
>> command, and allow the device to return an ETIMEDOUT error when the
>> timeout expires. Then it's up to the device to do the necessary timeout
>> handling; the server won't be involved at all (except for handling an
>> ETIMEDOUT error)
> 
> 
> This won't work if the device crashes.
> 
>>
>> b) implement an 'abort' command. With that the server controls the
>> timeout, and is allowed to send an 'abort' command when the timeout
>> expires. That requires the device to be able to abort commands (which
>> not all devices are able to), but avoids having to implement a timeout
>> handling in the device.
> 
> 
> I actually thought about this idea.
> This may work, but you'll still have a few moments when the server
> assumes that the command failed, and the network device assumes that
> it succeeded.
> So the server may still receive packets in an unexpected queue.
> 
> 
No. The server may only assume that the command failed until it gets the 
response for the abort command.
Before that the state of the command is undefined, and the server may 
not assume anything here.

And then we get into the fun topic of timing out aborts, which really 
can only be resolved if you have a fool-proof way of resetting the queue 
itself. But I guess virtio can do that (right?).

>>
>> I am very much in favour of having timeouts for virtio commands; we've
>> had several massive customer escalations which could have been solved if
>> we were able to set the command timeout in the VM.
>> As this was for virtio-scsi/virtio-block I would advocate to have a
>> generic virtio command timeout, not a protocol-specific one.
> 
> This may be difficult to implement.
> Especially when multiple commands may be queued at the same time, and
> the device can handle the commands in any order.
> We'll need to add identifiers for every command.
> 
Why, but of course. You cannot assume in-order delivery of the 
completions; in fact, that's the whole _point_ of having a queue-based 
I/O command delivery method.

> I'm actually referring here to the Linux kernel implementation of
> virtnet control commands, in which the server spins for a response.

Sheesh. Spinning for a response is never a good idea, as this means 
you'll end up with a non-interruptible command in the guest (essentially 
an ioctl into the hypervisor).

And really, identifying the command isn't hard. Each command already has 
an identifier (namely the virtio ring index), so if in doubt you can 
always use that.
To be foolproof, though, you might want to add a 'real' identifier (like 
a 32 or 64 bit command tag), which would even allow you to catch 
uninitialized / completed commands. Tends to be quite important when 
implementing an 'abort' command, as the command referred to by the 
'abort' command might have been completed by the time the hypervisor 
processes the abort command.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24 10:19       ` Hannes Reinecke
@ 2022-08-24 10:31         ` Alvaro Karsz
  0 siblings, 0 replies; 11+ messages in thread
From: Alvaro Karsz @ 2022-08-24 10:31 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: virtualization, Michael S. Tsirkin

> No. The server may only assume that the command failed until it gets the
> response for the abort command.
> Before that the state of the command is undefined, and the server may
> not assume anything here.


You're right, but the device will think that the command succeeded, so
it may send packets in a queue that is not expected by the server.
I'm referring to the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.

Don't you think that as a quick fix, adding a big enough timeout is an
adequate solution (for virtio net control commands)?
The big timeout ensures that when the control command is not answered,
it is because of some crash/bug in the network device (which acts as
the hypervisor).
This way, we can detect the crash/bug and reset/remove the device.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:16   ` Alvaro Karsz
  2022-08-24  9:24     ` Hannes Reinecke
@ 2022-08-25  2:27     ` Jason Wang
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Wang @ 2022-08-25  2:27 UTC (permalink / raw)
  To: Alvaro Karsz; +Cc: Michael S. Tsirkin, virtualization

On Wed, Aug 24, 2022 at 5:16 PM Alvaro Karsz <alvaro.karsz@solid-run.com> wrote:
>
> Hi Jason,
>
> > Or reset but can we simply use interrupt instead of the busy waiting?
>
> I agree that timeouts are not needed using interrupts.
> This can be done, but seems like a big change.
> All functions calling virtnet_send_command expect to get a response immediately.
> This may be difficult, since some of the commands are sent in the
> probe function.

We can put the process into interruptible sleep, then it should be fine?

(FYI, some transport specific methods may sleep e.g ccw).

Thanks

>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Virtio-net - add timeouts to control commands
  2022-08-24  9:42     ` Alvaro Karsz
  2022-08-24 10:19       ` Hannes Reinecke
@ 2022-08-25  3:01       ` Jason Wang
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Wang @ 2022-08-25  3:01 UTC (permalink / raw)
  To: Alvaro Karsz; +Cc: virtualization, Michael S. Tsirkin

On Wed, Aug 24, 2022 at 5:43 PM Alvaro Karsz <alvaro.karsz@solid-run.com> wrote:
>
> Hi Hannes,
>
> > a) let the device do the timeout: pass in a timeout value with the
> > command, and allow the device to return an ETIMEDOUT error when the
> > timeout expires. Then it's up to the device to do the necessary timeout
> > handling; the server won't be involved at all (except for handling an
> > ETIMEDOUT error)
>
>
> This won't work if the device crashes.

Yes, from the view of the hardening. Driver should not trust/depend on
device behaviour.

>
> >
> > b) implement an 'abort' command. With that the server controls the
> > timeout, and is allowed to send an 'abort' command when the timeout
> > expires. That requires the device to be able to abort commands (which
> > not all devices are able to), but avoids having to implement a timeout
> > handling in the device.
>
>
> I actually thought about this idea.
> This may work, but you'll still have a few moments when the server
> assumes that the command failed, and the network device assumes that
> it succeeded.
> So the server may still receive packets in an unexpected queue.

Similar to the previous case. Driver should not trust the device to
execute any command correctly.

>
>
> >
> > I am very much in favour of having timeouts for virtio commands; we've
> > had several massive customer escalations which could have been solved if
> > we were able to set the command timeout in the VM.
> > As this was for virtio-scsi/virtio-block I would advocate to have a
> > generic virtio command timeout, not a protocol-specific one.
>
> This may be difficult to implement.
> Especially when multiple commands may be queued at the same time, and
> the device can handle the commands in any order.
> We'll need to add identifiers for every command.

Having a timeout that is under the control of the driver might be
possible. Anyhow this needs to be discussed in the virtio-dev.

Thanks

>
> I'm actually referring here to the Linux kernel implementation of
> virtnet control commands, in which the server spins for a response.
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-25  3:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-24  7:51 Virtio-net - add timeouts to control commands Alvaro Karsz
2022-08-24  9:06 ` Jason Wang
2022-08-24  9:16   ` Alvaro Karsz
2022-08-24  9:24     ` Hannes Reinecke
2022-08-24  9:48       ` Alvaro Karsz
2022-08-25  2:27     ` Jason Wang
2022-08-24  9:21   ` Hannes Reinecke
2022-08-24  9:42     ` Alvaro Karsz
2022-08-24 10:19       ` Hannes Reinecke
2022-08-24 10:31         ` Alvaro Karsz
2022-08-25  3:01       ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.