All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] virtio: introduce STOP status bit
@ 2021-11-11 18:58 Eugenio Pérez
  2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Eugenio Pérez @ 2021-11-11 18:58 UTC (permalink / raw)
  To: virtio-dev, virtio-comment, mst, jasowang
  Cc: amikheev, stefanha, shahafs, oren, pasic, cohuck, bodong,
	Dr . David Alan Gilbert, parav, mgurtovoy

This patch introduces a new status bit STOP. This can be used by the
driver to stop the device in order to safely fetch used descriptors
status, making sure the device will not fetch new available ones.

Its main use case is live migration, although it has other orthogonal
use cases. It can be used to safely discard requests that have not been
used: in other words, to rewind available descriptors.

Stopping the device in the live migration context is done by per-device
operations in vhost backends, but the introduction of STOP as a basic
virtio facility comes with advantages:
* All the device virtio-specific state is summarized in a single entity,
  making easier to reason about it.
* VMM does not need to implement device specific operations in the
  driver part.
* Work out of the box for devices that use pure virtio backends in some
  part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
  in any transport the device can use.
* It's totally self-contained, solving the nested virtualization case
  straightforwardly.

To fully understand its position in the live migration case, it's needed
to note that the VMM acts as a part (or the whole) of the virtio device
from the guest point of view, and it can act as a part of the driver
from an external virtio device point of view. This is already the case
when using vhost-net, for example, where VMM exposes a combination of
backend and VMM features, and can mask them if needed.

To migrate an external device the VMM needs to retrieve its (guest
visible) status and make sure the device does not modify it or
communicate with the guest anymore. The STOP status bit achieves the
last part, and even the first one in case of a pure stateless device
using the split vring.

In its simpler way of working, the VMM masks the VIRTIO_F_STOP feature
to the guest, and also masks the STOP and STOP_FAILED status bit. This
way the VMM can stop and resume operation unilaterally, totally
transparent for the latter.

If we don't need the STOP status bit in the hypervisor but we want the
guest to be able to stop the device, the status can be passthrough.

If we want the guest to be able to stop and
resume the device by itself and VMM does not need LM, the flag and
status must not be masked. If we want both, we need VMM to be able to
override, taking into account the device status for the guest.

v3:
* Delete all virtqueue state saving and restoring, not needed at the
  moment.
* Add STOP_FAILED bit so device can fail the operation
* Add config interrupt to notify driver about the stop bit is set, so it
  can avoid busy waiting polling status.
* Expand device's required treatment to in-flight descriptors before
  setting the STOP bit.
* Add rewind capabilities.
* Add resume operation, clearing the STOP bit.
* Reword status clear bit / PCI reset contradictions, already present in
  the spec
* Specify device behavior if STOP status bit is set before DRIVER_OK

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Eugenio Pérez (1):
  content: Explain better the status clearing bits

Jason Wang (1):
  virtio: introduce STOP status bit

 content.tex | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 3 deletions(-)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v3 1/2] content: Explain better the status clearing bits
  2021-11-11 18:58 [PATCH v3 0/2] virtio: introduce STOP status bit Eugenio Pérez
@ 2021-11-11 18:58 ` Eugenio Pérez
  2021-11-12  3:46   ` Jason Wang
  2021-11-12 10:34   ` [virtio-dev] " Cornelia Huck
  2021-11-11 18:58 ` [PATCH v3 2/2] virtio: introduce STOP status bit Eugenio Pérez
  2021-11-18 14:45 ` [PATCH v3 0/2] " Stefan Hajnoczi
  2 siblings, 2 replies; 43+ messages in thread
From: Eugenio Pérez @ 2021-11-11 18:58 UTC (permalink / raw)
  To: virtio-dev, virtio-comment, mst, jasowang
  Cc: amikheev, stefanha, shahafs, oren, pasic, cohuck, bodong,
	Dr . David Alan Gilbert, parav, mgurtovoy

The spec tells that "The driver MUST NOT clear a device status bit", but
a device using PCI transport reset a virtio device writing 0 to device
status. In some way, that is to clear all its bits.

Instead of add an exception, tell explicitely the status bits that
the driver cannot clear anytime in a normal operation, so conformant
device and drivers keeps being conformant.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 content.tex | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/content.tex b/content.tex
index 5d112af..2aa3006 100644
--- a/content.tex
+++ b/content.tex
@@ -60,9 +60,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 initialization sequence specified in
 \ref{sec:General Initialization And Device Operation / Device
 Initialization}.
-The driver MUST NOT clear a
-\field{device status} bit.  If the driver sets the FAILED bit,
-the driver MUST later reset the device before attempting to re-initialize.
+The driver MUST NOT clear ACKNOWLEDGE, DRIVER, DRIVER_OK, FEATURES_OK or
+DEVICE_NEEDS_RESET bits of \field{device status}, except if resetting the whole
+device.  If the driver sets the FAILED bit, the driver MUST later reset the
+device before attempting to re-initialize.
 
 The driver SHOULD NOT rely on completion of operations of a
 device if DEVICE_NEEDS_RESET is set.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-11 18:58 [PATCH v3 0/2] virtio: introduce STOP status bit Eugenio Pérez
  2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
@ 2021-11-11 18:58 ` Eugenio Pérez
  2021-11-12  4:18   ` Jason Wang
  2021-11-18 15:59   ` Stefan Hajnoczi
  2021-11-18 14:45 ` [PATCH v3 0/2] " Stefan Hajnoczi
  2 siblings, 2 replies; 43+ messages in thread
From: Eugenio Pérez @ 2021-11-11 18:58 UTC (permalink / raw)
  To: virtio-dev, virtio-comment, mst, jasowang
  Cc: amikheev, stefanha, shahafs, oren, pasic, cohuck, bodong,
	Dr . David Alan Gilbert, parav, mgurtovoy

From: Jason Wang <jasowang@redhat.com>

This patch introduces a new status bit STOP. This can be used by the
driver to stop the device in order to safely fetch used descriptors
status, making sure the device will not fetch new available ones.

Its main use case is live migration, although it has other orthogonal
use cases. It can be used to safely discard requests that have not been
used: in other words, to rewind available descriptors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/content.tex b/content.tex
index 2aa3006..9ed0d09 100644
--- a/content.tex
+++ b/content.tex
@@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
   drive the device.
 
+\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
+  device has been stopped by the driver. This status bit is different
+  from the reset since the device state is preserved.
+
+\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
+  device could not stop the STOP request.
+
 \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
   an error from which it can't recover.
 \end{description}
@@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 recover by issuing a reset.
 \end{note}
 
+If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set or clear STOP if
+DRIVER_OK is not set.
+
+If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
+to ensure the STOP or STOP_FAILED bit is set after the write. The device
+acknowledges the new paused status setting the first, or the failure setting
+the last. Since this change may not be instantaneous, the driver MAY wait for
+the configuration change notification that the device must send after the
+change. If the device sets the STOP_FAILED bit, the driver MUST clear it before
+try new STOP attempts.
+
+If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
+the driver MAY change avail_idx in the case of split virtqueue, but the new
+avail_idx MUST be within used_idx and used_idx plus virtqueue size.
+
+If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
+the driver MAY change any descriptor.
+
+If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
+the driver can resume it clearing the STOP status bit. It MUST re-read the
+device status to ensure the STOP bit is clear after the write. The device
+acknowledges the new status clearing it. Since this change may not be
+instantaneous, the driver MAY wait for the configuration change notification
+that the device must send after the change.
+
 \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
 
 The device MUST NOT consume buffers or send any used buffer
 notifications to the driver before DRIVER_OK.
 
+If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
+STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
+or clear of STOP.
+
+If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
+operations after the driver writes STOP.  Depending on the device, it can do it
+in many ways as long as the driver can recover its normal operation if it
+resumes the device without the need of resetting it:
+\begin{itemize}
+\item Drain and wait for the completion of all pending requests until a
+convenient avail descriptor. Ignore any other posterior descriptor.
+\item Return a device-specific failure for these descriptors, so the driver
+can choose to retry or to cancel them.
+\item Mark them as done even if they are not, if the kind of device can
+assume to lose them.
+\end{itemize}
+
+If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
+a guest's request, the device MUST set the STOP_FAILED bit for the guest to
+read it. The device MUST ignore new writes to the STOP bit until the guest
+clears STOP_FAILED.
+
+If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
+and the device can pause its operation, the device MUST set the descriptors
+that it has done with them as used before exposing the STOP status bit as set.
+
+If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
+after exposing the STOP bit set:
+\begin{itemize}
+\item Read updates on the descriptor or driver area, or consume more buffers.
+\item Send any used buffer notifications to the driver.
+\end{itemize}
+
+The device MUST send a configuration space change right after exposing the STOP
+or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
+send another configuration space change notification to the driver afterwards
+until the guest clears it.
+
+If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
+the device MUST resume operation when the driver clears the STOP bit. The
+device MUST continue reading available descriptors as if an available buffer
+notification has reach it, starting from the last descriptor it marked as used,
+and continue the regular operation after that. The device MUST read again
+descriptor and driver area beyond the last descriptor it marked as used when it
+stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
+if for some reason it cannot continue.
+
 \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
 that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
 MUST send a device configuration change notification to the driver.
@@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   transport specific.
   For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
 
+\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
+  stop the device.
+  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 1/2] content: Explain better the status clearing bits
  2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
@ 2021-11-12  3:46   ` Jason Wang
  2021-11-12 11:41     ` Eugenio Perez Martin
  2021-11-12 10:34   ` [virtio-dev] " Cornelia Huck
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-12  3:46 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The spec tells that "The driver MUST NOT clear a device status bit", but
> a device using PCI transport reset a virtio device writing 0 to device
> status. In some way, that is to clear all its bits.
>
> Instead of add an exception, tell explicitely the status bits that
> the driver cannot clear anytime in a normal operation, so conformant
> device and drivers keeps being conformant.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/content.tex b/content.tex
> index 5d112af..2aa3006 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -60,9 +60,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  initialization sequence specified in
>  \ref{sec:General Initialization And Device Operation / Device
>  Initialization}.
> -The driver MUST NOT clear a
> -\field{device status} bit.  If the driver sets the FAILED bit,
> -the driver MUST later reset the device before attempting to re-initialize.
> +The driver MUST NOT clear ACKNOWLEDGE, DRIVER, DRIVER_OK, FEATURES_OK or
> +DEVICE_NEEDS_RESET bits of \field{device status}, except if resetting the whole
> +device.

Any reason for using blacklist here? I guess it is used for patch 2
(introduce the bit that can be cleared?).

Thanks

> If the driver sets the FAILED bit, the driver MUST later reset the
> +device before attempting to re-initialize.
>
>  The driver SHOULD NOT rely on completion of operations of a
>  device if DEVICE_NEEDS_RESET is set.
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-11 18:58 ` [PATCH v3 2/2] virtio: introduce STOP status bit Eugenio Pérez
@ 2021-11-12  4:18   ` Jason Wang
  2021-11-12 10:50     ` Eugenio Perez Martin
  2021-11-18 15:59   ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-12  4:18 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> From: Jason Wang <jasowang@redhat.com>
>
> This patch introduces a new status bit STOP. This can be used by the
> driver to stop the device in order to safely fetch used descriptors
> status, making sure the device will not fetch new available ones.
>
> Its main use case is live migration, although it has other orthogonal
> use cases. It can be used to safely discard requests that have not been
> used: in other words, to rewind available descriptors.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

So this is much more complicated, see below.

> ---
>  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 83 insertions(+)
>
> diff --git a/content.tex b/content.tex
> index 2aa3006..9ed0d09 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>
> +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver. This status bit is different
> +  from the reset since the device state is preserved.
> +
> +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device could not stop the STOP request.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>
> +If VIRTIO_F_STOP has been negotiated,

"has not been" actually?

> the driver MUST NOT set or clear STOP if
> +DRIVER_OK is not set.
> +
> +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> +acknowledges the new paused status setting the first, or the failure setting
> +the last. Since this change may not be instantaneous, the driver MAY wait for
> +the configuration change notification that the device must send after the
> +change.

This is kind of tricky, it means the device can send notification
after it has been stopped. As discussed in the previous versions,
driver is freed to use timer or what ever other mechanism if it
doesn't like the busy polling. I wonder how much value we can gain
from a dedicated config interrupt. Especially consider some transport
can use transport specific interrupt (not virtio specific interrupt)
for reporting whether or not set status succeed.

>If the device sets the STOP_FAILED bit, the driver MUST clear it before
> +try new STOP attempts.

Does the device need to re-read the STOP_FAILED for synchronization? I
wonder how much we can gain from STOP_FAILED, the patch is unclear on
when that the device needs to set this bit. And driver can choose to
reset after a specific timeout anyhow.

> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> +the driver MAY change avail_idx in the case of split virtqueue, but the new
> +avail_idx MUST be within used_idx and used_idx plus virtqueue size.

Any motivation for this? it looks to me it makes the feature coupled
with the virtqueue state proposal? It seems odd to allow avail change
but not the last_avail_idx change.

> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> +the driver MAY change any descriptor.
> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> +the driver can resume it clearing the STOP status bit. It MUST re-read the
> +device status to ensure the STOP bit is clear after the write. The device
> +acknowledges the new status clearing it. Since this change may not be
> +instantaneous, the driver MAY wait for the configuration change notification
> +that the device must send after the change.

Do we really needs resuming? it's kind of:

1) STOP -> clear STOP

vs

2) STOP -> RESET -> DRIVER_OK

Using 2) preserve the semantic that the driver can't clear the status bit.

> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>
> +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> +or clear of STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> +operations after the driver writes STOP.

I wonder if it's better to leave this to device to decide. E.g some
block devices may requires a very log time to finish the inflight
operations.

> Depending on the device, it can do it
> +in many ways as long as the driver can recover its normal operation if it
> +resumes the device without the need of resetting it:
> +\begin{itemize}
> +\item Drain and wait for the completion of all pending requests until a
> +convenient avail descriptor. Ignore any other posterior descriptor.
> +\item Return a device-specific failure for these descriptors, so the driver
> +can choose to retry or to cancel them.

If we allow the driver to retry, we need a way to report inflight
buffers which is not supported by the spec. A way to solve this is to
make it device specific.

> +\item Mark them as done even if they are not, if the kind of device can
> +assume to lose them.

I think "make buffer used" is better than "mark them as done". And we
need a accurate definition on who is "them".

> +\end{itemize}
> +
> +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> +a guest's request,

It's not clear what did "a guest's request" means.

> the device MUST set the STOP_FAILED bit for the guest to
> +read it. The device MUST ignore new writes to the STOP bit until the guest
> +clears STOP_FAILED.
> +
> +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> +and the device can pause its operation, the device MUST set the descriptors
> +that it has done with them as used before exposing the STOP status bit as set.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> +after exposing the STOP bit set:
> +\begin{itemize}
> +\item Read updates on the descriptor or driver area, or consume more buffers.
> +\item Send any used buffer notifications to the driver.
> +\end{itemize}
> +
> +The device MUST send a configuration space change right after exposing the STOP
> +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> +send another configuration space change notification to the driver afterwards
> +until the guest clears it.
> +
> +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> +the device MUST resume operation when the driver clears the STOP bit. The
> +device MUST continue reading available descriptors as if an available buffer
> +notification has reach it, starting from the last descriptor it marked as used,

So I still tend to define virtqueue state as basic facility before
defining STOP. It can makes thing easier.

> +and continue the regular operation after that. The device MUST read again
> +descriptor and driver area beyond the last descriptor it marked as used when it
> +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> +if for some reason it cannot continue.
> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    transport specific.
>    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>
> +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> +
>  \end{description}

So I think the patch complicate thing is various ways:

1) STOP_FAILED status bit, which seems unnecessary or even duplicated
with NEEDS_RESET
2) configuration change interrupt, looks conflict with the semantic of STOP
3) status bit clearing (resuming), a functional duplication with RESET
+ DRIVER_OK

I think we'd better to stick to the minimal set of the function to
reduce the complexity: virtqueue state + STOP bit (without clearing
and no config interrupt).

Thanks

>
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [virtio-dev] Re: [PATCH v3 1/2] content: Explain better the status clearing bits
  2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
  2021-11-12  3:46   ` Jason Wang
@ 2021-11-12 10:34   ` Cornelia Huck
  2021-11-12 11:41     ` Eugenio Perez Martin
  1 sibling, 1 reply; 43+ messages in thread
From: Cornelia Huck @ 2021-11-12 10:34 UTC (permalink / raw)
  To: Eugenio Pérez, virtio-dev, virtio-comment, mst, jasowang
  Cc: amikheev, stefanha, shahafs, oren, pasic, bodong,
	Dr . David Alan Gilbert, parav, mgurtovoy

On Thu, Nov 11 2021, Eugenio Pérez <eperezma@redhat.com> wrote:

> The spec tells that "The driver MUST NOT clear a device status bit", but
> a device using PCI transport reset a virtio device writing 0 to device

I think MMIO uses that mechanism as well?

> status. In some way, that is to clear all its bits.
>
> Instead of add an exception, tell explicitely the status bits that
> the driver cannot clear anytime in a normal operation, so conformant
> device and drivers keeps being conformant.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/content.tex b/content.tex
> index 5d112af..2aa3006 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -60,9 +60,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  initialization sequence specified in
>  \ref{sec:General Initialization And Device Operation / Device
>  Initialization}.
> -The driver MUST NOT clear a
> -\field{device status} bit.  If the driver sets the FAILED bit,
> -the driver MUST later reset the device before attempting to re-initialize.
> +The driver MUST NOT clear ACKNOWLEDGE, DRIVER, DRIVER_OK, FEATURES_OK or
> +DEVICE_NEEDS_RESET bits of \field{device status}, except if resetting the whole
> +device.  If the driver sets the FAILED bit, the driver MUST later reset the
> +device before attempting to re-initialize.

I think we need to distinguish "driver wants to clear a status bit" from
"driver is initiating a reset, and that transport implements that by
writing 0 to the device status". So, what about

"The driver MUST NOT clear a \field{device status} bit, except when
setting \field{device status} to 0 as a transport-specific way to
intitiate a reset."

If we introduce driver-clearable bits later, we can simply make that
"The driver MUST NOT clear a \field{device status} bit other than
NEW_BIT, ..."


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-12  4:18   ` Jason Wang
@ 2021-11-12 10:50     ` Eugenio Perez Martin
  2021-11-15  4:08       ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-12 10:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > From: Jason Wang <jasowang@redhat.com>
> >
> > This patch introduces a new status bit STOP. This can be used by the
> > driver to stop the device in order to safely fetch used descriptors
> > status, making sure the device will not fetch new available ones.
> >
> > Its main use case is live migration, although it has other orthogonal
> > use cases. It can be used to safely discard requests that have not been
> > used: in other words, to rewind available descriptors.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> So this is much more complicated, see below.
>

I agree it's more complicated, but it addresses some concerns raised
on previous patches sent to the list. Not saying that all of them must
be addressed, or addressed this way though :).

> > ---
> >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 83 insertions(+)
> >
> > diff --git a/content.tex b/content.tex
> > index 2aa3006..9ed0d09 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> >    drive the device.
> >
> > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > +  device has been stopped by the driver. This status bit is different
> > +  from the reset since the device state is preserved.
> > +
> > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > +  device could not stop the STOP request.
> > +
> >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> >    an error from which it can't recover.
> >  \end{description}
> > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  recover by issuing a reset.
> >  \end{note}
> >
> > +If VIRTIO_F_STOP has been negotiated,
>
> "has not been" actually?
>

I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
*has been* negotiated (in other words, driver sent FEATURES_OK), the
driver must not set or clear the STOP bit before setting DRIVER_OK".

> > the driver MUST NOT set or clear STOP if
> > +DRIVER_OK is not set.
> > +
> > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > +acknowledges the new paused status setting the first, or the failure setting
> > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > +the configuration change notification that the device must send after the
> > +change.
>
> This is kind of tricky, it means the device can send notification
> after it has been stopped.

I don't think this part it's so tricky. That notification is also sent
when the DEVICE_NEEDS_RESET bit is set, and (as I read) is for the
same reason somehow: To avoid the status polling:
* "The driver SHOULD NOT rely on completion of operations of a device
if DEVICE_NEEDS_RESET is set." (copied from the standard)
* The reading of the status field could be expensive / inconvenient in
each operation.
* Solution: Instead of polling, make a device facility to notify the
driver that it cannot trust the device is going to behave properly /
same as before anymore via notification.

We can add another exception to the "device configuration space
change" in "Notification of Device Configuration Changes", like the
one already present:
"In addition, this notification is triggered by the device setting
DEVICE_NEEDS_RESET".

I understand it sounds tricky that the device sends a notification
when it's stopped, but in my opinion it's aligned with previous
behavior (DEVICE_NEEDS_RESET), it's explicitly stated that it will be
the last one, and it's caused because of the inconvenience of polling
device status. Even if the driver can use other mechanisms.

If the community still has concerns about it, another option is to
actually extract the way the device notifies it from the general
facilities, and make it transport specific. But to use the device
configuration change notification for this makes sense to me. The
device configuration has changed.

> As discussed in the previous versions,
> driver is freed to use timer or what ever other mechanism if it
> doesn't like the busy polling. I wonder how much value we can gain
> from a dedicated config interrupt. Especially consider some transport
> can use transport specific interrupt (not virtio specific interrupt)
> for reporting whether or not set status succeed.
>

In my opinion, *if* we agree that a stop is a virtio facility and not
a per-device one, and *if* we agree that a notification is required
for the device to notify the stop, it makes sense to use a
transport-independent mechanism that the device must already
implement.

> >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > +try new STOP attempts.
>
> Does the device need to re-read the STOP_FAILED for synchronization?

I tend to see the status as something that belongs to the device and
is exposed to the driver. In that sense, the write from the guest
triggers an event on the device, and the device decides what will be
exposed on that field (MMIO?) on the next driver read. If it's not
that way, we couldn't use the STOP bit that way, right?

> I
> wonder how much we can gain from STOP_FAILED, the patch is unclear on
> when that the device needs to set this bit. And driver can choose to
> reset after a specific timeout anyhow.
>

The conditions where the device needs to set this bit are unspecified
because it depends on the device: Not only to the kind of device, but
also on the device backend.

The same condition (regarding the possibility of handling the pending
buffers) could cause different devices to react differently. A network
device could decide it's fine to drop pending tx, let the guest think
that "the network lost them", and mark them as done, where a
persistent storage cannot do that for write requests. Just as an
example, not saying that networking devices must do that :).

> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
>
> Any motivation for this? it looks to me it makes the feature coupled
> with the virtqueue state proposal? It seems odd to allow avail change
> but not the last_avail_idx change.
>

On second thought, I think you are right and this overlaps with the
state proposal.

> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > +the driver MAY change any descriptor.
> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > +device status to ensure the STOP bit is clear after the write. The device
> > +acknowledges the new status clearing it. Since this change may not be
> > +instantaneous, the driver MAY wait for the configuration change notification
> > +that the device must send after the change.
>
> Do we really needs resuming? it's kind of:
>
> 1) STOP -> clear STOP
>
> vs
>
> 2) STOP -> RESET -> DRIVER_OK
>
> Using 2) preserve the semantic that the driver can't clear the status bit.
>

You are totally right in that regard. But the use case simplifies the
operation when the driver only wants to take back some available
descriptors still not used, in the range last_avail_idx..avail_idx.
Doing that could be a big burden for drivers, who would need to
re-send every status. MST proposed that use case at [1].

In that regard, the straightforward thing to do is modify avail_idx /
descriptors from that range and let resume. However, the RESET path
makes it easier to implement the device part of course, and the guest
can also achieve the rewind that way.

> > +
> >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> >
> >  The device MUST NOT consume buffers or send any used buffer
> >  notifications to the driver before DRIVER_OK.
> >
> > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > +or clear of STOP.
> > +
> > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > +operations after the driver writes STOP.
>
> I wonder if it's better to leave this to device to decide. E.g some
> block devices may requires a very log time to finish the inflight
> operations.
>

(Letting out SVQ + inflight descriptors for this part of the response,
I will come back to it later)

But if virtqueue is not valid anymore, how can it report them when
finished? In that sense, I would say it's better to report failure and
let the guest handle it as if the disk is unavailable (timeout?
temporary faulty sector? I'm not sure what is the most suitable way).

*If* we are not going to allow the guest to resume operation, where it
knows all the status of the device, then there is no value on let the
device delay the operation: From the guest point of view it either
succeed to send to the device backend and somebody else caused a
failure (external network lose the tx packet, bit rotting caused I'm
reading a different value than previously written), or it failed at
the stop moment.

This is different with the resume possibility, where the device can
decide to hold the descriptors, stop operating, and then resume
operation.

> > Depending on the device, it can do it
> > +in many ways as long as the driver can recover its normal operation if it
> > +resumes the device without the need of resetting it:
> > +\begin{itemize}
> > +\item Drain and wait for the completion of all pending requests until a
> > +convenient avail descriptor. Ignore any other posterior descriptor.
> > +\item Return a device-specific failure for these descriptors, so the driver
> > +can choose to retry or to cancel them.
>
> If we allow the driver to retry, we need a way to report inflight
> buffers which is not supported by the spec. A way to solve this is to
> make it device specific.
>

Regarding the retry, I don't get you here. Re-reading the patch, I
think that "driver retry" is very ambiguous: I meant for the device to
mark the descriptor as used, but with a communication specific error
code, so the application, guest kernel, etc (driver in the standard)
can decide to retry.

Regarding the in-flight descriptor report, it's interesting but I
cannot see a way where it does not complicate the solution a lot or
adds new dependencies. I have the next thoughts:
1) If it works as inflight_fd, "a region of shared memory"
1.1) This region must be in the guest's AS so the device has access to
it. This either invalidates the use of STOP from the driver point of
view as "let me know where you are not going to modify the guest's
memory anymore".
1.2) This region is on the hypervisor's AS. If the device supports it,
it is possible to implement the SVQ without the need of STOP bit. This
is equivalent to "I have a PF that also supports VF dirty memory
tracking".
2) If it works as the config space, where the driver can ask for its
status, STOP means "STOP writing used and report via config space". No
need for reset.

Did you have something different in mind?

> > +\item Mark them as done even if they are not, if the kind of device can
> > +assume to lose them.
>
> I think "make buffer used" is better than "mark them as done". And we
> need a accurate definition on who is "them".
>

All items include other operations, like the ones that the device must
do internally to process the control virtqueue. But I cannot find an
example where telling the driver they are done when it's not is valid
for this particular item.

But I agree it needs better wording.

And I will s/them/operations/. for the next one.

> > +\end{itemize}
> > +
> > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > +a guest's request,
>
> It's not clear what did "a guest's request" means.
>

Right. Would "operation" fit better here?

> > the device MUST set the STOP_FAILED bit for the guest to
> > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > +clears STOP_FAILED.
> > +
> > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > +and the device can pause its operation, the device MUST set the descriptors
> > +that it has done with them as used before exposing the STOP status bit as set.
> > +
> > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > +after exposing the STOP bit set:
> > +\begin{itemize}
> > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > +\item Send any used buffer notifications to the driver.
> > +\end{itemize}
> > +
> > +The device MUST send a configuration space change right after exposing the STOP
> > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > +send another configuration space change notification to the driver afterwards
> > +until the guest clears it.
> > +
> > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > +the device MUST resume operation when the driver clears the STOP bit. The
> > +device MUST continue reading available descriptors as if an available buffer
> > +notification has reach it, starting from the last descriptor it marked as used,
>
> So I still tend to define virtqueue state as basic facility before
> defining STOP. It can makes thing easier.
>

Yes, coming back to that approach can simplify the whole proposal.

> > +and continue the regular operation after that. The device MUST read again
> > +descriptor and driver area beyond the last descriptor it marked as used when it
> > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > +if for some reason it cannot continue.
> > +
> >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> >  MUST send a device configuration change notification to the driver.
> > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >    transport specific.
> >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> >
> > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > +  stop the device.
> > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > +
> >  \end{description}
>
> So I think the patch complicate thing is various ways:
>
> 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> with NEEDS_RESET
> 2) configuration change interrupt, looks conflict with the semantic of STOP

I'm not sure about those two, I find we will have devices with unbound
stop time where both can be useful if we agree on making this a
general facility. Resetting the whole device because of this leaves
the driver with no possibility of knowing the state of the sent
descriptors.

Of course, if these use cases are not interesting, it's easier to
leave them out for sure.

> 3) status bit clearing (resuming), a functional duplication with RESET
> + DRIVER_OK
>

I agree it can be obtained with a whole reset, so it can be out and
leave it for the future if needed. However it seems overkill if we
just want to rewind some descriptors back, and there is no standard
way to recover the device status beyond vq_state.

Thanks!

> I think we'd better to stick to the minimal set of the function to
> reduce the complexity: virtqueue state + STOP bit (without clearing
> and no config interrupt).
>

[1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html

> Thanks
>
> >
> >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > --
> > 2.27.0
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 1/2] content: Explain better the status clearing bits
  2021-11-12  3:46   ` Jason Wang
@ 2021-11-12 11:41     ` Eugenio Perez Martin
  0 siblings, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-12 11:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 12, 2021 at 4:46 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The spec tells that "The driver MUST NOT clear a device status bit", but
> > a device using PCI transport reset a virtio device writing 0 to device
> > status. In some way, that is to clear all its bits.
> >
> > Instead of add an exception, tell explicitely the status bits that
> > the driver cannot clear anytime in a normal operation, so conformant
> > device and drivers keeps being conformant.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  content.tex | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/content.tex b/content.tex
> > index 5d112af..2aa3006 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -60,9 +60,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  initialization sequence specified in
> >  \ref{sec:General Initialization And Device Operation / Device
> >  Initialization}.
> > -The driver MUST NOT clear a
> > -\field{device status} bit.  If the driver sets the FAILED bit,
> > -the driver MUST later reset the device before attempting to re-initialize.
> > +The driver MUST NOT clear ACKNOWLEDGE, DRIVER, DRIVER_OK, FEATURES_OK or
> > +DEVICE_NEEDS_RESET bits of \field{device status}, except if resetting the whole
> > +device.
>
> Any reason for using blacklist here? I guess it is used for patch 2
> (introduce the bit that can be cleared?).
>

That's it. I should have stated better in the patch message. Thanks
for pointing it out!

> Thanks
>
> > If the driver sets the FAILED bit, the driver MUST later reset the
> > +device before attempting to re-initialize.
> >
> >  The driver SHOULD NOT rely on completion of operations of a
> >  device if DEVICE_NEEDS_RESET is set.
> > --
> > 2.27.0
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [virtio-dev] Re: [PATCH v3 1/2] content: Explain better the status clearing bits
  2021-11-12 10:34   ` [virtio-dev] " Cornelia Huck
@ 2021-11-12 11:41     ` Eugenio Perez Martin
  0 siblings, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-12 11:41 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Stefan Hajnoczi, Shahaf Shuler, Oren Duer,
	Halil Pasic, Bodong Wang, Dr . David Alan Gilbert, Parav Pandit,
	Max Gurtovoy

On Fri, Nov 12, 2021 at 11:35 AM Cornelia Huck <cohuck@redhat.com> wrote:
>
> On Thu, Nov 11 2021, Eugenio Pérez <eperezma@redhat.com> wrote:
>
> > The spec tells that "The driver MUST NOT clear a device status bit", but
> > a device using PCI transport reset a virtio device writing 0 to device
>
> I think MMIO uses that mechanism as well?
>
> > status. In some way, that is to clear all its bits.
> >
> > Instead of add an exception, tell explicitely the status bits that
> > the driver cannot clear anytime in a normal operation, so conformant
> > device and drivers keeps being conformant.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  content.tex | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/content.tex b/content.tex
> > index 5d112af..2aa3006 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -60,9 +60,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  initialization sequence specified in
> >  \ref{sec:General Initialization And Device Operation / Device
> >  Initialization}.
> > -The driver MUST NOT clear a
> > -\field{device status} bit.  If the driver sets the FAILED bit,
> > -the driver MUST later reset the device before attempting to re-initialize.
> > +The driver MUST NOT clear ACKNOWLEDGE, DRIVER, DRIVER_OK, FEATURES_OK or
> > +DEVICE_NEEDS_RESET bits of \field{device status}, except if resetting the whole
> > +device.  If the driver sets the FAILED bit, the driver MUST later reset the
> > +device before attempting to re-initialize.
>
> I think we need to distinguish "driver wants to clear a status bit" from
> "driver is initiating a reset, and that transport implements that by
> writing 0 to the device status". So, what about
>
> "The driver MUST NOT clear a \field{device status} bit, except when
> setting \field{device status} to 0 as a transport-specific way to
> intitiate a reset."
>
> If we introduce driver-clearable bits later, we can simply make that
> "The driver MUST NOT clear a \field{device status} bit other than
> NEW_BIT, ..."
>

The intention is to introduce a driver-clearable bit in the next
patch, but I should have stated better in the patch message.
Definitely I should have stated it better in the message.

Thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-12 10:50     ` Eugenio Perez Martin
@ 2021-11-15  4:08       ` Jason Wang
  2021-11-15 18:16         ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-15  4:08 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > From: Jason Wang <jasowang@redhat.com>
> > >
> > > This patch introduces a new status bit STOP. This can be used by the
> > > driver to stop the device in order to safely fetch used descriptors
> > > status, making sure the device will not fetch new available ones.
> > >
> > > Its main use case is live migration, although it has other orthogonal
> > > use cases. It can be used to safely discard requests that have not been
> > > used: in other words, to rewind available descriptors.
> > >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > So this is much more complicated, see below.
> >
>
> I agree it's more complicated, but it addresses some concerns raised
> on previous patches sent to the list. Not saying that all of them must
> be addressed, or addressed this way though :).
>
> > > ---
> > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 83 insertions(+)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index 2aa3006..9ed0d09 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > >    drive the device.
> > >
> > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > +  device has been stopped by the driver. This status bit is different
> > > +  from the reset since the device state is preserved.
> > > +
> > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > +  device could not stop the STOP request.
> > > +
> > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > >    an error from which it can't recover.
> > >  \end{description}
> > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > >  recover by issuing a reset.
> > >  \end{note}
> > >
> > > +If VIRTIO_F_STOP has been negotiated,
> >
> > "has not been" actually?
> >
>
> I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> *has been* negotiated (in other words, driver sent FEATURES_OK), the
> driver must not set or clear the STOP bit before setting DRIVER_OK".

Ok, but what happens if we simply allow the STOP to be set if
DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
conflict with STOP.

(Anyhow we allow to set STOP after DRIVER_OK)

>
> > > the driver MUST NOT set or clear STOP if
> > > +DRIVER_OK is not set.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > +acknowledges the new paused status setting the first, or the failure setting
> > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > +the configuration change notification that the device must send after the
> > > +change.
> >
> > This is kind of tricky, it means the device can send notification
> > after it has been stopped.
>
> I don't think this part it's so tricky. That notification is also sent
> when the DEVICE_NEEDS_RESET bit is set,

I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
is stopped. But what we want to achieve is to make sure there won't be
any interaction between device and driver after STOP is set by device.

> and (as I read) is for the
> same reason somehow: To avoid the status polling:
> * "The driver SHOULD NOT rely on completion of operations of a device
> if DEVICE_NEEDS_RESET is set." (copied from the standard)
> * The reading of the status field could be expensive / inconvenient in
> each operation.

It makes sense for the device initiated event to use interrupt. But
for a stop, it's driver initiated, in this case the driver won't start
the work (for example the cleanup) after it makes sure the device is
stopped. Polling the status should be fine as this is how the rest
works. Anything makes stop differ from reset here? Or what worries you
without the interrupt?

> * Solution: Instead of polling, make a device facility to notify the
> driver that it cannot trust the device is going to behave properly /
> same as before anymore via notification.
>
> We can add another exception to the "device configuration space
> change" in "Notification of Device Configuration Changes", like the
> one already present:
> "In addition, this notification is triggered by the device setting
> DEVICE_NEEDS_RESET".
>
> I understand it sounds tricky that the device sends a notification
> when it's stopped, but in my opinion it's aligned with previous
> behavior (DEVICE_NEEDS_RESET),

I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
device won't process the buffer or send an interrupt.

> it's explicitly stated that it will be
> the last one, and it's caused because of the inconvenience of polling
> device status. Even if the driver can use other mechanisms.

I think STOP works much more similarly to reset not NEEDS_RESET. The
only difference with reset is that STOP needs to preserve the device
states and we don't (or can't) use interrupt to signal the completion
of reset.

>
> If the community still has concerns about it, another option is to
> actually extract the way the device notifies it from the general
> facilities, and make it transport specific. But to use the device
> configuration change notification for this makes sense to me. The
> device configuration has changed.

See above, I think we should have a consistent way to handle reset and stop.

>
> > As discussed in the previous versions,
> > driver is freed to use timer or what ever other mechanism if it
> > doesn't like the busy polling. I wonder how much value we can gain
> > from a dedicated config interrupt. Especially consider some transport
> > can use transport specific interrupt (not virtio specific interrupt)
> > for reporting whether or not set status succeed.
> >
>
> In my opinion, *if* we agree that a stop is a virtio facility and not
> a per-device one, and *if* we agree that a notification is required
> for the device to notify the stop, it makes sense to use a
> transport-independent mechanism that the device must already
> implement.

So the major question is why a notification is a must? And Just to be
clear, there could be transport specific mechanisms for error
reporting.

E,g

1) PCI can have non-posted write, if we use non-posted write to carry
the stop command, the device can return whether or not the device is
stopped successfully.

or

2) Some other transport can convert the stop status bit set into a
command and queue it to device specific queue, device can then use
it's own specific interrupt to report the when the stop is handled
(success or fail)

>
> > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > +try new STOP attempts.
> >
> > Does the device need to re-read the STOP_FAILED for synchronization?
>
> I tend to see the status as something that belongs to the device and
> is exposed to the driver. In that sense, the write from the guest
> triggers an event on the device, and the device decides what will be
> exposed on that field (MMIO?) on the next driver read. If it's not
> that way, we couldn't use the STOP bit that way, right?

Yes, but this is not an answer to my question. It's about the
ordering, when write returns it doesn't mean the write arrives at the
device, this is the case of PCI at least. So we need a mechanism to
make sure the write arrives at the device (PCI read will flush
previous write).

>
> > I
> > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > when that the device needs to set this bit. And driver can choose to
> > reset after a specific timeout anyhow.
> >
>
> The conditions where the device needs to set this bit are unspecified
> because it depends on the device: Not only to the kind of device, but
> also on the device backend.
>
> The same condition (regarding the possibility of handling the pending
> buffers) could cause different devices to react differently. A network
> device could decide it's fine to drop pending tx, let the guest think
> that "the network lost them", and mark them as done,

We may meet the similar issue during reset.

> where a
> persistent storage cannot do that for write requests. Just as an
> example, not saying that networking devices must do that :).

So I think this brings extra complexity that we probably don't need to
worry about now. The reason is that the spec doesn't allow the reset
to fail.

>
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> >
> > Any motivation for this? it looks to me it makes the feature coupled
> > with the virtqueue state proposal? It seems odd to allow avail change
> > but not the last_avail_idx change.
> >
>
> On second thought, I think you are right and this overlaps with the
> state proposal.
>
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > +the driver MAY change any descriptor.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > +device status to ensure the STOP bit is clear after the write. The device
> > > +acknowledges the new status clearing it. Since this change may not be
> > > +instantaneous, the driver MAY wait for the configuration change notification
> > > +that the device must send after the change.
> >
> > Do we really needs resuming? it's kind of:
> >
> > 1) STOP -> clear STOP
> >
> > vs
> >
> > 2) STOP -> RESET -> DRIVER_OK
> >
> > Using 2) preserve the semantic that the driver can't clear the status bit.
> >
>
> You are totally right in that regard. But the use case simplifies the
> operation when the driver only wants to take back some available
> descriptors still not used, in the range last_avail_idx..avail_idx.
> Doing that could be a big burden for drivers, who would need to
> re-send every status. MST proposed that use case at [1].

Yes, but it looks to me this doesn't require the resuming? And the per
virtqueue reset is being proposed here.

https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html

Actually, there's a subtle difference between 1) and 2). That is using
2) doesn't make sure we can "resume" from the index where we stopped.
But this won't be an issue considering we know that we need to support
setting device virtqueue state(index). So if we want to resume from
the exact index it could be:

STOP -> RESET -> setting index -> DRIVER_OK

>
> In that regard, the straightforward thing to do is modify avail_idx /
> descriptors from that range and let resume. However, the RESET path
> makes it easier to implement the device part of course, and the guest
> can also achieve the rewind that way.
>
> > > +
> > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > >
> > >  The device MUST NOT consume buffers or send any used buffer
> > >  notifications to the driver before DRIVER_OK.
> > >
> > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > +or clear of STOP.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > +operations after the driver writes STOP.
> >
> > I wonder if it's better to leave this to device to decide. E.g some
> > block devices may requires a very log time to finish the inflight
> > operations.
> >
>
> (Letting out SVQ + inflight descriptors for this part of the response,
> I will come back to it later)
>
> But if virtqueue is not valid anymore, how can it report them when
> finished?

It's still valid since the STOP bit is not set by the device.

> In that sense, I would say it's better to report failure and
> let the guest handle it as if the disk is unavailable (timeout?
> temporary faulty sector? I'm not sure what is the most suitable way).

This could be addressed by leaving the following choices to the devices:

1) complete the inflight requests
2) device or virtio specific for reporting inflight descriptors

>
> *If* we are not going to allow the guest to resume operation, where it
> knows all the status of the device, then there is no value on let the
> device delay the operation: From the guest point of view it either
> succeed to send to the device backend and somebody else caused a
> failure (external network lose the tx packet, bit rotting caused I'm
> reading a different value than previously written), or it failed at
> the stop moment.

So it's highly device specific, e.g for ethernet, we can afford the
loss of packets but not for the block devices so reporting inflight
descriptors may help to res-submit those after "resuming".

>
> This is different with the resume possibility, where the device can
> decide to hold the descriptors, stop operating, and then resume
> operation.
>
> > > Depending on the device, it can do it
> > > +in many ways as long as the driver can recover its normal operation if it
> > > +resumes the device without the need of resetting it:
> > > +\begin{itemize}
> > > +\item Drain and wait for the completion of all pending requests until a
> > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > +\item Return a device-specific failure for these descriptors, so the driver
> > > +can choose to retry or to cancel them.
> >
> > If we allow the driver to retry, we need a way to report inflight
> > buffers which is not supported by the spec. A way to solve this is to
> > make it device specific.
> >
>
> Regarding the retry, I don't get you here. Re-reading the patch, I
> think that "driver retry" is very ambiguous: I meant for the device to
> mark the descriptor as used, but with a communication specific error
> code, so the application, guest kernel, etc (driver in the standard)
> can decide to retry.

That's why I think introducing the virtqueue state is a must for stop,
With all the indexes defined, it would be much easier to describe what
the device or driver is expected to work.

>
> Regarding the in-flight descriptor report, it's interesting but I
> cannot see a way where it does not complicate the solution a lot or
> adds new dependencies. I have the next thoughts:
> 1) If it works as inflight_fd, "a region of shared memory"
> 1.1) This region must be in the guest's AS so the device has access to
> it. This either invalidates the use of STOP from the driver point of
> view as "let me know where you are not going to modify the guest's
> memory anymore".
> 1.2) This region is on the hypervisor's AS. If the device supports it,
> it is possible to implement the SVQ without the need of STOP bit. This
> is equivalent to "I have a PF that also supports VF dirty memory
> tracking".
> 2) If it works as the config space, where the driver can ask for its
> status, STOP means "STOP writing used and report via config space". No
> need for reset.
>
> Did you have something different in mind?

Not sure, maybe config space is better. What I want is to make the
feature as small as possible but leaving spaces for future extension.

E.g we start from the feature that is sufficient for networking
devices, (but doesn't prevent the future work to extend it to block
devices). I'm not familiar with the block device, but mandating the
completion of inflight descriptor make have troubles, e.g unexpected
downtime during live migration.

>
> > > +\item Mark them as done even if they are not, if the kind of device can
> > > +assume to lose them.
> >
> > I think "make buffer used" is better than "mark them as done". And we
> > need a accurate definition on who is "them".
> >
>
> All items include other operations, like the ones that the device must
> do internally to process the control virtqueue. But I cannot find an
> example where telling the driver they are done when it's not is valid
> for this particular item.
>
> But I agree it needs better wording.
>
> And I will s/them/operations/. for the next one.
>
> > > +\end{itemize}
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > +a guest's request,
> >
> > It's not clear what did "a guest's request" means.
> >
>
> Right. Would "operation" fit better here?

Still unclear, I guess this sentence tries to define when the device
can fail the stop?

>
> > > the device MUST set the STOP_FAILED bit for the guest to
> > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > +clears STOP_FAILED.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > +and the device can pause its operation, the device MUST set the descriptors
> > > +that it has done with them as used before exposing the STOP status bit as set.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > +after exposing the STOP bit set:
> > > +\begin{itemize}
> > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > +\item Send any used buffer notifications to the driver.
> > > +\end{itemize}
> > > +
> > > +The device MUST send a configuration space change right after exposing the STOP
> > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > +send another configuration space change notification to the driver afterwards
> > > +until the guest clears it.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > +device MUST continue reading available descriptors as if an available buffer
> > > +notification has reach it, starting from the last descriptor it marked as used,
> >
> > So I still tend to define virtqueue state as basic facility before
> > defining STOP. It can makes thing easier.
> >
>
> Yes, coming back to that approach can simplify the whole proposal.
>
> > > +and continue the regular operation after that. The device MUST read again
> > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > +if for some reason it cannot continue.
> > > +
> > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > >  MUST send a device configuration change notification to the driver.
> > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > >    transport specific.
> > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > >
> > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > +  stop the device.
> > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > +
> > >  \end{description}
> >
> > So I think the patch complicate thing is various ways:
> >
> > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > with NEEDS_RESET
> > 2) configuration change interrupt, looks conflict with the semantic of STOP
>
> I'm not sure about those two, I find we will have devices with unbound
> stop time where both can be useful if we agree on making this a
> general facility.

If the unbound stop time is the only worry, the way to report inflight
descriptors looks like a better solution. And STOP_FAILED is actually
not accurate since it means the stop is not finished in bound time
(but we need to define how long should be a bound time?)

> Resetting the whole device because of this leaves
> the driver with no possibility of knowing the state of the sent
> descriptors.
>
> Of course, if these use cases are not interesting, it's easier to
> leave them out for sure.
>
> > 3) status bit clearing (resuming), a functional duplication with RESET
> > + DRIVER_OK
> >
>
> I agree it can be obtained with a whole reset, so it can be out and
> leave it for the future if needed. However it seems overkill if we
> just want to rewind some descriptors back, and there is no standard
> way to recover the device status beyond vq_state.

It's more about the minimal self-contained set of the new features. If
it's just rewind, device or virtqueue reset is sufficient. If we want
to obtain the state, virtqueue state is a must and with virtqueue
state, resuming (clearing STOP) is not a must.

Thanks

>
> Thanks!
>
> > I think we'd better to stick to the minimal set of the function to
> > reduce the complexity: virtqueue state + STOP bit (without clearing
> > and no config interrupt).
> >
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
>
> > Thanks
> >
> > >
> > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > --
> > > 2.27.0
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-15  4:08       ` Jason Wang
@ 2021-11-15 18:16         ` Eugenio Perez Martin
  2021-11-16  6:56           ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-15 18:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > >
> > > > This patch introduces a new status bit STOP. This can be used by the
> > > > driver to stop the device in order to safely fetch used descriptors
> > > > status, making sure the device will not fetch new available ones.
> > > >
> > > > Its main use case is live migration, although it has other orthogonal
> > > > use cases. It can be used to safely discard requests that have not been
> > > > used: in other words, to rewind available descriptors.
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >
> > > So this is much more complicated, see below.
> > >
> >
> > I agree it's more complicated, but it addresses some concerns raised
> > on previous patches sent to the list. Not saying that all of them must
> > be addressed, or addressed this way though :).
> >
> > > > ---
> > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 83 insertions(+)
> > > >
> > > > diff --git a/content.tex b/content.tex
> > > > index 2aa3006..9ed0d09 100644
> > > > --- a/content.tex
> > > > +++ b/content.tex
> > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > >    drive the device.
> > > >
> > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > +  device has been stopped by the driver. This status bit is different
> > > > +  from the reset since the device state is preserved.
> > > > +
> > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > +  device could not stop the STOP request.
> > > > +
> > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > >    an error from which it can't recover.
> > > >  \end{description}
> > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > >  recover by issuing a reset.
> > > >  \end{note}
> > > >
> > > > +If VIRTIO_F_STOP has been negotiated,
> > >
> > > "has not been" actually?
> > >
> >
> > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > driver must not set or clear the STOP bit before setting DRIVER_OK".
>
> Ok, but what happens if we simply allow the STOP to be set if
> DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> conflict with STOP.
>
> (Anyhow we allow to set STOP after DRIVER_OK)
>

We could change it to "the driver MUST NOT set or clear STOP if
FEATURES_OK is not set", which would allow the driver to start a
device in stop mode. Before that should be definitely not done by a
good driver.

But if we don't allow the resume, it makes little sense to allow the
driver to start (as "set DRIVER_OK bit") in stop mode anyhow. I would
say that it is better to limit that now, and allow it in the future if
we find a valid use case, enabling a specific feature flag for it.

I'm also fine if we decide to leave this unspecified, but limiting it
now could enable us to make something useful with it in the future.

> >
> > > > the driver MUST NOT set or clear STOP if
> > > > +DRIVER_OK is not set.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > +the configuration change notification that the device must send after the
> > > > +change.
> > >
> > > This is kind of tricky, it means the device can send notification
> > > after it has been stopped.
> >
> > I don't think this part it's so tricky. That notification is also sent
> > when the DEVICE_NEEDS_RESET bit is set,
>
> I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> is stopped.

To clarify, what I meant is that there are situations where this
notification is raised even if device configuration is not changed,
but its status.

NEED_RESET does not mean the device is stopped, but it (should) signal
the driver that further interaction with the device will be for sure
invalid. I may be wrong with this, but this way of notifying the
driver relieves it to the need for check status in every interaction.

> But what we want to achieve is to make sure there won't be
> any interaction between device and driver after STOP is set by device.
>

If I understand you correctly, what you meant is that that a driver
could (and I think it will a lot of times) read the status change in
this order:
1) STOP bit is set
2) Notification change arrives

And 2) is weird since the device promised no more interaction somehow.

I agree to some extent, but it can be read even from the opposite
angle: From the moment the driver sets DRIVER_OK, every change on the
device (status or config) is notified using configuration change
interrupt.

a) Regarding the standard, I don't see it so different from the
NEED_RESET: the config change keeps being an out of band notification
system the driver can relay to know a (expected) status change.
b) I don't see a big deal with changing the semantic from "no more
interaction from the device" with "no more interaction but the
expected config change interrupt".
c) It's easy to ignore the interrupt, or even not to treat it
specially after the stop: The driver already should scan config to
look for changes in configuration and status, it will simply find
none. Although this is not implemented widely as far as I see.

In that regard, I feel that interaction is very innocuous, and to me
is the straightforward solution to avoid the active polling.

> > and (as I read) is for the
> > same reason somehow: To avoid the status polling:
> > * "The driver SHOULD NOT rely on completion of operations of a device
> > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > * The reading of the status field could be expensive / inconvenient in
> > each operation.
>
> It makes sense for the device initiated event to use interrupt. But
> for a stop, it's driver initiated, in this case the driver won't start
> the work (for example the cleanup) after it makes sure the device is
> stopped. Polling the status should be fine as this is how the rest
> works. Anything makes stop differ from reset here? Or what worries you
> without the interrupt?
>

This is proposed only in the scope of the concerns I saw raised in
previous series: the time to stop a device could be unbound, and
tricks to poll less frequently will increase migration time.

I will fully agree if these are left to the future: it is easy to
implement this chunk of the proposal under a separated feature flag if
this need arises. Sorry if that part was not clear enough.

> > * Solution: Instead of polling, make a device facility to notify the
> > driver that it cannot trust the device is going to behave properly /
> > same as before anymore via notification.
> >
> > We can add another exception to the "device configuration space
> > change" in "Notification of Device Configuration Changes", like the
> > one already present:
> > "In addition, this notification is triggered by the device setting
> > DEVICE_NEEDS_RESET".
> >
> > I understand it sounds tricky that the device sends a notification
> > when it's stopped, but in my opinion it's aligned with previous
> > behavior (DEVICE_NEEDS_RESET),
>
> I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> device won't process the buffer or send an interrupt.
>

From the driver point of view, it means that the driver cannot trust
the device anymore until the reset, so the driver actions are similar:

""
the driver can’t assume requests in flight will be completed if
DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
completed
""

(Sorry for being circular here, I think it proceeds here too) What I
meant is that the device sent an out of band notification when the
device status changed. The driver could check the status field before
processing every used buffer and also with a timer just in case, and
DEVICE_NEEDS_RESET would not need the config interrupt change. But the
interrupt gives convenience to the whole operation.

Every time the driver gets that interrupt, it must re-check all the
device configuration and status anyway. It can still make buffers
available while processing it, but that's the meaning of the interrupt
to me. And a status change after DRIVER_OK fits to it, from my point
of view.

> > it's explicitly stated that it will be
> > the last one, and it's caused because of the inconvenience of polling
> > device status. Even if the driver can use other mechanisms.
>
> I think STOP works much more similarly to reset not NEEDS_RESET. The
> only difference with reset is that STOP needs to preserve the device
> states and we don't (or can't) use interrupt to signal the completion
> of reset.
>

From the semantic point of view, yes. But in practical terms we can
face unbounded time. I mean, both operations have unbound time for
sure, but I would say that any device should handle reset way faster
than the STOP.

I fully agree on your point, but I can also see the other way around:
It would be convenient to have a configuration interrupt for the reset
too, but it is impossible since we cannot configure any before the
reset.

> >
> > If the community still has concerns about it, another option is to
> > actually extract the way the device notifies it from the general
> > facilities, and make it transport specific. But to use the device
> > configuration change notification for this makes sense to me. The
> > device configuration has changed.
>
> See above, I think we should have a consistent way to handle reset and stop.
>
> >
> > > As discussed in the previous versions,
> > > driver is freed to use timer or what ever other mechanism if it
> > > doesn't like the busy polling. I wonder how much value we can gain
> > > from a dedicated config interrupt. Especially consider some transport
> > > can use transport specific interrupt (not virtio specific interrupt)
> > > for reporting whether or not set status succeed.
> > >
> >
> > In my opinion, *if* we agree that a stop is a virtio facility and not
> > a per-device one, and *if* we agree that a notification is required
> > for the device to notify the stop, it makes sense to use a
> > transport-independent mechanism that the device must already
> > implement.
>
> So the major question is why a notification is a must? And Just to be
> clear, there could be transport specific mechanisms for error
> reporting.
>
> E,g
>
> 1) PCI can have non-posted write, if we use non-posted write to carry
> the stop command, the device can return whether or not the device is
> stopped successfully.
>
> or
>
> 2) Some other transport can convert the stop status bit set into a
> command and queue it to device specific queue, device can then use
> it's own specific interrupt to report the when the stop is handled
> (success or fail)
>

I would be totally fine with that too.

> >
> > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > +try new STOP attempts.
> > >
> > > Does the device need to re-read the STOP_FAILED for synchronization?
> >
> > I tend to see the status as something that belongs to the device and
> > is exposed to the driver. In that sense, the write from the guest
> > triggers an event on the device, and the device decides what will be
> > exposed on that field (MMIO?) on the next driver read. If it's not
> > that way, we couldn't use the STOP bit that way, right?
>
> Yes, but this is not an answer to my question. It's about the
> ordering, when write returns it doesn't mean the write arrives at the
> device, this is the case of PCI at least. So we need a mechanism to
> make sure the write arrives at the device (PCI read will flush
> previous write).
>

I didn't see that in your original question, sorry. But the PCI read
that flush the write is the driver one, isn't it?

In that case I would say that "the read" is part of "the write". It's
an issue of the PCI protocol, which I would say doesn't belong to this
section (or even this document?): To implement virtio over PCI, you
know that virtio needs a write, and, in particular, you know that PCI
needs a posterior read to make sure that write is effective.

Either that, or that the driver must use non-posted ones if it wants
the device to note it.

Or am I still missing something?

> >
> > > I
> > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > when that the device needs to set this bit. And driver can choose to
> > > reset after a specific timeout anyhow.
> > >
> >
> > The conditions where the device needs to set this bit are unspecified
> > because it depends on the device: Not only to the kind of device, but
> > also on the device backend.
> >
> > The same condition (regarding the possibility of handling the pending
> > buffers) could cause different devices to react differently. A network
> > device could decide it's fine to drop pending tx, let the guest think
> > that "the network lost them", and mark them as done,
>
> We may meet the similar issue during reset.
>

Yes, but the driver should be fine to fail a reset, it does not want
to use the device anymore or it wants to totally override the device
state. If a stop fails, the driver would expect the device to continue
operating in my opinion, because it will be impossible to recover the
device state.

This is again something that we could leave if we decide it is not
necessary at this moment: It just shows how a concern of previous
proposals can be solved, at least technically.

> > where a
> > persistent storage cannot do that for write requests. Just as an
> > example, not saying that networking devices must do that :).
>
> So I think this brings extra complexity that we probably don't need to
> worry about now. The reason is that the spec doesn't allow the reset
> to fail.
>

It can be left for the future for sure.

> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > >
> > > Any motivation for this? it looks to me it makes the feature coupled
> > > with the virtqueue state proposal? It seems odd to allow avail change
> > > but not the last_avail_idx change.
> > >
> >
> > On second thought, I think you are right and this overlaps with the
> > state proposal.
> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > +the driver MAY change any descriptor.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > +acknowledges the new status clearing it. Since this change may not be
> > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > +that the device must send after the change.
> > >
> > > Do we really needs resuming? it's kind of:
> > >
> > > 1) STOP -> clear STOP
> > >
> > > vs
> > >
> > > 2) STOP -> RESET -> DRIVER_OK
> > >
> > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > >
> >
> > You are totally right in that regard. But the use case simplifies the
> > operation when the driver only wants to take back some available
> > descriptors still not used, in the range last_avail_idx..avail_idx.
> > Doing that could be a big burden for drivers, who would need to
> > re-send every status. MST proposed that use case at [1].
>
> Yes, but it looks to me this doesn't require the resuming? And the per
> virtqueue reset is being proposed here.
>
> https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
>
> Actually, there's a subtle difference between 1) and 2). That is using
> 2) doesn't make sure we can "resume" from the index where we stopped.
> But this won't be an issue considering we know that we need to support
> setting device virtqueue state(index). So if we want to resume from
> the exact index it could be:
>
> STOP -> RESET -> setting index -> DRIVER_OK
>

With the state I meant more than VQ state, but the device state in
general. For example, for the network, you must also send all the
needed control commands to recover mac, rx filters, etc.

That's what I meant with "if you just want to rewind some descriptors,
resetting the whole device is overkill".

The example may be wrong, I can think of virtiofs and the need to keep
files opened:
* If we go through a full reset circle, the files opened may not be
the same as the closed ones, like deleted files with open handles.
* If we go through a full reset circle, watchers may skip a change.

Of course, this complexity may be left for the future and simply state
that, if that is the case, the device cannot offer stop feature.
Virtiofs have already other complexities that makes its migration
hard, but I think the point is explained.

> >
> > In that regard, the straightforward thing to do is modify avail_idx /
> > descriptors from that range and let resume. However, the RESET path
> > makes it easier to implement the device part of course, and the guest
> > can also achieve the rewind that way.
> >
> > > > +
> > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > >
> > > >  The device MUST NOT consume buffers or send any used buffer
> > > >  notifications to the driver before DRIVER_OK.
> > > >
> > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > +or clear of STOP.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > +operations after the driver writes STOP.
> > >
> > > I wonder if it's better to leave this to device to decide. E.g some
> > > block devices may requires a very log time to finish the inflight
> > > operations.
> > >
> >
> > (Letting out SVQ + inflight descriptors for this part of the response,
> > I will come back to it later)
> >
> > But if virtqueue is not valid anymore, how can it report them when
> > finished?
>
> It's still valid since the STOP bit is not set by the device.
>

Then I don't understand your answer. To my proposal of:

"If VIRTIO_F_STOP has been negotiated, the device MUST finish any
in-flight operations after the driver writes STOP."

You answered:

"I wonder if it's better to leave this to device to decide. E.g some
block devices may requires a very log time to finish the inflight
operations."

The device must finish all requests before it shows the STOP bit as
set to the device. Maybe it is better to rephrase it like:

If VIRTIO_F_STOP has been negotiated, the device MUST finish any
in-flight operations after the driver writes STOP and before it sets
its status bit STOP as set.

?

> > In that sense, I would say it's better to report failure and
> > let the guest handle it as if the disk is unavailable (timeout?
> > temporary faulty sector? I'm not sure what is the most suitable way).
>
> This could be addressed by leaving the following choices to the devices:
>
> 1) complete the inflight requests
> 2) device or virtio specific for reporting inflight descriptors
>

As previously, I'm not sure how this relates with "the stop bit is not
set by the device", so my answer may be completely wrong here,

Even assuming the device can report in-flight descriptors, it needs to
wait for the backend before reporting them anyhow. And we would need
another indication. What is the use of separating these status?
(waiting for stop bit, waiting for inflight descriptors to be valid).

The only possibility I can come up with is to actually stop the
request right in the middle of an operation. For example, to allow a
big block read to stop and then when the device is informed about
these inflight descriptors and its progress, it can continue. I would
say this is very out of scope, more about this later ([1]).

> >
> > *If* we are not going to allow the guest to resume operation, where it
> > knows all the status of the device, then there is no value on let the
> > device delay the operation: From the guest point of view it either
> > succeed to send to the device backend and somebody else caused a
> > failure (external network lose the tx packet, bit rotting caused I'm
> > reading a different value than previously written), or it failed at
> > the stop moment.
>
> So it's highly device specific, e.g for ethernet, we can afford the
> loss of packets but not for the block devices so reporting inflight
> descriptors may help to res-submit those after "resuming".
>

Right.

> >
> > This is different with the resume possibility, where the device can
> > decide to hold the descriptors, stop operating, and then resume
> > operation.
> >
> > > > Depending on the device, it can do it
> > > > +in many ways as long as the driver can recover its normal operation if it
> > > > +resumes the device without the need of resetting it:
> > > > +\begin{itemize}
> > > > +\item Drain and wait for the completion of all pending requests until a
> > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > +can choose to retry or to cancel them.
> > >
> > > If we allow the driver to retry, we need a way to report inflight
> > > buffers which is not supported by the spec. A way to solve this is to
> > > make it device specific.
> > >
> >
> > Regarding the retry, I don't get you here. Re-reading the patch, I
> > think that "driver retry" is very ambiguous: I meant for the device to
> > mark the descriptor as used, but with a communication specific error
> > code, so the application, guest kernel, etc (driver in the standard)
> > can decide to retry.
>
> That's why I think introducing the virtqueue state is a must for stop,
> With all the indexes defined, it would be much easier to describe what
> the device or driver is expected to work.
>

I still don't see the relationship, sorry.

What I intended to say in the patch is that the device can choose to
just return a device / communication error to indicate that the
transaction has failed at device level, but related to virtio, the
buffer would be marked as used.

Maybe a good example of this is for the device to choose to return
VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
block backend, but I don't know a lot of the blk device so I may be
wrong. I guess that the guest cannot know about the value being
written / read with that error code, and it is forced to re-read that.
But the virtqueue will be in a good state, and the device can be reset
and can recover its state. It's totally up to the device to choose to
do so.

Virtqueue state is still needed, but not because the device chooses to
return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
status after the reset.

> >
> > Regarding the in-flight descriptor report, it's interesting but I
> > cannot see a way where it does not complicate the solution a lot or
> > adds new dependencies. I have the next thoughts:
> > 1) If it works as inflight_fd, "a region of shared memory"
> > 1.1) This region must be in the guest's AS so the device has access to
> > it. This either invalidates the use of STOP from the driver point of
> > view as "let me know where you are not going to modify the guest's
> > memory anymore".

Long shot here, but might this work with the combination of the
balloon device? Making this far and far from the simplicity though...

> > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > it is possible to implement the SVQ without the need of STOP bit. This
> > is equivalent to "I have a PF that also supports VF dirty memory
> > tracking".
> > 2) If it works as the config space, where the driver can ask for its
> > status, STOP means "STOP writing used and report via config space". No
> > need for reset.
> >
> > Did you have something different in mind?
>
> Not sure, maybe config space is better. What I want is to make the
> feature as small as possible but leaving spaces for future extension.
>
> E.g we start from the feature that is sufficient for networking
> devices, (but doesn't prevent the future work to extend it to block
> devices). I'm not familiar with the block device, but mandating the
> completion of inflight descriptor make have troubles, e.g unexpected
> downtime during live migration.
>

[1] I agree with that, but I feel that "device or virtio specific for
reporting inflight descriptors" is way too broad to make it useful at
the moment.

Maybe the best thing to do is to put all the restrictions at this
moment, and when we figure out a good format for the inflight, add
"\item report inflight descriptors". Then, the device and the driver
are free to not accept any combination. Does it make sense?

> >
> > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > +assume to lose them.
> > >
> > > I think "make buffer used" is better than "mark them as done". And we
> > > need a accurate definition on who is "them".
> > >
> >
> > All items include other operations, like the ones that the device must
> > do internally to process the control virtqueue. But I cannot find an
> > example where telling the driver they are done when it's not is valid
> > for this particular item.
> >
> > But I agree it needs better wording.
> >
> > And I will s/them/operations/. for the next one.
> >
> > > > +\end{itemize}
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > +a guest's request,
> > >
> > > It's not clear what did "a guest's request" means.
> > >
> >
> > Right. Would "operation" fit better here?
>
> Still unclear, I guess this sentence tries to define when the device
> can fail the stop?
>

Not really, my intentions were to add a MUST operation for when the
device fails. The first is needed for the second though, so maybe we
can rephrase.

If we agree that a device can fail the stop, I think we should not
restrict the circumstances where the device can fail. "If the device
can find external circumstances where it cannot satisfy STOP must not
offer STOP feature" works for me too, actually.

> >
> > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > +clears STOP_FAILED.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > +after exposing the STOP bit set:
> > > > +\begin{itemize}
> > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > +\item Send any used buffer notifications to the driver.
> > > > +\end{itemize}
> > > > +
> > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > +send another configuration space change notification to the driver afterwards
> > > > +until the guest clears it.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > +device MUST continue reading available descriptors as if an available buffer
> > > > +notification has reach it, starting from the last descriptor it marked as used,
> > >
> > > So I still tend to define virtqueue state as basic facility before
> > > defining STOP. It can makes thing easier.
> > >
> >
> > Yes, coming back to that approach can simplify the whole proposal.
> >
> > > > +and continue the regular operation after that. The device MUST read again
> > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > +if for some reason it cannot continue.
> > > > +
> > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > >  MUST send a device configuration change notification to the driver.
> > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > >    transport specific.
> > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > >
> > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > +  stop the device.
> > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > +
> > > >  \end{description}
> > >
> > > So I think the patch complicate thing is various ways:
> > >
> > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > with NEEDS_RESET
> > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> >
> > I'm not sure about those two, I find we will have devices with unbound
> > stop time where both can be useful if we agree on making this a
> > general facility.
>
> If the unbound stop time is the only worry, the way to report inflight
> descriptors looks like a better solution.

I'm not sure if that's the only condition under which a device can
fail to stop, but if we agree on that we could prepare a format for
block devices to report them, for example. They are needed somehow in
the networking case of packed if buffers are used out of order.

> And STOP_FAILED is actually
> not accurate since it means the stop is not finished in bound time
> (but we need to define how long should be a bound time?)
>
> > Resetting the whole device because of this leaves
> > the driver with no possibility of knowing the state of the sent
> > descriptors.
> >
> > Of course, if these use cases are not interesting, it's easier to
> > leave them out for sure.
> >
> > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > + DRIVER_OK
> > >
> >
> > I agree it can be obtained with a whole reset, so it can be out and
> > leave it for the future if needed. However it seems overkill if we
> > just want to rewind some descriptors back, and there is no standard
> > way to recover the device status beyond vq_state.
>
> It's more about the minimal self-contained set of the new features. If
> it's just rewind, device or virtqueue reset is sufficient.

I'm not sure if that is true for all devices with the features the
standard offers at the moment, but it might be right for serial.

> If we want
> to obtain the state, virtqueue state is a must and with virtqueue
> state, resuming (clearing STOP) is not a must.
>

Right.

Thanks!

> Thanks
>
> >
> > Thanks!
> >
> > > I think we'd better to stick to the minimal set of the function to
> > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > and no config interrupt).
> > >
> >
> > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> >
> > > Thanks
> > >
> > > >
> > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > --
> > > > 2.27.0
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-15 18:16         ` Eugenio Perez Martin
@ 2021-11-16  6:56           ` Jason Wang
  2021-11-16 14:50             ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-16  6:56 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 16, 2021 at 2:17 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > >
> > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > status, making sure the device will not fetch new available ones.
> > > > >
> > > > > Its main use case is live migration, although it has other orthogonal
> > > > > use cases. It can be used to safely discard requests that have not been
> > > > > used: in other words, to rewind available descriptors.
> > > > >
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > >
> > > > So this is much more complicated, see below.
> > > >
> > >
> > > I agree it's more complicated, but it addresses some concerns raised
> > > on previous patches sent to the list. Not saying that all of them must
> > > be addressed, or addressed this way though :).
> > >
> > > > > ---
> > > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 83 insertions(+)
> > > > >
> > > > > diff --git a/content.tex b/content.tex
> > > > > index 2aa3006..9ed0d09 100644
> > > > > --- a/content.tex
> > > > > +++ b/content.tex
> > > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > > >    drive the device.
> > > > >
> > > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > +  device has been stopped by the driver. This status bit is different
> > > > > +  from the reset since the device state is preserved.
> > > > > +
> > > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > +  device could not stop the STOP request.
> > > > > +
> > > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > > >    an error from which it can't recover.
> > > > >  \end{description}
> > > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > >  recover by issuing a reset.
> > > > >  \end{note}
> > > > >
> > > > > +If VIRTIO_F_STOP has been negotiated,
> > > >
> > > > "has not been" actually?
> > > >
> > >
> > > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > > driver must not set or clear the STOP bit before setting DRIVER_OK".
> >
> > Ok, but what happens if we simply allow the STOP to be set if
> > DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> > conflict with STOP.
> >
> > (Anyhow we allow to set STOP after DRIVER_OK)
> >
>
> We could change it to "the driver MUST NOT set or clear STOP if
> FEATURES_OK is not set", which would allow the driver to start a
> device in stop mode. Before that should be definitely not done by a
> good driver.

Yes, limiting it before FEATURES_OK is a must.

>
> But if we don't allow the resume, it makes little sense to allow the
> driver to start (as "set DRIVER_OK bit") in stop mode anyhow.

Yes.

> I would
> say that it is better to limit that now, and allow it in the future if
> we find a valid use case, enabling a specific feature flag for it.
>
> I'm also fine if we decide to leave this unspecified, but limiting it
> now could enable us to make something useful with it in the future.

Leaving is unspecified seems better since if we do the limitation, it
introduces extra efforts for future extension.  But I'm fine with
either.

>
> > >
> > > > > the driver MUST NOT set or clear STOP if
> > > > > +DRIVER_OK is not set.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > > +the configuration change notification that the device must send after the
> > > > > +change.
> > > >
> > > > This is kind of tricky, it means the device can send notification
> > > > after it has been stopped.
> > >
> > > I don't think this part it's so tricky. That notification is also sent
> > > when the DEVICE_NEEDS_RESET bit is set,
> >
> > I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> > is stopped.
>
> To clarify, what I meant is that there are situations where this
> notification is raised even if device configuration is not changed,
> but its status.

Right, but again the notification is not a must for the status changed
(e.g reset).

>
> NEED_RESET does not mean the device is stopped, but it (should) signal
> the driver that further interaction with the device will be for sure
> invalid. I may be wrong with this, but this way of notifying the
> driver relieves it to the need for check status in every interaction.

You are right. But it's still different with stop, we don't need to
check if for every interaction. What we need is to check if only after
the driver tries to stop the device (as reset).

>
> > But what we want to achieve is to make sure there won't be
> > any interaction between device and driver after STOP is set by device.
> >
>
> If I understand you correctly, what you meant is that that a driver
> could (and I think it will a lot of times) read the status change in
> this order:
> 1) STOP bit is set
> 2) Notification change arrives
>
> And 2) is weird since the device promised no more interaction somehow.
>
> I agree to some extent, but it can be read even from the opposite
> angle: From the moment the driver sets DRIVER_OK, every change on the
> device (status or config) is notified using configuration change
> interrupt.

For status, It depends on what you mean by "change". If it's the value
that read from the driver:

1) The only thing that needs to be notified is the status that can
only be set by device, that is NEEDS_RESET.
2) For the status that set by the driver and device can refuse
(forever or temporarily), there's no notification change: DRIVER_OK,
FEATURE_OK

STOP belongs to 2). STOP_FAILED belongs to 1), but:

1) STOP status bit 0 means the device is not stopped
2) We don't have DRIVER_FAILED and FEATURE_FAILED, instead, we just
check whether or not the bit is set by the device.

So whether or not we need STOP_FAILED is still questionable.

>
> a) Regarding the standard, I don't see it so different from the
> NEED_RESET: the config change keeps being an out of band notification
> system the driver can relay to know a (expected) status change.
> b) I don't see a big deal with changing the semantic from "no more
> interaction from the device" with "no more interaction but the
> expected config change interrupt".
> c) It's easy to ignore the interrupt, or even not to treat it
> specially after the stop: The driver already should scan config to
> look for changes in configuration and status, it will simply find
> none. Although this is not implemented widely as far as I see.
>
> In that regard, I feel that interaction is very innocuous, and to me
> is the straightforward solution to avoid the active polling.

Well, the driver can choose not to do busy polling for sure without
the interrupt for sure.

>
> > > and (as I read) is for the
> > > same reason somehow: To avoid the status polling:
> > > * "The driver SHOULD NOT rely on completion of operations of a device
> > > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > > * The reading of the status field could be expensive / inconvenient in
> > > each operation.
> >
> > It makes sense for the device initiated event to use interrupt. But
> > for a stop, it's driver initiated, in this case the driver won't start
> > the work (for example the cleanup) after it makes sure the device is
> > stopped. Polling the status should be fine as this is how the rest
> > works. Anything makes stop differ from reset here? Or what worries you
> > without the interrupt?
> >
>
> This is proposed only in the scope of the concerns I saw raised in
> previous series: the time to stop a device could be unbound, and
> tricks to poll less frequently will increase migration time.

But I don't see how an interrupt can help to reduce the time spent on
the stop. The downtime is usually a user policy, so the VMM can choose
to timeout the stop and perform the resume.

As discussed, the way to advertise the inflight buffers might be a
solution for this.

>
> I will fully agree if these are left to the future: it is easy to
> implement this chunk of the proposal under a separated feature flag if
> this need arises. Sorry if that part was not clear enough.

That's fine.

>
> > > * Solution: Instead of polling, make a device facility to notify the
> > > driver that it cannot trust the device is going to behave properly /
> > > same as before anymore via notification.
> > >
> > > We can add another exception to the "device configuration space
> > > change" in "Notification of Device Configuration Changes", like the
> > > one already present:
> > > "In addition, this notification is triggered by the device setting
> > > DEVICE_NEEDS_RESET".
> > >
> > > I understand it sounds tricky that the device sends a notification
> > > when it's stopped, but in my opinion it's aligned with previous
> > > behavior (DEVICE_NEEDS_RESET),
> >
> > I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> > device won't process the buffer or send an interrupt.
> >
>
> From the driver point of view, it means that the driver cannot trust
> the device anymore until the reset, so the driver actions are similar:

I think we need clarify the exact semanic of STOP_FAILED, and it will
be very hard to differ

1) The device can not be stopped in short time

from

2) The device can not be stopped forever

At least in case 1) the driver still can trust the device.

>
> ""
> the driver can’t assume requests in flight will be completed if
> DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
> completed
> ""
>
> (Sorry for being circular here, I think it proceeds here too) What I
> meant is that the device sent an out of band notification when the
> device status changed. The driver could check the status field before
> processing every used buffer and also with a timer just in case, and
> DEVICE_NEEDS_RESET would not need the config interrupt change. But the
> interrupt gives convenience to the whole operation.
>
> Every time the driver gets that interrupt, it must re-check all the
> device configuration and status anyway. It can still make buffers
> available while processing it, but that's the meaning of the interrupt
> to me. And a status change after DRIVER_OK fits to it, from my point
> of view.

I fully agree, but it's different:

1) The driver don't know when there will be a NEEDS_RESET, so without
an interrupt, it must check the status for each operation
2) The drive know when there will be STOP/STOP_FAILED, it only needs
to check the status after it tries to stop the device

>
> > > it's explicitly stated that it will be
> > > the last one, and it's caused because of the inconvenience of polling
> > > device status. Even if the driver can use other mechanisms.
> >
> > I think STOP works much more similarly to reset not NEEDS_RESET. The
> > only difference with reset is that STOP needs to preserve the device
> > states and we don't (or can't) use interrupt to signal the completion
> > of reset.
> >
>
> From the semantic point of view, yes. But in practical terms we can
> face unbounded time.

The problem is:

1) how to define the "unbounded time", if we don't define it exactly,
the device may abuse the status bit which may cause a lot of troubles
2) there are other approaches that we can deal with the unbound time,
timeout in driver + reset

> I mean, both operations have unbound time for
> sure, but I would say that any device should handle reset way faster
> than the STOP.

Any reason that STOP is faster? I'd expect STOP is a subset of reset,
that is, in order to do reset, we must first stop.

>
> I fully agree on your point, but I can also see the other way around:
> It would be convenient to have a configuration interrupt for the reset
> too, but it is impossible since we cannot configure any before the
> reset.

I think it's a more about the question:

1) Why an interrupt is a must for STOP

than

2) If we can use an interrupt

My questions are all for 1).

>
> > >
> > > If the community still has concerns about it, another option is to
> > > actually extract the way the device notifies it from the general
> > > facilities, and make it transport specific. But to use the device
> > > configuration change notification for this makes sense to me. The
> > > device configuration has changed.
> >
> > See above, I think we should have a consistent way to handle reset and stop.
> >
> > >
> > > > As discussed in the previous versions,
> > > > driver is freed to use timer or what ever other mechanism if it
> > > > doesn't like the busy polling. I wonder how much value we can gain
> > > > from a dedicated config interrupt. Especially consider some transport
> > > > can use transport specific interrupt (not virtio specific interrupt)
> > > > for reporting whether or not set status succeed.
> > > >
> > >
> > > In my opinion, *if* we agree that a stop is a virtio facility and not
> > > a per-device one, and *if* we agree that a notification is required
> > > for the device to notify the stop, it makes sense to use a
> > > transport-independent mechanism that the device must already
> > > implement.
> >
> > So the major question is why a notification is a must? And Just to be
> > clear, there could be transport specific mechanisms for error
> > reporting.
> >
> > E,g
> >
> > 1) PCI can have non-posted write, if we use non-posted write to carry
> > the stop command, the device can return whether or not the device is
> > stopped successfully.
> >
> > or
> >
> > 2) Some other transport can convert the stop status bit set into a
> > command and queue it to device specific queue, device can then use
> > it's own specific interrupt to report the when the stop is handled
> > (success or fail)
> >
>
> I would be totally fine with that too.
>
> > >
> > > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > > +try new STOP attempts.
> > > >
> > > > Does the device need to re-read the STOP_FAILED for synchronization?
> > >
> > > I tend to see the status as something that belongs to the device and
> > > is exposed to the driver. In that sense, the write from the guest
> > > triggers an event on the device, and the device decides what will be
> > > exposed on that field (MMIO?) on the next driver read. If it's not
> > > that way, we couldn't use the STOP bit that way, right?
> >
> > Yes, but this is not an answer to my question. It's about the
> > ordering, when write returns it doesn't mean the write arrives at the
> > device, this is the case of PCI at least. So we need a mechanism to
> > make sure the write arrives at the device (PCI read will flush
> > previous write).
> >
>
> I didn't see that in your original question, sorry. But the PCI read
> that flush the write is the driver one, isn't it?
>
> In that case I would say that "the read" is part of "the write". It's
> an issue of the PCI protocol, which I would say doesn't belong to this
> section (or even this document?): To implement virtio over PCI, you
> know that virtio needs a write, and, in particular, you know that PCI
> needs a posterior read to make sure that write is effective.
>
> Either that, or that the driver must use non-posted ones if it wants
> the device to note it.
>
> Or am I still missing something?

Just to make sure we are in the same page,  in this paragraph, you
said this at the beginning:

"If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
to ensure the STOP or STOP_FAILED bit is set after the write."

And in the end of the paragraph:

If the device sets the STOP_FAILED bit, the driver MUST clear it
before try new STOP attempts. But you don't define whether we need to
re-read to make sure STOP_FAILED is clear if the driver tries to clear
it. Is this intended?

>
> > >
> > > > I
> > > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > > when that the device needs to set this bit. And driver can choose to
> > > > reset after a specific timeout anyhow.
> > > >
> > >
> > > The conditions where the device needs to set this bit are unspecified
> > > because it depends on the device: Not only to the kind of device, but
> > > also on the device backend.
> > >
> > > The same condition (regarding the possibility of handling the pending
> > > buffers) could cause different devices to react differently. A network
> > > device could decide it's fine to drop pending tx, let the guest think
> > > that "the network lost them", and mark them as done,
> >
> > We may meet the similar issue during reset.
> >
>
> Yes, but the driver should be fine to fail a reset, it does not want
> to use the device anymore or it wants to totally override the device
> state. If a stop fails, the driver would expect the device to continue
> operating in my opinion, because it will be impossible to recover the
> device state.

This is only true if we allow the stop to be failed. It would be an
issue if the driver fails to stop a device since it can fail the stop
of the entire VM which is not something that the VMM is expecting.

If we don't allow the stop can fail and we allow the device to expose
the inflight buffers, we are all fine:

1) VM is guaranteed to be stopped
2) stop can be finished in time

Devices are free to choose to wait for the short time request and tag
the long time request as inflight.

>
> This is again something that we could leave if we decide it is not
> necessary at this moment: It just shows how a concern of previous
> proposals can be solved, at least technically.

To me, I think we can start from a set of functions that can make e.g
the virtio-net to work to unblock:

1) live migration work
2) extensions to other devices (e.g inflight could be done on top as
new features)

>
> > > where a
> > > persistent storage cannot do that for write requests. Just as an
> > > example, not saying that networking devices must do that :).
> >
> > So I think this brings extra complexity that we probably don't need to
> > worry about now. The reason is that the spec doesn't allow the reset
> > to fail.
> >
>
> It can be left for the future for sure.
>
> > >
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > >
> > > > Any motivation for this? it looks to me it makes the feature coupled
> > > > with the virtqueue state proposal? It seems odd to allow avail change
> > > > but not the last_avail_idx change.
> > > >
> > >
> > > On second thought, I think you are right and this overlaps with the
> > > state proposal.
> > >
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > +the driver MAY change any descriptor.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > > +acknowledges the new status clearing it. Since this change may not be
> > > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > > +that the device must send after the change.
> > > >
> > > > Do we really needs resuming? it's kind of:
> > > >
> > > > 1) STOP -> clear STOP
> > > >
> > > > vs
> > > >
> > > > 2) STOP -> RESET -> DRIVER_OK
> > > >
> > > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > > >
> > >
> > > You are totally right in that regard. But the use case simplifies the
> > > operation when the driver only wants to take back some available
> > > descriptors still not used, in the range last_avail_idx..avail_idx.
> > > Doing that could be a big burden for drivers, who would need to
> > > re-send every status. MST proposed that use case at [1].
> >
> > Yes, but it looks to me this doesn't require the resuming? And the per
> > virtqueue reset is being proposed here.
> >
> > https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
> >
> > Actually, there's a subtle difference between 1) and 2). That is using
> > 2) doesn't make sure we can "resume" from the index where we stopped.
> > But this won't be an issue considering we know that we need to support
> > setting device virtqueue state(index). So if we want to resume from
> > the exact index it could be:
> >
> > STOP -> RESET -> setting index -> DRIVER_OK
> >
>
> With the state I meant more than VQ state, but the device state in
> general. For example, for the network, you must also send all the
> needed control commands to recover mac, rx filters, etc.

I'm not sure I get this. For those cvq stuff, with the help of the
shadow virtqueue, we don't need any spec extensions. What did I miss
here?

>
> That's what I meant with "if you just want to rewind some descriptors,
> resetting the whole device is overkill".
>
> The example may be wrong, I can think of virtiofs and the need to keep
> files opened:
> * If we go through a full reset circle, the files opened may not be
> the same as the closed ones, like deleted files with open handles.
> * If we go through a full reset circle, watchers may skip a change.
>
> Of course, this complexity may be left for the future and simply state
> that, if that is the case, the device cannot offer stop feature.
> Virtiofs have already other complexities that makes its migration
> hard, but I think the point is explained.

There are long discussions about the virtiofs migration. But it's out
of the scope for the discussion of device stop since it's mainly about
how to define and expose device states. For stop, it's more than
sufficient to say the device states needs to be preserved after the
device is stopped.

I'd rather go with something simple to work for a simple type of
device like ethernet. Otherwise there will be endless discussion. For
any features that are not needed by the ethernet device, I would leave
it for future investigation.

>
> > >
> > > In that regard, the straightforward thing to do is modify avail_idx /
> > > descriptors from that range and let resume. However, the RESET path
> > > makes it easier to implement the device part of course, and the guest
> > > can also achieve the rewind that way.
> > >
> > > > > +
> > > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > > >
> > > > >  The device MUST NOT consume buffers or send any used buffer
> > > > >  notifications to the driver before DRIVER_OK.
> > > > >
> > > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > > +or clear of STOP.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > > +operations after the driver writes STOP.
> > > >
> > > > I wonder if it's better to leave this to device to decide. E.g some
> > > > block devices may requires a very log time to finish the inflight
> > > > operations.
> > > >
> > >
> > > (Letting out SVQ + inflight descriptors for this part of the response,
> > > I will come back to it later)
> > >
> > > But if virtqueue is not valid anymore, how can it report them when
> > > finished?
> >
> > It's still valid since the STOP bit is not set by the device.
> >
>
> Then I don't understand your answer. To my proposal of:
>
> "If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> in-flight operations after the driver writes STOP."
>
> You answered:
>
> "I wonder if it's better to leave this to device to decide. E.g some
> block devices may requires a very log time to finish the inflight
> operations."
>
> The device must finish all requests before it shows the STOP bit as
> set to the device. Maybe it is better to rephrase it like:
>
> If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> in-flight operations after the driver writes STOP and before it sets
> its status bit STOP as set.
>
> ?

I meant when to set STOP is highly device specific. E.g for the virtio
block devices which allows the in-flight requests to be re-submitted
after resume, the device can choose to not wait for the completion of
the inflight operations and expose them to the driver. This helps to
reduce the time spent on the stop.

>
> > > In that sense, I would say it's better to report failure and
> > > let the guest handle it as if the disk is unavailable (timeout?
> > > temporary faulty sector? I'm not sure what is the most suitable way).
> >
> > This could be addressed by leaving the following choices to the devices:
> >
> > 1) complete the inflight requests
> > 2) device or virtio specific for reporting inflight descriptors
> >
>
> As previously, I'm not sure how this relates with "the stop bit is not
> set by the device", so my answer may be completely wrong here,

It's related to time spent on the stop. E.g for block devices, it can
simply show all the inflight buffers to guests and set the stop bit.
Then STOP should be very fast.

>
> Even assuming the device can report in-flight descriptors, it needs to
> wait for the backend before reporting them anyhow.

Why does it need to wait for the backend? I mean for the device that
supports in-flight descriptors, the semantic of the device should
allow the requests to be processed twice.

> And we would need
> another indication. What is the use of separating these status?
> (waiting for stop bit, waiting for inflight descriptors to be valid).
>
> The only possibility I can come up with is to actually stop the
> request right in the middle of an operation. For example, to allow a
> big block read to stop and then when the device is informed about
> these inflight descriptors and its progress, it can continue. I would
> say this is very out of scope, more about this later ([1]).
>
> > >
> > > *If* we are not going to allow the guest to resume operation, where it
> > > knows all the status of the device, then there is no value on let the
> > > device delay the operation: From the guest point of view it either
> > > succeed to send to the device backend and somebody else caused a
> > > failure (external network lose the tx packet, bit rotting caused I'm
> > > reading a different value than previously written), or it failed at
> > > the stop moment.
> >
> > So it's highly device specific, e.g for ethernet, we can afford the
> > loss of packets but not for the block devices so reporting inflight
> > descriptors may help to res-submit those after "resuming".
> >
>
> Right.
>
> > >
> > > This is different with the resume possibility, where the device can
> > > decide to hold the descriptors, stop operating, and then resume
> > > operation.
> > >
> > > > > Depending on the device, it can do it
> > > > > +in many ways as long as the driver can recover its normal operation if it
> > > > > +resumes the device without the need of resetting it:
> > > > > +\begin{itemize}
> > > > > +\item Drain and wait for the completion of all pending requests until a
> > > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > > +can choose to retry or to cancel them.
> > > >
> > > > If we allow the driver to retry, we need a way to report inflight
> > > > buffers which is not supported by the spec. A way to solve this is to
> > > > make it device specific.
> > > >
> > >
> > > Regarding the retry, I don't get you here. Re-reading the patch, I
> > > think that "driver retry" is very ambiguous: I meant for the device to
> > > mark the descriptor as used, but with a communication specific error
> > > code, so the application, guest kernel, etc (driver in the standard)
> > > can decide to retry.
> >
> > That's why I think introducing the virtqueue state is a must for stop,
> > With all the indexes defined, it would be much easier to describe what
> > the device or driver is expected to work.
> >
>
> I still don't see the relationship, sorry.

E.g how do you define the in-flight buffers accurately?

>
> What I intended to say in the patch is that the device can choose to
> just return a device / communication error to indicate that the
> transaction has failed at device level, but related to virtio, the
> buffer would be marked as used.
>
> Maybe a good example of this is for the device to choose to return
> VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
> block backend, but I don't know a lot of the blk device so I may be
> wrong. I guess that the guest cannot know about the value being
> written / read with that error code, and it is forced to re-read that.
> But the virtqueue will be in a good state, and the device can be reset
> and can recover its state. It's totally up to the device to choose to
> do so.

I think not, if we tie STOP to some device errors that could be even
more complicated.

>
> Virtqueue state is still needed, but not because the device chooses to
> return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
> status after the reset.
>
> > >
> > > Regarding the in-flight descriptor report, it's interesting but I
> > > cannot see a way where it does not complicate the solution a lot or
> > > adds new dependencies. I have the next thoughts:
> > > 1) If it works as inflight_fd, "a region of shared memory"
> > > 1.1) This region must be in the guest's AS so the device has access to
> > > it. This either invalidates the use of STOP from the driver point of
> > > view as "let me know where you are not going to modify the guest's
> > > memory anymore".
>
> Long shot here, but might this work with the combination of the
> balloon device? Making this far and far from the simplicity though...
>
> > > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > > it is possible to implement the SVQ without the need of STOP bit. This
> > > is equivalent to "I have a PF that also supports VF dirty memory
> > > tracking".
> > > 2) If it works as the config space, where the driver can ask for its
> > > status, STOP means "STOP writing used and report via config space". No
> > > need for reset.
> > >
> > > Did you have something different in mind?
> >
> > Not sure, maybe config space is better. What I want is to make the
> > feature as small as possible but leaving spaces for future extension.
> >
> > E.g we start from the feature that is sufficient for networking
> > devices, (but doesn't prevent the future work to extend it to block
> > devices). I'm not familiar with the block device, but mandating the
> > completion of inflight descriptor make have troubles, e.g unexpected
> > downtime during live migration.
> >
>
> [1] I agree with that, but I feel that "device or virtio specific for
> reporting inflight descriptors" is way too broad to make it useful at
> the moment.

Yes and that's not a must for an ethernet device.

>
> Maybe the best thing to do is to put all the restrictions at this
> moment, and when we figure out a good format for the inflight, add
> "\item report inflight descriptors". Then, the device and the driver
> are free to not accept any combination. Does it make sense?

Somehow, to start from a version that works for networking devices.
Where we know we don't need to care:

1) stop fail
2) unbound time of stop, so we don't need an interrupt
3) inflight buffers
4) new facility querying device states (shadow CVQ can do this)

This will ease both of us as I feel the discussion might not be easily
converged if we care about other types of devices with too many
things. With networking done, we can start to support block devices
and we can ask help from block gurus.

>
> > >
> > > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > > +assume to lose them.
> > > >
> > > > I think "make buffer used" is better than "mark them as done". And we
> > > > need a accurate definition on who is "them".
> > > >
> > >
> > > All items include other operations, like the ones that the device must
> > > do internally to process the control virtqueue. But I cannot find an
> > > example where telling the driver they are done when it's not is valid
> > > for this particular item.
> > >
> > > But I agree it needs better wording.
> > >
> > > And I will s/them/operations/. for the next one.
> > >
> > > > > +\end{itemize}
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > > +a guest's request,
> > > >
> > > > It's not clear what did "a guest's request" means.
> > > >
> > >
> > > Right. Would "operation" fit better here?
> >
> > Still unclear, I guess this sentence tries to define when the device
> > can fail the stop?
> >
>
> Not really, my intentions were to add a MUST operation for when the
> device fails. The first is needed for the second though, so maybe we
> can rephrase.
>
> If we agree that a device can fail the stop, I think we should not
> restrict the circumstances where the device can fail. "If the device
> can find external circumstances where it cannot satisfy STOP must not
> offer STOP feature" works for me too, actually.

I'd leave the STOP_FAILED for future investigation.

>
> > >
> > > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > > +clears STOP_FAILED.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > > +after exposing the STOP bit set:
> > > > > +\begin{itemize}
> > > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > > +\item Send any used buffer notifications to the driver.
> > > > > +\end{itemize}
> > > > > +
> > > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > > +send another configuration space change notification to the driver afterwards
> > > > > +until the guest clears it.
> > > > > +
> > > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > > +device MUST continue reading available descriptors as if an available buffer
> > > > > +notification has reach it, starting from the last descriptor it marked as used,
> > > >
> > > > So I still tend to define virtqueue state as basic facility before
> > > > defining STOP. It can makes thing easier.
> > > >
> > >
> > > Yes, coming back to that approach can simplify the whole proposal.
> > >
> > > > > +and continue the regular operation after that. The device MUST read again
> > > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > > +if for some reason it cannot continue.
> > > > > +
> > > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > > >  MUST send a device configuration change notification to the driver.
> > > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > > >    transport specific.
> > > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > > >
> > > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > > +  stop the device.
> > > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > > +
> > > > >  \end{description}
> > > >
> > > > So I think the patch complicate thing is various ways:
> > > >
> > > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > > with NEEDS_RESET
> > > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> > >
> > > I'm not sure about those two, I find we will have devices with unbound
> > > stop time where both can be useful if we agree on making this a
> > > general facility.
> >
> > If the unbound stop time is the only worry, the way to report inflight
> > descriptors looks like a better solution.
>
> I'm not sure if that's the only condition under which a device can
> fail to stop, but if we agree on that we could prepare a format for
> block devices to report them, for example. They are needed somehow in
> the networking case of packed if buffers are used out of order.

It can, but let's leave it for the future now.

Actually, for inflight buffers, a better idea is to support it at the
virtqueue level without extra data structure. But it's for sure not a
short term solution.

>
> > And STOP_FAILED is actually
> > not accurate since it means the stop is not finished in bound time
> > (but we need to define how long should be a bound time?)
> >
> > > Resetting the whole device because of this leaves
> > > the driver with no possibility of knowing the state of the sent
> > > descriptors.
> > >
> > > Of course, if these use cases are not interesting, it's easier to
> > > leave them out for sure.
> > >
> > > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > > + DRIVER_OK
> > > >
> > >
> > > I agree it can be obtained with a whole reset, so it can be out and
> > > leave it for the future if needed. However it seems overkill if we
> > > just want to rewind some descriptors back, and there is no standard
> > > way to recover the device status beyond vq_state.
> >
> > It's more about the minimal self-contained set of the new features. If
> > it's just rewind, device or virtqueue reset is sufficient.
>
> I'm not sure if that is true for all devices with the features the
> standard offers at the moment, but it might be right for serial.

Thanks

>
> > If we want
> > to obtain the state, virtqueue state is a must and with virtqueue
> > state, resuming (clearing STOP) is not a must.
> >
>
> Right.
>
> Thanks!
>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > I think we'd better to stick to the minimal set of the function to
> > > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > > and no config interrupt).
> > > >
> > >
> > > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> > >
> > > > Thanks
> > > >
> > > > >
> > > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > > --
> > > > > 2.27.0
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-16  6:56           ` Jason Wang
@ 2021-11-16 14:50             ` Eugenio Perez Martin
  2021-11-17  3:27               ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-16 14:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 16, 2021 at 7:56 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Nov 16, 2021 at 2:17 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > >
> > > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > > status, making sure the device will not fetch new available ones.
> > > > > >
> > > > > > Its main use case is live migration, although it has other orthogonal
> > > > > > use cases. It can be used to safely discard requests that have not been
> > > > > > used: in other words, to rewind available descriptors.
> > > > > >
> > > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > >
> > > > > So this is much more complicated, see below.
> > > > >
> > > >
> > > > I agree it's more complicated, but it addresses some concerns raised
> > > > on previous patches sent to the list. Not saying that all of them must
> > > > be addressed, or addressed this way though :).
> > > >
> > > > > > ---
> > > > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 83 insertions(+)
> > > > > >
> > > > > > diff --git a/content.tex b/content.tex
> > > > > > index 2aa3006..9ed0d09 100644
> > > > > > --- a/content.tex
> > > > > > +++ b/content.tex
> > > > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > > > >    drive the device.
> > > > > >
> > > > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > +  device has been stopped by the driver. This status bit is different
> > > > > > +  from the reset since the device state is preserved.
> > > > > > +
> > > > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > +  device could not stop the STOP request.
> > > > > > +
> > > > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > > > >    an error from which it can't recover.
> > > > > >  \end{description}
> > > > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > >  recover by issuing a reset.
> > > > > >  \end{note}
> > > > > >
> > > > > > +If VIRTIO_F_STOP has been negotiated,
> > > > >
> > > > > "has not been" actually?
> > > > >
> > > >
> > > > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > > > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > > > driver must not set or clear the STOP bit before setting DRIVER_OK".
> > >
> > > Ok, but what happens if we simply allow the STOP to be set if
> > > DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> > > conflict with STOP.
> > >
> > > (Anyhow we allow to set STOP after DRIVER_OK)
> > >
> >
> > We could change it to "the driver MUST NOT set or clear STOP if
> > FEATURES_OK is not set", which would allow the driver to start a
> > device in stop mode. Before that should be definitely not done by a
> > good driver.
>
> Yes, limiting it before FEATURES_OK is a must.
>
> >
> > But if we don't allow the resume, it makes little sense to allow the
> > driver to start (as "set DRIVER_OK bit") in stop mode anyhow.
>
> Yes.
>
> > I would
> > say that it is better to limit that now, and allow it in the future if
> > we find a valid use case, enabling a specific feature flag for it.
> >
> > I'm also fine if we decide to leave this unspecified, but limiting it
> > now could enable us to make something useful with it in the future.
>
> Leaving is unspecified seems better since if we do the limitation, it
> introduces extra efforts for future extension.  But I'm fine with
> either.
>
> >
> > > >
> > > > > > the driver MUST NOT set or clear STOP if
> > > > > > +DRIVER_OK is not set.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > > > +the configuration change notification that the device must send after the
> > > > > > +change.
> > > > >
> > > > > This is kind of tricky, it means the device can send notification
> > > > > after it has been stopped.
> > > >
> > > > I don't think this part it's so tricky. That notification is also sent
> > > > when the DEVICE_NEEDS_RESET bit is set,
> > >
> > > I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> > > is stopped.
> >
> > To clarify, what I meant is that there are situations where this
> > notification is raised even if device configuration is not changed,
> > but its status.
>
> Right, but again the notification is not a must for the status changed
> (e.g reset).
>
> >
> > NEED_RESET does not mean the device is stopped, but it (should) signal
> > the driver that further interaction with the device will be for sure
> > invalid. I may be wrong with this, but this way of notifying the
> > driver relieves it to the need for check status in every interaction.
>
> You are right. But it's still different with stop, we don't need to
> check if for every interaction. What we need is to check if only after
> the driver tries to stop the device (as reset).
>
> >
> > > But what we want to achieve is to make sure there won't be
> > > any interaction between device and driver after STOP is set by device.
> > >
> >
> > If I understand you correctly, what you meant is that that a driver
> > could (and I think it will a lot of times) read the status change in
> > this order:
> > 1) STOP bit is set
> > 2) Notification change arrives
> >
> > And 2) is weird since the device promised no more interaction somehow.
> >
> > I agree to some extent, but it can be read even from the opposite
> > angle: From the moment the driver sets DRIVER_OK, every change on the
> > device (status or config) is notified using configuration change
> > interrupt.
>
> For status, It depends on what you mean by "change". If it's the value
> that read from the driver:
>
> 1) The only thing that needs to be notified is the status that can
> only be set by device, that is NEEDS_RESET.
> 2) For the status that set by the driver and device can refuse
> (forever or temporarily), there's no notification change: DRIVER_OK,
> FEATURE_OK
>
> STOP belongs to 2). STOP_FAILED belongs to 1), but:
>
> 1) STOP status bit 0 means the device is not stopped
> 2) We don't have DRIVER_FAILED and FEATURE_FAILED, instead, we just
> check whether or not the bit is set by the device.
>
> So whether or not we need STOP_FAILED is still questionable.
>

I don't split in "bits set by device" or "set by the driver". The way
I see it, the driver send a request, and the device is going to change
its status in the future. The driver can read the status reading the
two bits bitfield:

STOP_FAILED
|STOP
||
00 - Running normally
01 - Device has stopped successfully
10 - Device could not stop, but is running normally
11 - Cannot find this combination at this moment.

> >
> > a) Regarding the standard, I don't see it so different from the
> > NEED_RESET: the config change keeps being an out of band notification
> > system the driver can relay to know a (expected) status change.
> > b) I don't see a big deal with changing the semantic from "no more
> > interaction from the device" with "no more interaction but the
> > expected config change interrupt".
> > c) It's easy to ignore the interrupt, or even not to treat it
> > specially after the stop: The driver already should scan config to
> > look for changes in configuration and status, it will simply find
> > none. Although this is not implemented widely as far as I see.
> >
> > In that regard, I feel that interaction is very innocuous, and to me
> > is the straightforward solution to avoid the active polling.
>
> Well, the driver can choose not to do busy polling for sure without
> the interrupt for sure.
>

You mean using the transport specific method you describe in previous mails?

> >
> > > > and (as I read) is for the
> > > > same reason somehow: To avoid the status polling:
> > > > * "The driver SHOULD NOT rely on completion of operations of a device
> > > > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > > > * The reading of the status field could be expensive / inconvenient in
> > > > each operation.
> > >
> > > It makes sense for the device initiated event to use interrupt. But
> > > for a stop, it's driver initiated, in this case the driver won't start
> > > the work (for example the cleanup) after it makes sure the device is
> > > stopped. Polling the status should be fine as this is how the rest
> > > works. Anything makes stop differ from reset here? Or what worries you
> > > without the interrupt?
> > >
> >
> > This is proposed only in the scope of the concerns I saw raised in
> > previous series: the time to stop a device could be unbound, and
> > tricks to poll less frequently will increase migration time.
>
> But I don't see how an interrupt can help to reduce the time spent on
> the stop.

If a host has many pass-through devices, it needs to burn CPU to ask
all of them periodically. To reduce that burden, it poll less
frequently. Because of that, some devices are stopped but hypervisor
is not aware of that.

To solve it, the device must be able to tell the hypervisor / driver
when it has stopped. Of course, interrupt may not be the only way, and
actively polling will always be a choice.

> The downtime is usually a user policy, so the VMM can choose
> to timeout the stop and perform the resume.
>

With "resume" do you mean to actually reset the device if the STOP bit
is not set in time? If that is a possibility then it is sure we will
need to handle the inflight descriptors.

> As discussed, the way to advertise the inflight buffers might be a
> solution for this.
>

If these transactions are idempotent, as you say later, then yes.

> >
> > I will fully agree if these are left to the future: it is easy to
> > implement this chunk of the proposal under a separated feature flag if
> > this need arises. Sorry if that part was not clear enough.
>
> That's fine.
>
> >
> > > > * Solution: Instead of polling, make a device facility to notify the
> > > > driver that it cannot trust the device is going to behave properly /
> > > > same as before anymore via notification.
> > > >
> > > > We can add another exception to the "device configuration space
> > > > change" in "Notification of Device Configuration Changes", like the
> > > > one already present:
> > > > "In addition, this notification is triggered by the device setting
> > > > DEVICE_NEEDS_RESET".
> > > >
> > > > I understand it sounds tricky that the device sends a notification
> > > > when it's stopped, but in my opinion it's aligned with previous
> > > > behavior (DEVICE_NEEDS_RESET),
> > >
> > > I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> > > device won't process the buffer or send an interrupt.
> > >
> >
> > From the driver point of view, it means that the driver cannot trust
> > the device anymore until the reset, so the driver actions are similar:
>
> I think we need clarify the exact semanic of STOP_FAILED, and it will
> be very hard to differ
>
> 1) The device can not be stopped in short time
>
> from
>
> 2) The device can not be stopped forever
>
> At least in case 1) the driver still can trust the device.
>

I think that from the driver POV is the same. If device cannot stop,
it will continue operating normally.

> >
> > ""
> > the driver can’t assume requests in flight will be completed if
> > DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
> > completed
> > ""
> >
> > (Sorry for being circular here, I think it proceeds here too) What I
> > meant is that the device sent an out of band notification when the
> > device status changed. The driver could check the status field before
> > processing every used buffer and also with a timer just in case, and
> > DEVICE_NEEDS_RESET would not need the config interrupt change. But the
> > interrupt gives convenience to the whole operation.
> >
> > Every time the driver gets that interrupt, it must re-check all the
> > device configuration and status anyway. It can still make buffers
> > available while processing it, but that's the meaning of the interrupt
> > to me. And a status change after DRIVER_OK fits to it, from my point
> > of view.
>
> I fully agree, but it's different:
>
> 1) The driver don't know when there will be a NEEDS_RESET, so without
> an interrupt, it must check the status for each operation
> 2) The drive know when there will be STOP/STOP_FAILED, it only needs
> to check the status after it tries to stop the device
>

You know that the device will set from that point in the future, with
an unbound time. That could not be a problem, but there are concerns
raised on previous patches about this.

> >
> > > > it's explicitly stated that it will be
> > > > the last one, and it's caused because of the inconvenience of polling
> > > > device status. Even if the driver can use other mechanisms.
> > >
> > > I think STOP works much more similarly to reset not NEEDS_RESET. The
> > > only difference with reset is that STOP needs to preserve the device
> > > states and we don't (or can't) use interrupt to signal the completion
> > > of reset.
> > >
> >
> > From the semantic point of view, yes. But in practical terms we can
> > face unbounded time.
>
> The problem is:
>
> 1) how to define the "unbounded time", if we don't define it exactly,
> the device may abuse the status bit which may cause a lot of troubles

To me, the unbounded time is where the device does not guarantee that
you will have the STOP status set the next time you read it. In case
of an untrusted device, this may include everyone of them, for sure.

You care a little bit less about it if the device notifies you about
the condition: You simply have the driver waiting for it, but
operating normally in other regards.

That notification can be a basic facility, a transport device one, or
whichever other method.

To mandate that the device is stopped the next time the status bit is
read is another possibility for sure but if I understood correctly
that would let other devices out. If vqs descriptors are idempotent,
inflight_fd may be a way to get rid of that unbound time for sure.

> 2) there are other approaches that we can deal with the unbound time,
> timeout in driver + reset
>
> > I mean, both operations have unbound time for
> > sure, but I would say that any device should handle reset way faster
> > than the STOP.
>
> Any reason that STOP is faster? I'd expect STOP is a subset of reset,
> that is, in order to do reset, we must first stop.
>

I'm not sure if that is the way hardware can work, but with reset the
device can reclaim the needed resources for the communication in an
asynchronous way, since they are not needed for the driver to ask.
With the STOP, they need to be communicated to the driver somehow in a
virtio way.

> >
> > I fully agree on your point, but I can also see the other way around:
> > It would be convenient to have a configuration interrupt for the reset
> > too, but it is impossible since we cannot configure any before the
> > reset.
>
> I think it's a more about the question:
>
> 1) Why an interrupt is a must for STOP
>
> than
>
> 2) If we can use an interrupt
>
> My questions are all for 1).
>

It's not a must.

> >
> > > >
> > > > If the community still has concerns about it, another option is to
> > > > actually extract the way the device notifies it from the general
> > > > facilities, and make it transport specific. But to use the device
> > > > configuration change notification for this makes sense to me. The
> > > > device configuration has changed.
> > >
> > > See above, I think we should have a consistent way to handle reset and stop.
> > >
> > > >
> > > > > As discussed in the previous versions,
> > > > > driver is freed to use timer or what ever other mechanism if it
> > > > > doesn't like the busy polling. I wonder how much value we can gain
> > > > > from a dedicated config interrupt. Especially consider some transport
> > > > > can use transport specific interrupt (not virtio specific interrupt)
> > > > > for reporting whether or not set status succeed.
> > > > >
> > > >
> > > > In my opinion, *if* we agree that a stop is a virtio facility and not
> > > > a per-device one, and *if* we agree that a notification is required
> > > > for the device to notify the stop, it makes sense to use a
> > > > transport-independent mechanism that the device must already
> > > > implement.
> > >
> > > So the major question is why a notification is a must? And Just to be
> > > clear, there could be transport specific mechanisms for error
> > > reporting.
> > >
> > > E,g
> > >
> > > 1) PCI can have non-posted write, if we use non-posted write to carry
> > > the stop command, the device can return whether or not the device is
> > > stopped successfully.
> > >
> > > or
> > >
> > > 2) Some other transport can convert the stop status bit set into a
> > > command and queue it to device specific queue, device can then use
> > > it's own specific interrupt to report the when the stop is handled
> > > (success or fail)
> > >
> >
> > I would be totally fine with that too.
> >
> > > >
> > > > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > > > +try new STOP attempts.
> > > > >
> > > > > Does the device need to re-read the STOP_FAILED for synchronization?
> > > >
> > > > I tend to see the status as something that belongs to the device and
> > > > is exposed to the driver. In that sense, the write from the guest
> > > > triggers an event on the device, and the device decides what will be
> > > > exposed on that field (MMIO?) on the next driver read. If it's not
> > > > that way, we couldn't use the STOP bit that way, right?
> > >
> > > Yes, but this is not an answer to my question. It's about the
> > > ordering, when write returns it doesn't mean the write arrives at the
> > > device, this is the case of PCI at least. So we need a mechanism to
> > > make sure the write arrives at the device (PCI read will flush
> > > previous write).
> > >
> >
> > I didn't see that in your original question, sorry. But the PCI read
> > that flush the write is the driver one, isn't it?
> >
> > In that case I would say that "the read" is part of "the write". It's
> > an issue of the PCI protocol, which I would say doesn't belong to this
> > section (or even this document?): To implement virtio over PCI, you
> > know that virtio needs a write, and, in particular, you know that PCI
> > needs a posterior read to make sure that write is effective.
> >
> > Either that, or that the driver must use non-posted ones if it wants
> > the device to note it.
> >
> > Or am I still missing something?
>
> Just to make sure we are in the same page,  in this paragraph, you
> said this at the beginning:
>
> "If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> to ensure the STOP or STOP_FAILED bit is set after the write."
>
> And in the end of the paragraph:
>
> If the device sets the STOP_FAILED bit, the driver MUST clear it
> before try new STOP attempts. But you don't define whether we need to
> re-read to make sure STOP_FAILED is clear if the driver tries to clear
> it. Is this intended?
>

Not really, I assumed that the write of clearing STOP_FAILED and the
write of STOP would be ordered, and that the device should have no
problem clearing that status bit, since I don't see any kind of
resource reclamation or anything like that there.

> >
> > > >
> > > > > I
> > > > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > > > when that the device needs to set this bit. And driver can choose to
> > > > > reset after a specific timeout anyhow.
> > > > >
> > > >
> > > > The conditions where the device needs to set this bit are unspecified
> > > > because it depends on the device: Not only to the kind of device, but
> > > > also on the device backend.
> > > >
> > > > The same condition (regarding the possibility of handling the pending
> > > > buffers) could cause different devices to react differently. A network
> > > > device could decide it's fine to drop pending tx, let the guest think
> > > > that "the network lost them", and mark them as done,
> > >
> > > We may meet the similar issue during reset.
> > >
> >
> > Yes, but the driver should be fine to fail a reset, it does not want
> > to use the device anymore or it wants to totally override the device
> > state. If a stop fails, the driver would expect the device to continue
> > operating in my opinion, because it will be impossible to recover the
> > device state.
>
> This is only true if we allow the stop to be failed. It would be an
> issue if the driver fails to stop a device since it can fail the stop
> of the entire VM which is not something that the VMM is expecting.
>
> If we don't allow the stop can fail and we allow the device to expose
> the inflight buffers, we are all fine:
>
> 1) VM is guaranteed to be stopped
> 2) stop can be finished in time
>
> Devices are free to choose to wait for the short time request and tag
> the long time request as inflight.
>
> >
> > This is again something that we could leave if we decide it is not
> > necessary at this moment: It just shows how a concern of previous
> > proposals can be solved, at least technically.
>
> To me, I think we can start from a set of functions that can make e.g
> the virtio-net to work to unblock:
>
> 1) live migration work
> 2) extensions to other devices (e.g inflight could be done on top as
> new features)
>
> >
> > > > where a
> > > > persistent storage cannot do that for write requests. Just as an
> > > > example, not saying that networking devices must do that :).
> > >
> > > So I think this brings extra complexity that we probably don't need to
> > > worry about now. The reason is that the spec doesn't allow the reset
> > > to fail.
> > >
> >
> > It can be left for the future for sure.
> >
> > > >
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > >
> > > > > Any motivation for this? it looks to me it makes the feature coupled
> > > > > with the virtqueue state proposal? It seems odd to allow avail change
> > > > > but not the last_avail_idx change.
> > > > >
> > > >
> > > > On second thought, I think you are right and this overlaps with the
> > > > state proposal.
> > > >
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > +the driver MAY change any descriptor.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > > > +acknowledges the new status clearing it. Since this change may not be
> > > > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > > > +that the device must send after the change.
> > > > >
> > > > > Do we really needs resuming? it's kind of:
> > > > >
> > > > > 1) STOP -> clear STOP
> > > > >
> > > > > vs
> > > > >
> > > > > 2) STOP -> RESET -> DRIVER_OK
> > > > >
> > > > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > > > >
> > > >
> > > > You are totally right in that regard. But the use case simplifies the
> > > > operation when the driver only wants to take back some available
> > > > descriptors still not used, in the range last_avail_idx..avail_idx.
> > > > Doing that could be a big burden for drivers, who would need to
> > > > re-send every status. MST proposed that use case at [1].
> > >
> > > Yes, but it looks to me this doesn't require the resuming? And the per
> > > virtqueue reset is being proposed here.
> > >
> > > https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
> > >
> > > Actually, there's a subtle difference between 1) and 2). That is using
> > > 2) doesn't make sure we can "resume" from the index where we stopped.
> > > But this won't be an issue considering we know that we need to support
> > > setting device virtqueue state(index). So if we want to resume from
> > > the exact index it could be:
> > >
> > > STOP -> RESET -> setting index -> DRIVER_OK
> > >
> >
> > With the state I meant more than VQ state, but the device state in
> > general. For example, for the network, you must also send all the
> > needed control commands to recover mac, rx filters, etc.
>
> I'm not sure I get this. For those cvq stuff, with the help of the
> shadow virtqueue, we don't need any spec extensions. What did I miss
> here?
>

Right, that was not the example I intended to put actually. I did a
lot of back and forth for the answer and I put the wrong one here,
sorry :).

But my concern is solved if we treat the inflight as a idempotent. All
unanswered things above this are solved with that too.

> >
> > That's what I meant with "if you just want to rewind some descriptors,
> > resetting the whole device is overkill".
> >
> > The example may be wrong, I can think of virtiofs and the need to keep
> > files opened:
> > * If we go through a full reset circle, the files opened may not be
> > the same as the closed ones, like deleted files with open handles.
> > * If we go through a full reset circle, watchers may skip a change.
> >
> > Of course, this complexity may be left for the future and simply state
> > that, if that is the case, the device cannot offer stop feature.
> > Virtiofs have already other complexities that makes its migration
> > hard, but I think the point is explained.
>
> There are long discussions about the virtiofs migration. But it's out
> of the scope for the discussion of device stop since it's mainly about
> how to define and expose device states. For stop, it's more than
> sufficient to say the device states needs to be preserved after the
> device is stopped.
>
> I'd rather go with something simple to work for a simple type of
> device like ethernet. Otherwise there will be endless discussion. For
> any features that are not needed by the ethernet device, I would leave
> it for future investigation.
>
> >
> > > >
> > > > In that regard, the straightforward thing to do is modify avail_idx /
> > > > descriptors from that range and let resume. However, the RESET path
> > > > makes it easier to implement the device part of course, and the guest
> > > > can also achieve the rewind that way.
> > > >
> > > > > > +
> > > > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > > > >
> > > > > >  The device MUST NOT consume buffers or send any used buffer
> > > > > >  notifications to the driver before DRIVER_OK.
> > > > > >
> > > > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > > > +or clear of STOP.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > > > +operations after the driver writes STOP.
> > > > >
> > > > > I wonder if it's better to leave this to device to decide. E.g some
> > > > > block devices may requires a very log time to finish the inflight
> > > > > operations.
> > > > >
> > > >
> > > > (Letting out SVQ + inflight descriptors for this part of the response,
> > > > I will come back to it later)
> > > >
> > > > But if virtqueue is not valid anymore, how can it report them when
> > > > finished?
> > >
> > > It's still valid since the STOP bit is not set by the device.
> > >
> >
> > Then I don't understand your answer. To my proposal of:
> >
> > "If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > in-flight operations after the driver writes STOP."
> >
> > You answered:
> >
> > "I wonder if it's better to leave this to device to decide. E.g some
> > block devices may requires a very log time to finish the inflight
> > operations."
> >
> > The device must finish all requests before it shows the STOP bit as
> > set to the device. Maybe it is better to rephrase it like:
> >
> > If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > in-flight operations after the driver writes STOP and before it sets
> > its status bit STOP as set.
> >
> > ?
>
> I meant when to set STOP is highly device specific. E.g for the virtio
> block devices which allows the in-flight requests to be re-submitted
> after resume, the device can choose to not wait for the completion of
> the inflight operations and expose them to the driver. This helps to
> reduce the time spent on the stop.
>
> >
> > > > In that sense, I would say it's better to report failure and
> > > > let the guest handle it as if the disk is unavailable (timeout?
> > > > temporary faulty sector? I'm not sure what is the most suitable way).
> > >
> > > This could be addressed by leaving the following choices to the devices:
> > >
> > > 1) complete the inflight requests
> > > 2) device or virtio specific for reporting inflight descriptors
> > >
> >
> > As previously, I'm not sure how this relates with "the stop bit is not
> > set by the device", so my answer may be completely wrong here,
>
> It's related to time spent on the stop. E.g for block devices, it can
> simply show all the inflight buffers to guests and set the stop bit.
> Then STOP should be very fast.
>
> >
> > Even assuming the device can report in-flight descriptors, it needs to
> > wait for the backend before reporting them anyhow.
>
> Why does it need to wait for the backend? I mean for the device that
> supports in-flight descriptors, the semantic of the device should
> allow the requests to be processed twice.
>

If that is true the proposal would be greatly simplified of course.

> > And we would need
> > another indication. What is the use of separating these status?
> > (waiting for stop bit, waiting for inflight descriptors to be valid).
> >
> > The only possibility I can come up with is to actually stop the
> > request right in the middle of an operation. For example, to allow a
> > big block read to stop and then when the device is informed about
> > these inflight descriptors and its progress, it can continue. I would
> > say this is very out of scope, more about this later ([1]).
> >
> > > >
> > > > *If* we are not going to allow the guest to resume operation, where it
> > > > knows all the status of the device, then there is no value on let the
> > > > device delay the operation: From the guest point of view it either
> > > > succeed to send to the device backend and somebody else caused a
> > > > failure (external network lose the tx packet, bit rotting caused I'm
> > > > reading a different value than previously written), or it failed at
> > > > the stop moment.
> > >
> > > So it's highly device specific, e.g for ethernet, we can afford the
> > > loss of packets but not for the block devices so reporting inflight
> > > descriptors may help to res-submit those after "resuming".
> > >
> >
> > Right.
> >
> > > >
> > > > This is different with the resume possibility, where the device can
> > > > decide to hold the descriptors, stop operating, and then resume
> > > > operation.
> > > >
> > > > > > Depending on the device, it can do it
> > > > > > +in many ways as long as the driver can recover its normal operation if it
> > > > > > +resumes the device without the need of resetting it:
> > > > > > +\begin{itemize}
> > > > > > +\item Drain and wait for the completion of all pending requests until a
> > > > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > > > +can choose to retry or to cancel them.
> > > > >
> > > > > If we allow the driver to retry, we need a way to report inflight
> > > > > buffers which is not supported by the spec. A way to solve this is to
> > > > > make it device specific.
> > > > >
> > > >
> > > > Regarding the retry, I don't get you here. Re-reading the patch, I
> > > > think that "driver retry" is very ambiguous: I meant for the device to
> > > > mark the descriptor as used, but with a communication specific error
> > > > code, so the application, guest kernel, etc (driver in the standard)
> > > > can decide to retry.
> > >
> > > That's why I think introducing the virtqueue state is a must for stop,
> > > With all the indexes defined, it would be much easier to describe what
> > > the device or driver is expected to work.
> > >
> >
> > I still don't see the relationship, sorry.
>
> E.g how do you define the in-flight buffers accurately?
>

If we can make them idempotent, one descriptor is in flight if it is
available and the device is aware of that.

> >
> > What I intended to say in the patch is that the device can choose to
> > just return a device / communication error to indicate that the
> > transaction has failed at device level, but related to virtio, the
> > buffer would be marked as used.
> >
> > Maybe a good example of this is for the device to choose to return
> > VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
> > block backend, but I don't know a lot of the blk device so I may be
> > wrong. I guess that the guest cannot know about the value being
> > written / read with that error code, and it is forced to re-read that.
> > But the virtqueue will be in a good state, and the device can be reset
> > and can recover its state. It's totally up to the device to choose to
> > do so.
>
> I think not, if we tie STOP to some device errors that could be even
> more complicated.
>

It's up to the device to implement that way but I understand your point.

> >
> > Virtqueue state is still needed, but not because the device chooses to
> > return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
> > status after the reset.
> >
> > > >
> > > > Regarding the in-flight descriptor report, it's interesting but I
> > > > cannot see a way where it does not complicate the solution a lot or
> > > > adds new dependencies. I have the next thoughts:
> > > > 1) If it works as inflight_fd, "a region of shared memory"
> > > > 1.1) This region must be in the guest's AS so the device has access to
> > > > it. This either invalidates the use of STOP from the driver point of
> > > > view as "let me know where you are not going to modify the guest's
> > > > memory anymore".
> >
> > Long shot here, but might this work with the combination of the
> > balloon device? Making this far and far from the simplicity though...
> >
> > > > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > > > it is possible to implement the SVQ without the need of STOP bit. This
> > > > is equivalent to "I have a PF that also supports VF dirty memory
> > > > tracking".
> > > > 2) If it works as the config space, where the driver can ask for its
> > > > status, STOP means "STOP writing used and report via config space". No
> > > > need for reset.
> > > >
> > > > Did you have something different in mind?
> > >
> > > Not sure, maybe config space is better. What I want is to make the
> > > feature as small as possible but leaving spaces for future extension.
> > >
> > > E.g we start from the feature that is sufficient for networking
> > > devices, (but doesn't prevent the future work to extend it to block
> > > devices). I'm not familiar with the block device, but mandating the
> > > completion of inflight descriptor make have troubles, e.g unexpected
> > > downtime during live migration.
> > >
> >
> > [1] I agree with that, but I feel that "device or virtio specific for
> > reporting inflight descriptors" is way too broad to make it useful at
> > the moment.
>
> Yes and that's not a must for an ethernet device.
>

This is not really true as how this proposal is specified, if we use
it in the hypervisor. Just saving the virtqueue index is not enough if
the device is not using the descriptors in order, since the available
buffers may not be recoverable just looking at the guest memory.

In that regard, we must either flush them (as this proposal do, and
with the unbound time problem), or use the inflight descriptors.

> >
> > Maybe the best thing to do is to put all the restrictions at this
> > moment, and when we figure out a good format for the inflight, add
> > "\item report inflight descriptors". Then, the device and the driver
> > are free to not accept any combination. Does it make sense?
>
> Somehow, to start from a version that works for networking devices.
> Where we know we don't need to care:
>
> 1) stop fail
> 2) unbound time of stop, so we don't need an interrupt
> 3) inflight buffers
> 4) new facility querying device states (shadow CVQ can do this)
>
> This will ease both of us as I feel the discussion might not be easily
> converged if we care about other types of devices with too many
> things. With networking done, we can start to support block devices
> and we can ask help from block gurus.
>

I'm fine with that except that we don't need 3.

> >
> > > >
> > > > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > > > +assume to lose them.
> > > > >
> > > > > I think "make buffer used" is better than "mark them as done". And we
> > > > > need a accurate definition on who is "them".
> > > > >
> > > >
> > > > All items include other operations, like the ones that the device must
> > > > do internally to process the control virtqueue. But I cannot find an
> > > > example where telling the driver they are done when it's not is valid
> > > > for this particular item.
> > > >
> > > > But I agree it needs better wording.
> > > >
> > > > And I will s/them/operations/. for the next one.
> > > >
> > > > > > +\end{itemize}
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > > > +a guest's request,
> > > > >
> > > > > It's not clear what did "a guest's request" means.
> > > > >
> > > >
> > > > Right. Would "operation" fit better here?
> > >
> > > Still unclear, I guess this sentence tries to define when the device
> > > can fail the stop?
> > >
> >
> > Not really, my intentions were to add a MUST operation for when the
> > device fails. The first is needed for the second though, so maybe we
> > can rephrase.
> >
> > If we agree that a device can fail the stop, I think we should not
> > restrict the circumstances where the device can fail. "If the device
> > can find external circumstances where it cannot satisfy STOP must not
> > offer STOP feature" works for me too, actually.
>
> I'd leave the STOP_FAILED for future investigation.
>
> >
> > > >
> > > > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > > > +clears STOP_FAILED.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > > > +after exposing the STOP bit set:
> > > > > > +\begin{itemize}
> > > > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > > > +\item Send any used buffer notifications to the driver.
> > > > > > +\end{itemize}
> > > > > > +
> > > > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > > > +send another configuration space change notification to the driver afterwards
> > > > > > +until the guest clears it.
> > > > > > +
> > > > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > > > +device MUST continue reading available descriptors as if an available buffer
> > > > > > +notification has reach it, starting from the last descriptor it marked as used,
> > > > >
> > > > > So I still tend to define virtqueue state as basic facility before
> > > > > defining STOP. It can makes thing easier.
> > > > >
> > > >
> > > > Yes, coming back to that approach can simplify the whole proposal.
> > > >
> > > > > > +and continue the regular operation after that. The device MUST read again
> > > > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > > > +if for some reason it cannot continue.
> > > > > > +
> > > > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > > > >  MUST send a device configuration change notification to the driver.
> > > > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > > > >    transport specific.
> > > > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > > > >
> > > > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > > > +  stop the device.
> > > > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > > > +
> > > > > >  \end{description}
> > > > >
> > > > > So I think the patch complicate thing is various ways:
> > > > >
> > > > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > > > with NEEDS_RESET
> > > > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> > > >
> > > > I'm not sure about those two, I find we will have devices with unbound
> > > > stop time where both can be useful if we agree on making this a
> > > > general facility.
> > >
> > > If the unbound stop time is the only worry, the way to report inflight
> > > descriptors looks like a better solution.
> >
> > I'm not sure if that's the only condition under which a device can
> > fail to stop, but if we agree on that we could prepare a format for
> > block devices to report them, for example. They are needed somehow in
> > the networking case of packed if buffers are used out of order.
>
> It can, but let's leave it for the future now.
>
> Actually, for inflight buffers, a better idea is to support it at the
> virtqueue level without extra data structure. But it's for sure not a
> short term solution.
>

Can you expand on this? Why do you think it is not a short term solution?

> >
> > > And STOP_FAILED is actually
> > > not accurate since it means the stop is not finished in bound time
> > > (but we need to define how long should be a bound time?)
> > >
> > > > Resetting the whole device because of this leaves
> > > > the driver with no possibility of knowing the state of the sent
> > > > descriptors.
> > > >
> > > > Of course, if these use cases are not interesting, it's easier to
> > > > leave them out for sure.
> > > >
> > > > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > > > + DRIVER_OK
> > > > >
> > > >
> > > > I agree it can be obtained with a whole reset, so it can be out and
> > > > leave it for the future if needed. However it seems overkill if we
> > > > just want to rewind some descriptors back, and there is no standard
> > > > way to recover the device status beyond vq_state.
> > >
> > > It's more about the minimal self-contained set of the new features. If
> > > it's just rewind, device or virtqueue reset is sufficient.
> >
> > I'm not sure if that is true for all devices with the features the
> > standard offers at the moment, but it might be right for serial.
>
> Thanks
>
> >
> > > If we want
> > > to obtain the state, virtqueue state is a must and with virtqueue
> > > state, resuming (clearing STOP) is not a must.
> > >
> >
> > Right.
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > > I think we'd better to stick to the minimal set of the function to
> > > > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > > > and no config interrupt).
> > > > >
> > > >
> > > > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > > > --
> > > > > > 2.27.0
> > > > > >
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-16 14:50             ` Eugenio Perez Martin
@ 2021-11-17  3:27               ` Jason Wang
  2021-11-17  8:08                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-17  3:27 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 16, 2021 at 10:50 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Nov 16, 2021 at 7:56 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 2:17 AM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > >
> > > > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > > > status, making sure the device will not fetch new available ones.
> > > > > > >
> > > > > > > Its main use case is live migration, although it has other orthogonal
> > > > > > > use cases. It can be used to safely discard requests that have not been
> > > > > > > used: in other words, to rewind available descriptors.
> > > > > > >
> > > > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > >
> > > > > > So this is much more complicated, see below.
> > > > > >
> > > > >
> > > > > I agree it's more complicated, but it addresses some concerns raised
> > > > > on previous patches sent to the list. Not saying that all of them must
> > > > > be addressed, or addressed this way though :).
> > > > >
> > > > > > > ---
> > > > > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 83 insertions(+)
> > > > > > >
> > > > > > > diff --git a/content.tex b/content.tex
> > > > > > > index 2aa3006..9ed0d09 100644
> > > > > > > --- a/content.tex
> > > > > > > +++ b/content.tex
> > > > > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > > > > >    drive the device.
> > > > > > >
> > > > > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > +  device has been stopped by the driver. This status bit is different
> > > > > > > +  from the reset since the device state is preserved.
> > > > > > > +
> > > > > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > +  device could not stop the STOP request.
> > > > > > > +
> > > > > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > > > > >    an error from which it can't recover.
> > > > > > >  \end{description}
> > > > > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > >  recover by issuing a reset.
> > > > > > >  \end{note}
> > > > > > >
> > > > > > > +If VIRTIO_F_STOP has been negotiated,
> > > > > >
> > > > > > "has not been" actually?
> > > > > >
> > > > >
> > > > > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > > > > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > > > > driver must not set or clear the STOP bit before setting DRIVER_OK".
> > > >
> > > > Ok, but what happens if we simply allow the STOP to be set if
> > > > DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> > > > conflict with STOP.
> > > >
> > > > (Anyhow we allow to set STOP after DRIVER_OK)
> > > >
> > >
> > > We could change it to "the driver MUST NOT set or clear STOP if
> > > FEATURES_OK is not set", which would allow the driver to start a
> > > device in stop mode. Before that should be definitely not done by a
> > > good driver.
> >
> > Yes, limiting it before FEATURES_OK is a must.
> >
> > >
> > > But if we don't allow the resume, it makes little sense to allow the
> > > driver to start (as "set DRIVER_OK bit") in stop mode anyhow.
> >
> > Yes.
> >
> > > I would
> > > say that it is better to limit that now, and allow it in the future if
> > > we find a valid use case, enabling a specific feature flag for it.
> > >
> > > I'm also fine if we decide to leave this unspecified, but limiting it
> > > now could enable us to make something useful with it in the future.
> >
> > Leaving is unspecified seems better since if we do the limitation, it
> > introduces extra efforts for future extension.  But I'm fine with
> > either.
> >
> > >
> > > > >
> > > > > > > the driver MUST NOT set or clear STOP if
> > > > > > > +DRIVER_OK is not set.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > > > > +the configuration change notification that the device must send after the
> > > > > > > +change.
> > > > > >
> > > > > > This is kind of tricky, it means the device can send notification
> > > > > > after it has been stopped.
> > > > >
> > > > > I don't think this part it's so tricky. That notification is also sent
> > > > > when the DEVICE_NEEDS_RESET bit is set,
> > > >
> > > > I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> > > > is stopped.
> > >
> > > To clarify, what I meant is that there are situations where this
> > > notification is raised even if device configuration is not changed,
> > > but its status.
> >
> > Right, but again the notification is not a must for the status changed
> > (e.g reset).
> >
> > >
> > > NEED_RESET does not mean the device is stopped, but it (should) signal
> > > the driver that further interaction with the device will be for sure
> > > invalid. I may be wrong with this, but this way of notifying the
> > > driver relieves it to the need for check status in every interaction.
> >
> > You are right. But it's still different with stop, we don't need to
> > check if for every interaction. What we need is to check if only after
> > the driver tries to stop the device (as reset).
> >
> > >
> > > > But what we want to achieve is to make sure there won't be
> > > > any interaction between device and driver after STOP is set by device.
> > > >
> > >
> > > If I understand you correctly, what you meant is that that a driver
> > > could (and I think it will a lot of times) read the status change in
> > > this order:
> > > 1) STOP bit is set
> > > 2) Notification change arrives
> > >
> > > And 2) is weird since the device promised no more interaction somehow.
> > >
> > > I agree to some extent, but it can be read even from the opposite
> > > angle: From the moment the driver sets DRIVER_OK, every change on the
> > > device (status or config) is notified using configuration change
> > > interrupt.
> >
> > For status, It depends on what you mean by "change". If it's the value
> > that read from the driver:
> >
> > 1) The only thing that needs to be notified is the status that can
> > only be set by device, that is NEEDS_RESET.
> > 2) For the status that set by the driver and device can refuse
> > (forever or temporarily), there's no notification change: DRIVER_OK,
> > FEATURE_OK
> >
> > STOP belongs to 2). STOP_FAILED belongs to 1), but:
> >
> > 1) STOP status bit 0 means the device is not stopped
> > 2) We don't have DRIVER_FAILED and FEATURE_FAILED, instead, we just
> > check whether or not the bit is set by the device.
> >
> > So whether or not we need STOP_FAILED is still questionable.
> >
>
> I don't split in "bits set by device" or "set by the driver". The way
> I see it, the driver send a request, and the device is going to change
> its status in the future. The driver can read the status reading the
> two bits bitfield:
>
> STOP_FAILED
> |STOP
> ||
> 00 - Running normally
> 01 - Device has stopped successfully
> 10 - Device could not stop, but is running normally

I wonder what's the advantage of differing 10 from 00?

> 11 - Cannot find this combination at this moment.
>
> > >
> > > a) Regarding the standard, I don't see it so different from the
> > > NEED_RESET: the config change keeps being an out of band notification
> > > system the driver can relay to know a (expected) status change.
> > > b) I don't see a big deal with changing the semantic from "no more
> > > interaction from the device" with "no more interaction but the
> > > expected config change interrupt".
> > > c) It's easy to ignore the interrupt, or even not to treat it
> > > specially after the stop: The driver already should scan config to
> > > look for changes in configuration and status, it will simply find
> > > none. Although this is not implemented widely as far as I see.
> > >
> > > In that regard, I feel that interaction is very innocuous, and to me
> > > is the straightforward solution to avoid the active polling.
> >
> > Well, the driver can choose not to do busy polling for sure without
> > the interrupt for sure.
> >
>
> You mean using the transport specific method you describe in previous mails?

Yes and no, it could be:

1) transport specific method
2) any other method that doesn't do busy polling (timer, sleep etc)

>
> > >
> > > > > and (as I read) is for the
> > > > > same reason somehow: To avoid the status polling:
> > > > > * "The driver SHOULD NOT rely on completion of operations of a device
> > > > > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > > > > * The reading of the status field could be expensive / inconvenient in
> > > > > each operation.
> > > >
> > > > It makes sense for the device initiated event to use interrupt. But
> > > > for a stop, it's driver initiated, in this case the driver won't start
> > > > the work (for example the cleanup) after it makes sure the device is
> > > > stopped. Polling the status should be fine as this is how the rest
> > > > works. Anything makes stop differ from reset here? Or what worries you
> > > > without the interrupt?
> > > >
> > >
> > > This is proposed only in the scope of the concerns I saw raised in
> > > previous series: the time to stop a device could be unbound, and
> > > tricks to poll less frequently will increase migration time.
> >
> > But I don't see how an interrupt can help to reduce the time spent on
> > the stop.
>
> If a host has many pass-through devices, it needs to burn CPU to ask
> all of them periodically. To reduce that burden, it poll less
> frequently. Because of that, some devices are stopped but hypervisor
> is not aware of that.
>
> To solve it, the device must be able to tell the hypervisor / driver
> when it has stopped. Of course, interrupt may not be the only way, and
> actively polling will always be a choice.

Yes, usually linux drivers will not do busy polling, instead it can
use cpu_relax() or msleep() during the polling.

>
> > The downtime is usually a user policy, so the VMM can choose
> > to timeout the stop and perform the resume.
> >
>
> With "resume" do you mean to actually reset the device if the STOP bit
> is not set in time? If that is a possibility then it is sure we will
> need to handle the inflight descriptors.

Yes, but I wonder whether or not we can leave the inflight descriptors
for the future. (If it's not a must for networking). The idea is to
have something that can work quickly. But I'm also fine to propose it
now. With that I believe most of the device can be stopped in a short
time (we don't need to wait for the completion of inflight requests).

>
> > As discussed, the way to advertise the inflight buffers might be a
> > solution for this.
> >
>
> If these transactions are idempotent, as you say later, then yes.
>
> > >
> > > I will fully agree if these are left to the future: it is easy to
> > > implement this chunk of the proposal under a separated feature flag if
> > > this need arises. Sorry if that part was not clear enough.
> >
> > That's fine.
> >
> > >
> > > > > * Solution: Instead of polling, make a device facility to notify the
> > > > > driver that it cannot trust the device is going to behave properly /
> > > > > same as before anymore via notification.
> > > > >
> > > > > We can add another exception to the "device configuration space
> > > > > change" in "Notification of Device Configuration Changes", like the
> > > > > one already present:
> > > > > "In addition, this notification is triggered by the device setting
> > > > > DEVICE_NEEDS_RESET".
> > > > >
> > > > > I understand it sounds tricky that the device sends a notification
> > > > > when it's stopped, but in my opinion it's aligned with previous
> > > > > behavior (DEVICE_NEEDS_RESET),
> > > >
> > > > I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> > > > device won't process the buffer or send an interrupt.
> > > >
> > >
> > > From the driver point of view, it means that the driver cannot trust
> > > the device anymore until the reset, so the driver actions are similar:
> >
> > I think we need clarify the exact semanic of STOP_FAILED, and it will
> > be very hard to differ
> >
> > 1) The device can not be stopped in short time
> >
> > from
> >
> > 2) The device can not be stopped forever
> >
> > At least in case 1) the driver still can trust the device.
> >
>
> I think that from the driver POV is the same. If device cannot stop,
> it will continue operating normally.
>
> > >
> > > ""
> > > the driver can’t assume requests in flight will be completed if
> > > DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
> > > completed
> > > ""
> > >
> > > (Sorry for being circular here, I think it proceeds here too) What I
> > > meant is that the device sent an out of band notification when the
> > > device status changed. The driver could check the status field before
> > > processing every used buffer and also with a timer just in case, and
> > > DEVICE_NEEDS_RESET would not need the config interrupt change. But the
> > > interrupt gives convenience to the whole operation.
> > >
> > > Every time the driver gets that interrupt, it must re-check all the
> > > device configuration and status anyway. It can still make buffers
> > > available while processing it, but that's the meaning of the interrupt
> > > to me. And a status change after DRIVER_OK fits to it, from my point
> > > of view.
> >
> > I fully agree, but it's different:
> >
> > 1) The driver don't know when there will be a NEEDS_RESET, so without
> > an interrupt, it must check the status for each operation
> > 2) The drive know when there will be STOP/STOP_FAILED, it only needs
> > to check the status after it tries to stop the device
> >
>
> You know that the device will set from that point in the future, with
> an unbound time.

So my point is introducing facilities to avoid the unbound time. I
think both of us know dealing with it is not easy.

> That could not be a problem, but there are concerns
> raised on previous patches about this.
>
> > >
> > > > > it's explicitly stated that it will be
> > > > > the last one, and it's caused because of the inconvenience of polling
> > > > > device status. Even if the driver can use other mechanisms.
> > > >
> > > > I think STOP works much more similarly to reset not NEEDS_RESET. The
> > > > only difference with reset is that STOP needs to preserve the device
> > > > states and we don't (or can't) use interrupt to signal the completion
> > > > of reset.
> > > >
> > >
> > > From the semantic point of view, yes. But in practical terms we can
> > > face unbounded time.
> >
> > The problem is:
> >
> > 1) how to define the "unbounded time", if we don't define it exactly,
> > the device may abuse the status bit which may cause a lot of troubles
>
> To me, the unbounded time is where the device does not guarantee that
> you will have the STOP status set the next time you read it. In case
> of an untrusted device, this may include everyone of them, for sure.

Then the device behaviour is tied to the driver's behaviour. The
problem is that there's no bound time between the write and the
following read. There could be arbitrary time in the middle.

>
> You care a little bit less about it if the device notifies you about
> the condition: You simply have the driver waiting for it, but
> operating normally in other regards.

This can be achieved even without an interrupt.

>
> That notification can be a basic facility, a transport device one, or
> whichever other method.
>
> To mandate that the device is stopped the next time the status bit is
> read is another possibility for sure but if I understood correctly
> that would let other devices out. If vqs descriptors are idempotent,
> inflight_fd may be a way to get rid of that unbound time for sure.

To be clear, inflight_fd should be part of the device area?

>
> > 2) there are other approaches that we can deal with the unbound time,
> > timeout in driver + reset
> >
> > > I mean, both operations have unbound time for
> > > sure, but I would say that any device should handle reset way faster
> > > than the STOP.
> >
> > Any reason that STOP is faster? I'd expect STOP is a subset of reset,
> > that is, in order to do reset, we must first stop.
> >
>
> I'm not sure if that is the way hardware can work, but with reset the
> device can reclaim the needed resources for the communication in an
> asynchronous way, since they are not needed for the driver to ask.

Yes and I think we can do the same for the stop.

> With the STOP, they need to be communicated to the driver somehow in a
> virtio way.

So I think if it's device specific, we'd better not assume which is
faster. Instead, as discussed, it's better to try our best to avoid
the unbound stop (as what we did for reset). If we fail, it's not late
to consider how to recover from that.

>
> > >
> > > I fully agree on your point, but I can also see the other way around:
> > > It would be convenient to have a configuration interrupt for the reset
> > > too, but it is impossible since we cannot configure any before the
> > > reset.
> >
> > I think it's a more about the question:
> >
> > 1) Why an interrupt is a must for STOP
> >
> > than
> >
> > 2) If we can use an interrupt
> >
> > My questions are all for 1).
> >
>
> It's not a must.
>
> > >
> > > > >
> > > > > If the community still has concerns about it, another option is to
> > > > > actually extract the way the device notifies it from the general
> > > > > facilities, and make it transport specific. But to use the device
> > > > > configuration change notification for this makes sense to me. The
> > > > > device configuration has changed.
> > > >
> > > > See above, I think we should have a consistent way to handle reset and stop.
> > > >
> > > > >
> > > > > > As discussed in the previous versions,
> > > > > > driver is freed to use timer or what ever other mechanism if it
> > > > > > doesn't like the busy polling. I wonder how much value we can gain
> > > > > > from a dedicated config interrupt. Especially consider some transport
> > > > > > can use transport specific interrupt (not virtio specific interrupt)
> > > > > > for reporting whether or not set status succeed.
> > > > > >
> > > > >
> > > > > In my opinion, *if* we agree that a stop is a virtio facility and not
> > > > > a per-device one, and *if* we agree that a notification is required
> > > > > for the device to notify the stop, it makes sense to use a
> > > > > transport-independent mechanism that the device must already
> > > > > implement.
> > > >
> > > > So the major question is why a notification is a must? And Just to be
> > > > clear, there could be transport specific mechanisms for error
> > > > reporting.
> > > >
> > > > E,g
> > > >
> > > > 1) PCI can have non-posted write, if we use non-posted write to carry
> > > > the stop command, the device can return whether or not the device is
> > > > stopped successfully.
> > > >
> > > > or
> > > >
> > > > 2) Some other transport can convert the stop status bit set into a
> > > > command and queue it to device specific queue, device can then use
> > > > it's own specific interrupt to report the when the stop is handled
> > > > (success or fail)
> > > >
> > >
> > > I would be totally fine with that too.
> > >
> > > > >
> > > > > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > > > > +try new STOP attempts.
> > > > > >
> > > > > > Does the device need to re-read the STOP_FAILED for synchronization?
> > > > >
> > > > > I tend to see the status as something that belongs to the device and
> > > > > is exposed to the driver. In that sense, the write from the guest
> > > > > triggers an event on the device, and the device decides what will be
> > > > > exposed on that field (MMIO?) on the next driver read. If it's not
> > > > > that way, we couldn't use the STOP bit that way, right?
> > > >
> > > > Yes, but this is not an answer to my question. It's about the
> > > > ordering, when write returns it doesn't mean the write arrives at the
> > > > device, this is the case of PCI at least. So we need a mechanism to
> > > > make sure the write arrives at the device (PCI read will flush
> > > > previous write).
> > > >
> > >
> > > I didn't see that in your original question, sorry. But the PCI read
> > > that flush the write is the driver one, isn't it?
> > >
> > > In that case I would say that "the read" is part of "the write". It's
> > > an issue of the PCI protocol, which I would say doesn't belong to this
> > > section (or even this document?): To implement virtio over PCI, you
> > > know that virtio needs a write, and, in particular, you know that PCI
> > > needs a posterior read to make sure that write is effective.
> > >
> > > Either that, or that the driver must use non-posted ones if it wants
> > > the device to note it.
> > >
> > > Or am I still missing something?
> >
> > Just to make sure we are in the same page,  in this paragraph, you
> > said this at the beginning:
> >
> > "If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > to ensure the STOP or STOP_FAILED bit is set after the write."
> >
> > And in the end of the paragraph:
> >
> > If the device sets the STOP_FAILED bit, the driver MUST clear it
> > before try new STOP attempts. But you don't define whether we need to
> > re-read to make sure STOP_FAILED is clear if the driver tries to clear
> > it. Is this intended?
> >
>
> Not really, I assumed that the write of clearing STOP_FAILED and the
> write of STOP would be ordered, and that the device should have no
> problem clearing that status bit, since I don't see any kind of
> resource reclamation or anything like that there.

Ok.

>
> > >
> > > > >
> > > > > > I
> > > > > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > > > > when that the device needs to set this bit. And driver can choose to
> > > > > > reset after a specific timeout anyhow.
> > > > > >
> > > > >
> > > > > The conditions where the device needs to set this bit are unspecified
> > > > > because it depends on the device: Not only to the kind of device, but
> > > > > also on the device backend.
> > > > >
> > > > > The same condition (regarding the possibility of handling the pending
> > > > > buffers) could cause different devices to react differently. A network
> > > > > device could decide it's fine to drop pending tx, let the guest think
> > > > > that "the network lost them", and mark them as done,
> > > >
> > > > We may meet the similar issue during reset.
> > > >
> > >
> > > Yes, but the driver should be fine to fail a reset, it does not want
> > > to use the device anymore or it wants to totally override the device
> > > state. If a stop fails, the driver would expect the device to continue
> > > operating in my opinion, because it will be impossible to recover the
> > > device state.
> >
> > This is only true if we allow the stop to be failed. It would be an
> > issue if the driver fails to stop a device since it can fail the stop
> > of the entire VM which is not something that the VMM is expecting.
> >
> > If we don't allow the stop can fail and we allow the device to expose
> > the inflight buffers, we are all fine:
> >
> > 1) VM is guaranteed to be stopped
> > 2) stop can be finished in time
> >
> > Devices are free to choose to wait for the short time request and tag
> > the long time request as inflight.
> >
> > >
> > > This is again something that we could leave if we decide it is not
> > > necessary at this moment: It just shows how a concern of previous
> > > proposals can be solved, at least technically.
> >
> > To me, I think we can start from a set of functions that can make e.g
> > the virtio-net to work to unblock:
> >
> > 1) live migration work
> > 2) extensions to other devices (e.g inflight could be done on top as
> > new features)
> >
> > >
> > > > > where a
> > > > > persistent storage cannot do that for write requests. Just as an
> > > > > example, not saying that networking devices must do that :).
> > > >
> > > > So I think this brings extra complexity that we probably don't need to
> > > > worry about now. The reason is that the spec doesn't allow the reset
> > > > to fail.
> > > >
> > >
> > > It can be left for the future for sure.
> > >
> > > > >
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > >
> > > > > > Any motivation for this? it looks to me it makes the feature coupled
> > > > > > with the virtqueue state proposal? It seems odd to allow avail change
> > > > > > but not the last_avail_idx change.
> > > > > >
> > > > >
> > > > > On second thought, I think you are right and this overlaps with the
> > > > > state proposal.
> > > > >
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > +the driver MAY change any descriptor.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > > > > +acknowledges the new status clearing it. Since this change may not be
> > > > > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > > > > +that the device must send after the change.
> > > > > >
> > > > > > Do we really needs resuming? it's kind of:
> > > > > >
> > > > > > 1) STOP -> clear STOP
> > > > > >
> > > > > > vs
> > > > > >
> > > > > > 2) STOP -> RESET -> DRIVER_OK
> > > > > >
> > > > > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > > > > >
> > > > >
> > > > > You are totally right in that regard. But the use case simplifies the
> > > > > operation when the driver only wants to take back some available
> > > > > descriptors still not used, in the range last_avail_idx..avail_idx.
> > > > > Doing that could be a big burden for drivers, who would need to
> > > > > re-send every status. MST proposed that use case at [1].
> > > >
> > > > Yes, but it looks to me this doesn't require the resuming? And the per
> > > > virtqueue reset is being proposed here.
> > > >
> > > > https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
> > > >
> > > > Actually, there's a subtle difference between 1) and 2). That is using
> > > > 2) doesn't make sure we can "resume" from the index where we stopped.
> > > > But this won't be an issue considering we know that we need to support
> > > > setting device virtqueue state(index). So if we want to resume from
> > > > the exact index it could be:
> > > >
> > > > STOP -> RESET -> setting index -> DRIVER_OK
> > > >
> > >
> > > With the state I meant more than VQ state, but the device state in
> > > general. For example, for the network, you must also send all the
> > > needed control commands to recover mac, rx filters, etc.
> >
> > I'm not sure I get this. For those cvq stuff, with the help of the
> > shadow virtqueue, we don't need any spec extensions. What did I miss
> > here?
> >
>
> Right, that was not the example I intended to put actually. I did a
> lot of back and forth for the answer and I put the wrong one here,
> sorry :).
>
> But my concern is solved if we treat the inflight as a idempotent. All
> unanswered things above this are solved with that too.
>
> > >
> > > That's what I meant with "if you just want to rewind some descriptors,
> > > resetting the whole device is overkill".
> > >
> > > The example may be wrong, I can think of virtiofs and the need to keep
> > > files opened:
> > > * If we go through a full reset circle, the files opened may not be
> > > the same as the closed ones, like deleted files with open handles.
> > > * If we go through a full reset circle, watchers may skip a change.
> > >
> > > Of course, this complexity may be left for the future and simply state
> > > that, if that is the case, the device cannot offer stop feature.
> > > Virtiofs have already other complexities that makes its migration
> > > hard, but I think the point is explained.
> >
> > There are long discussions about the virtiofs migration. But it's out
> > of the scope for the discussion of device stop since it's mainly about
> > how to define and expose device states. For stop, it's more than
> > sufficient to say the device states needs to be preserved after the
> > device is stopped.
> >
> > I'd rather go with something simple to work for a simple type of
> > device like ethernet. Otherwise there will be endless discussion. For
> > any features that are not needed by the ethernet device, I would leave
> > it for future investigation.
> >
> > >
> > > > >
> > > > > In that regard, the straightforward thing to do is modify avail_idx /
> > > > > descriptors from that range and let resume. However, the RESET path
> > > > > makes it easier to implement the device part of course, and the guest
> > > > > can also achieve the rewind that way.
> > > > >
> > > > > > > +
> > > > > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > > > > >
> > > > > > >  The device MUST NOT consume buffers or send any used buffer
> > > > > > >  notifications to the driver before DRIVER_OK.
> > > > > > >
> > > > > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > > > > +or clear of STOP.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > > > > +operations after the driver writes STOP.
> > > > > >
> > > > > > I wonder if it's better to leave this to device to decide. E.g some
> > > > > > block devices may requires a very log time to finish the inflight
> > > > > > operations.
> > > > > >
> > > > >
> > > > > (Letting out SVQ + inflight descriptors for this part of the response,
> > > > > I will come back to it later)
> > > > >
> > > > > But if virtqueue is not valid anymore, how can it report them when
> > > > > finished?
> > > >
> > > > It's still valid since the STOP bit is not set by the device.
> > > >
> > >
> > > Then I don't understand your answer. To my proposal of:
> > >
> > > "If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > in-flight operations after the driver writes STOP."
> > >
> > > You answered:
> > >
> > > "I wonder if it's better to leave this to device to decide. E.g some
> > > block devices may requires a very log time to finish the inflight
> > > operations."
> > >
> > > The device must finish all requests before it shows the STOP bit as
> > > set to the device. Maybe it is better to rephrase it like:
> > >
> > > If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > in-flight operations after the driver writes STOP and before it sets
> > > its status bit STOP as set.
> > >
> > > ?
> >
> > I meant when to set STOP is highly device specific. E.g for the virtio
> > block devices which allows the in-flight requests to be re-submitted
> > after resume, the device can choose to not wait for the completion of
> > the inflight operations and expose them to the driver. This helps to
> > reduce the time spent on the stop.
> >
> > >
> > > > > In that sense, I would say it's better to report failure and
> > > > > let the guest handle it as if the disk is unavailable (timeout?
> > > > > temporary faulty sector? I'm not sure what is the most suitable way).
> > > >
> > > > This could be addressed by leaving the following choices to the devices:
> > > >
> > > > 1) complete the inflight requests
> > > > 2) device or virtio specific for reporting inflight descriptors
> > > >
> > >
> > > As previously, I'm not sure how this relates with "the stop bit is not
> > > set by the device", so my answer may be completely wrong here,
> >
> > It's related to time spent on the stop. E.g for block devices, it can
> > simply show all the inflight buffers to guests and set the stop bit.
> > Then STOP should be very fast.
> >
> > >
> > > Even assuming the device can report in-flight descriptors, it needs to
> > > wait for the backend before reporting them anyhow.
> >
> > Why does it need to wait for the backend? I mean for the device that
> > supports in-flight descriptors, the semantic of the device should
> > allow the requests to be processed twice.
> >
>
> If that is true the proposal would be greatly simplified of course.

It seems to be true for virtio-blk (at least current Qemu migrates
inflight requests). But it might not be true for other type of devices
(anyhow we can leave them for future investigation).

>
> > > And we would need
> > > another indication. What is the use of separating these status?
> > > (waiting for stop bit, waiting for inflight descriptors to be valid).
> > >
> > > The only possibility I can come up with is to actually stop the
> > > request right in the middle of an operation. For example, to allow a
> > > big block read to stop and then when the device is informed about
> > > these inflight descriptors and its progress, it can continue. I would
> > > say this is very out of scope, more about this later ([1]).
> > >
> > > > >
> > > > > *If* we are not going to allow the guest to resume operation, where it
> > > > > knows all the status of the device, then there is no value on let the
> > > > > device delay the operation: From the guest point of view it either
> > > > > succeed to send to the device backend and somebody else caused a
> > > > > failure (external network lose the tx packet, bit rotting caused I'm
> > > > > reading a different value than previously written), or it failed at
> > > > > the stop moment.
> > > >
> > > > So it's highly device specific, e.g for ethernet, we can afford the
> > > > loss of packets but not for the block devices so reporting inflight
> > > > descriptors may help to res-submit those after "resuming".
> > > >
> > >
> > > Right.
> > >
> > > > >
> > > > > This is different with the resume possibility, where the device can
> > > > > decide to hold the descriptors, stop operating, and then resume
> > > > > operation.
> > > > >
> > > > > > > Depending on the device, it can do it
> > > > > > > +in many ways as long as the driver can recover its normal operation if it
> > > > > > > +resumes the device without the need of resetting it:
> > > > > > > +\begin{itemize}
> > > > > > > +\item Drain and wait for the completion of all pending requests until a
> > > > > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > > > > +can choose to retry or to cancel them.
> > > > > >
> > > > > > If we allow the driver to retry, we need a way to report inflight
> > > > > > buffers which is not supported by the spec. A way to solve this is to
> > > > > > make it device specific.
> > > > > >
> > > > >
> > > > > Regarding the retry, I don't get you here. Re-reading the patch, I
> > > > > think that "driver retry" is very ambiguous: I meant for the device to
> > > > > mark the descriptor as used, but with a communication specific error
> > > > > code, so the application, guest kernel, etc (driver in the standard)
> > > > > can decide to retry.
> > > >
> > > > That's why I think introducing the virtqueue state is a must for stop,
> > > > With all the indexes defined, it would be much easier to describe what
> > > > the device or driver is expected to work.
> > > >
> > >
> > > I still don't see the relationship, sorry.
> >
> > E.g how do you define the in-flight buffers accurately?
> >
>
> If we can make them idempotent, one descriptor is in flight if it is
> available and the device is aware of that.

Somehow, but it would be tricky to define 'aware' since it's device
specific stuffs.

>
> > >
> > > What I intended to say in the patch is that the device can choose to
> > > just return a device / communication error to indicate that the
> > > transaction has failed at device level, but related to virtio, the
> > > buffer would be marked as used.
> > >
> > > Maybe a good example of this is for the device to choose to return
> > > VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
> > > block backend, but I don't know a lot of the blk device so I may be
> > > wrong. I guess that the guest cannot know about the value being
> > > written / read with that error code, and it is forced to re-read that.
> > > But the virtqueue will be in a good state, and the device can be reset
> > > and can recover its state. It's totally up to the device to choose to
> > > do so.
> >
> > I think not, if we tie STOP to some device errors that could be even
> > more complicated.
> >
>
> It's up to the device to implement that way but I understand your point.
>
> > >
> > > Virtqueue state is still needed, but not because the device chooses to
> > > return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
> > > status after the reset.
> > >
> > > > >
> > > > > Regarding the in-flight descriptor report, it's interesting but I
> > > > > cannot see a way where it does not complicate the solution a lot or
> > > > > adds new dependencies. I have the next thoughts:
> > > > > 1) If it works as inflight_fd, "a region of shared memory"
> > > > > 1.1) This region must be in the guest's AS so the device has access to
> > > > > it. This either invalidates the use of STOP from the driver point of
> > > > > view as "let me know where you are not going to modify the guest's
> > > > > memory anymore".
> > >
> > > Long shot here, but might this work with the combination of the
> > > balloon device? Making this far and far from the simplicity though...
> > >
> > > > > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > > > > it is possible to implement the SVQ without the need of STOP bit. This
> > > > > is equivalent to "I have a PF that also supports VF dirty memory
> > > > > tracking".
> > > > > 2) If it works as the config space, where the driver can ask for its
> > > > > status, STOP means "STOP writing used and report via config space". No
> > > > > need for reset.
> > > > >
> > > > > Did you have something different in mind?
> > > >
> > > > Not sure, maybe config space is better. What I want is to make the
> > > > feature as small as possible but leaving spaces for future extension.
> > > >
> > > > E.g we start from the feature that is sufficient for networking
> > > > devices, (but doesn't prevent the future work to extend it to block
> > > > devices). I'm not familiar with the block device, but mandating the
> > > > completion of inflight descriptor make have troubles, e.g unexpected
> > > > downtime during live migration.
> > > >
> > >
> > > [1] I agree with that, but I feel that "device or virtio specific for
> > > reporting inflight descriptors" is way too broad to make it useful at
> > > the moment.
> >
> > Yes and that's not a must for an ethernet device.
> >
>
> This is not really true as how this proposal is specified, if we use
> it in the hypervisor. Just saving the virtqueue index is not enough if
> the device is not using the descriptors in order, since the available
> buffers may not be recoverable just looking at the guest memory.
>
> In that regard, we must either flush them (as this proposal do, and
> with the unbound time problem), or use the inflight descriptors.

I meant for ethernet device, device can simply complet all inflight
buffers before the STOP. Or anything I missed here?

>
> > >
> > > Maybe the best thing to do is to put all the restrictions at this
> > > moment, and when we figure out a good format for the inflight, add
> > > "\item report inflight descriptors". Then, the device and the driver
> > > are free to not accept any combination. Does it make sense?
> >
> > Somehow, to start from a version that works for networking devices.
> > Where we know we don't need to care:
> >
> > 1) stop fail
> > 2) unbound time of stop, so we don't need an interrupt
> > 3) inflight buffers
> > 4) new facility querying device states (shadow CVQ can do this)
> >
> > This will ease both of us as I feel the discussion might not be easily
> > converged if we care about other types of devices with too many
> > things. With networking done, we can start to support block devices
> > and we can ask help from block gurus.
> >
>
> I'm fine with that except that we don't need 3.

I still think it's not a must for ethernet device, (we don't have that
in all the current virtio-net backends, anything make hardware
different in this situation?).

But I'm ok if we start to propose the inflight stuffs, that can makes
a complete virtqueue state. I wonder maybe it's better to use
device-area for those instead of transport specific way to access
them.

>
> > >
> > > > >
> > > > > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > > > > +assume to lose them.
> > > > > >
> > > > > > I think "make buffer used" is better than "mark them as done". And we
> > > > > > need a accurate definition on who is "them".
> > > > > >
> > > > >
> > > > > All items include other operations, like the ones that the device must
> > > > > do internally to process the control virtqueue. But I cannot find an
> > > > > example where telling the driver they are done when it's not is valid
> > > > > for this particular item.
> > > > >
> > > > > But I agree it needs better wording.
> > > > >
> > > > > And I will s/them/operations/. for the next one.
> > > > >
> > > > > > > +\end{itemize}
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > > > > +a guest's request,
> > > > > >
> > > > > > It's not clear what did "a guest's request" means.
> > > > > >
> > > > >
> > > > > Right. Would "operation" fit better here?
> > > >
> > > > Still unclear, I guess this sentence tries to define when the device
> > > > can fail the stop?
> > > >
> > >
> > > Not really, my intentions were to add a MUST operation for when the
> > > device fails. The first is needed for the second though, so maybe we
> > > can rephrase.
> > >
> > > If we agree that a device can fail the stop, I think we should not
> > > restrict the circumstances where the device can fail. "If the device
> > > can find external circumstances where it cannot satisfy STOP must not
> > > offer STOP feature" works for me too, actually.
> >
> > I'd leave the STOP_FAILED for future investigation.
> >
> > >
> > > > >
> > > > > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > > > > +clears STOP_FAILED.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > > > > +after exposing the STOP bit set:
> > > > > > > +\begin{itemize}
> > > > > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > > > > +\item Send any used buffer notifications to the driver.
> > > > > > > +\end{itemize}
> > > > > > > +
> > > > > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > > > > +send another configuration space change notification to the driver afterwards
> > > > > > > +until the guest clears it.
> > > > > > > +
> > > > > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > > > > +device MUST continue reading available descriptors as if an available buffer
> > > > > > > +notification has reach it, starting from the last descriptor it marked as used,
> > > > > >
> > > > > > So I still tend to define virtqueue state as basic facility before
> > > > > > defining STOP. It can makes thing easier.
> > > > > >
> > > > >
> > > > > Yes, coming back to that approach can simplify the whole proposal.
> > > > >
> > > > > > > +and continue the regular operation after that. The device MUST read again
> > > > > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > > > > +if for some reason it cannot continue.
> > > > > > > +
> > > > > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > > > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > > > > >  MUST send a device configuration change notification to the driver.
> > > > > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > > > > >    transport specific.
> > > > > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > > > > >
> > > > > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > > > > +  stop the device.
> > > > > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > > > > +
> > > > > > >  \end{description}
> > > > > >
> > > > > > So I think the patch complicate thing is various ways:
> > > > > >
> > > > > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > > > > with NEEDS_RESET
> > > > > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> > > > >
> > > > > I'm not sure about those two, I find we will have devices with unbound
> > > > > stop time where both can be useful if we agree on making this a
> > > > > general facility.
> > > >
> > > > If the unbound stop time is the only worry, the way to report inflight
> > > > descriptors looks like a better solution.
> > >
> > > I'm not sure if that's the only condition under which a device can
> > > fail to stop, but if we agree on that we could prepare a format for
> > > block devices to report them, for example. They are needed somehow in
> > > the networking case of packed if buffers are used out of order.
> >
> > It can, but let's leave it for the future now.
> >
> > Actually, for inflight buffers, a better idea is to support it at the
> > virtqueue level without extra data structure. But it's for sure not a
> > short term solution.
> >
>
> Can you expand on this? Why do you think it is not a short term solution?

For the inflight buffers, I guess we need add more data structure in
the device area. That's fine. But I wonder if we can re-design the
virtqueue carefully then the inflight buffers could be deduced from
the vring. It requires a re-design of the current vring which is not
easy.

Thanks

>
> > >
> > > > And STOP_FAILED is actually
> > > > not accurate since it means the stop is not finished in bound time
> > > > (but we need to define how long should be a bound time?)
> > > >
> > > > > Resetting the whole device because of this leaves
> > > > > the driver with no possibility of knowing the state of the sent
> > > > > descriptors.
> > > > >
> > > > > Of course, if these use cases are not interesting, it's easier to
> > > > > leave them out for sure.
> > > > >
> > > > > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > > > > + DRIVER_OK
> > > > > >
> > > > >
> > > > > I agree it can be obtained with a whole reset, so it can be out and
> > > > > leave it for the future if needed. However it seems overkill if we
> > > > > just want to rewind some descriptors back, and there is no standard
> > > > > way to recover the device status beyond vq_state.
> > > >
> > > > It's more about the minimal self-contained set of the new features. If
> > > > it's just rewind, device or virtqueue reset is sufficient.
> > >
> > > I'm not sure if that is true for all devices with the features the
> > > standard offers at the moment, but it might be right for serial.
> >
> > Thanks
> >
> > >
> > > > If we want
> > > > to obtain the state, virtqueue state is a must and with virtqueue
> > > > state, resuming (clearing STOP) is not a must.
> > > >
> > >
> > > Right.
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > I think we'd better to stick to the minimal set of the function to
> > > > > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > > > > and no config interrupt).
> > > > > >
> > > > >
> > > > > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > > > > --
> > > > > > > 2.27.0
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-17  3:27               ` Jason Wang
@ 2021-11-17  8:08                 ` Eugenio Perez Martin
  2021-11-18  3:27                   ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-17  8:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Wed, Nov 17, 2021 at 4:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Nov 16, 2021 at 10:50 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 7:56 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Nov 16, 2021 at 2:17 AM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > >
> > > > > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > > > > status, making sure the device will not fetch new available ones.
> > > > > > > >
> > > > > > > > Its main use case is live migration, although it has other orthogonal
> > > > > > > > use cases. It can be used to safely discard requests that have not been
> > > > > > > > used: in other words, to rewind available descriptors.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > >
> > > > > > > So this is much more complicated, see below.
> > > > > > >
> > > > > >
> > > > > > I agree it's more complicated, but it addresses some concerns raised
> > > > > > on previous patches sent to the list. Not saying that all of them must
> > > > > > be addressed, or addressed this way though :).
> > > > > >
> > > > > > > > ---
> > > > > > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  1 file changed, 83 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/content.tex b/content.tex
> > > > > > > > index 2aa3006..9ed0d09 100644
> > > > > > > > --- a/content.tex
> > > > > > > > +++ b/content.tex
> > > > > > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > > > > > >    drive the device.
> > > > > > > >
> > > > > > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > > +  device has been stopped by the driver. This status bit is different
> > > > > > > > +  from the reset since the device state is preserved.
> > > > > > > > +
> > > > > > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > > +  device could not stop the STOP request.
> > > > > > > > +
> > > > > > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > > > > > >    an error from which it can't recover.
> > > > > > > >  \end{description}
> > > > > > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > > >  recover by issuing a reset.
> > > > > > > >  \end{note}
> > > > > > > >
> > > > > > > > +If VIRTIO_F_STOP has been negotiated,
> > > > > > >
> > > > > > > "has not been" actually?
> > > > > > >
> > > > > >
> > > > > > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > > > > > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > > > > > driver must not set or clear the STOP bit before setting DRIVER_OK".
> > > > >
> > > > > Ok, but what happens if we simply allow the STOP to be set if
> > > > > DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> > > > > conflict with STOP.
> > > > >
> > > > > (Anyhow we allow to set STOP after DRIVER_OK)
> > > > >
> > > >
> > > > We could change it to "the driver MUST NOT set or clear STOP if
> > > > FEATURES_OK is not set", which would allow the driver to start a
> > > > device in stop mode. Before that should be definitely not done by a
> > > > good driver.
> > >
> > > Yes, limiting it before FEATURES_OK is a must.
> > >
> > > >
> > > > But if we don't allow the resume, it makes little sense to allow the
> > > > driver to start (as "set DRIVER_OK bit") in stop mode anyhow.
> > >
> > > Yes.
> > >
> > > > I would
> > > > say that it is better to limit that now, and allow it in the future if
> > > > we find a valid use case, enabling a specific feature flag for it.
> > > >
> > > > I'm also fine if we decide to leave this unspecified, but limiting it
> > > > now could enable us to make something useful with it in the future.
> > >
> > > Leaving is unspecified seems better since if we do the limitation, it
> > > introduces extra efforts for future extension.  But I'm fine with
> > > either.
> > >
> > > >
> > > > > >
> > > > > > > > the driver MUST NOT set or clear STOP if
> > > > > > > > +DRIVER_OK is not set.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > > > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > > > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > > > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > > > > > +the configuration change notification that the device must send after the
> > > > > > > > +change.
> > > > > > >
> > > > > > > This is kind of tricky, it means the device can send notification
> > > > > > > after it has been stopped.
> > > > > >
> > > > > > I don't think this part it's so tricky. That notification is also sent
> > > > > > when the DEVICE_NEEDS_RESET bit is set,
> > > > >
> > > > > I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> > > > > is stopped.
> > > >
> > > > To clarify, what I meant is that there are situations where this
> > > > notification is raised even if device configuration is not changed,
> > > > but its status.
> > >
> > > Right, but again the notification is not a must for the status changed
> > > (e.g reset).
> > >
> > > >
> > > > NEED_RESET does not mean the device is stopped, but it (should) signal
> > > > the driver that further interaction with the device will be for sure
> > > > invalid. I may be wrong with this, but this way of notifying the
> > > > driver relieves it to the need for check status in every interaction.
> > >
> > > You are right. But it's still different with stop, we don't need to
> > > check if for every interaction. What we need is to check if only after
> > > the driver tries to stop the device (as reset).
> > >
> > > >
> > > > > But what we want to achieve is to make sure there won't be
> > > > > any interaction between device and driver after STOP is set by device.
> > > > >
> > > >
> > > > If I understand you correctly, what you meant is that that a driver
> > > > could (and I think it will a lot of times) read the status change in
> > > > this order:
> > > > 1) STOP bit is set
> > > > 2) Notification change arrives
> > > >
> > > > And 2) is weird since the device promised no more interaction somehow.
> > > >
> > > > I agree to some extent, but it can be read even from the opposite
> > > > angle: From the moment the driver sets DRIVER_OK, every change on the
> > > > device (status or config) is notified using configuration change
> > > > interrupt.
> > >
> > > For status, It depends on what you mean by "change". If it's the value
> > > that read from the driver:
> > >
> > > 1) The only thing that needs to be notified is the status that can
> > > only be set by device, that is NEEDS_RESET.
> > > 2) For the status that set by the driver and device can refuse
> > > (forever or temporarily), there's no notification change: DRIVER_OK,
> > > FEATURE_OK
> > >
> > > STOP belongs to 2). STOP_FAILED belongs to 1), but:
> > >
> > > 1) STOP status bit 0 means the device is not stopped
> > > 2) We don't have DRIVER_FAILED and FEATURE_FAILED, instead, we just
> > > check whether or not the bit is set by the device.
> > >
> > > So whether or not we need STOP_FAILED is still questionable.
> > >
> >
> > I don't split in "bits set by device" or "set by the driver". The way
> > I see it, the driver send a request, and the device is going to change
> > its status in the future. The driver can read the status reading the
> > two bits bitfield:
> >
> > STOP_FAILED
> > |STOP
> > ||
> > 00 - Running normally
> > 01 - Device has stopped successfully
> > 10 - Device could not stop, but is running normally
>
> I wonder what's the advantage of differing 10 from 00?
>

If the stop is not instantaneous, the driver knows when to stop
polling, or if it has to retry.

> > 11 - Cannot find this combination at this moment.
> >
> > > >
> > > > a) Regarding the standard, I don't see it so different from the
> > > > NEED_RESET: the config change keeps being an out of band notification
> > > > system the driver can relay to know a (expected) status change.
> > > > b) I don't see a big deal with changing the semantic from "no more
> > > > interaction from the device" with "no more interaction but the
> > > > expected config change interrupt".
> > > > c) It's easy to ignore the interrupt, or even not to treat it
> > > > specially after the stop: The driver already should scan config to
> > > > look for changes in configuration and status, it will simply find
> > > > none. Although this is not implemented widely as far as I see.
> > > >
> > > > In that regard, I feel that interaction is very innocuous, and to me
> > > > is the straightforward solution to avoid the active polling.
> > >
> > > Well, the driver can choose not to do busy polling for sure without
> > > the interrupt for sure.
> > >
> >
> > You mean using the transport specific method you describe in previous mails?
>
> Yes and no, it could be:
>
> 1) transport specific method
> 2) any other method that doesn't do busy polling (timer, sleep etc)
>

2 without 1 could increase migration time, and block a CPU only to
check for the status.

> >
> > > >
> > > > > > and (as I read) is for the
> > > > > > same reason somehow: To avoid the status polling:
> > > > > > * "The driver SHOULD NOT rely on completion of operations of a device
> > > > > > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > > > > > * The reading of the status field could be expensive / inconvenient in
> > > > > > each operation.
> > > > >
> > > > > It makes sense for the device initiated event to use interrupt. But
> > > > > for a stop, it's driver initiated, in this case the driver won't start
> > > > > the work (for example the cleanup) after it makes sure the device is
> > > > > stopped. Polling the status should be fine as this is how the rest
> > > > > works. Anything makes stop differ from reset here? Or what worries you
> > > > > without the interrupt?
> > > > >
> > > >
> > > > This is proposed only in the scope of the concerns I saw raised in
> > > > previous series: the time to stop a device could be unbound, and
> > > > tricks to poll less frequently will increase migration time.
> > >
> > > But I don't see how an interrupt can help to reduce the time spent on
> > > the stop.
> >
> > If a host has many pass-through devices, it needs to burn CPU to ask
> > all of them periodically. To reduce that burden, it poll less
> > frequently. Because of that, some devices are stopped but hypervisor
> > is not aware of that.
> >
> > To solve it, the device must be able to tell the hypervisor / driver
> > when it has stopped. Of course, interrupt may not be the only way, and
> > actively polling will always be a choice.
>
> Yes, usually linux drivers will not do busy polling, instead it can
> use cpu_relax() or msleep() during the polling.
>

I understand that both of them work technically, but CPU relax does
not allow to introduce more work with that thread and msleep increases
the migration time.

> >
> > > The downtime is usually a user policy, so the VMM can choose
> > > to timeout the stop and perform the resume.
> > >
> >
> > With "resume" do you mean to actually reset the device if the STOP bit
> > is not set in time? If that is a possibility then it is sure we will
> > need to handle the inflight descriptors.
>
> Yes, but I wonder whether or not we can leave the inflight descriptors
> for the future. (If it's not a must for networking). The idea is to
> have something that can work quickly. But I'm also fine to propose it
> now. With that I believe most of the device can be stopped in a short
> time (we don't need to wait for the completion of inflight requests).
>

(answered below, let me know if it is not enough).

> >
> > > As discussed, the way to advertise the inflight buffers might be a
> > > solution for this.
> > >
> >
> > If these transactions are idempotent, as you say later, then yes.
> >
> > > >
> > > > I will fully agree if these are left to the future: it is easy to
> > > > implement this chunk of the proposal under a separated feature flag if
> > > > this need arises. Sorry if that part was not clear enough.
> > >
> > > That's fine.
> > >
> > > >
> > > > > > * Solution: Instead of polling, make a device facility to notify the
> > > > > > driver that it cannot trust the device is going to behave properly /
> > > > > > same as before anymore via notification.
> > > > > >
> > > > > > We can add another exception to the "device configuration space
> > > > > > change" in "Notification of Device Configuration Changes", like the
> > > > > > one already present:
> > > > > > "In addition, this notification is triggered by the device setting
> > > > > > DEVICE_NEEDS_RESET".
> > > > > >
> > > > > > I understand it sounds tricky that the device sends a notification
> > > > > > when it's stopped, but in my opinion it's aligned with previous
> > > > > > behavior (DEVICE_NEEDS_RESET),
> > > > >
> > > > > I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> > > > > device won't process the buffer or send an interrupt.
> > > > >
> > > >
> > > > From the driver point of view, it means that the driver cannot trust
> > > > the device anymore until the reset, so the driver actions are similar:
> > >
> > > I think we need clarify the exact semanic of STOP_FAILED, and it will
> > > be very hard to differ
> > >
> > > 1) The device can not be stopped in short time
> > >
> > > from
> > >
> > > 2) The device can not be stopped forever
> > >
> > > At least in case 1) the driver still can trust the device.
> > >
> >
> > I think that from the driver POV is the same. If device cannot stop,
> > it will continue operating normally.
> >
> > > >
> > > > ""
> > > > the driver can’t assume requests in flight will be completed if
> > > > DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
> > > > completed
> > > > ""
> > > >
> > > > (Sorry for being circular here, I think it proceeds here too) What I
> > > > meant is that the device sent an out of band notification when the
> > > > device status changed. The driver could check the status field before
> > > > processing every used buffer and also with a timer just in case, and
> > > > DEVICE_NEEDS_RESET would not need the config interrupt change. But the
> > > > interrupt gives convenience to the whole operation.
> > > >
> > > > Every time the driver gets that interrupt, it must re-check all the
> > > > device configuration and status anyway. It can still make buffers
> > > > available while processing it, but that's the meaning of the interrupt
> > > > to me. And a status change after DRIVER_OK fits to it, from my point
> > > > of view.
> > >
> > > I fully agree, but it's different:
> > >
> > > 1) The driver don't know when there will be a NEEDS_RESET, so without
> > > an interrupt, it must check the status for each operation
> > > 2) The drive know when there will be STOP/STOP_FAILED, it only needs
> > > to check the status after it tries to stop the device
> > >
> >
> > You know that the device will set from that point in the future, with
> > an unbound time.
>
> So my point is introducing facilities to avoid the unbound time. I
> think both of us know dealing with it is not easy.
>

Sure, that works for me.

> > That could not be a problem, but there are concerns
> > raised on previous patches about this.
> >
> > > >
> > > > > > it's explicitly stated that it will be
> > > > > > the last one, and it's caused because of the inconvenience of polling
> > > > > > device status. Even if the driver can use other mechanisms.
> > > > >
> > > > > I think STOP works much more similarly to reset not NEEDS_RESET. The
> > > > > only difference with reset is that STOP needs to preserve the device
> > > > > states and we don't (or can't) use interrupt to signal the completion
> > > > > of reset.
> > > > >
> > > >
> > > > From the semantic point of view, yes. But in practical terms we can
> > > > face unbounded time.
> > >
> > > The problem is:
> > >
> > > 1) how to define the "unbounded time", if we don't define it exactly,
> > > the device may abuse the status bit which may cause a lot of troubles
> >
> > To me, the unbounded time is where the device does not guarantee that
> > you will have the STOP status set the next time you read it. In case
> > of an untrusted device, this may include everyone of them, for sure.
>
> Then the device behaviour is tied to the driver's behaviour. The
> problem is that there's no bound time between the write and the
> following read. There could be arbitrary time in the middle.
>

Right. In my opinion, that's why a device should be able to report
that the stop is either in progress or failed. But *idempotent* in
flight descriptors help for sure.

> >
> > You care a little bit less about it if the device notifies you about
> > the condition: You simply have the driver waiting for it, but
> > operating normally in other regards.
>
> This can be achieved even without an interrupt.
>

Sure, the interrupt is just the straightforward way to do so (or the
first that comes to my mind at least). There may be more convenient
ways to do so.

> >
> > That notification can be a basic facility, a transport device one, or
> > whichever other method.
> >
> > To mandate that the device is stopped the next time the status bit is
> > read is another possibility for sure but if I understood correctly
> > that would let other devices out. If vqs descriptors are idempotent,
> > inflight_fd may be a way to get rid of that unbound time for sure.
>
> To be clear, inflight_fd should be part of the device area?
>

This is a little bit tricky, but I would say they should.

I think this deserves another thread of its own, but it also allows
other situations I have in my backlog to work way better. For example,
in the case of emulated out of qemu devices (OVS, packed queue, no
inflight_fd), it is impossible to recover the queue in case of fatal
failure of the device. This would allow VMM to continue the connection
with a new spawned device with no notice for guests.

It would be ideal to place that chunk of memory in the VMM, but I'm
not sure if self-contained devices would allow that.

> >
> > > 2) there are other approaches that we can deal with the unbound time,
> > > timeout in driver + reset
> > >
> > > > I mean, both operations have unbound time for
> > > > sure, but I would say that any device should handle reset way faster
> > > > than the STOP.
> > >
> > > Any reason that STOP is faster? I'd expect STOP is a subset of reset,
> > > that is, in order to do reset, we must first stop.
> > >
> >
> > I'm not sure if that is the way hardware can work, but with reset the
> > device can reclaim the needed resources for the communication in an
> > asynchronous way, since they are not needed for the driver to ask.
>
> Yes and I think we can do the same for the stop.
>
> > With the STOP, they need to be communicated to the driver somehow in a
> > virtio way.
>
> So I think if it's device specific, we'd better not assume which is
> faster. Instead, as discussed, it's better to try our best to avoid
> the unbound stop (as what we did for reset). If we fail, it's not late
> to consider how to recover from that.
>

I'm fine with that.

> >
> > > >
> > > > I fully agree on your point, but I can also see the other way around:
> > > > It would be convenient to have a configuration interrupt for the reset
> > > > too, but it is impossible since we cannot configure any before the
> > > > reset.
> > >
> > > I think it's a more about the question:
> > >
> > > 1) Why an interrupt is a must for STOP
> > >
> > > than
> > >
> > > 2) If we can use an interrupt
> > >
> > > My questions are all for 1).
> > >
> >
> > It's not a must.
> >
> > > >
> > > > > >
> > > > > > If the community still has concerns about it, another option is to
> > > > > > actually extract the way the device notifies it from the general
> > > > > > facilities, and make it transport specific. But to use the device
> > > > > > configuration change notification for this makes sense to me. The
> > > > > > device configuration has changed.
> > > > >
> > > > > See above, I think we should have a consistent way to handle reset and stop.
> > > > >
> > > > > >
> > > > > > > As discussed in the previous versions,
> > > > > > > driver is freed to use timer or what ever other mechanism if it
> > > > > > > doesn't like the busy polling. I wonder how much value we can gain
> > > > > > > from a dedicated config interrupt. Especially consider some transport
> > > > > > > can use transport specific interrupt (not virtio specific interrupt)
> > > > > > > for reporting whether or not set status succeed.
> > > > > > >
> > > > > >
> > > > > > In my opinion, *if* we agree that a stop is a virtio facility and not
> > > > > > a per-device one, and *if* we agree that a notification is required
> > > > > > for the device to notify the stop, it makes sense to use a
> > > > > > transport-independent mechanism that the device must already
> > > > > > implement.
> > > > >
> > > > > So the major question is why a notification is a must? And Just to be
> > > > > clear, there could be transport specific mechanisms for error
> > > > > reporting.
> > > > >
> > > > > E,g
> > > > >
> > > > > 1) PCI can have non-posted write, if we use non-posted write to carry
> > > > > the stop command, the device can return whether or not the device is
> > > > > stopped successfully.
> > > > >
> > > > > or
> > > > >
> > > > > 2) Some other transport can convert the stop status bit set into a
> > > > > command and queue it to device specific queue, device can then use
> > > > > it's own specific interrupt to report the when the stop is handled
> > > > > (success or fail)
> > > > >
> > > >
> > > > I would be totally fine with that too.
> > > >
> > > > > >
> > > > > > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > > > > > +try new STOP attempts.
> > > > > > >
> > > > > > > Does the device need to re-read the STOP_FAILED for synchronization?
> > > > > >
> > > > > > I tend to see the status as something that belongs to the device and
> > > > > > is exposed to the driver. In that sense, the write from the guest
> > > > > > triggers an event on the device, and the device decides what will be
> > > > > > exposed on that field (MMIO?) on the next driver read. If it's not
> > > > > > that way, we couldn't use the STOP bit that way, right?
> > > > >
> > > > > Yes, but this is not an answer to my question. It's about the
> > > > > ordering, when write returns it doesn't mean the write arrives at the
> > > > > device, this is the case of PCI at least. So we need a mechanism to
> > > > > make sure the write arrives at the device (PCI read will flush
> > > > > previous write).
> > > > >
> > > >
> > > > I didn't see that in your original question, sorry. But the PCI read
> > > > that flush the write is the driver one, isn't it?
> > > >
> > > > In that case I would say that "the read" is part of "the write". It's
> > > > an issue of the PCI protocol, which I would say doesn't belong to this
> > > > section (or even this document?): To implement virtio over PCI, you
> > > > know that virtio needs a write, and, in particular, you know that PCI
> > > > needs a posterior read to make sure that write is effective.
> > > >
> > > > Either that, or that the driver must use non-posted ones if it wants
> > > > the device to note it.
> > > >
> > > > Or am I still missing something?
> > >
> > > Just to make sure we are in the same page,  in this paragraph, you
> > > said this at the beginning:
> > >
> > > "If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > to ensure the STOP or STOP_FAILED bit is set after the write."
> > >
> > > And in the end of the paragraph:
> > >
> > > If the device sets the STOP_FAILED bit, the driver MUST clear it
> > > before try new STOP attempts. But you don't define whether we need to
> > > re-read to make sure STOP_FAILED is clear if the driver tries to clear
> > > it. Is this intended?
> > >
> >
> > Not really, I assumed that the write of clearing STOP_FAILED and the
> > write of STOP would be ordered, and that the device should have no
> > problem clearing that status bit, since I don't see any kind of
> > resource reclamation or anything like that there.
>
> Ok.
>
> >
> > > >
> > > > > >
> > > > > > > I
> > > > > > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > > > > > when that the device needs to set this bit. And driver can choose to
> > > > > > > reset after a specific timeout anyhow.
> > > > > > >
> > > > > >
> > > > > > The conditions where the device needs to set this bit are unspecified
> > > > > > because it depends on the device: Not only to the kind of device, but
> > > > > > also on the device backend.
> > > > > >
> > > > > > The same condition (regarding the possibility of handling the pending
> > > > > > buffers) could cause different devices to react differently. A network
> > > > > > device could decide it's fine to drop pending tx, let the guest think
> > > > > > that "the network lost them", and mark them as done,
> > > > >
> > > > > We may meet the similar issue during reset.
> > > > >
> > > >
> > > > Yes, but the driver should be fine to fail a reset, it does not want
> > > > to use the device anymore or it wants to totally override the device
> > > > state. If a stop fails, the driver would expect the device to continue
> > > > operating in my opinion, because it will be impossible to recover the
> > > > device state.
> > >
> > > This is only true if we allow the stop to be failed. It would be an
> > > issue if the driver fails to stop a device since it can fail the stop
> > > of the entire VM which is not something that the VMM is expecting.
> > >
> > > If we don't allow the stop can fail and we allow the device to expose
> > > the inflight buffers, we are all fine:
> > >
> > > 1) VM is guaranteed to be stopped
> > > 2) stop can be finished in time
> > >
> > > Devices are free to choose to wait for the short time request and tag
> > > the long time request as inflight.
> > >
> > > >
> > > > This is again something that we could leave if we decide it is not
> > > > necessary at this moment: It just shows how a concern of previous
> > > > proposals can be solved, at least technically.
> > >
> > > To me, I think we can start from a set of functions that can make e.g
> > > the virtio-net to work to unblock:
> > >
> > > 1) live migration work
> > > 2) extensions to other devices (e.g inflight could be done on top as
> > > new features)
> > >
> > > >
> > > > > > where a
> > > > > > persistent storage cannot do that for write requests. Just as an
> > > > > > example, not saying that networking devices must do that :).
> > > > >
> > > > > So I think this brings extra complexity that we probably don't need to
> > > > > worry about now. The reason is that the spec doesn't allow the reset
> > > > > to fail.
> > > > >
> > > >
> > > > It can be left for the future for sure.
> > > >
> > > > > >
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > >
> > > > > > > Any motivation for this? it looks to me it makes the feature coupled
> > > > > > > with the virtqueue state proposal? It seems odd to allow avail change
> > > > > > > but not the last_avail_idx change.
> > > > > > >
> > > > > >
> > > > > > On second thought, I think you are right and this overlaps with the
> > > > > > state proposal.
> > > > > >
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > > +the driver MAY change any descriptor.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > > > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > > > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > > > > > +acknowledges the new status clearing it. Since this change may not be
> > > > > > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > > > > > +that the device must send after the change.
> > > > > > >
> > > > > > > Do we really needs resuming? it's kind of:
> > > > > > >
> > > > > > > 1) STOP -> clear STOP
> > > > > > >
> > > > > > > vs
> > > > > > >
> > > > > > > 2) STOP -> RESET -> DRIVER_OK
> > > > > > >
> > > > > > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > > > > > >
> > > > > >
> > > > > > You are totally right in that regard. But the use case simplifies the
> > > > > > operation when the driver only wants to take back some available
> > > > > > descriptors still not used, in the range last_avail_idx..avail_idx.
> > > > > > Doing that could be a big burden for drivers, who would need to
> > > > > > re-send every status. MST proposed that use case at [1].
> > > > >
> > > > > Yes, but it looks to me this doesn't require the resuming? And the per
> > > > > virtqueue reset is being proposed here.
> > > > >
> > > > > https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
> > > > >
> > > > > Actually, there's a subtle difference between 1) and 2). That is using
> > > > > 2) doesn't make sure we can "resume" from the index where we stopped.
> > > > > But this won't be an issue considering we know that we need to support
> > > > > setting device virtqueue state(index). So if we want to resume from
> > > > > the exact index it could be:
> > > > >
> > > > > STOP -> RESET -> setting index -> DRIVER_OK
> > > > >
> > > >
> > > > With the state I meant more than VQ state, but the device state in
> > > > general. For example, for the network, you must also send all the
> > > > needed control commands to recover mac, rx filters, etc.
> > >
> > > I'm not sure I get this. For those cvq stuff, with the help of the
> > > shadow virtqueue, we don't need any spec extensions. What did I miss
> > > here?
> > >
> >
> > Right, that was not the example I intended to put actually. I did a
> > lot of back and forth for the answer and I put the wrong one here,
> > sorry :).
> >
> > But my concern is solved if we treat the inflight as a idempotent. All
> > unanswered things above this are solved with that too.
> >
> > > >
> > > > That's what I meant with "if you just want to rewind some descriptors,
> > > > resetting the whole device is overkill".
> > > >
> > > > The example may be wrong, I can think of virtiofs and the need to keep
> > > > files opened:
> > > > * If we go through a full reset circle, the files opened may not be
> > > > the same as the closed ones, like deleted files with open handles.
> > > > * If we go through a full reset circle, watchers may skip a change.
> > > >
> > > > Of course, this complexity may be left for the future and simply state
> > > > that, if that is the case, the device cannot offer stop feature.
> > > > Virtiofs have already other complexities that makes its migration
> > > > hard, but I think the point is explained.
> > >
> > > There are long discussions about the virtiofs migration. But it's out
> > > of the scope for the discussion of device stop since it's mainly about
> > > how to define and expose device states. For stop, it's more than
> > > sufficient to say the device states needs to be preserved after the
> > > device is stopped.
> > >
> > > I'd rather go with something simple to work for a simple type of
> > > device like ethernet. Otherwise there will be endless discussion. For
> > > any features that are not needed by the ethernet device, I would leave
> > > it for future investigation.
> > >
> > > >
> > > > > >
> > > > > > In that regard, the straightforward thing to do is modify avail_idx /
> > > > > > descriptors from that range and let resume. However, the RESET path
> > > > > > makes it easier to implement the device part of course, and the guest
> > > > > > can also achieve the rewind that way.
> > > > > >
> > > > > > > > +
> > > > > > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > > > > > >
> > > > > > > >  The device MUST NOT consume buffers or send any used buffer
> > > > > > > >  notifications to the driver before DRIVER_OK.
> > > > > > > >
> > > > > > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > > > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > > > > > +or clear of STOP.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > > > > > +operations after the driver writes STOP.
> > > > > > >
> > > > > > > I wonder if it's better to leave this to device to decide. E.g some
> > > > > > > block devices may requires a very log time to finish the inflight
> > > > > > > operations.
> > > > > > >
> > > > > >
> > > > > > (Letting out SVQ + inflight descriptors for this part of the response,
> > > > > > I will come back to it later)
> > > > > >
> > > > > > But if virtqueue is not valid anymore, how can it report them when
> > > > > > finished?
> > > > >
> > > > > It's still valid since the STOP bit is not set by the device.
> > > > >
> > > >
> > > > Then I don't understand your answer. To my proposal of:
> > > >
> > > > "If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > > in-flight operations after the driver writes STOP."
> > > >
> > > > You answered:
> > > >
> > > > "I wonder if it's better to leave this to device to decide. E.g some
> > > > block devices may requires a very log time to finish the inflight
> > > > operations."
> > > >
> > > > The device must finish all requests before it shows the STOP bit as
> > > > set to the device. Maybe it is better to rephrase it like:
> > > >
> > > > If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > > in-flight operations after the driver writes STOP and before it sets
> > > > its status bit STOP as set.
> > > >
> > > > ?
> > >
> > > I meant when to set STOP is highly device specific. E.g for the virtio
> > > block devices which allows the in-flight requests to be re-submitted
> > > after resume, the device can choose to not wait for the completion of
> > > the inflight operations and expose them to the driver. This helps to
> > > reduce the time spent on the stop.
> > >
> > > >
> > > > > > In that sense, I would say it's better to report failure and
> > > > > > let the guest handle it as if the disk is unavailable (timeout?
> > > > > > temporary faulty sector? I'm not sure what is the most suitable way).
> > > > >
> > > > > This could be addressed by leaving the following choices to the devices:
> > > > >
> > > > > 1) complete the inflight requests
> > > > > 2) device or virtio specific for reporting inflight descriptors
> > > > >
> > > >
> > > > As previously, I'm not sure how this relates with "the stop bit is not
> > > > set by the device", so my answer may be completely wrong here,
> > >
> > > It's related to time spent on the stop. E.g for block devices, it can
> > > simply show all the inflight buffers to guests and set the stop bit.
> > > Then STOP should be very fast.
> > >
> > > >
> > > > Even assuming the device can report in-flight descriptors, it needs to
> > > > wait for the backend before reporting them anyhow.
> > >
> > > Why does it need to wait for the backend? I mean for the device that
> > > supports in-flight descriptors, the semantic of the device should
> > > allow the requests to be processed twice.
> > >
> >
> > If that is true the proposal would be greatly simplified of course.
>
> It seems to be true for virtio-blk (at least current Qemu migrates
> inflight requests). But it might not be true for other type of devices
> (anyhow we can leave them for future investigation).
>

That's the issue, but if qemu already migrate them that way I think we
should be safe.

> >
> > > > And we would need
> > > > another indication. What is the use of separating these status?
> > > > (waiting for stop bit, waiting for inflight descriptors to be valid).
> > > >
> > > > The only possibility I can come up with is to actually stop the
> > > > request right in the middle of an operation. For example, to allow a
> > > > big block read to stop and then when the device is informed about
> > > > these inflight descriptors and its progress, it can continue. I would
> > > > say this is very out of scope, more about this later ([1]).
> > > >
> > > > > >
> > > > > > *If* we are not going to allow the guest to resume operation, where it
> > > > > > knows all the status of the device, then there is no value on let the
> > > > > > device delay the operation: From the guest point of view it either
> > > > > > succeed to send to the device backend and somebody else caused a
> > > > > > failure (external network lose the tx packet, bit rotting caused I'm
> > > > > > reading a different value than previously written), or it failed at
> > > > > > the stop moment.
> > > > >
> > > > > So it's highly device specific, e.g for ethernet, we can afford the
> > > > > loss of packets but not for the block devices so reporting inflight
> > > > > descriptors may help to res-submit those after "resuming".
> > > > >
> > > >
> > > > Right.
> > > >
> > > > > >
> > > > > > This is different with the resume possibility, where the device can
> > > > > > decide to hold the descriptors, stop operating, and then resume
> > > > > > operation.
> > > > > >
> > > > > > > > Depending on the device, it can do it
> > > > > > > > +in many ways as long as the driver can recover its normal operation if it
> > > > > > > > +resumes the device without the need of resetting it:
> > > > > > > > +\begin{itemize}
> > > > > > > > +\item Drain and wait for the completion of all pending requests until a
> > > > > > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > > > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > > > > > +can choose to retry or to cancel them.
> > > > > > >
> > > > > > > If we allow the driver to retry, we need a way to report inflight
> > > > > > > buffers which is not supported by the spec. A way to solve this is to
> > > > > > > make it device specific.
> > > > > > >
> > > > > >
> > > > > > Regarding the retry, I don't get you here. Re-reading the patch, I
> > > > > > think that "driver retry" is very ambiguous: I meant for the device to
> > > > > > mark the descriptor as used, but with a communication specific error
> > > > > > code, so the application, guest kernel, etc (driver in the standard)
> > > > > > can decide to retry.
> > > > >
> > > > > That's why I think introducing the virtqueue state is a must for stop,
> > > > > With all the indexes defined, it would be much easier to describe what
> > > > > the device or driver is expected to work.
> > > > >
> > > >
> > > > I still don't see the relationship, sorry.
> > >
> > > E.g how do you define the in-flight buffers accurately?
> > >
> >
> > If we can make them idempotent, one descriptor is in flight if it is
> > available and the device is aware of that.
>
> Somehow, but it would be tricky to define 'aware' since it's device
> specific stuffs.
>

We can define it in terms of VirtIO queues but I see it way better if
I think from "the device point of view" about them. If you have a
packed virtqueue, the in-flight descriptors are the available
descriptors that the driver placed in a given position but then were
overridden by used descriptors with a different id. In the moment they
are used, they are not available or in-flight anymore.

So the in-flight descriptors are a subset of the available ones, but
they are not recoverable just looking at the descriptor ring and
knowing the vq status (that includes the wrap bit and ring idx). The
guest and the device track them in their regular working, but if the
hypervisor needs to override the device with no notice by the guest
(OVS case I described above), it's not possible.

The split ring is the same but the descriptor length is recoverable
from the descriptor ring. And we cannot talk in terms of overridden
but when used index has surpassed the avail idx of old descriptors.

The only thing left to say is that for some devices they should be
reported in the same order they were made available by the driver, to
preserve ordering. We can skip this for net queues. And I think we
could delegate this requirement to be per device.

I agree this cannot happen in the current qemu or vhost-net, because
they use the indexes in order, but it's not something we should rely
on. That's why I forced these to be flushed but, as discussed, this
should not be a problem if they are treated as idempotent.

> >
> > > >
> > > > What I intended to say in the patch is that the device can choose to
> > > > just return a device / communication error to indicate that the
> > > > transaction has failed at device level, but related to virtio, the
> > > > buffer would be marked as used.
> > > >
> > > > Maybe a good example of this is for the device to choose to return
> > > > VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
> > > > block backend, but I don't know a lot of the blk device so I may be
> > > > wrong. I guess that the guest cannot know about the value being
> > > > written / read with that error code, and it is forced to re-read that.
> > > > But the virtqueue will be in a good state, and the device can be reset
> > > > and can recover its state. It's totally up to the device to choose to
> > > > do so.
> > >
> > > I think not, if we tie STOP to some device errors that could be even
> > > more complicated.
> > >
> >
> > It's up to the device to implement that way but I understand your point.
> >
> > > >
> > > > Virtqueue state is still needed, but not because the device chooses to
> > > > return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
> > > > status after the reset.
> > > >
> > > > > >
> > > > > > Regarding the in-flight descriptor report, it's interesting but I
> > > > > > cannot see a way where it does not complicate the solution a lot or
> > > > > > adds new dependencies. I have the next thoughts:
> > > > > > 1) If it works as inflight_fd, "a region of shared memory"
> > > > > > 1.1) This region must be in the guest's AS so the device has access to
> > > > > > it. This either invalidates the use of STOP from the driver point of
> > > > > > view as "let me know where you are not going to modify the guest's
> > > > > > memory anymore".
> > > >
> > > > Long shot here, but might this work with the combination of the
> > > > balloon device? Making this far and far from the simplicity though...
> > > >
> > > > > > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > > > > > it is possible to implement the SVQ without the need of STOP bit. This
> > > > > > is equivalent to "I have a PF that also supports VF dirty memory
> > > > > > tracking".
> > > > > > 2) If it works as the config space, where the driver can ask for its
> > > > > > status, STOP means "STOP writing used and report via config space". No
> > > > > > need for reset.
> > > > > >
> > > > > > Did you have something different in mind?
> > > > >
> > > > > Not sure, maybe config space is better. What I want is to make the
> > > > > feature as small as possible but leaving spaces for future extension.
> > > > >
> > > > > E.g we start from the feature that is sufficient for networking
> > > > > devices, (but doesn't prevent the future work to extend it to block
> > > > > devices). I'm not familiar with the block device, but mandating the
> > > > > completion of inflight descriptor make have troubles, e.g unexpected
> > > > > downtime during live migration.
> > > > >
> > > >
> > > > [1] I agree with that, but I feel that "device or virtio specific for
> > > > reporting inflight descriptors" is way too broad to make it useful at
> > > > the moment.
> > >
> > > Yes and that's not a must for an ethernet device.
> > >
> >
> > This is not really true as how this proposal is specified, if we use
> > it in the hypervisor. Just saving the virtqueue index is not enough if
> > the device is not using the descriptors in order, since the available
> > buffers may not be recoverable just looking at the guest memory.
> >
> > In that regard, we must either flush them (as this proposal do, and
> > with the unbound time problem), or use the inflight descriptors.
>
> I meant for ethernet device, device can simply complet all inflight
> buffers before the STOP. Or anything I missed here?
>

That's what the proposal mandates :). It even allows the device to
mark as completed (used) without sending / receive them, since the
network should be prepared for that.

But its not true for other devices, and concerns have been raised previously.

> >
> > > >
> > > > Maybe the best thing to do is to put all the restrictions at this
> > > > moment, and when we figure out a good format for the inflight, add
> > > > "\item report inflight descriptors". Then, the device and the driver
> > > > are free to not accept any combination. Does it make sense?
> > >
> > > Somehow, to start from a version that works for networking devices.
> > > Where we know we don't need to care:
> > >
> > > 1) stop fail
> > > 2) unbound time of stop, so we don't need an interrupt
> > > 3) inflight buffers
> > > 4) new facility querying device states (shadow CVQ can do this)
> > >
> > > This will ease both of us as I feel the discussion might not be easily
> > > converged if we care about other types of devices with too many
> > > things. With networking done, we can start to support block devices
> > > and we can ask help from block gurus.
> > >
> >
> > I'm fine with that except that we don't need 3.
>
> I still think it's not a must for ethernet device, (we don't have that
> in all the current virtio-net backends, anything make hardware
> different in this situation?).
>

As long as we mandate that the device flush them, is not a problem.

Its convenient to have that way of reporting, since we are talking
about general virtio queues here. I mean, net can survive it because
it's the way the guest uses the net, but we don't allow because
virtqueue works differently.

> But I'm ok if we start to propose the inflight stuffs, that can makes
> a complete virtqueue state. I wonder maybe it's better to use
> device-area for those instead of transport specific way to access
> them.
>

I'd say it's the best bet.

> >
> > > >
> > > > > >
> > > > > > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > > > > > +assume to lose them.
> > > > > > >
> > > > > > > I think "make buffer used" is better than "mark them as done". And we
> > > > > > > need a accurate definition on who is "them".
> > > > > > >
> > > > > >
> > > > > > All items include other operations, like the ones that the device must
> > > > > > do internally to process the control virtqueue. But I cannot find an
> > > > > > example where telling the driver they are done when it's not is valid
> > > > > > for this particular item.
> > > > > >
> > > > > > But I agree it needs better wording.
> > > > > >
> > > > > > And I will s/them/operations/. for the next one.
> > > > > >
> > > > > > > > +\end{itemize}
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > > > > > +a guest's request,
> > > > > > >
> > > > > > > It's not clear what did "a guest's request" means.
> > > > > > >
> > > > > >
> > > > > > Right. Would "operation" fit better here?
> > > > >
> > > > > Still unclear, I guess this sentence tries to define when the device
> > > > > can fail the stop?
> > > > >
> > > >
> > > > Not really, my intentions were to add a MUST operation for when the
> > > > device fails. The first is needed for the second though, so maybe we
> > > > can rephrase.
> > > >
> > > > If we agree that a device can fail the stop, I think we should not
> > > > restrict the circumstances where the device can fail. "If the device
> > > > can find external circumstances where it cannot satisfy STOP must not
> > > > offer STOP feature" works for me too, actually.
> > >
> > > I'd leave the STOP_FAILED for future investigation.
> > >
> > > >
> > > > > >
> > > > > > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > > > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > > > > > +clears STOP_FAILED.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > > > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > > > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > > > > > +after exposing the STOP bit set:
> > > > > > > > +\begin{itemize}
> > > > > > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > > > > > +\item Send any used buffer notifications to the driver.
> > > > > > > > +\end{itemize}
> > > > > > > > +
> > > > > > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > > > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > > > > > +send another configuration space change notification to the driver afterwards
> > > > > > > > +until the guest clears it.
> > > > > > > > +
> > > > > > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > > > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > > > > > +device MUST continue reading available descriptors as if an available buffer
> > > > > > > > +notification has reach it, starting from the last descriptor it marked as used,
> > > > > > >
> > > > > > > So I still tend to define virtqueue state as basic facility before
> > > > > > > defining STOP. It can makes thing easier.
> > > > > > >
> > > > > >
> > > > > > Yes, coming back to that approach can simplify the whole proposal.
> > > > > >
> > > > > > > > +and continue the regular operation after that. The device MUST read again
> > > > > > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > > > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > > > > > +if for some reason it cannot continue.
> > > > > > > > +
> > > > > > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > > > > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > > > > > >  MUST send a device configuration change notification to the driver.
> > > > > > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > > > > > >    transport specific.
> > > > > > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > > > > > >
> > > > > > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > > > > > +  stop the device.
> > > > > > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > > > > > +
> > > > > > > >  \end{description}
> > > > > > >
> > > > > > > So I think the patch complicate thing is various ways:
> > > > > > >
> > > > > > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > > > > > with NEEDS_RESET
> > > > > > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> > > > > >
> > > > > > I'm not sure about those two, I find we will have devices with unbound
> > > > > > stop time where both can be useful if we agree on making this a
> > > > > > general facility.
> > > > >
> > > > > If the unbound stop time is the only worry, the way to report inflight
> > > > > descriptors looks like a better solution.
> > > >
> > > > I'm not sure if that's the only condition under which a device can
> > > > fail to stop, but if we agree on that we could prepare a format for
> > > > block devices to report them, for example. They are needed somehow in
> > > > the networking case of packed if buffers are used out of order.
> > >
> > > It can, but let's leave it for the future now.
> > >
> > > Actually, for inflight buffers, a better idea is to support it at the
> > > virtqueue level without extra data structure. But it's for sure not a
> > > short term solution.
> > >
> >
> > Can you expand on this? Why do you think it is not a short term solution?
>
> For the inflight buffers, I guess we need add more data structure in
> the device area. That's fine. But I wonder if we can re-design the
> virtqueue carefully then the inflight buffers could be deduced from
> the vring. It requires a re-design of the current vring which is not
> easy.
>

I've never thought that, but it's hard to pursue both vring
compactation and to *interleave* data not useful in the normal
communication.

I'd say that allocate a region and let driver (including VMM) know
when it can trust it is the way to go. For SW device implementations
that can abort in any moment because of a bug or whatever, they can
use the area as if it were theirs, and keep them properly updated. For
HW, it's enough if they populate at STOP. Memory is allocated upfront
but there is no need to use transport bandwith to keep it updated.

If we reuse that region, there is no need to make a sepparate config
call to get or set vq state. I find the config space call more
straightforward, but this cover way more use cases.

Thoughts?

Thanks!

> Thanks
>
> >
> > > >
> > > > > And STOP_FAILED is actually
> > > > > not accurate since it means the stop is not finished in bound time
> > > > > (but we need to define how long should be a bound time?)
> > > > >
> > > > > > Resetting the whole device because of this leaves
> > > > > > the driver with no possibility of knowing the state of the sent
> > > > > > descriptors.
> > > > > >
> > > > > > Of course, if these use cases are not interesting, it's easier to
> > > > > > leave them out for sure.
> > > > > >
> > > > > > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > > > > > + DRIVER_OK
> > > > > > >
> > > > > >
> > > > > > I agree it can be obtained with a whole reset, so it can be out and
> > > > > > leave it for the future if needed. However it seems overkill if we
> > > > > > just want to rewind some descriptors back, and there is no standard
> > > > > > way to recover the device status beyond vq_state.
> > > > >
> > > > > It's more about the minimal self-contained set of the new features. If
> > > > > it's just rewind, device or virtqueue reset is sufficient.
> > > >
> > > > I'm not sure if that is true for all devices with the features the
> > > > standard offers at the moment, but it might be right for serial.
> > >
> > > Thanks
> > >
> > > >
> > > > > If we want
> > > > > to obtain the state, virtqueue state is a must and with virtqueue
> > > > > state, resuming (clearing STOP) is not a must.
> > > > >
> > > >
> > > > Right.
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > > I think we'd better to stick to the minimal set of the function to
> > > > > > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > > > > > and no config interrupt).
> > > > > > >
> > > > > >
> > > > > > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > >
> > > > > > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > > > > > --
> > > > > > > > 2.27.0
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-17  8:08                 ` Eugenio Perez Martin
@ 2021-11-18  3:27                   ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2021-11-18  3:27 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, mst, Alexander Mikheev,
	Stefan Hajnoczi, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Wed, Nov 17, 2021 at 4:08 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Nov 17, 2021 at 4:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 10:50 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Tue, Nov 16, 2021 at 7:56 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, Nov 16, 2021 at 2:17 AM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Mon, Nov 15, 2021 at 5:08 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 12, 2021 at 6:51 PM Eugenio Perez Martin
> > > > > > <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 12, 2021 at 5:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Nov 12, 2021 at 2:59 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > >
> > > > > > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > > > > > status, making sure the device will not fetch new available ones.
> > > > > > > > >
> > > > > > > > > Its main use case is live migration, although it has other orthogonal
> > > > > > > > > use cases. It can be used to safely discard requests that have not been
> > > > > > > > > used: in other words, to rewind available descriptors.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > >
> > > > > > > > So this is much more complicated, see below.
> > > > > > > >
> > > > > > >
> > > > > > > I agree it's more complicated, but it addresses some concerns raised
> > > > > > > on previous patches sent to the list. Not saying that all of them must
> > > > > > > be addressed, or addressed this way though :).
> > > > > > >
> > > > > > > > > ---
> > > > > > > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  1 file changed, 83 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/content.tex b/content.tex
> > > > > > > > > index 2aa3006..9ed0d09 100644
> > > > > > > > > --- a/content.tex
> > > > > > > > > +++ b/content.tex
> > > > > > > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > > > > > > >    drive the device.
> > > > > > > > >
> > > > > > > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > > > +  device has been stopped by the driver. This status bit is different
> > > > > > > > > +  from the reset since the device state is preserved.
> > > > > > > > > +
> > > > > > > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > > > > > > +  device could not stop the STOP request.
> > > > > > > > > +
> > > > > > > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > > > > > > >    an error from which it can't recover.
> > > > > > > > >  \end{description}
> > > > > > > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > > > > > >  recover by issuing a reset.
> > > > > > > > >  \end{note}
> > > > > > > > >
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated,
> > > > > > > >
> > > > > > > > "has not been" actually?
> > > > > > > >
> > > > > > >
> > > > > > > I think the sentence is ok. In other words, "Even when VIRTIO_F_STOP
> > > > > > > *has been* negotiated (in other words, driver sent FEATURES_OK), the
> > > > > > > driver must not set or clear the STOP bit before setting DRIVER_OK".
> > > > > >
> > > > > > Ok, but what happens if we simply allow the STOP to be set if
> > > > > > DRIVER_OK is not set? It looks to me that the DRIVER_OK doesn't
> > > > > > conflict with STOP.
> > > > > >
> > > > > > (Anyhow we allow to set STOP after DRIVER_OK)
> > > > > >
> > > > >
> > > > > We could change it to "the driver MUST NOT set or clear STOP if
> > > > > FEATURES_OK is not set", which would allow the driver to start a
> > > > > device in stop mode. Before that should be definitely not done by a
> > > > > good driver.
> > > >
> > > > Yes, limiting it before FEATURES_OK is a must.
> > > >
> > > > >
> > > > > But if we don't allow the resume, it makes little sense to allow the
> > > > > driver to start (as "set DRIVER_OK bit") in stop mode anyhow.
> > > >
> > > > Yes.
> > > >
> > > > > I would
> > > > > say that it is better to limit that now, and allow it in the future if
> > > > > we find a valid use case, enabling a specific feature flag for it.
> > > > >
> > > > > I'm also fine if we decide to leave this unspecified, but limiting it
> > > > > now could enable us to make something useful with it in the future.
> > > >
> > > > Leaving is unspecified seems better since if we do the limitation, it
> > > > introduces extra efforts for future extension.  But I'm fine with
> > > > either.
> > > >
> > > > >
> > > > > > >
> > > > > > > > > the driver MUST NOT set or clear STOP if
> > > > > > > > > +DRIVER_OK is not set.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > > > > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > > > > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > > > > > > +the last. Since this change may not be instantaneous, the driver MAY wait for
> > > > > > > > > +the configuration change notification that the device must send after the
> > > > > > > > > +change.
> > > > > > > >
> > > > > > > > This is kind of tricky, it means the device can send notification
> > > > > > > > after it has been stopped.
> > > > > > >
> > > > > > > I don't think this part it's so tricky. That notification is also sent
> > > > > > > when the DEVICE_NEEDS_RESET bit is set,
> > > > > >
> > > > > > I think they are different, DEVICE_NEEDS_RESET doesn't mean the device
> > > > > > is stopped.
> > > > >
> > > > > To clarify, what I meant is that there are situations where this
> > > > > notification is raised even if device configuration is not changed,
> > > > > but its status.
> > > >
> > > > Right, but again the notification is not a must for the status changed
> > > > (e.g reset).
> > > >
> > > > >
> > > > > NEED_RESET does not mean the device is stopped, but it (should) signal
> > > > > the driver that further interaction with the device will be for sure
> > > > > invalid. I may be wrong with this, but this way of notifying the
> > > > > driver relieves it to the need for check status in every interaction.
> > > >
> > > > You are right. But it's still different with stop, we don't need to
> > > > check if for every interaction. What we need is to check if only after
> > > > the driver tries to stop the device (as reset).
> > > >
> > > > >
> > > > > > But what we want to achieve is to make sure there won't be
> > > > > > any interaction between device and driver after STOP is set by device.
> > > > > >
> > > > >
> > > > > If I understand you correctly, what you meant is that that a driver
> > > > > could (and I think it will a lot of times) read the status change in
> > > > > this order:
> > > > > 1) STOP bit is set
> > > > > 2) Notification change arrives
> > > > >
> > > > > And 2) is weird since the device promised no more interaction somehow.
> > > > >
> > > > > I agree to some extent, but it can be read even from the opposite
> > > > > angle: From the moment the driver sets DRIVER_OK, every change on the
> > > > > device (status or config) is notified using configuration change
> > > > > interrupt.
> > > >
> > > > For status, It depends on what you mean by "change". If it's the value
> > > > that read from the driver:
> > > >
> > > > 1) The only thing that needs to be notified is the status that can
> > > > only be set by device, that is NEEDS_RESET.
> > > > 2) For the status that set by the driver and device can refuse
> > > > (forever or temporarily), there's no notification change: DRIVER_OK,
> > > > FEATURE_OK
> > > >
> > > > STOP belongs to 2). STOP_FAILED belongs to 1), but:
> > > >
> > > > 1) STOP status bit 0 means the device is not stopped
> > > > 2) We don't have DRIVER_FAILED and FEATURE_FAILED, instead, we just
> > > > check whether or not the bit is set by the device.
> > > >
> > > > So whether or not we need STOP_FAILED is still questionable.
> > > >
> > >
> > > I don't split in "bits set by device" or "set by the driver". The way
> > > I see it, the driver send a request, and the device is going to change
> > > its status in the future. The driver can read the status reading the
> > > two bits bitfield:
> > >
> > > STOP_FAILED
> > > |STOP
> > > ||
> > > 00 - Running normally
> > > 01 - Device has stopped successfully
> > > 10 - Device could not stop, but is running normally
> >
> > I wonder what's the advantage of differing 10 from 00?
> >
>
> If the stop is not instantaneous, the driver knows when to stop
> polling, or if it has to retry.

Yes, I meant the driver knows the stop is sent from itself. So in this
case 00 after stop means 10 above.

>
> > > 11 - Cannot find this combination at this moment.
> > >
> > > > >
> > > > > a) Regarding the standard, I don't see it so different from the
> > > > > NEED_RESET: the config change keeps being an out of band notification
> > > > > system the driver can relay to know a (expected) status change.
> > > > > b) I don't see a big deal with changing the semantic from "no more
> > > > > interaction from the device" with "no more interaction but the
> > > > > expected config change interrupt".
> > > > > c) It's easy to ignore the interrupt, or even not to treat it
> > > > > specially after the stop: The driver already should scan config to
> > > > > look for changes in configuration and status, it will simply find
> > > > > none. Although this is not implemented widely as far as I see.
> > > > >
> > > > > In that regard, I feel that interaction is very innocuous, and to me
> > > > > is the straightforward solution to avoid the active polling.
> > > >
> > > > Well, the driver can choose not to do busy polling for sure without
> > > > the interrupt for sure.
> > > >
> > >
> > > You mean using the transport specific method you describe in previous mails?
> >
> > Yes and no, it could be:
> >
> > 1) transport specific method
> > 2) any other method that doesn't do busy polling (timer, sleep etc)
> >
>
> 2 without 1 could increase migration time,

It depends on the VMM actually. E.g the qemu's downtime is in ms, we
can sleep in sub-ms then the latency introduced could be ignored.

> and block a CPU only to
> check for the status.

It's not too bad as we do this for reset as well.

>
> > >
> > > > >
> > > > > > > and (as I read) is for the
> > > > > > > same reason somehow: To avoid the status polling:
> > > > > > > * "The driver SHOULD NOT rely on completion of operations of a device
> > > > > > > if DEVICE_NEEDS_RESET is set." (copied from the standard)
> > > > > > > * The reading of the status field could be expensive / inconvenient in
> > > > > > > each operation.
> > > > > >
> > > > > > It makes sense for the device initiated event to use interrupt. But
> > > > > > for a stop, it's driver initiated, in this case the driver won't start
> > > > > > the work (for example the cleanup) after it makes sure the device is
> > > > > > stopped. Polling the status should be fine as this is how the rest
> > > > > > works. Anything makes stop differ from reset here? Or what worries you
> > > > > > without the interrupt?
> > > > > >
> > > > >
> > > > > This is proposed only in the scope of the concerns I saw raised in
> > > > > previous series: the time to stop a device could be unbound, and
> > > > > tricks to poll less frequently will increase migration time.
> > > >
> > > > But I don't see how an interrupt can help to reduce the time spent on
> > > > the stop.
> > >
> > > If a host has many pass-through devices, it needs to burn CPU to ask
> > > all of them periodically. To reduce that burden, it poll less
> > > frequently. Because of that, some devices are stopped but hypervisor
> > > is not aware of that.
> > >
> > > To solve it, the device must be able to tell the hypervisor / driver
> > > when it has stopped. Of course, interrupt may not be the only way, and
> > > actively polling will always be a choice.
> >
> > Yes, usually linux drivers will not do busy polling, instead it can
> > use cpu_relax() or msleep() during the polling.
> >
>
> I understand that both of them work technically, but CPU relax does
> not allow to introduce more work with that thread and msleep increases
> the migration time.

(See above reply for reset)

>
> > >
> > > > The downtime is usually a user policy, so the VMM can choose
> > > > to timeout the stop and perform the resume.
> > > >
> > >
> > > With "resume" do you mean to actually reset the device if the STOP bit
> > > is not set in time? If that is a possibility then it is sure we will
> > > need to handle the inflight descriptors.
> >
> > Yes, but I wonder whether or not we can leave the inflight descriptors
> > for the future. (If it's not a must for networking). The idea is to
> > have something that can work quickly. But I'm also fine to propose it
> > now. With that I believe most of the device can be stopped in a short
> > time (we don't need to wait for the completion of inflight requests).
> >
>
> (answered below, let me know if it is not enough).
>
> > >
> > > > As discussed, the way to advertise the inflight buffers might be a
> > > > solution for this.
> > > >
> > >
> > > If these transactions are idempotent, as you say later, then yes.
> > >
> > > > >
> > > > > I will fully agree if these are left to the future: it is easy to
> > > > > implement this chunk of the proposal under a separated feature flag if
> > > > > this need arises. Sorry if that part was not clear enough.
> > > >
> > > > That's fine.
> > > >
> > > > >
> > > > > > > * Solution: Instead of polling, make a device facility to notify the
> > > > > > > driver that it cannot trust the device is going to behave properly /
> > > > > > > same as before anymore via notification.
> > > > > > >
> > > > > > > We can add another exception to the "device configuration space
> > > > > > > change" in "Notification of Device Configuration Changes", like the
> > > > > > > one already present:
> > > > > > > "In addition, this notification is triggered by the device setting
> > > > > > > DEVICE_NEEDS_RESET".
> > > > > > >
> > > > > > > I understand it sounds tricky that the device sends a notification
> > > > > > > when it's stopped, but in my opinion it's aligned with previous
> > > > > > > behavior (DEVICE_NEEDS_RESET),
> > > > > >
> > > > > > I think not,  e.g DEVICE_NEEDS_RESET doesn't (or it can't) mean the
> > > > > > device won't process the buffer or send an interrupt.
> > > > > >
> > > > >
> > > > > From the driver point of view, it means that the driver cannot trust
> > > > > the device anymore until the reset, so the driver actions are similar:
> > > >
> > > > I think we need clarify the exact semanic of STOP_FAILED, and it will
> > > > be very hard to differ
> > > >
> > > > 1) The device can not be stopped in short time
> > > >
> > > > from
> > > >
> > > > 2) The device can not be stopped forever
> > > >
> > > > At least in case 1) the driver still can trust the device.
> > > >
> > >
> > > I think that from the driver POV is the same. If device cannot stop,
> > > it will continue operating normally.
> > >
> > > > >
> > > > > ""
> > > > > the driver can’t assume requests in flight will be completed if
> > > > > DEVICE_NEEDS_RESET is set, nor can it assume that they have not been
> > > > > completed
> > > > > ""
> > > > >
> > > > > (Sorry for being circular here, I think it proceeds here too) What I
> > > > > meant is that the device sent an out of band notification when the
> > > > > device status changed. The driver could check the status field before
> > > > > processing every used buffer and also with a timer just in case, and
> > > > > DEVICE_NEEDS_RESET would not need the config interrupt change. But the
> > > > > interrupt gives convenience to the whole operation.
> > > > >
> > > > > Every time the driver gets that interrupt, it must re-check all the
> > > > > device configuration and status anyway. It can still make buffers
> > > > > available while processing it, but that's the meaning of the interrupt
> > > > > to me. And a status change after DRIVER_OK fits to it, from my point
> > > > > of view.
> > > >
> > > > I fully agree, but it's different:
> > > >
> > > > 1) The driver don't know when there will be a NEEDS_RESET, so without
> > > > an interrupt, it must check the status for each operation
> > > > 2) The drive know when there will be STOP/STOP_FAILED, it only needs
> > > > to check the status after it tries to stop the device
> > > >
> > >
> > > You know that the device will set from that point in the future, with
> > > an unbound time.
> >
> > So my point is introducing facilities to avoid the unbound time. I
> > think both of us know dealing with it is not easy.
> >
>
> Sure, that works for me.
>
> > > That could not be a problem, but there are concerns
> > > raised on previous patches about this.
> > >
> > > > >
> > > > > > > it's explicitly stated that it will be
> > > > > > > the last one, and it's caused because of the inconvenience of polling
> > > > > > > device status. Even if the driver can use other mechanisms.
> > > > > >
> > > > > > I think STOP works much more similarly to reset not NEEDS_RESET. The
> > > > > > only difference with reset is that STOP needs to preserve the device
> > > > > > states and we don't (or can't) use interrupt to signal the completion
> > > > > > of reset.
> > > > > >
> > > > >
> > > > > From the semantic point of view, yes. But in practical terms we can
> > > > > face unbounded time.
> > > >
> > > > The problem is:
> > > >
> > > > 1) how to define the "unbounded time", if we don't define it exactly,
> > > > the device may abuse the status bit which may cause a lot of troubles
> > >
> > > To me, the unbounded time is where the device does not guarantee that
> > > you will have the STOP status set the next time you read it. In case
> > > of an untrusted device, this may include everyone of them, for sure.
> >
> > Then the device behaviour is tied to the driver's behaviour. The
> > problem is that there's no bound time between the write and the
> > following read. There could be arbitrary time in the middle.
> >
>
> Right. In my opinion, that's why a device should be able to report
> that the stop is either in progress or failed. But *idempotent* in
> flight descriptors help for sure.

Yes.

>
> > >
> > > You care a little bit less about it if the device notifies you about
> > > the condition: You simply have the driver waiting for it, but
> > > operating normally in other regards.
> >
> > This can be achieved even without an interrupt.
> >
>
> Sure, the interrupt is just the straightforward way to do so (or the
> first that comes to my mind at least). There may be more convenient
> ways to do so.
>
> > >
> > > That notification can be a basic facility, a transport device one, or
> > > whichever other method.
> > >
> > > To mandate that the device is stopped the next time the status bit is
> > > read is another possibility for sure but if I understood correctly
> > > that would let other devices out. If vqs descriptors are idempotent,
> > > inflight_fd may be a way to get rid of that unbound time for sure.
> >
> > To be clear, inflight_fd should be part of the device area?
> >
>
> This is a little bit tricky, but I would say they should.
>
> I think this deserves another thread of its own, but it also allows
> other situations I have in my backlog to work way better. For example,
> in the case of emulated out of qemu devices (OVS, packed queue, no
> inflight_fd), it is impossible to recover the queue in case of fatal
> failure of the device. This would allow VMM to continue the connection
> with a new spawned device with no notice for guests.
>
> It would be ideal to place that chunk of memory in the VMM, but I'm
> not sure if self-contained devices would allow that.

Yes, that's somehow my point as well. From the view of inflight
reporting current virtqueue is not self-contained. So it should be
fine to introduce facilities to report them.

>
> > >
> > > > 2) there are other approaches that we can deal with the unbound time,
> > > > timeout in driver + reset
> > > >
> > > > > I mean, both operations have unbound time for
> > > > > sure, but I would say that any device should handle reset way faster
> > > > > than the STOP.
> > > >
> > > > Any reason that STOP is faster? I'd expect STOP is a subset of reset,
> > > > that is, in order to do reset, we must first stop.
> > > >
> > >
> > > I'm not sure if that is the way hardware can work, but with reset the
> > > device can reclaim the needed resources for the communication in an
> > > asynchronous way, since they are not needed for the driver to ask.
> >
> > Yes and I think we can do the same for the stop.
> >
> > > With the STOP, they need to be communicated to the driver somehow in a
> > > virtio way.
> >
> > So I think if it's device specific, we'd better not assume which is
> > faster. Instead, as discussed, it's better to try our best to avoid
> > the unbound stop (as what we did for reset). If we fail, it's not late
> > to consider how to recover from that.
> >
>
> I'm fine with that.
>
> > >
> > > > >
> > > > > I fully agree on your point, but I can also see the other way around:
> > > > > It would be convenient to have a configuration interrupt for the reset
> > > > > too, but it is impossible since we cannot configure any before the
> > > > > reset.
> > > >
> > > > I think it's a more about the question:
> > > >
> > > > 1) Why an interrupt is a must for STOP
> > > >
> > > > than
> > > >
> > > > 2) If we can use an interrupt
> > > >
> > > > My questions are all for 1).
> > > >
> > >
> > > It's not a must.
> > >
> > > > >
> > > > > > >
> > > > > > > If the community still has concerns about it, another option is to
> > > > > > > actually extract the way the device notifies it from the general
> > > > > > > facilities, and make it transport specific. But to use the device
> > > > > > > configuration change notification for this makes sense to me. The
> > > > > > > device configuration has changed.
> > > > > >
> > > > > > See above, I think we should have a consistent way to handle reset and stop.
> > > > > >
> > > > > > >
> > > > > > > > As discussed in the previous versions,
> > > > > > > > driver is freed to use timer or what ever other mechanism if it
> > > > > > > > doesn't like the busy polling. I wonder how much value we can gain
> > > > > > > > from a dedicated config interrupt. Especially consider some transport
> > > > > > > > can use transport specific interrupt (not virtio specific interrupt)
> > > > > > > > for reporting whether or not set status succeed.
> > > > > > > >
> > > > > > >
> > > > > > > In my opinion, *if* we agree that a stop is a virtio facility and not
> > > > > > > a per-device one, and *if* we agree that a notification is required
> > > > > > > for the device to notify the stop, it makes sense to use a
> > > > > > > transport-independent mechanism that the device must already
> > > > > > > implement.
> > > > > >
> > > > > > So the major question is why a notification is a must? And Just to be
> > > > > > clear, there could be transport specific mechanisms for error
> > > > > > reporting.
> > > > > >
> > > > > > E,g
> > > > > >
> > > > > > 1) PCI can have non-posted write, if we use non-posted write to carry
> > > > > > the stop command, the device can return whether or not the device is
> > > > > > stopped successfully.
> > > > > >
> > > > > > or
> > > > > >
> > > > > > 2) Some other transport can convert the stop status bit set into a
> > > > > > command and queue it to device specific queue, device can then use
> > > > > > it's own specific interrupt to report the when the stop is handled
> > > > > > (success or fail)
> > > > > >
> > > > >
> > > > > I would be totally fine with that too.
> > > > >
> > > > > > >
> > > > > > > > >If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > > > > > > +try new STOP attempts.
> > > > > > > >
> > > > > > > > Does the device need to re-read the STOP_FAILED for synchronization?
> > > > > > >
> > > > > > > I tend to see the status as something that belongs to the device and
> > > > > > > is exposed to the driver. In that sense, the write from the guest
> > > > > > > triggers an event on the device, and the device decides what will be
> > > > > > > exposed on that field (MMIO?) on the next driver read. If it's not
> > > > > > > that way, we couldn't use the STOP bit that way, right?
> > > > > >
> > > > > > Yes, but this is not an answer to my question. It's about the
> > > > > > ordering, when write returns it doesn't mean the write arrives at the
> > > > > > device, this is the case of PCI at least. So we need a mechanism to
> > > > > > make sure the write arrives at the device (PCI read will flush
> > > > > > previous write).
> > > > > >
> > > > >
> > > > > I didn't see that in your original question, sorry. But the PCI read
> > > > > that flush the write is the driver one, isn't it?
> > > > >
> > > > > In that case I would say that "the read" is part of "the write". It's
> > > > > an issue of the PCI protocol, which I would say doesn't belong to this
> > > > > section (or even this document?): To implement virtio over PCI, you
> > > > > know that virtio needs a write, and, in particular, you know that PCI
> > > > > needs a posterior read to make sure that write is effective.
> > > > >
> > > > > Either that, or that the driver must use non-posted ones if it wants
> > > > > the device to note it.
> > > > >
> > > > > Or am I still missing something?
> > > >
> > > > Just to make sure we are in the same page,  in this paragraph, you
> > > > said this at the beginning:
> > > >
> > > > "If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > to ensure the STOP or STOP_FAILED bit is set after the write."
> > > >
> > > > And in the end of the paragraph:
> > > >
> > > > If the device sets the STOP_FAILED bit, the driver MUST clear it
> > > > before try new STOP attempts. But you don't define whether we need to
> > > > re-read to make sure STOP_FAILED is clear if the driver tries to clear
> > > > it. Is this intended?
> > > >
> > >
> > > Not really, I assumed that the write of clearing STOP_FAILED and the
> > > write of STOP would be ordered, and that the device should have no
> > > problem clearing that status bit, since I don't see any kind of
> > > resource reclamation or anything like that there.
> >
> > Ok.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > I
> > > > > > > > wonder how much we can gain from STOP_FAILED, the patch is unclear on
> > > > > > > > when that the device needs to set this bit. And driver can choose to
> > > > > > > > reset after a specific timeout anyhow.
> > > > > > > >
> > > > > > >
> > > > > > > The conditions where the device needs to set this bit are unspecified
> > > > > > > because it depends on the device: Not only to the kind of device, but
> > > > > > > also on the device backend.
> > > > > > >
> > > > > > > The same condition (regarding the possibility of handling the pending
> > > > > > > buffers) could cause different devices to react differently. A network
> > > > > > > device could decide it's fine to drop pending tx, let the guest think
> > > > > > > that "the network lost them", and mark them as done,
> > > > > >
> > > > > > We may meet the similar issue during reset.
> > > > > >
> > > > >
> > > > > Yes, but the driver should be fine to fail a reset, it does not want
> > > > > to use the device anymore or it wants to totally override the device
> > > > > state. If a stop fails, the driver would expect the device to continue
> > > > > operating in my opinion, because it will be impossible to recover the
> > > > > device state.
> > > >
> > > > This is only true if we allow the stop to be failed. It would be an
> > > > issue if the driver fails to stop a device since it can fail the stop
> > > > of the entire VM which is not something that the VMM is expecting.
> > > >
> > > > If we don't allow the stop can fail and we allow the device to expose
> > > > the inflight buffers, we are all fine:
> > > >
> > > > 1) VM is guaranteed to be stopped
> > > > 2) stop can be finished in time
> > > >
> > > > Devices are free to choose to wait for the short time request and tag
> > > > the long time request as inflight.
> > > >
> > > > >
> > > > > This is again something that we could leave if we decide it is not
> > > > > necessary at this moment: It just shows how a concern of previous
> > > > > proposals can be solved, at least technically.
> > > >
> > > > To me, I think we can start from a set of functions that can make e.g
> > > > the virtio-net to work to unblock:
> > > >
> > > > 1) live migration work
> > > > 2) extensions to other devices (e.g inflight could be done on top as
> > > > new features)
> > > >
> > > > >
> > > > > > > where a
> > > > > > > persistent storage cannot do that for write requests. Just as an
> > > > > > > example, not saying that networking devices must do that :).
> > > > > >
> > > > > > So I think this brings extra complexity that we probably don't need to
> > > > > > worry about now. The reason is that the spec doesn't allow the reset
> > > > > > to fail.
> > > > > >
> > > > >
> > > > > It can be left for the future for sure.
> > > > >
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > >
> > > > > > > > Any motivation for this? it looks to me it makes the feature coupled
> > > > > > > > with the virtqueue state proposal? It seems odd to allow avail change
> > > > > > > > but not the last_avail_idx change.
> > > > > > > >
> > > > > > >
> > > > > > > On second thought, I think you are right and this overlaps with the
> > > > > > > state proposal.
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > > > > > > > > +the driver MAY change any descriptor.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > > > > > > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > > > > > > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > > > > > > +acknowledges the new status clearing it. Since this change may not be
> > > > > > > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > > > > > > +that the device must send after the change.
> > > > > > > >
> > > > > > > > Do we really needs resuming? it's kind of:
> > > > > > > >
> > > > > > > > 1) STOP -> clear STOP
> > > > > > > >
> > > > > > > > vs
> > > > > > > >
> > > > > > > > 2) STOP -> RESET -> DRIVER_OK
> > > > > > > >
> > > > > > > > Using 2) preserve the semantic that the driver can't clear the status bit.
> > > > > > > >
> > > > > > >
> > > > > > > You are totally right in that regard. But the use case simplifies the
> > > > > > > operation when the driver only wants to take back some available
> > > > > > > descriptors still not used, in the range last_avail_idx..avail_idx.
> > > > > > > Doing that could be a big burden for drivers, who would need to
> > > > > > > re-send every status. MST proposed that use case at [1].
> > > > > >
> > > > > > Yes, but it looks to me this doesn't require the resuming? And the per
> > > > > > virtqueue reset is being proposed here.
> > > > > >
> > > > > > https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07818.html
> > > > > >
> > > > > > Actually, there's a subtle difference between 1) and 2). That is using
> > > > > > 2) doesn't make sure we can "resume" from the index where we stopped.
> > > > > > But this won't be an issue considering we know that we need to support
> > > > > > setting device virtqueue state(index). So if we want to resume from
> > > > > > the exact index it could be:
> > > > > >
> > > > > > STOP -> RESET -> setting index -> DRIVER_OK
> > > > > >
> > > > >
> > > > > With the state I meant more than VQ state, but the device state in
> > > > > general. For example, for the network, you must also send all the
> > > > > needed control commands to recover mac, rx filters, etc.
> > > >
> > > > I'm not sure I get this. For those cvq stuff, with the help of the
> > > > shadow virtqueue, we don't need any spec extensions. What did I miss
> > > > here?
> > > >
> > >
> > > Right, that was not the example I intended to put actually. I did a
> > > lot of back and forth for the answer and I put the wrong one here,
> > > sorry :).
> > >
> > > But my concern is solved if we treat the inflight as a idempotent. All
> > > unanswered things above this are solved with that too.
> > >
> > > > >
> > > > > That's what I meant with "if you just want to rewind some descriptors,
> > > > > resetting the whole device is overkill".
> > > > >
> > > > > The example may be wrong, I can think of virtiofs and the need to keep
> > > > > files opened:
> > > > > * If we go through a full reset circle, the files opened may not be
> > > > > the same as the closed ones, like deleted files with open handles.
> > > > > * If we go through a full reset circle, watchers may skip a change.
> > > > >
> > > > > Of course, this complexity may be left for the future and simply state
> > > > > that, if that is the case, the device cannot offer stop feature.
> > > > > Virtiofs have already other complexities that makes its migration
> > > > > hard, but I think the point is explained.
> > > >
> > > > There are long discussions about the virtiofs migration. But it's out
> > > > of the scope for the discussion of device stop since it's mainly about
> > > > how to define and expose device states. For stop, it's more than
> > > > sufficient to say the device states needs to be preserved after the
> > > > device is stopped.
> > > >
> > > > I'd rather go with something simple to work for a simple type of
> > > > device like ethernet. Otherwise there will be endless discussion. For
> > > > any features that are not needed by the ethernet device, I would leave
> > > > it for future investigation.
> > > >
> > > > >
> > > > > > >
> > > > > > > In that regard, the straightforward thing to do is modify avail_idx /
> > > > > > > descriptors from that range and let resume. However, the RESET path
> > > > > > > makes it easier to implement the device part of course, and the guest
> > > > > > > can also achieve the rewind that way.
> > > > > > >
> > > > > > > > > +
> > > > > > > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > > > > > > >
> > > > > > > > >  The device MUST NOT consume buffers or send any used buffer
> > > > > > > > >  notifications to the driver before DRIVER_OK.
> > > > > > > > >
> > > > > > > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > > > > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > > > > > > +or clear of STOP.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > > > > > > +operations after the driver writes STOP.
> > > > > > > >
> > > > > > > > I wonder if it's better to leave this to device to decide. E.g some
> > > > > > > > block devices may requires a very log time to finish the inflight
> > > > > > > > operations.
> > > > > > > >
> > > > > > >
> > > > > > > (Letting out SVQ + inflight descriptors for this part of the response,
> > > > > > > I will come back to it later)
> > > > > > >
> > > > > > > But if virtqueue is not valid anymore, how can it report them when
> > > > > > > finished?
> > > > > >
> > > > > > It's still valid since the STOP bit is not set by the device.
> > > > > >
> > > > >
> > > > > Then I don't understand your answer. To my proposal of:
> > > > >
> > > > > "If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > > > in-flight operations after the driver writes STOP."
> > > > >
> > > > > You answered:
> > > > >
> > > > > "I wonder if it's better to leave this to device to decide. E.g some
> > > > > block devices may requires a very log time to finish the inflight
> > > > > operations."
> > > > >
> > > > > The device must finish all requests before it shows the STOP bit as
> > > > > set to the device. Maybe it is better to rephrase it like:
> > > > >
> > > > > If VIRTIO_F_STOP has been negotiated, the device MUST finish any
> > > > > in-flight operations after the driver writes STOP and before it sets
> > > > > its status bit STOP as set.
> > > > >
> > > > > ?
> > > >
> > > > I meant when to set STOP is highly device specific. E.g for the virtio
> > > > block devices which allows the in-flight requests to be re-submitted
> > > > after resume, the device can choose to not wait for the completion of
> > > > the inflight operations and expose them to the driver. This helps to
> > > > reduce the time spent on the stop.
> > > >
> > > > >
> > > > > > > In that sense, I would say it's better to report failure and
> > > > > > > let the guest handle it as if the disk is unavailable (timeout?
> > > > > > > temporary faulty sector? I'm not sure what is the most suitable way).
> > > > > >
> > > > > > This could be addressed by leaving the following choices to the devices:
> > > > > >
> > > > > > 1) complete the inflight requests
> > > > > > 2) device or virtio specific for reporting inflight descriptors
> > > > > >
> > > > >
> > > > > As previously, I'm not sure how this relates with "the stop bit is not
> > > > > set by the device", so my answer may be completely wrong here,
> > > >
> > > > It's related to time spent on the stop. E.g for block devices, it can
> > > > simply show all the inflight buffers to guests and set the stop bit.
> > > > Then STOP should be very fast.
> > > >
> > > > >
> > > > > Even assuming the device can report in-flight descriptors, it needs to
> > > > > wait for the backend before reporting them anyhow.
> > > >
> > > > Why does it need to wait for the backend? I mean for the device that
> > > > supports in-flight descriptors, the semantic of the device should
> > > > allow the requests to be processed twice.
> > > >
> > >
> > > If that is true the proposal would be greatly simplified of course.
> >
> > It seems to be true for virtio-blk (at least current Qemu migrates
> > inflight requests). But it might not be true for other type of devices
> > (anyhow we can leave them for future investigation).
> >
>
> That's the issue, but if qemu already migrate them that way I think we
> should be safe.
>
> > >
> > > > > And we would need
> > > > > another indication. What is the use of separating these status?
> > > > > (waiting for stop bit, waiting for inflight descriptors to be valid).
> > > > >
> > > > > The only possibility I can come up with is to actually stop the
> > > > > request right in the middle of an operation. For example, to allow a
> > > > > big block read to stop and then when the device is informed about
> > > > > these inflight descriptors and its progress, it can continue. I would
> > > > > say this is very out of scope, more about this later ([1]).
> > > > >
> > > > > > >
> > > > > > > *If* we are not going to allow the guest to resume operation, where it
> > > > > > > knows all the status of the device, then there is no value on let the
> > > > > > > device delay the operation: From the guest point of view it either
> > > > > > > succeed to send to the device backend and somebody else caused a
> > > > > > > failure (external network lose the tx packet, bit rotting caused I'm
> > > > > > > reading a different value than previously written), or it failed at
> > > > > > > the stop moment.
> > > > > >
> > > > > > So it's highly device specific, e.g for ethernet, we can afford the
> > > > > > loss of packets but not for the block devices so reporting inflight
> > > > > > descriptors may help to res-submit those after "resuming".
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > >
> > > > > > > This is different with the resume possibility, where the device can
> > > > > > > decide to hold the descriptors, stop operating, and then resume
> > > > > > > operation.
> > > > > > >
> > > > > > > > > Depending on the device, it can do it
> > > > > > > > > +in many ways as long as the driver can recover its normal operation if it
> > > > > > > > > +resumes the device without the need of resetting it:
> > > > > > > > > +\begin{itemize}
> > > > > > > > > +\item Drain and wait for the completion of all pending requests until a
> > > > > > > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > > > > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > > > > > > +can choose to retry or to cancel them.
> > > > > > > >
> > > > > > > > If we allow the driver to retry, we need a way to report inflight
> > > > > > > > buffers which is not supported by the spec. A way to solve this is to
> > > > > > > > make it device specific.
> > > > > > > >
> > > > > > >
> > > > > > > Regarding the retry, I don't get you here. Re-reading the patch, I
> > > > > > > think that "driver retry" is very ambiguous: I meant for the device to
> > > > > > > mark the descriptor as used, but with a communication specific error
> > > > > > > code, so the application, guest kernel, etc (driver in the standard)
> > > > > > > can decide to retry.
> > > > > >
> > > > > > That's why I think introducing the virtqueue state is a must for stop,
> > > > > > With all the indexes defined, it would be much easier to describe what
> > > > > > the device or driver is expected to work.
> > > > > >
> > > > >
> > > > > I still don't see the relationship, sorry.
> > > >
> > > > E.g how do you define the in-flight buffers accurately?
> > > >
> > >
> > > If we can make them idempotent, one descriptor is in flight if it is
> > > available and the device is aware of that.
> >
> > Somehow, but it would be tricky to define 'aware' since it's device
> > specific stuffs.
> >
>
> We can define it in terms of VirtIO queues but I see it way better if
> I think from "the device point of view" about them. If you have a
> packed virtqueue, the in-flight descriptors are the available
> descriptors that the driver placed in a given position but then were
> overridden by used descriptors with a different id. In the moment they
> are used, they are not available or in-flight anymore.
>
> So the in-flight descriptors are a subset of the available ones, but
> they are not recoverable just looking at the descriptor ring and
> knowing the vq status (that includes the wrap bit and ring idx). The
> guest and the device track them in their regular working, but if the
> hypervisor needs to override the device with no notice by the guest
> (OVS case I described above), it's not possible.

Yes.

>
> The split ring is the same but the descriptor length is recoverable
> from the descriptor ring. And we cannot talk in terms of overridden
> but when used index has surpassed the avail idx of old descriptors.
>
> The only thing left to say is that for some devices they should be
> reported in the same order they were made available by the driver, to
> preserve ordering. We can skip this for net queues. And I think we
> could delegate this requirement to be per device.
>
> I agree this cannot happen in the current qemu or vhost-net, because
> they use the indexes in order, but it's not something we should rely
> on. That's why I forced these to be flushed but, as discussed, this
> should not be a problem if they are treated as idempotent.

Ok.

>
> > >
> > > > >
> > > > > What I intended to say in the patch is that the device can choose to
> > > > > just return a device / communication error to indicate that the
> > > > > transaction has failed at device level, but related to virtio, the
> > > > > buffer would be marked as used.
> > > > >
> > > > > Maybe a good example of this is for the device to choose to return
> > > > > VIRTIO_BLK_S_IOERR, even if the transaction is still going in the
> > > > > block backend, but I don't know a lot of the blk device so I may be
> > > > > wrong. I guess that the guest cannot know about the value being
> > > > > written / read with that error code, and it is forced to re-read that.
> > > > > But the virtqueue will be in a good state, and the device can be reset
> > > > > and can recover its state. It's totally up to the device to choose to
> > > > > do so.
> > > >
> > > > I think not, if we tie STOP to some device errors that could be even
> > > > more complicated.
> > > >
> > >
> > > It's up to the device to implement that way but I understand your point.
> > >
> > > > >
> > > > > Virtqueue state is still needed, but not because the device chooses to
> > > > > return VIRTIO_BLK_S_IOERR, but because it needs a way to recover the
> > > > > status after the reset.
> > > > >
> > > > > > >
> > > > > > > Regarding the in-flight descriptor report, it's interesting but I
> > > > > > > cannot see a way where it does not complicate the solution a lot or
> > > > > > > adds new dependencies. I have the next thoughts:
> > > > > > > 1) If it works as inflight_fd, "a region of shared memory"
> > > > > > > 1.1) This region must be in the guest's AS so the device has access to
> > > > > > > it. This either invalidates the use of STOP from the driver point of
> > > > > > > view as "let me know where you are not going to modify the guest's
> > > > > > > memory anymore".
> > > > >
> > > > > Long shot here, but might this work with the combination of the
> > > > > balloon device? Making this far and far from the simplicity though...
> > > > >
> > > > > > > 1.2) This region is on the hypervisor's AS. If the device supports it,
> > > > > > > it is possible to implement the SVQ without the need of STOP bit. This
> > > > > > > is equivalent to "I have a PF that also supports VF dirty memory
> > > > > > > tracking".
> > > > > > > 2) If it works as the config space, where the driver can ask for its
> > > > > > > status, STOP means "STOP writing used and report via config space". No
> > > > > > > need for reset.
> > > > > > >
> > > > > > > Did you have something different in mind?
> > > > > >
> > > > > > Not sure, maybe config space is better. What I want is to make the
> > > > > > feature as small as possible but leaving spaces for future extension.
> > > > > >
> > > > > > E.g we start from the feature that is sufficient for networking
> > > > > > devices, (but doesn't prevent the future work to extend it to block
> > > > > > devices). I'm not familiar with the block device, but mandating the
> > > > > > completion of inflight descriptor make have troubles, e.g unexpected
> > > > > > downtime during live migration.
> > > > > >
> > > > >
> > > > > [1] I agree with that, but I feel that "device or virtio specific for
> > > > > reporting inflight descriptors" is way too broad to make it useful at
> > > > > the moment.
> > > >
> > > > Yes and that's not a must for an ethernet device.
> > > >
> > >
> > > This is not really true as how this proposal is specified, if we use
> > > it in the hypervisor. Just saving the virtqueue index is not enough if
> > > the device is not using the descriptors in order, since the available
> > > buffers may not be recoverable just looking at the guest memory.
> > >
> > > In that regard, we must either flush them (as this proposal do, and
> > > with the unbound time problem), or use the inflight descriptors.
> >
> > I meant for ethernet device, device can simply complet all inflight
> > buffers before the STOP. Or anything I missed here?
> >
>
> That's what the proposal mandates :). It even allows the device to
> mark as completed (used) without sending / receive them, since the
> network should be prepared for that.
>
> But its not true for other devices, and concerns have been raised previously.

Right, that's my understanding as well.

>
> > >
> > > > >
> > > > > Maybe the best thing to do is to put all the restrictions at this
> > > > > moment, and when we figure out a good format for the inflight, add
> > > > > "\item report inflight descriptors". Then, the device and the driver
> > > > > are free to not accept any combination. Does it make sense?
> > > >
> > > > Somehow, to start from a version that works for networking devices.
> > > > Where we know we don't need to care:
> > > >
> > > > 1) stop fail
> > > > 2) unbound time of stop, so we don't need an interrupt
> > > > 3) inflight buffers
> > > > 4) new facility querying device states (shadow CVQ can do this)
> > > >
> > > > This will ease both of us as I feel the discussion might not be easily
> > > > converged if we care about other types of devices with too many
> > > > things. With networking done, we can start to support block devices
> > > > and we can ask help from block gurus.
> > > >
> > >
> > > I'm fine with that except that we don't need 3.
> >
> > I still think it's not a must for ethernet device, (we don't have that
> > in all the current virtio-net backends, anything make hardware
> > different in this situation?).
> >
>
> As long as we mandate that the device flush them, is not a problem.

Just to make sure we are in the same page, what I meant is we need either:

1) mandate the "flush"

or

2) leave the device to choose "flush" or report inlfight descriptors
or using of the both

We know 1) works for net but probably not others, 2) seems can work
for most of the devices (e.g block). That's why I think 2) is better.

>
> Its convenient to have that way of reporting, since we are talking
> about general virtio queues here. I mean, net can survive it because
> it's the way the guest uses the net, but we don't allow because
> virtqueue works differently.
>
> > But I'm ok if we start to propose the inflight stuffs, that can makes
> > a complete virtqueue state. I wonder maybe it's better to use
> > device-area for those instead of transport specific way to access
> > them.
> >
>
> I'd say it's the best bet.

Yes.

>
> > >
> > > > >
> > > > > > >
> > > > > > > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > > > > > > +assume to lose them.
> > > > > > > >
> > > > > > > > I think "make buffer used" is better than "mark them as done". And we
> > > > > > > > need a accurate definition on who is "them".
> > > > > > > >
> > > > > > >
> > > > > > > All items include other operations, like the ones that the device must
> > > > > > > do internally to process the control virtqueue. But I cannot find an
> > > > > > > example where telling the driver they are done when it's not is valid
> > > > > > > for this particular item.
> > > > > > >
> > > > > > > But I agree it needs better wording.
> > > > > > >
> > > > > > > And I will s/them/operations/. for the next one.
> > > > > > >
> > > > > > > > > +\end{itemize}
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > > > > > > +a guest's request,
> > > > > > > >
> > > > > > > > It's not clear what did "a guest's request" means.
> > > > > > > >
> > > > > > >
> > > > > > > Right. Would "operation" fit better here?
> > > > > >
> > > > > > Still unclear, I guess this sentence tries to define when the device
> > > > > > can fail the stop?
> > > > > >
> > > > >
> > > > > Not really, my intentions were to add a MUST operation for when the
> > > > > device fails. The first is needed for the second though, so maybe we
> > > > > can rephrase.
> > > > >
> > > > > If we agree that a device can fail the stop, I think we should not
> > > > > restrict the circumstances where the device can fail. "If the device
> > > > > can find external circumstances where it cannot satisfy STOP must not
> > > > > offer STOP feature" works for me too, actually.
> > > >
> > > > I'd leave the STOP_FAILED for future investigation.
> > > >
> > > > >
> > > > > > >
> > > > > > > > > the device MUST set the STOP_FAILED bit for the guest to
> > > > > > > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > > > > > > +clears STOP_FAILED.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > > > > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > > > > > > +that it has done with them as used before exposing the STOP status bit as set.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > > > > > > > > +after exposing the STOP bit set:
> > > > > > > > > +\begin{itemize}
> > > > > > > > > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > > > > > > > > +\item Send any used buffer notifications to the driver.
> > > > > > > > > +\end{itemize}
> > > > > > > > > +
> > > > > > > > > +The device MUST send a configuration space change right after exposing the STOP
> > > > > > > > > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > > > > > > > > +send another configuration space change notification to the driver afterwards
> > > > > > > > > +until the guest clears it.
> > > > > > > > > +
> > > > > > > > > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > > > > > > > > +the device MUST resume operation when the driver clears the STOP bit. The
> > > > > > > > > +device MUST continue reading available descriptors as if an available buffer
> > > > > > > > > +notification has reach it, starting from the last descriptor it marked as used,
> > > > > > > >
> > > > > > > > So I still tend to define virtqueue state as basic facility before
> > > > > > > > defining STOP. It can makes thing easier.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, coming back to that approach can simplify the whole proposal.
> > > > > > >
> > > > > > > > > +and continue the regular operation after that. The device MUST read again
> > > > > > > > > +descriptor and driver area beyond the last descriptor it marked as used when it
> > > > > > > > > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > > > > > > > > +if for some reason it cannot continue.
> > > > > > > > > +
> > > > > > > > >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > > > > > > > >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> > > > > > > > >  MUST send a device configuration change notification to the driver.
> > > > > > > > > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > > > > > > > >    transport specific.
> > > > > > > > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > > > > > > > >
> > > > > > > > > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > > > > > > > > +  stop the device.
> > > > > > > > > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > > > > > > > > +
> > > > > > > > >  \end{description}
> > > > > > > >
> > > > > > > > So I think the patch complicate thing is various ways:
> > > > > > > >
> > > > > > > > 1) STOP_FAILED status bit, which seems unnecessary or even duplicated
> > > > > > > > with NEEDS_RESET
> > > > > > > > 2) configuration change interrupt, looks conflict with the semantic of STOP
> > > > > > >
> > > > > > > I'm not sure about those two, I find we will have devices with unbound
> > > > > > > stop time where both can be useful if we agree on making this a
> > > > > > > general facility.
> > > > > >
> > > > > > If the unbound stop time is the only worry, the way to report inflight
> > > > > > descriptors looks like a better solution.
> > > > >
> > > > > I'm not sure if that's the only condition under which a device can
> > > > > fail to stop, but if we agree on that we could prepare a format for
> > > > > block devices to report them, for example. They are needed somehow in
> > > > > the networking case of packed if buffers are used out of order.
> > > >
> > > > It can, but let's leave it for the future now.
> > > >
> > > > Actually, for inflight buffers, a better idea is to support it at the
> > > > virtqueue level without extra data structure. But it's for sure not a
> > > > short term solution.
> > > >
> > >
> > > Can you expand on this? Why do you think it is not a short term solution?
> >
> > For the inflight buffers, I guess we need add more data structure in
> > the device area. That's fine. But I wonder if we can re-design the
> > virtqueue carefully then the inflight buffers could be deduced from
> > the vring. It requires a re-design of the current vring which is not
> > easy.
> >
>
> I've never thought that, but it's hard to pursue both vring
> compactation and to *interleave* data not useful in the normal
> communication.
>
> I'd say that allocate a region and let driver (including VMM) know
> when it can trust it is the way to go. For SW device implementations
> that can abort in any moment because of a bug or whatever, they can
> use the area as if it were theirs, and keep them properly updated. For
> HW, it's enough if they populate at STOP. Memory is allocated upfront
> but there is no need to use transport bandwith to keep it updated.
>
> If we reuse that region, there is no need to make a sepparate config
> call to get or set vq state. I find the config space call more
> straightforward, but this cover way more use cases.
>
> Thoughts?

The advantage of config space is that the data is provided on demand.
E.g the vq state could be also useful for debugging (e.g via ethtool
and other tools).

If we use virtqueue, it will forces the device to update the data via
DMA or make it only available after STOP. So I think I agree, using
config and device memory is probably better.

Thanks

>
> Thanks!
>
> > Thanks
> >
> > >
> > > > >
> > > > > > And STOP_FAILED is actually
> > > > > > not accurate since it means the stop is not finished in bound time
> > > > > > (but we need to define how long should be a bound time?)
> > > > > >
> > > > > > > Resetting the whole device because of this leaves
> > > > > > > the driver with no possibility of knowing the state of the sent
> > > > > > > descriptors.
> > > > > > >
> > > > > > > Of course, if these use cases are not interesting, it's easier to
> > > > > > > leave them out for sure.
> > > > > > >
> > > > > > > > 3) status bit clearing (resuming), a functional duplication with RESET
> > > > > > > > + DRIVER_OK
> > > > > > > >
> > > > > > >
> > > > > > > I agree it can be obtained with a whole reset, so it can be out and
> > > > > > > leave it for the future if needed. However it seems overkill if we
> > > > > > > just want to rewind some descriptors back, and there is no standard
> > > > > > > way to recover the device status beyond vq_state.
> > > > > >
> > > > > > It's more about the minimal self-contained set of the new features. If
> > > > > > it's just rewind, device or virtqueue reset is sufficient.
> > > > >
> > > > > I'm not sure if that is true for all devices with the features the
> > > > > standard offers at the moment, but it might be right for serial.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > > If we want
> > > > > > to obtain the state, virtqueue state is a must and with virtqueue
> > > > > > state, resuming (clearing STOP) is not a must.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > > I think we'd better to stick to the minimal set of the function to
> > > > > > > > reduce the complexity: virtqueue state + STOP bit (without clearing
> > > > > > > > and no config interrupt).
> > > > > > > >
> > > > > > >
> > > > > > > [1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
> > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > >
> > > > > > > > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > > > > > > > --
> > > > > > > > > 2.27.0
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-11 18:58 [PATCH v3 0/2] virtio: introduce STOP status bit Eugenio Pérez
  2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
  2021-11-11 18:58 ` [PATCH v3 2/2] virtio: introduce STOP status bit Eugenio Pérez
@ 2021-11-18 14:45 ` Stefan Hajnoczi
  2021-11-18 16:49   ` Eugenio Perez Martin
  2 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-18 14:45 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: virtio-dev, virtio-comment, mst, jasowang, amikheev, shahafs,
	oren, pasic, cohuck, bodong, Dr . David Alan Gilbert, parav,
	mgurtovoy

[-- Attachment #1: Type: text/plain, Size: 2762 bytes --]

On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> This patch introduces a new status bit STOP. This can be used by the
> driver to stop the device in order to safely fetch used descriptors
> status, making sure the device will not fetch new available ones.
> 
> Its main use case is live migration, although it has other orthogonal
> use cases. It can be used to safely discard requests that have not been
> used: in other words, to rewind available descriptors.

This sounds non-trivial and would require more explanation.

> Stopping the device in the live migration context is done by per-device
> operations in vhost backends, but the introduction of STOP as a basic
> virtio facility comes with advantages:
> * All the device virtio-specific state is summarized in a single entity,
>   making easier to reason about it.

This point isn't clear to me. I think it's saying that using STOP
somehow unifies things compared to the way that vhost devices are
stopped today. Given that vhost already syncs state back to the VMM's
emulated VIRTIO device, I'm not sure how STOP is different.

> * VMM does not need to implement device specific operations in the
>   driver part.

What is the "driver part"?

> * Work out of the box for devices that use pure virtio backends in some
>   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
>   in any transport the device can use.

?

> * It's totally self-contained, solving the nested virtualization case
>   straightforwardly.

Make sense.

> To fully understand its position in the live migration case, it's needed
> to note that the VMM acts as a part (or the whole) of the virtio device
> from the guest point of view, and it can act as a part of the driver
> from an external virtio device point of view. This is already the case
> when using vhost-net, for example, where VMM exposes a combination of
> backend and VMM features, and can mask them if needed.
> 
> To migrate an external device the VMM needs to retrieve its (guest
> visible) status and make sure the device does not modify it or

"status" means device state here, not VIRTIO Status Register? (There
even a third use of "status" in the first paragraph related to used
descriptors. I found this confusing.)

> communicate with the guest anymore. The STOP status bit achieves the
> last part, and even the first one in case of a pure stateless device
> using the split vring.
> 
> In its simpler way of working, the VMM masks the VIRTIO_F_STOP feature
> to the guest, and also masks the STOP and STOP_FAILED status bit. This
> way the VMM can stop and resume operation unilaterally, totally
> transparent for the latter.

"latter" means the guest here?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-11 18:58 ` [PATCH v3 2/2] virtio: introduce STOP status bit Eugenio Pérez
  2021-11-12  4:18   ` Jason Wang
@ 2021-11-18 15:59   ` Stefan Hajnoczi
  2021-11-18 19:58     ` Eugenio Perez Martin
  1 sibling, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-18 15:59 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: virtio-dev, virtio-comment, mst, jasowang, amikheev, shahafs,
	oren, pasic, cohuck, bodong, Dr . David Alan Gilbert, parav,
	mgurtovoy

[-- Attachment #1: Type: text/plain, Size: 11434 bytes --]

On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> This patch introduces a new status bit STOP. This can be used by the
> driver to stop the device in order to safely fetch used descriptors
> status, making sure the device will not fetch new available ones.
> 
> Its main use case is live migration, although it has other orthogonal
> use cases. It can be used to safely discard requests that have not been
> used: in other words, to rewind available descriptors.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 83 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 2aa3006..9ed0d09 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>  
> +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver.

Who controls the STOP bit? If I understand correctly, the driver writes
STOP to the Status Register but the device will not report the STOP bit
until the device has fully stopped?

> +  This status bit is different
> +  from the reset since the device state is preserved.

"the reset" -> "resetting the device"

> +
> +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device could not stop the STOP request.

"could not stop the STOP request" is weird, maybe "could not complete
the STOP request" or just "could not stop".

> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set or clear STOP if
> +DRIVER_OK is not set.

Small tweak: "if DRIVER_OK is not set" -> "when DRIVER_OK is not set".
That makes the sentence a little easier to read (for me, at least).

> +
> +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> +acknowledges the new paused status setting the first, or the failure setting
> +the last.

This sentence is confusing. "paused status" introduces an alias for the
STOP functionality that is being described. Don't use multiple names for
the same thing. The "first"/"last" is unnecessary indirection, just use
"STOP" and "STOP_FAILED" so the reader doesn't have to figure out what
you meant. I suggest:

  "The device indicates that it has stopped by reporting the STOP bit or
  indicates failure by reporting the STOP_FAILED bit in the device
  status field."

> +Since this change may not be instantaneous, the driver MAY wait for
> +the configuration change notification that the device must send after the

"must" is lowercase. If this is a device normative section then it
should be "MUST". Otherwise I suggest removing the "must": "that the
device sends after ...".

> +change. If the device sets the STOP_FAILED bit, the driver MUST clear it before
> +try new STOP attempts.

s/try/trying/

> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,

"its stop" should probably be "it has stopped", but a more explicit way
to explain this is:

"If VIRTIO_F_STOP has been negotiated and the driver reads the STOP bit
from the device status field,"

> +the driver MAY change avail_idx in the case of split virtqueue, but the new
> +avail_idx MUST be within used_idx and used_idx plus virtqueue size.

I'm trying to understand how this would work. Available buffers may be
consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
avail ring could contain something like:

  avail.ring = [Used, Not used, Used, Not used, ...]
                                                ^--- avail.idx

There are num_not_used = avail.idx - used.idx requests that are "Not
used" in avail.ring.

Does this mean the driver can rewind avail.idx by counting the number of
"Not used" buffers and skipping "Used" buffers until it reaches
num_not_used "Not used" buffers?

I think there is a known issue with this approach:

Imagine a vring with 4 elements:

  avail.ring = [0,        1,    2,    3   ]
                Not used, used, used, used
                                           ^--- avail.idx

Since the device has used 3 buffers the driver now has space to make
more buffers available. avail.idx wraps back to the start of the ring
and the driver overwrites the first element ("Not used"):

  avail.ring = [1,        N/A,  N/A,  N/A]
                Not used, N/A,  N/A,  N/A
		         ^--- avail.idx

Since vring descriptor 0 is still in use the driver chose descriptor 1
for the new available buffer.

Now we stop the device, knowing there are two buffers available that
have not been used. But avail.ring[] actually only contains the new
buffer (vring descriptor 1) that we made available because we overwrote
the old avail.ring[] element (vring descriptor 0).

What now? Where does the device reset its internal avail_idx to? Does
the device remember the available buffer with vring descriptor 0 or do
we need to add it again?

I'm not sure if this idea works even with split virtqueues.

> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,

"its stop" -> "it has stopped".

> +the driver MAY change any descriptor.

Not just descriptors, but also avail.ring[] and avail.idx?

> +
> +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,

"its" -> "it has"

> +the driver can resume it clearing the STOP status bit. It MUST re-read the

"it clearing" -> "it by clearing"

> +device status to ensure the STOP bit is clear after the write. The device
> +acknowledges the new status clearing it. Since this change may not be

"new status clearing" -> "new status by clearing"

> +instantaneous, the driver MAY wait for the configuration change notification
> +that the device must send after the change.

"must" -> "MUST" or "that the device sends"

> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>  
> +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> +or clear of STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> +operations after the driver writes STOP.  Depending on the device, it can do it
> +in many ways as long as the driver can recover its normal operation if it
> +resumes the device without the need of resetting it:
> +\begin{itemize}
> +\item Drain and wait for the completion of all pending requests until a
> +convenient avail descriptor. Ignore any other posterior descriptor.
> +\item Return a device-specific failure for these descriptors, so the driver
> +can choose to retry or to cancel them.

This means each device type in the spec needs to define STOP semantics
so drivers know what to expect. Not sure it's feasible to do this. If
you can drop this point from the patch I would because this is going to
be hard to get right in implementations and a pain to specify properly.

> +\item Mark them as done even if they are not, if the kind of device can
> +assume to lose them.
> +\end{itemize}
> +
> +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> +a guest's request, the device MUST set the STOP_FAILED bit for the guest to
> +read it. The device MUST ignore new writes to the STOP bit until the guest
> +clears STOP_FAILED.

s/guest/driver/ here and elsewhere in this patch

> +
> +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> +and the device can pause its operation, the device MUST set the descriptors
> +that it has done with them as used before exposing the STOP status bit as set.

This sentence is a bit confusing. What does "can" mean here? Can the
device negotiate VIRTIO_F_STOP but always set STOP_FAILED because it
cannot actually stop?

The second part of the sentence is unclear. Does it mean that the device
must complete an in-flight buffers? When VIRTIO_F_IN_ORDER is not
negotiated the device doesn't have to process available buffers in
order, so that means the device may leave a mix of available and used
buffers rather than a sequence of used buffers followed by available
buffers?

> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> +after exposing the STOP bit set:
> +\begin{itemize}
> +\item Read updates on the descriptor or driver area, or consume more buffers.
> +\item Send any used buffer notifications to the driver.
> +\end{itemize}
> +
> +The device MUST send a configuration space change right after exposing the STOP
> +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> +send another configuration space change notification to the driver afterwards
> +until the guest clears it.
> +
> +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> +the device MUST resume operation when the driver clears the STOP bit. The
> +device MUST continue reading available descriptors as if an available buffer
> +notification has reach it, starting from the last descriptor it marked as used,

"has reach it" -> "has been received"?

"starting from the last descriptor it marked as used" sounds wrong. I
guess this assumes VIRTIO_F_IN_ORDER was negotiated?

> +and continue the regular operation after that. The device MUST read again
> +descriptor and driver area beyond the last descriptor it marked as used when it
> +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> +if for some reason it cannot continue.
> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    transport specific.
>    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>  
> +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.27.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-18 14:45 ` [PATCH v3 0/2] " Stefan Hajnoczi
@ 2021-11-18 16:49   ` Eugenio Perez Martin
  2021-11-23 11:33     ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-18 16:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > This patch introduces a new status bit STOP. This can be used by the
> > driver to stop the device in order to safely fetch used descriptors
> > status, making sure the device will not fetch new available ones.
> >
> > Its main use case is live migration, although it has other orthogonal
> > use cases. It can be used to safely discard requests that have not been
> > used: in other words, to rewind available descriptors.
>
> This sounds non-trivial and would require more explanation.
>

You are right and it's not well explained here, I will try to develop
it better for the next version. It's in the cover letter explaining
one use case proposed by MST [1] and the answer by Jason [2].

When the VQ is stopped, it is forced to flush all the descriptors (in
this version) and the used index in the case of split. With that
information, the driver can modify any available descriptor that has
not been used by the device, and to rewind the virtqueue available
index to an extent.

If we add Jason's virtqueue state (first patch of previous version),
which I need to recover in later versions, the driver can move freely
available and used index prior to queue resetting.

The (in comments) proposed solution for the device that cannot flush
its descriptor timely is to provide it with a way to report in-flight
descriptors. As a reference, vhost-user has done it before, but with a
memory region shared by a file descriptor [3]. If we add something
similar, the driver still knows what file descriptors is able to
rewind.

Does that explanation make the driver rewind use case more clear?

> > Stopping the device in the live migration context is done by per-device
> > operations in vhost backends, but the introduction of STOP as a basic
> > virtio facility comes with advantages:
> > * All the device virtio-specific state is summarized in a single entity,
> >   making easier to reason about it.
>
> This point isn't clear to me. I think it's saying that using STOP
> somehow unifies things compared to the way that vhost devices are
> stopped today. Given that vhost already syncs state back to the VMM's
> emulated VIRTIO device, I'm not sure how STOP is different.
>

It also achieves that, but it's more related to the fact that the
current way of getting the index through vhost net, user, ... is not
reusable by others methods to expose the device to VMM. Avoiding
developing a different way to stop and get the status of each kind of
device helps others devices implementations out of VMM.

What you mention has more to do with the next bullet point.

> > * VMM does not need to implement device specific operations in the
> >   driver part.
>
> What is the "driver part"?
>

The part of qemu that talks to devices using virtio through (for
example) vhost messages. This set features, get features, etc.

Each vhost device has its own method to stop the device. In networking
is setting a backend file descriptor -1, and others have their own
way.

Using the status field allows out of VMM to unify that part too.

Maybe this one and the above would be clearer if I use vhost as
examples. I will rewrite them anyhow, thanks!

> > * Work out of the box for devices that use pure virtio backends in some
> >   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
> >   in any transport the device can use.
>
> ?

vp_vdpa makes a standard virtio device exposed as a vdpa one. This
implies that each of the vhost commands sent to vhost-vdpa needs to be
converted to standard virtio request if it needs to reach the actual
device. But there is currently no way to stop the device or retrieve
its state using just virtio.

Because of that, it's also usable by a pure virtio device, like in the
case of vdpa devices using virtio_vdpa or other devices that simply
exposes itself as a virtio one with no further facilities.

It is also not restricted by the transport you use to expose the
virtio: PCI, MMIO, etc, since you need to perform operations already
defined by any usable device (set and get the status).

>
> > * It's totally self-contained, solving the nested virtualization case
> >   straightforwardly.
>
> Make sense.
>
> > To fully understand its position in the live migration case, it's needed
> > to note that the VMM acts as a part (or the whole) of the virtio device
> > from the guest point of view, and it can act as a part of the driver
> > from an external virtio device point of view. This is already the case
> > when using vhost-net, for example, where VMM exposes a combination of
> > backend and VMM features, and can mask them if needed.
> >
> > To migrate an external device the VMM needs to retrieve its (guest
> > visible) status and make sure the device does not modify it or
>
> "status" means device state here, not VIRTIO Status Register? (There
> even a third use of "status" in the first paragraph related to used
> descriptors. I found this confusing.)
>

Yes, this one should be "state", as used in previous series.

Status in the first paragraph is totally redundant, I will delete it
for the next revision.

> > communicate with the guest anymore. The STOP status bit achieves the
> > last part, and even the first one in case of a pure stateless device
> > using the split vring.
> >
> > In its simpler way of working, the VMM masks the VIRTIO_F_STOP feature
> > to the guest, and also masks the STOP and STOP_FAILED status bit. This
> > way the VMM can stop and resume operation unilaterally, totally
> > transparent for the latter.
>
> "latter" means the guest here?

Yes. I will rephrase that too.

Thanks!

[1] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00043.html
[2] https://lists.oasis-open.org/archives/virtio-comment/202107/msg00047.html
[3] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-18 15:59   ` Stefan Hajnoczi
@ 2021-11-18 19:58     ` Eugenio Perez Martin
  2021-11-23 12:16       ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-18 19:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > From: Jason Wang <jasowang@redhat.com>
> >
> > This patch introduces a new status bit STOP. This can be used by the
> > driver to stop the device in order to safely fetch used descriptors
> > status, making sure the device will not fetch new available ones.
> >
> > Its main use case is live migration, although it has other orthogonal
> > use cases. It can be used to safely discard requests that have not been
> > used: in other words, to rewind available descriptors.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 83 insertions(+)
> >
> > diff --git a/content.tex b/content.tex
> > index 2aa3006..9ed0d09 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> >    drive the device.
> >
> > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > +  device has been stopped by the driver.
>
> Who controls the STOP bit? If I understand correctly, the driver writes
> STOP to the Status Register but the device will not report the STOP bit
> until the device has fully stopped?
>

Yes, but to add the point of view of the driver here seems too much.
However, driver_ok is already explained from the point of view of the
driver, so I should try to accommodate that here too.

> > +  This status bit is different
> > +  from the reset since the device state is preserved.
>
> "the reset" -> "resetting the device"
>

I will rephrase that.

> > +
> > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > +  device could not stop the STOP request.
>
> "could not stop the STOP request" is weird, maybe "could not complete
> the STOP request" or just "could not stop".
>

I think STOP_FAILED should be left out of the proposal for the next
revision actually, but you are right.

> > +
> >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> >    an error from which it can't recover.
> >  \end{description}
> > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> >  recover by issuing a reset.
> >  \end{note}
> >
> > +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set or clear STOP if
> > +DRIVER_OK is not set.
>
> Small tweak: "if DRIVER_OK is not set" -> "when DRIVER_OK is not set".
> That makes the sentence a little easier to read (for me, at least).
>

Ok I will change that.

> > +
> > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > +acknowledges the new paused status setting the first, or the failure setting
> > +the last.
>
> This sentence is confusing. "paused status" introduces an alias for the
> STOP functionality that is being described. Don't use multiple names for
> the same thing. The "first"/"last" is unnecessary indirection, just use
> "STOP" and "STOP_FAILED" so the reader doesn't have to figure out what
> you meant. I suggest:
>
>   "The device indicates that it has stopped by reporting the STOP bit or
>   indicates failure by reporting the STOP_FAILED bit in the device
>   status field."
>

I agree, I will change that.

To be sure, you mean to replace just the second part of the paragraph, isn't it?

> > +Since this change may not be instantaneous, the driver MAY wait for
> > +the configuration change notification that the device must send after the
>
> "must" is lowercase. If this is a device normative section then it
> should be "MUST". Otherwise I suggest removing the "must": "that the
> device sends after ...".
>

It's in lowercase because it's the driver normative, not the device
one. But the "after" alternative sounds perfect to me.

> > +change. If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > +try new STOP attempts.
>
> s/try/trying/
>

I'll probably discard the STOP_FAILED part for the next revisions but
I will take into account if I recover that, thanks!

> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
>
> "its stop" should probably be "it has stopped", but a more explicit way
> to explain this is:
>
> "If VIRTIO_F_STOP has been negotiated and the driver reads the STOP bit
> from the device status field,"
>

I will change that.

> > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
>
> I'm trying to understand how this would work. Available buffers may be
> consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> avail ring could contain something like:
>
>   avail.ring = [Used, Not used, Used, Not used, ...]
>                                                 ^--- avail.idx
>
> There are num_not_used = avail.idx - used.idx requests that are "Not
> used" in avail.ring.
>
> Does this mean the driver can rewind avail.idx by counting the number of
> "Not used" buffers and skipping "Used" buffers until it reaches
> num_not_used "Not used" buffers?
>

I'm going to also drop the "resume" part for the next version because
it adds extra complexity not actually needed, and it can be achieved
with a full reset in a simpler way.

But I'll explain it below with your examples. Long story short, the
driver only can rewind the available descriptors that are still in the
available ring, and the device must flush the ones that cannot recover
from the ring.

> I think there is a known issue with this approach:
>
> Imagine a vring with 4 elements:
>
>   avail.ring = [0,        1,    2,    3   ]
>                 Not used, used, used, used
>                                            ^--- avail.idx
>
> Since the device has used 3 buffers the driver now has space to make
> more buffers available. avail.idx wraps back to the start of the ring
> and the driver overwrites the first element ("Not used"):
>
>   avail.ring = [1,        N/A,  N/A,  N/A]
>                 Not used, N/A,  N/A,  N/A
>                          ^--- avail.idx
>
> Since vring descriptor 0 is still in use the driver chose descriptor 1
> for the new available buffer.
>
> Now we stop the device, knowing there are two buffers available that
> have not been used. But avail.ring[] actually only contains the new
> buffer (vring descriptor 1) that we made available because we overwrote
> the old avail.ring[] element (vring descriptor 0).
>
> What now? Where does the device reset its internal avail_idx to?

To be on the same page, in qemu the device maintains two "internal
avail idx": shadow_avail_idx (last seen in the available ring, could
be 4 in this case) and last_avail_idx (next descriptor to fetch from
avail, 2). The device must forget shadow_avail_idx and flush the
descriptors that cannot recover (0). So last_avail_idx is now 3. Now
it can stop.

The proposal allows the device to fail descriptor 0 in a
device-specific way, but I think now it was a bad choice.

The driver cannot move the device's last_avail_idx in this operation:
The device is simply forced to flush used ones to the used ring or
descriptor ring in a packed vq case. So the device's internal
avail_idx == used_idx == 3. When the device resumes, it's still 3.

The device must keep its last_avail_idx through stop and resume cycle.

> Does
> the device remember the available buffer with vring descriptor 0 or do
> we need to add it again?
>

If we want to keep the descriptor 0 as available, device and driver
should forget these internally tracked available buffers on stop and
skip those in the packed case, similar as in the descriptor chain or
IN_ORDER case. Regular stop and resume gets more complicated, but
descriptors rewinding is more powerful. And this solution gets way
more difficult or impossible to use from the VMM.

We could allow the device and the driver to remember these internally
tracked available descriptors. But this makes it impossible to use the
solution from the VMM transparently to the guest unless a way to
provide them is used.

However this gets more and more complicated. I think it's better to
rely on device stop + reset for descriptor recovery. Everything is way
more clear even if a few more steps are needed to resume, and all
descriptors are recoverable. The device still needs a way to expose
the in-flight ones in the not IN_ORDER case.

> I'm not sure if this idea works even with split virtqueues.
>
> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
>
> "its stop" -> "it has stopped".
>

I will replace it.

> > +the driver MAY change any descriptor.
>
> Not just descriptors, but also avail.ring[] and avail.idx?
>

Right.

> > +
> > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
>
> "its" -> "it has"
>
> > +the driver can resume it clearing the STOP status bit. It MUST re-read the
>
> "it clearing" -> "it by clearing"
>
> > +device status to ensure the STOP bit is clear after the write. The device
> > +acknowledges the new status clearing it. Since this change may not be
>
> "new status clearing" -> "new status by clearing"
>

I will correct all of the above.

> > +instantaneous, the driver MAY wait for the configuration change notification
> > +that the device must send after the change.
>
> "must" -> "MUST" or "that the device sends"
>

I will delete it.

> > +
> >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> >
> >  The device MUST NOT consume buffers or send any used buffer
> >  notifications to the driver before DRIVER_OK.
> >
> > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > +or clear of STOP.
> > +
> > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > +operations after the driver writes STOP.  Depending on the device, it can do it
> > +in many ways as long as the driver can recover its normal operation if it
> > +resumes the device without the need of resetting it:
> > +\begin{itemize}
> > +\item Drain and wait for the completion of all pending requests until a
> > +convenient avail descriptor. Ignore any other posterior descriptor.
> > +\item Return a device-specific failure for these descriptors, so the driver
> > +can choose to retry or to cancel them.
>
> This means each device type in the spec needs to define STOP semantics
> so drivers know what to expect. Not sure it's feasible to do this. If
> you can drop this point from the patch I would because this is going to
> be hard to get right in implementations and a pain to specify properly.
>

Yes, this complicates the solution and is not needed if the device is
able to report the in-flight ones.

> > +\item Mark them as done even if they are not, if the kind of device can
> > +assume to lose them.
> > +\end{itemize}
> > +
> > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > +a guest's request, the device MUST set the STOP_FAILED bit for the guest to
> > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > +clears STOP_FAILED.
>
> s/guest/driver/ here and elsewhere in this patch
>

I will review it.

> > +
> > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > +and the device can pause its operation, the device MUST set the descriptors
> > +that it has done with them as used before exposing the STOP status bit as set.
>
> This sentence is a bit confusing. What does "can" mean here?
>

"the device can pause its operation" can be replaced by "The device
has set STOP bit". Is that clearer?

> Can the
> device negotiate VIRTIO_F_STOP but always set STOP_FAILED because it
> cannot actually stop?

In this proposal yes, although it is not very polite of course.
STOP_FAILED will be dropped for the next revision so there is no need
to worry about this at the moment.

> The second part of the sentence is unclear. Does it mean that the device
> must complete an in-flight buffers? When VIRTIO_F_IN_ORDER is not
> negotiated the device doesn't have to process available buffers in
> order, so that means the device may leave a mix of available and used
> buffers rather than a sequence of used buffers followed by available
> buffers?
>

It does not need to complete all in-flight, only the non recoverable actually.

> > +
> > +If VIRTIO_F_STOP has been negotiated, the device MUST NOT perform these actions
> > +after exposing the STOP bit set:
> > +\begin{itemize}
> > +\item Read updates on the descriptor or driver area, or consume more buffers.
> > +\item Send any used buffer notifications to the driver.
> > +\end{itemize}
> > +
> > +The device MUST send a configuration space change right after exposing the STOP
> > +or STOP_FAILED as set to the driver, and MUST NOT change configuration space or
> > +send another configuration space change notification to the driver afterwards
> > +until the guest clears it.
> > +
> > +If VIRTIO_F_STOP has been negotiated and STOP device status flag is set,
> > +the device MUST resume operation when the driver clears the STOP bit. The
> > +device MUST continue reading available descriptors as if an available buffer
> > +notification has reach it, starting from the last descriptor it marked as used,
>
> "has reach it" -> "has been received"?
>
> "starting from the last descriptor it marked as used" sounds wrong. I
> guess this assumes VIRTIO_F_IN_ORDER was negotiated?
>

It is written with the packed vq in mind but I think that "last used
idx" is ok too. No need for VIRTIO_F_IN_ORDER.

I hope these explanations make the proposal more clear. A big part of
it will change in next revisions with the "in flight" reporting in
mind, but feel free to request more clarifications for sure. I will
address all the other comments, so thank you a lot for all of them!

> > +and continue the regular operation after that. The device MUST read again
> > +descriptor and driver area beyond the last descriptor it marked as used when it
> > +stopped, because the driver can change it. Device MUST set DEVICE_NEEDS_RESET
> > +if for some reason it cannot continue.
> > +
> >  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> >  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
> >  MUST send a device configuration change notification to the driver.
> > @@ -6694,6 +6773,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >    transport specific.
> >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> >
> > +\item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> > +  stop the device.
> > +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> > +
> >  \end{description}
> >
> >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-18 16:49   ` Eugenio Perez Martin
@ 2021-11-23 11:33     ` Stefan Hajnoczi
  2021-11-23 16:19       ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-23 11:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 5736 bytes --]

On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > This patch introduces a new status bit STOP. This can be used by the
> > > driver to stop the device in order to safely fetch used descriptors
> > > status, making sure the device will not fetch new available ones.
> > >
> > > Its main use case is live migration, although it has other orthogonal
> > > use cases. It can be used to safely discard requests that have not been
> > > used: in other words, to rewind available descriptors.
> >
> > This sounds non-trivial and would require more explanation.
> >
> 
> You are right and it's not well explained here, I will try to develop
> it better for the next version. It's in the cover letter explaining
> one use case proposed by MST [1] and the answer by Jason [2].
> 
> When the VQ is stopped, it is forced to flush all the descriptors (in
> this version) and the used index in the case of split. With that
> information, the driver can modify any available descriptor that has
> not been used by the device, and to rewind the virtqueue available
> index to an extent.
> 
> If we add Jason's virtqueue state (first patch of previous version),
> which I need to recover in later versions, the driver can move freely
> available and used index prior to queue resetting.
> 
> The (in comments) proposed solution for the device that cannot flush
> its descriptor timely is to provide it with a way to report in-flight
> descriptors. As a reference, vhost-user has done it before, but with a
> memory region shared by a file descriptor [3]. If we add something
> similar, the driver still knows what file descriptors is able to
> rewind.
> 
> Does that explanation make the driver rewind use case more clear?

I understand the use case but it's not clear how the mechanism is
supposed to work. Let's continue discussing in the sub-thread where I
posted details.

> 
> > > Stopping the device in the live migration context is done by per-device
> > > operations in vhost backends, but the introduction of STOP as a basic
> > > virtio facility comes with advantages:
> > > * All the device virtio-specific state is summarized in a single entity,
> > >   making easier to reason about it.
> >
> > This point isn't clear to me. I think it's saying that using STOP
> > somehow unifies things compared to the way that vhost devices are
> > stopped today. Given that vhost already syncs state back to the VMM's
> > emulated VIRTIO device, I'm not sure how STOP is different.
> >
> 
> It also achieves that, but it's more related to the fact that the
> current way of getting the index through vhost net, user, ... is not
> reusable by others methods to expose the device to VMM.

->vhost_get_vring_base() is a common interface across vhost-kernel,
vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
that I'm missing?

"vhost net, user" is confusing. Do you mean vhost-kernel and vhost-user
or do you mean vhost-net and vhost-user-net?

> Avoiding
> developing a different way to stop and get the status of each kind of
> device helps others devices implementations out of VMM.

What does "kind of device" mean? I think you mean vhost device types
like net, scsi, blk, vsock, etc (a subset of VIRTIO Device Types that
have been defined for vhost). As you say, they have different stop
operations, but it's not true that getting the status of a vring is
different for each one (they all use ->vhost_get_vring_base()).

> 
> What you mention has more to do with the next bullet point.
> 
> > > * VMM does not need to implement device specific operations in the
> > >   driver part.
> >
> > What is the "driver part"?
> >
> 
> The part of qemu that talks to devices using virtio through (for
> example) vhost messages. This set features, get features, etc.
> 
> Each vhost device has its own method to stop the device. In networking
> is setting a backend file descriptor -1, and others have their own
> way.
> 
> Using the status field allows out of VMM to unify that part too.
> 
> Maybe this one and the above would be clearer if I use vhost as
> examples. I will rewrite them anyhow, thanks!

Thanks. Something like "The VMM does not need to implement a different
stop operation like VHOST_NET_SET_BACKEND -1 for each device type" would
be clearer.

> > > * Work out of the box for devices that use pure virtio backends in some
> > >   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
> > >   in any transport the device can use.
> >
> > ?
> 
> vp_vdpa makes a standard virtio device exposed as a vdpa one. This
> implies that each of the vhost commands sent to vhost-vdpa needs to be
> converted to standard virtio request if it needs to reach the actual
> device. But there is currently no way to stop the device or retrieve
> its state using just virtio.
> 
> Because of that, it's also usable by a pure virtio device, like in the
> case of vdpa devices using virtio_vdpa or other devices that simply
> exposes itself as a virtio one with no further facilities.
> 
> It is also not restricted by the transport you use to expose the
> virtio: PCI, MMIO, etc, since you need to perform operations already
> defined by any usable device (set and get the status).

I see. This says STOP needs to be an in-band VIRTIO operation so that
vDPA/vhost can be stacked on top of VIRTIO devices. If STOP was only a
vhost operation then it wouldn't be possible to forward it to underlying
VIRTIO devices.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-18 19:58     ` Eugenio Perez Martin
@ 2021-11-23 12:16       ` Stefan Hajnoczi
  2021-11-23 17:00         ` [virtio-dev] " Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-23 12:16 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 14536 bytes --]

On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > >
> > > This patch introduces a new status bit STOP. This can be used by the
> > > driver to stop the device in order to safely fetch used descriptors
> > > status, making sure the device will not fetch new available ones.
> > >
> > > Its main use case is live migration, although it has other orthogonal
> > > use cases. It can be used to safely discard requests that have not been
> > > used: in other words, to rewind available descriptors.
> > >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 83 insertions(+)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index 2aa3006..9ed0d09 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > >    drive the device.
> > >
> > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > +  device has been stopped by the driver.
> >
> > Who controls the STOP bit? If I understand correctly, the driver writes
> > STOP to the Status Register but the device will not report the STOP bit
> > until the device has fully stopped?
> >
> 
> Yes, but to add the point of view of the driver here seems too much.
> However, driver_ok is already explained from the point of view of the
> driver, so I should try to accommodate that here too.

I would drop "by the driver" to make it less confusing. The role of the
driver and the device is explained later in this patch, so it can be
omitted here to save the reader from guessing what "by the driver"
means.

> 
> > > +  This status bit is different
> > > +  from the reset since the device state is preserved.
> >
> > "the reset" -> "resetting the device"
> >
> 
> I will rephrase that.
> 
> > > +
> > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > +  device could not stop the STOP request.
> >
> > "could not stop the STOP request" is weird, maybe "could not complete
> > the STOP request" or just "could not stop".
> >
> 
> I think STOP_FAILED should be left out of the proposal for the next
> revision actually, but you are right.
> 
> > > +
> > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > >    an error from which it can't recover.
> > >  \end{description}
> > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > >  recover by issuing a reset.
> > >  \end{note}
> > >
> > > +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set or clear STOP if
> > > +DRIVER_OK is not set.
> >
> > Small tweak: "if DRIVER_OK is not set" -> "when DRIVER_OK is not set".
> > That makes the sentence a little easier to read (for me, at least).
> >
> 
> Ok I will change that.
> 
> > > +
> > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > +acknowledges the new paused status setting the first, or the failure setting
> > > +the last.
> >
> > This sentence is confusing. "paused status" introduces an alias for the
> > STOP functionality that is being described. Don't use multiple names for
> > the same thing. The "first"/"last" is unnecessary indirection, just use
> > "STOP" and "STOP_FAILED" so the reader doesn't have to figure out what
> > you meant. I suggest:
> >
> >   "The device indicates that it has stopped by reporting the STOP bit or
> >   indicates failure by reporting the STOP_FAILED bit in the device
> >   status field."
> >
> 
> I agree, I will change that.
> 
> To be sure, you mean to replace just the second part of the paragraph, isn't it?

Yes, just the last sentence of the paragraph.

> > > +Since this change may not be instantaneous, the driver MAY wait for
> > > +the configuration change notification that the device must send after the
> >
> > "must" is lowercase. If this is a device normative section then it
> > should be "MUST". Otherwise I suggest removing the "must": "that the
> > device sends after ...".
> >
> 
> It's in lowercase because it's the driver normative, not the device
> one. But the "after" alternative sounds perfect to me.
> 
> > > +change. If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > +try new STOP attempts.
> >
> > s/try/trying/
> >
> 
> I'll probably discard the STOP_FAILED part for the next revisions but
> I will take into account if I recover that, thanks!
> 
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> >
> > "its stop" should probably be "it has stopped", but a more explicit way
> > to explain this is:
> >
> > "If VIRTIO_F_STOP has been negotiated and the driver reads the STOP bit
> > from the device status field,"
> >
> 
> I will change that.
> 
> > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> >
> > I'm trying to understand how this would work. Available buffers may be
> > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > avail ring could contain something like:
> >
> >   avail.ring = [Used, Not used, Used, Not used, ...]
> >                                                 ^--- avail.idx
> >
> > There are num_not_used = avail.idx - used.idx requests that are "Not
> > used" in avail.ring.
> >
> > Does this mean the driver can rewind avail.idx by counting the number of
> > "Not used" buffers and skipping "Used" buffers until it reaches
> > num_not_used "Not used" buffers?
> >
> 
> I'm going to also drop the "resume" part for the next version because
> it adds extra complexity not actually needed, and it can be achieved
> with a full reset in a simpler way.
> 
> But I'll explain it below with your examples. Long story short, the
> driver only can rewind the available descriptors that are still in the
> available ring, and the device must flush the ones that cannot recover
> from the ring.
> 
> > I think there is a known issue with this approach:
> >
> > Imagine a vring with 4 elements:
> >
> >   avail.ring = [0,        1,    2,    3   ]
> >                 Not used, used, used, used
> >                                            ^--- avail.idx
> >
> > Since the device has used 3 buffers the driver now has space to make
> > more buffers available. avail.idx wraps back to the start of the ring
> > and the driver overwrites the first element ("Not used"):
> >
> >   avail.ring = [1,        N/A,  N/A,  N/A]
> >                 Not used, N/A,  N/A,  N/A
> >                          ^--- avail.idx
> >
> > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > for the new available buffer.
> >
> > Now we stop the device, knowing there are two buffers available that
> > have not been used. But avail.ring[] actually only contains the new
> > buffer (vring descriptor 1) that we made available because we overwrote
> > the old avail.ring[] element (vring descriptor 0).
> >
> > What now? Where does the device reset its internal avail_idx to?
> 
> To be on the same page, in qemu the device maintains two "internal
> avail idx": shadow_avail_idx (last seen in the available ring, could
> be 4 in this case) and last_avail_idx (next descriptor to fetch from
> avail, 2). The device must forget shadow_avail_idx and flush the
> descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> it can stop.
> 
> The proposal allows the device to fail descriptor 0 in a
> device-specific way, but I think now it was a bad choice.
> 
> The driver cannot move the device's last_avail_idx in this operation:
> The device is simply forced to flush used ones to the used ring or
> descriptor ring in a packed vq case. So the device's internal
> avail_idx == used_idx == 3. When the device resumes, it's still 3.
> 
> The device must keep its last_avail_idx through stop and resume cycle.

Are you saying that all buffers avail->ring[i % ring_size] must be
completed by the device before the STOP bit is reported where i <=
last_avail_idx?

This means the driver can modify avail->ring[i % ring_size] where
avail_idx >= i > used_idx.

(There might be off-by-one errors, I didn't check whether avail_idx and
used_idx are inclusive or exclusive bounds the spec.)

The example I gave violates these constraints and wouldn't be allowed.

> > Does
> > the device remember the available buffer with vring descriptor 0 or do
> > we need to add it again?
> >
> 
> If we want to keep the descriptor 0 as available, device and driver
> should forget these internally tracked available buffers on stop and
> skip those in the packed case, similar as in the descriptor chain or
> IN_ORDER case. Regular stop and resume gets more complicated, but
> descriptors rewinding is more powerful. And this solution gets way
> more difficult or impossible to use from the VMM.
> 
> We could allow the device and the driver to remember these internally
> tracked available descriptors. But this makes it impossible to use the
> solution from the VMM transparently to the guest unless a way to
> provide them is used.
> 
> However this gets more and more complicated. I think it's better to
> rely on device stop + reset for descriptor recovery. Everything is way
> more clear even if a few more steps are needed to resume, and all
> descriptors are recoverable. The device still needs a way to expose
> the in-flight ones in the not IN_ORDER case.

I'm also wary of a complicated mechanism for modifying available
descriptors. Reset sounds good.

> 
> > I'm not sure if this idea works even with split virtqueues.
> >
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> >
> > "its stop" -> "it has stopped".
> >
> 
> I will replace it.
> 
> > > +the driver MAY change any descriptor.
> >
> > Not just descriptors, but also avail.ring[] and avail.idx?
> >
> 
> Right.
> 
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> >
> > "its" -> "it has"
> >
> > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> >
> > "it clearing" -> "it by clearing"
> >
> > > +device status to ensure the STOP bit is clear after the write. The device
> > > +acknowledges the new status clearing it. Since this change may not be
> >
> > "new status clearing" -> "new status by clearing"
> >
> 
> I will correct all of the above.
> 
> > > +instantaneous, the driver MAY wait for the configuration change notification
> > > +that the device must send after the change.
> >
> > "must" -> "MUST" or "that the device sends"
> >
> 
> I will delete it.
> 
> > > +
> > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > >
> > >  The device MUST NOT consume buffers or send any used buffer
> > >  notifications to the driver before DRIVER_OK.
> > >
> > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > +or clear of STOP.
> > > +
> > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > +operations after the driver writes STOP.  Depending on the device, it can do it
> > > +in many ways as long as the driver can recover its normal operation if it
> > > +resumes the device without the need of resetting it:
> > > +\begin{itemize}
> > > +\item Drain and wait for the completion of all pending requests until a
> > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > +\item Return a device-specific failure for these descriptors, so the driver
> > > +can choose to retry or to cancel them.
> >
> > This means each device type in the spec needs to define STOP semantics
> > so drivers know what to expect. Not sure it's feasible to do this. If
> > you can drop this point from the patch I would because this is going to
> > be hard to get right in implementations and a pain to specify properly.
> >
> 
> Yes, this complicates the solution and is not needed if the device is
> able to report the in-flight ones.
> 
> > > +\item Mark them as done even if they are not, if the kind of device can
> > > +assume to lose them.
> > > +\end{itemize}
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > +a guest's request, the device MUST set the STOP_FAILED bit for the guest to
> > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > +clears STOP_FAILED.
> >
> > s/guest/driver/ here and elsewhere in this patch
> >
> 
> I will review it.
> 
> > > +
> > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > +and the device can pause its operation, the device MUST set the descriptors
> > > +that it has done with them as used before exposing the STOP status bit as set.
> >
> > This sentence is a bit confusing. What does "can" mean here?
> >
> 
> "the device can pause its operation" can be replaced by "The device
> has set STOP bit". Is that clearer?

I think that clause can be removed to make the sentence simpler:

  If VIRTIO_F_STOP has been negotiated and the guest has written the
  STOP bit, the device MUST mark as used all descriptors currently being
  processed before reporting the STOP bit.

BTW this sentence need to be made more specific if you want the "all
buffers avail->ring[i % ring_size] must be completed by the device
before the STOP bit is reported where i <= last_avail_idx" semantics. We
don't just want to mark in-flight buffers as used, but all buffers
before the used descriptor with the highest free-running ring index.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-23 11:33     ` Stefan Hajnoczi
@ 2021-11-23 16:19       ` Eugenio Perez Martin
  2021-11-24 15:26         ` Stefan Hajnoczi
  2021-11-25  3:05         ` Jason Wang
  0 siblings, 2 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-23 16:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > > This patch introduces a new status bit STOP. This can be used by the
> > > > driver to stop the device in order to safely fetch used descriptors
> > > > status, making sure the device will not fetch new available ones.
> > > >
> > > > Its main use case is live migration, although it has other orthogonal
> > > > use cases. It can be used to safely discard requests that have not been
> > > > used: in other words, to rewind available descriptors.
> > >
> > > This sounds non-trivial and would require more explanation.
> > >
> >
> > You are right and it's not well explained here, I will try to develop
> > it better for the next version. It's in the cover letter explaining
> > one use case proposed by MST [1] and the answer by Jason [2].
> >
> > When the VQ is stopped, it is forced to flush all the descriptors (in
> > this version) and the used index in the case of split. With that
> > information, the driver can modify any available descriptor that has
> > not been used by the device, and to rewind the virtqueue available
> > index to an extent.
> >
> > If we add Jason's virtqueue state (first patch of previous version),
> > which I need to recover in later versions, the driver can move freely
> > available and used index prior to queue resetting.
> >
> > The (in comments) proposed solution for the device that cannot flush
> > its descriptor timely is to provide it with a way to report in-flight
> > descriptors. As a reference, vhost-user has done it before, but with a
> > memory region shared by a file descriptor [3]. If we add something
> > similar, the driver still knows what file descriptors is able to
> > rewind.
> >
> > Does that explanation make the driver rewind use case more clear?
>
> I understand the use case but it's not clear how the mechanism is
> supposed to work. Let's continue discussing in the sub-thread where I
> posted details.
>

Ok!

> >
> > > > Stopping the device in the live migration context is done by per-device
> > > > operations in vhost backends, but the introduction of STOP as a basic
> > > > virtio facility comes with advantages:
> > > > * All the device virtio-specific state is summarized in a single entity,
> > > >   making easier to reason about it.
> > >
> > > This point isn't clear to me. I think it's saying that using STOP
> > > somehow unifies things compared to the way that vhost devices are
> > > stopped today. Given that vhost already syncs state back to the VMM's
> > > emulated VIRTIO device, I'm not sure how STOP is different.
> > >
> >
> > It also achieves that, but it's more related to the fact that the
> > current way of getting the index through vhost net, user, ... is not
> > reusable by others methods to expose the device to VMM.
>
> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> that I'm missing?
>

It is usable by all VIRTIO Device types *exposed through vhost*, so it
fails to address the case when we cannot use vhost to expose the
device. It can happen with the cases you explained better than me in
[1], or when exposing a vdpa device as a pure virtio device using
virtio_vdpa.

> "vhost net, user" is confusing. Do you mean vhost-kernel and vhost-user
> or do you mean vhost-net and vhost-user-net?
>

I meant vhost-kernel and vhost-user, sorry.

> > Avoiding
> > developing a different way to stop and get the status of each kind of
> > device helps others devices implementations out of VMM.
>
> What does "kind of device" mean? I think you mean vhost device types
> like net, scsi, blk, vsock, etc (a subset of VIRTIO Device Types that
> have been defined for vhost). As you say, they have different stop
> operations, but it's not true that getting the status of a vring is
> different for each one (they all use ->vhost_get_vring_base()).
>

Reading that way I meant all VIRTIO Device Types, since this proposal
also addresses them even if they don't use vhost-*.

I said "stop and get the status" as a the operation, but now I see
it's confusing, and I meant mostly stop as you say.

> >
> > What you mention has more to do with the next bullet point.
> >
> > > > * VMM does not need to implement device specific operations in the
> > > >   driver part.
> > >
> > > What is the "driver part"?
> > >
> >
> > The part of qemu that talks to devices using virtio through (for
> > example) vhost messages. This set features, get features, etc.
> >
> > Each vhost device has its own method to stop the device. In networking
> > is setting a backend file descriptor -1, and others have their own
> > way.
> >
> > Using the status field allows out of VMM to unify that part too.
> >
> > Maybe this one and the above would be clearer if I use vhost as
> > examples. I will rewrite them anyhow, thanks!
>
> Thanks. Something like "The VMM does not need to implement a different
> stop operation like VHOST_NET_SET_BACKEND -1 for each device type" would
> be clearer.
>

I will change, thanks!

> > > > * Work out of the box for devices that use pure virtio backends in some
> > > >   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
> > > >   in any transport the device can use.
> > >
> > > ?
> >
> > vp_vdpa makes a standard virtio device exposed as a vdpa one. This
> > implies that each of the vhost commands sent to vhost-vdpa needs to be
> > converted to standard virtio request if it needs to reach the actual
> > device. But there is currently no way to stop the device or retrieve
> > its state using just virtio.
> >
> > Because of that, it's also usable by a pure virtio device, like in the
> > case of vdpa devices using virtio_vdpa or other devices that simply
> > exposes itself as a virtio one with no further facilities.
> >
> > It is also not restricted by the transport you use to expose the
> > virtio: PCI, MMIO, etc, since you need to perform operations already
> > defined by any usable device (set and get the status).
>

[1]
> I see. This says STOP needs to be an in-band VIRTIO operation so that
> vDPA/vhost can be stacked on top of VIRTIO devices. If STOP was only a
> vhost operation then it wouldn't be possible to forward it to underlying
> VIRTIO devices.
>

I will use that to clarify the point, thanks!

I think that all the points overlap too much, so I will try to rewrite
differently for the next version. Thank you very much for the
feedback!

> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [virtio-dev] Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-23 12:16       ` Stefan Hajnoczi
@ 2021-11-23 17:00         ` Eugenio Perez Martin
  2021-11-24 11:20           ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-23 17:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > From: Jason Wang <jasowang@redhat.com>
> > > >
> > > > This patch introduces a new status bit STOP. This can be used by the
> > > > driver to stop the device in order to safely fetch used descriptors
> > > > status, making sure the device will not fetch new available ones.
> > > >
> > > > Its main use case is live migration, although it has other orthogonal
> > > > use cases. It can be used to safely discard requests that have not been
> > > > used: in other words, to rewind available descriptors.
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >  content.tex | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 83 insertions(+)
> > > >
> > > > diff --git a/content.tex b/content.tex
> > > > index 2aa3006..9ed0d09 100644
> > > > --- a/content.tex
> > > > +++ b/content.tex
> > > > @@ -47,6 +47,13 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > >  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > >    drive the device.
> > > >
> > > > +\item[STOP (16)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > +  device has been stopped by the driver.
> > >
> > > Who controls the STOP bit? If I understand correctly, the driver writes
> > > STOP to the Status Register but the device will not report the STOP bit
> > > until the device has fully stopped?
> > >
> >
> > Yes, but to add the point of view of the driver here seems too much.
> > However, driver_ok is already explained from the point of view of the
> > driver, so I should try to accommodate that here too.
>
> I would drop "by the driver" to make it less confusing. The role of the
> driver and the device is explained later in this patch, so it can be
> omitted here to save the reader from guessing what "by the driver"
> means.
>

Ok we can do it that way.

> >
> > > > +  This status bit is different
> > > > +  from the reset since the device state is preserved.
> > >
> > > "the reset" -> "resetting the device"
> > >
> >
> > I will rephrase that.
> >
> > > > +
> > > > +\item[STOP_FAILED (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> > > > +  device could not stop the STOP request.
> > >
> > > "could not stop the STOP request" is weird, maybe "could not complete
> > > the STOP request" or just "could not stop".
> > >
> >
> > I think STOP_FAILED should be left out of the proposal for the next
> > revision actually, but you are right.
> >
> > > > +
> > > >  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > >    an error from which it can't recover.
> > > >  \end{description}
> > > > @@ -74,11 +81,83 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> > > >  recover by issuing a reset.
> > > >  \end{note}
> > > >
> > > > +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set or clear STOP if
> > > > +DRIVER_OK is not set.
> > >
> > > Small tweak: "if DRIVER_OK is not set" -> "when DRIVER_OK is not set".
> > > That makes the sentence a little easier to read (for me, at least).
> > >
> >
> > Ok I will change that.
> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated the driver MUST re-read the device status
> > > > +to ensure the STOP or STOP_FAILED bit is set after the write. The device
> > > > +acknowledges the new paused status setting the first, or the failure setting
> > > > +the last.
> > >
> > > This sentence is confusing. "paused status" introduces an alias for the
> > > STOP functionality that is being described. Don't use multiple names for
> > > the same thing. The "first"/"last" is unnecessary indirection, just use
> > > "STOP" and "STOP_FAILED" so the reader doesn't have to figure out what
> > > you meant. I suggest:
> > >
> > >   "The device indicates that it has stopped by reporting the STOP bit or
> > >   indicates failure by reporting the STOP_FAILED bit in the device
> > >   status field."
> > >
> >
> > I agree, I will change that.
> >
> > To be sure, you mean to replace just the second part of the paragraph, isn't it?
>
> Yes, just the last sentence of the paragraph.
>
> > > > +Since this change may not be instantaneous, the driver MAY wait for
> > > > +the configuration change notification that the device must send after the
> > >
> > > "must" is lowercase. If this is a device normative section then it
> > > should be "MUST". Otherwise I suggest removing the "must": "that the
> > > device sends after ...".
> > >
> >
> > It's in lowercase because it's the driver normative, not the device
> > one. But the "after" alternative sounds perfect to me.
> >
> > > > +change. If the device sets the STOP_FAILED bit, the driver MUST clear it before
> > > > +try new STOP attempts.
> > >
> > > s/try/trying/
> > >
> >
> > I'll probably discard the STOP_FAILED part for the next revisions but
> > I will take into account if I recover that, thanks!
> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > >
> > > "its stop" should probably be "it has stopped", but a more explicit way
> > > to explain this is:
> > >
> > > "If VIRTIO_F_STOP has been negotiated and the driver reads the STOP bit
> > > from the device status field,"
> > >
> >
> > I will change that.
> >
> > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > >
> > > I'm trying to understand how this would work. Available buffers may be
> > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > avail ring could contain something like:
> > >
> > >   avail.ring = [Used, Not used, Used, Not used, ...]
> > >                                                 ^--- avail.idx
> > >
> > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > used" in avail.ring.
> > >
> > > Does this mean the driver can rewind avail.idx by counting the number of
> > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > num_not_used "Not used" buffers?
> > >
> >
> > I'm going to also drop the "resume" part for the next version because
> > it adds extra complexity not actually needed, and it can be achieved
> > with a full reset in a simpler way.
> >
> > But I'll explain it below with your examples. Long story short, the
> > driver only can rewind the available descriptors that are still in the
> > available ring, and the device must flush the ones that cannot recover
> > from the ring.
> >
> > > I think there is a known issue with this approach:
> > >
> > > Imagine a vring with 4 elements:
> > >
> > >   avail.ring = [0,        1,    2,    3   ]
> > >                 Not used, used, used, used
> > >                                            ^--- avail.idx
> > >
> > > Since the device has used 3 buffers the driver now has space to make
> > > more buffers available. avail.idx wraps back to the start of the ring
> > > and the driver overwrites the first element ("Not used"):
> > >
> > >   avail.ring = [1,        N/A,  N/A,  N/A]
> > >                 Not used, N/A,  N/A,  N/A
> > >                          ^--- avail.idx
> > >
> > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > for the new available buffer.
> > >
> > > Now we stop the device, knowing there are two buffers available that
> > > have not been used. But avail.ring[] actually only contains the new
> > > buffer (vring descriptor 1) that we made available because we overwrote
> > > the old avail.ring[] element (vring descriptor 0).
> > >
> > > What now? Where does the device reset its internal avail_idx to?
> >
> > To be on the same page, in qemu the device maintains two "internal
> > avail idx": shadow_avail_idx (last seen in the available ring, could
> > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > avail, 2). The device must forget shadow_avail_idx and flush the
> > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > it can stop.
> >
> > The proposal allows the device to fail descriptor 0 in a
> > device-specific way, but I think now it was a bad choice.
> >
> > The driver cannot move the device's last_avail_idx in this operation:
> > The device is simply forced to flush used ones to the used ring or
> > descriptor ring in a packed vq case. So the device's internal
> > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> >
> > The device must keep its last_avail_idx through stop and resume cycle.
>
> Are you saying that all buffers avail->ring[i % ring_size] must be
> completed by the device before the STOP bit is reported where i <=
> last_avail_idx?
>
> This means the driver can modify avail->ring[i % ring_size] where
> avail_idx >= i > used_idx.
>

Yes, That's correct. The driver could also decide to modify the
descriptor table instead of the avail ring to do so, but I think the
point is clear now.

Somehow it is thought after the premise that the out of order
descriptors are descriptors that the device must wait to complete
before the pause anyway. Depending on the device, it might prefer to
cancel them, to wait for them, etc. The interesting descriptors to
rewind are the ones that have not reached the device (i > used_idx).
The driver can do whatever it wants with them.

If we assume all the in-flight descriptors are idempotent and we
expose a way for the device to expose them, the model is way more
simpler than this.

> (There might be off-by-one errors, I didn't check whether avail_idx and
> used_idx are inclusive or exclusive bounds the spec.)
>
> The example I gave violates these constraints and wouldn't be allowed.
>
> > > Does
> > > the device remember the available buffer with vring descriptor 0 or do
> > > we need to add it again?
> > >
> >
> > If we want to keep the descriptor 0 as available, device and driver
> > should forget these internally tracked available buffers on stop and
> > skip those in the packed case, similar as in the descriptor chain or
> > IN_ORDER case. Regular stop and resume gets more complicated, but
> > descriptors rewinding is more powerful. And this solution gets way
> > more difficult or impossible to use from the VMM.
> >
> > We could allow the device and the driver to remember these internally
> > tracked available descriptors. But this makes it impossible to use the
> > solution from the VMM transparently to the guest unless a way to
> > provide them is used.
> >
> > However this gets more and more complicated. I think it's better to
> > rely on device stop + reset for descriptor recovery. Everything is way
> > more clear even if a few more steps are needed to resume, and all
> > descriptors are recoverable. The device still needs a way to expose
> > the in-flight ones in the not IN_ORDER case.
>
> I'm also wary of a complicated mechanism for modifying available
> descriptors. Reset sounds good.
>

The next version will include it.

> >
> > > I'm not sure if this idea works even with split virtqueues.
> > >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stop,
> > >
> > > "its stop" -> "it has stopped".
> > >
> >
> > I will replace it.
> >
> > > > +the driver MAY change any descriptor.
> > >
> > > Not just descriptors, but also avail.ring[] and avail.idx?
> > >
> >
> > Right.
> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the device has confirmed its stopped,
> > >
> > > "its" -> "it has"
> > >
> > > > +the driver can resume it clearing the STOP status bit. It MUST re-read the
> > >
> > > "it clearing" -> "it by clearing"
> > >
> > > > +device status to ensure the STOP bit is clear after the write. The device
> > > > +acknowledges the new status clearing it. Since this change may not be
> > >
> > > "new status clearing" -> "new status by clearing"
> > >
> >
> > I will correct all of the above.
> >
> > > > +instantaneous, the driver MAY wait for the configuration change notification
> > > > +that the device must send after the change.
> > >
> > > "must" -> "MUST" or "that the device sends"
> > >
> >
> > I will delete it.
> >
> > > > +
> > > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> > > >
> > > >  The device MUST NOT consume buffers or send any used buffer
> > > >  notifications to the driver before DRIVER_OK.
> > > >
> > > > +If VIRTIO_F_STOP has not been negotiated the device MUST ignore the write of
> > > > +STOP. If the DRIVER_OK status bit is not set the device SHOULD ignore the write
> > > > +or clear of STOP.
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated, the device MUST finish any in flight
> > > > +operations after the driver writes STOP.  Depending on the device, it can do it
> > > > +in many ways as long as the driver can recover its normal operation if it
> > > > +resumes the device without the need of resetting it:
> > > > +\begin{itemize}
> > > > +\item Drain and wait for the completion of all pending requests until a
> > > > +convenient avail descriptor. Ignore any other posterior descriptor.
> > > > +\item Return a device-specific failure for these descriptors, so the driver
> > > > +can choose to retry or to cancel them.
> > >
> > > This means each device type in the spec needs to define STOP semantics
> > > so drivers know what to expect. Not sure it's feasible to do this. If
> > > you can drop this point from the patch I would because this is going to
> > > be hard to get right in implementations and a pain to specify properly.
> > >
> >
> > Yes, this complicates the solution and is not needed if the device is
> > able to report the in-flight ones.
> >
> > > > +\item Mark them as done even if they are not, if the kind of device can
> > > > +assume to lose them.
> > > > +\end{itemize}
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and it needs to fail the device stop after
> > > > +a guest's request, the device MUST set the STOP_FAILED bit for the guest to
> > > > +read it. The device MUST ignore new writes to the STOP bit until the guest
> > > > +clears STOP_FAILED.
> > >
> > > s/guest/driver/ here and elsewhere in this patch
> > >
> >
> > I will review it.
> >
> > > > +
> > > > +If VIRTIO_F_STOP has been negotiated and the guest has written the STOP bit,
> > > > +and the device can pause its operation, the device MUST set the descriptors
> > > > +that it has done with them as used before exposing the STOP status bit as set.
> > >
> > > This sentence is a bit confusing. What does "can" mean here?
> > >
> >
> > "the device can pause its operation" can be replaced by "The device
> > has set STOP bit". Is that clearer?
>
> I think that clause can be removed to make the sentence simpler:
>
>   If VIRTIO_F_STOP has been negotiated and the guest has written the
>   STOP bit, the device MUST mark as used all descriptors currently being
>   processed before reporting the STOP bit.
>
> BTW this sentence need to be made more specific if you want the "all
> buffers avail->ring[i % ring_size] must be completed by the device
> before the STOP bit is reported where i <= last_avail_idx" semantics. We
> don't just want to mark in-flight buffers as used, but all buffers
> before the used descriptor with the highest free-running ring index.

Now I think it overlaps too much with the second clause, regarding the
completion of them. These are going to be discarded for the next
revision, but I save this suggestion in case we need to come back for
this :).

Thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-23 17:00         ` [virtio-dev] " Eugenio Perez Martin
@ 2021-11-24 11:20           ` Stefan Hajnoczi
  2021-11-24 16:41             ` Eugenio Perez Martin
  2021-11-25  2:57             ` Jason Wang
  0 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-24 11:20 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 5949 bytes --]

On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > >
> > > > I'm trying to understand how this would work. Available buffers may be
> > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > avail ring could contain something like:
> > > >
> > > >   avail.ring = [Used, Not used, Used, Not used, ...]
> > > >                                                 ^--- avail.idx
> > > >
> > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > used" in avail.ring.
> > > >
> > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > num_not_used "Not used" buffers?
> > > >
> > >
> > > I'm going to also drop the "resume" part for the next version because
> > > it adds extra complexity not actually needed, and it can be achieved
> > > with a full reset in a simpler way.
> > >
> > > But I'll explain it below with your examples. Long story short, the
> > > driver only can rewind the available descriptors that are still in the
> > > available ring, and the device must flush the ones that cannot recover
> > > from the ring.
> > >
> > > > I think there is a known issue with this approach:
> > > >
> > > > Imagine a vring with 4 elements:
> > > >
> > > >   avail.ring = [0,        1,    2,    3   ]
> > > >                 Not used, used, used, used
> > > >                                            ^--- avail.idx
> > > >
> > > > Since the device has used 3 buffers the driver now has space to make
> > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > and the driver overwrites the first element ("Not used"):
> > > >
> > > >   avail.ring = [1,        N/A,  N/A,  N/A]
> > > >                 Not used, N/A,  N/A,  N/A
> > > >                          ^--- avail.idx
> > > >
> > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > for the new available buffer.
> > > >
> > > > Now we stop the device, knowing there are two buffers available that
> > > > have not been used. But avail.ring[] actually only contains the new
> > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > the old avail.ring[] element (vring descriptor 0).
> > > >
> > > > What now? Where does the device reset its internal avail_idx to?
> > >
> > > To be on the same page, in qemu the device maintains two "internal
> > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > it can stop.
> > >
> > > The proposal allows the device to fail descriptor 0 in a
> > > device-specific way, but I think now it was a bad choice.
> > >
> > > The driver cannot move the device's last_avail_idx in this operation:
> > > The device is simply forced to flush used ones to the used ring or
> > > descriptor ring in a packed vq case. So the device's internal
> > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > >
> > > The device must keep its last_avail_idx through stop and resume cycle.
> >
> > Are you saying that all buffers avail->ring[i % ring_size] must be
> > completed by the device before the STOP bit is reported where i <=
> > last_avail_idx?
> >
> > This means the driver can modify avail->ring[i % ring_size] where
> > avail_idx >= i > used_idx.
> >
> 
> Yes, That's correct. The driver could also decide to modify the
> descriptor table instead of the avail ring to do so, but I think the
> point is clear now.
> 
> Somehow it is thought after the premise that the out of order

"Somehow it is thought after the premise" == "there is a fundamental
design assumption"?

> descriptors are descriptors that the device must wait to complete
> before the pause anyway. Depending on the device, it might prefer to
> cancel them, to wait for them, etc. The interesting descriptors to
> rewind are the ones that have not reached the device (i > used_idx).
> The driver can do whatever it wants with them.
> 
> If we assume all the in-flight descriptors are idempotent and we
> expose a way for the device to expose them, the model is way more
> simpler than this.

The constraint that the device has to mark all previously seen "avail"
buffers as "used" is problematic. It makes STOP visible to the driver
when the device has to fail requests. That is incompatible with how
devices behave across live migration today. If you want to use STOP for
live migration then it's probably necessary to rethink this constraint.

QEMU's virtio-blk and virtio-scsi device models put failed requests onto
a list so they can be retried after problems with the underlying storage
have been resolved (e.g. more disk space becomes available and ENOSPC
requests can be retried). Based on the constraints you described, those
requests cannot be kept in the list across STOP.

QEMU live migration sends the retry list to the migration destination. I
think you're saying that won't be possible when STOP is used to
implement live migration?

That would be a shame since one of the ways to resolve I/O errors is by
migrating to another host :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-23 16:19       ` Eugenio Perez Martin
@ 2021-11-24 15:26         ` Stefan Hajnoczi
  2021-11-24 16:58           ` Eugenio Perez Martin
  2021-11-25  3:05         ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-24 15:26 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 7537 bytes --]

On Tue, Nov 23, 2021 at 05:19:23PM +0100, Eugenio Perez Martin wrote:
> On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > > On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > status, making sure the device will not fetch new available ones.
> > > > >
> > > > > Its main use case is live migration, although it has other orthogonal
> > > > > use cases. It can be used to safely discard requests that have not been
> > > > > used: in other words, to rewind available descriptors.
> > > >
> > > > This sounds non-trivial and would require more explanation.
> > > >
> > >
> > > You are right and it's not well explained here, I will try to develop
> > > it better for the next version. It's in the cover letter explaining
> > > one use case proposed by MST [1] and the answer by Jason [2].
> > >
> > > When the VQ is stopped, it is forced to flush all the descriptors (in
> > > this version) and the used index in the case of split. With that
> > > information, the driver can modify any available descriptor that has
> > > not been used by the device, and to rewind the virtqueue available
> > > index to an extent.
> > >
> > > If we add Jason's virtqueue state (first patch of previous version),
> > > which I need to recover in later versions, the driver can move freely
> > > available and used index prior to queue resetting.
> > >
> > > The (in comments) proposed solution for the device that cannot flush
> > > its descriptor timely is to provide it with a way to report in-flight
> > > descriptors. As a reference, vhost-user has done it before, but with a
> > > memory region shared by a file descriptor [3]. If we add something
> > > similar, the driver still knows what file descriptors is able to
> > > rewind.
> > >
> > > Does that explanation make the driver rewind use case more clear?
> >
> > I understand the use case but it's not clear how the mechanism is
> > supposed to work. Let's continue discussing in the sub-thread where I
> > posted details.
> >
> 
> Ok!
> 
> > >
> > > > > Stopping the device in the live migration context is done by per-device
> > > > > operations in vhost backends, but the introduction of STOP as a basic
> > > > > virtio facility comes with advantages:
> > > > > * All the device virtio-specific state is summarized in a single entity,
> > > > >   making easier to reason about it.
> > > >
> > > > This point isn't clear to me. I think it's saying that using STOP
> > > > somehow unifies things compared to the way that vhost devices are
> > > > stopped today. Given that vhost already syncs state back to the VMM's
> > > > emulated VIRTIO device, I'm not sure how STOP is different.
> > > >
> > >
> > > It also achieves that, but it's more related to the fact that the
> > > current way of getting the index through vhost net, user, ... is not
> > > reusable by others methods to expose the device to VMM.
> >
> > ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > that I'm missing?
> >
> 
> It is usable by all VIRTIO Device types *exposed through vhost*, so it
> fails to address the case when we cannot use vhost to expose the
> device. It can happen with the cases you explained better than me in
> [1], or when exposing a vdpa device as a pure virtio device using
> virtio_vdpa.

I see. Maybe you can describe that motivation in the cover letter, I
didn't get it until much later in our discussion.

> 
> > "vhost net, user" is confusing. Do you mean vhost-kernel and vhost-user
> > or do you mean vhost-net and vhost-user-net?
> >
> 
> I meant vhost-kernel and vhost-user, sorry.
> 
> > > Avoiding
> > > developing a different way to stop and get the status of each kind of
> > > device helps others devices implementations out of VMM.
> >
> > What does "kind of device" mean? I think you mean vhost device types
> > like net, scsi, blk, vsock, etc (a subset of VIRTIO Device Types that
> > have been defined for vhost). As you say, they have different stop
> > operations, but it's not true that getting the status of a vring is
> > different for each one (they all use ->vhost_get_vring_base()).
> >
> 
> Reading that way I meant all VIRTIO Device Types, since this proposal
> also addresses them even if they don't use vhost-*.
> 
> I said "stop and get the status" as a the operation, but now I see
> it's confusing, and I meant mostly stop as you say.
> 
> > >
> > > What you mention has more to do with the next bullet point.
> > >
> > > > > * VMM does not need to implement device specific operations in the
> > > > >   driver part.
> > > >
> > > > What is the "driver part"?
> > > >
> > >
> > > The part of qemu that talks to devices using virtio through (for
> > > example) vhost messages. This set features, get features, etc.
> > >
> > > Each vhost device has its own method to stop the device. In networking
> > > is setting a backend file descriptor -1, and others have their own
> > > way.
> > >
> > > Using the status field allows out of VMM to unify that part too.
> > >
> > > Maybe this one and the above would be clearer if I use vhost as
> > > examples. I will rewrite them anyhow, thanks!
> >
> > Thanks. Something like "The VMM does not need to implement a different
> > stop operation like VHOST_NET_SET_BACKEND -1 for each device type" would
> > be clearer.
> >
> 
> I will change, thanks!
> 
> > > > > * Work out of the box for devices that use pure virtio backends in some
> > > > >   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
> > > > >   in any transport the device can use.
> > > >
> > > > ?
> > >
> > > vp_vdpa makes a standard virtio device exposed as a vdpa one. This
> > > implies that each of the vhost commands sent to vhost-vdpa needs to be
> > > converted to standard virtio request if it needs to reach the actual
> > > device. But there is currently no way to stop the device or retrieve
> > > its state using just virtio.
> > >
> > > Because of that, it's also usable by a pure virtio device, like in the
> > > case of vdpa devices using virtio_vdpa or other devices that simply
> > > exposes itself as a virtio one with no further facilities.
> > >
> > > It is also not restricted by the transport you use to expose the
> > > virtio: PCI, MMIO, etc, since you need to perform operations already
> > > defined by any usable device (set and get the status).
> >
> 
> [1]
> > I see. This says STOP needs to be an in-band VIRTIO operation so that
> > vDPA/vhost can be stacked on top of VIRTIO devices. If STOP was only a
> > vhost operation then it wouldn't be possible to forward it to underlying
> > VIRTIO devices.
> >
> 
> I will use that to clarify the point, thanks!
> 
> I think that all the points overlap too much, so I will try to rewrite
> differently for the next version. Thank you very much for the
> feedback!

Sorry that I've been insisting on all these details. I was worried that
I'm missing the motivation for this feature or misunderstanding it.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-24 11:20           ` Stefan Hajnoczi
@ 2021-11-24 16:41             ` Eugenio Perez Martin
  2021-11-29 10:32               ` Stefan Hajnoczi
  2021-11-25  2:57             ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-24 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Wed, Nov 24, 2021 at 12:21 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > >
> > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > avail ring could contain something like:
> > > > >
> > > > >   avail.ring = [Used, Not used, Used, Not used, ...]
> > > > >                                                 ^--- avail.idx
> > > > >
> > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > used" in avail.ring.
> > > > >
> > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > num_not_used "Not used" buffers?
> > > > >
> > > >
> > > > I'm going to also drop the "resume" part for the next version because
> > > > it adds extra complexity not actually needed, and it can be achieved
> > > > with a full reset in a simpler way.
> > > >
> > > > But I'll explain it below with your examples. Long story short, the
> > > > driver only can rewind the available descriptors that are still in the
> > > > available ring, and the device must flush the ones that cannot recover
> > > > from the ring.
> > > >
> > > > > I think there is a known issue with this approach:
> > > > >
> > > > > Imagine a vring with 4 elements:
> > > > >
> > > > >   avail.ring = [0,        1,    2,    3   ]
> > > > >                 Not used, used, used, used
> > > > >                                            ^--- avail.idx
> > > > >
> > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > and the driver overwrites the first element ("Not used"):
> > > > >
> > > > >   avail.ring = [1,        N/A,  N/A,  N/A]
> > > > >                 Not used, N/A,  N/A,  N/A
> > > > >                          ^--- avail.idx
> > > > >
> > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > for the new available buffer.
> > > > >
> > > > > Now we stop the device, knowing there are two buffers available that
> > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > the old avail.ring[] element (vring descriptor 0).
> > > > >
> > > > > What now? Where does the device reset its internal avail_idx to?
> > > >
> > > > To be on the same page, in qemu the device maintains two "internal
> > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > it can stop.
> > > >
> > > > The proposal allows the device to fail descriptor 0 in a
> > > > device-specific way, but I think now it was a bad choice.
> > > >
> > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > The device is simply forced to flush used ones to the used ring or
> > > > descriptor ring in a packed vq case. So the device's internal
> > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > >
> > > > The device must keep its last_avail_idx through stop and resume cycle.
> > >
> > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > completed by the device before the STOP bit is reported where i <=
> > > last_avail_idx?
> > >
> > > This means the driver can modify avail->ring[i % ring_size] where
> > > avail_idx >= i > used_idx.
> > >
> >
> > Yes, That's correct. The driver could also decide to modify the
> > descriptor table instead of the avail ring to do so, but I think the
> > point is clear now.
> >
> > Somehow it is thought after the premise that the out of order
>
> "Somehow it is thought after the premise" == "there is a fundamental
> design assumption"?
>

Well, there always are design assumptions :). I didn't see it as
fundamental at the time of sending it, when I didn't consider
"idempotents in-flight descriptors" as something I could take for
granted. So I thought of it as the best we could do with the backend
that must wait for them, and without introducing other complicated
things (in-flight).

> > descriptors are descriptors that the device must wait to complete
> > before the pause anyway. Depending on the device, it might prefer to
> > cancel them, to wait for them, etc. The interesting descriptors to
> > rewind are the ones that have not reached the device (i > used_idx).
> > The driver can do whatever it wants with them.
> >
> > If we assume all the in-flight descriptors are idempotent and we
> > expose a way for the device to expose them, the model is way more
> > simpler than this.
>
> The constraint that the device has to mark all previously seen "avail"
> buffers as "used" is problematic. It makes STOP visible to the driver
> when the device has to fail requests. That is incompatible with how
> devices behave across live migration today. If you want to use STOP for
> live migration then it's probably necessary to rethink this constraint.
>
> QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> a list so they can be retried after problems with the underlying storage
> have been resolved (e.g. more disk space becomes available and ENOSPC
> requests can be retried). Based on the constraints you described, those
> requests cannot be kept in the list across STOP.
>

I didn't know about that, thanks for the information! Can vhost ones
do the same?

> QEMU live migration sends the retry list to the migration destination. I
> think you're saying that won't be possible when STOP is used to
> implement live migration?
>

Without out of order descriptor usage, the device can simply not mark
them as used, and the destination will re-try them. Would that work?

In the case of out or order, this proposal does not cover it 100% but
the next one will do.

> That would be a shame since one of the ways to resolve I/O errors is by
> migrating to another host :).
>

I totally agree.

Thanks!

> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-24 15:26         ` Stefan Hajnoczi
@ 2021-11-24 16:58           ` Eugenio Perez Martin
  0 siblings, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-24 16:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Wed, Nov 24, 2021 at 4:26 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Nov 23, 2021 at 05:19:23PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > > > On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > > > > This patch introduces a new status bit STOP. This can be used by the
> > > > > > driver to stop the device in order to safely fetch used descriptors
> > > > > > status, making sure the device will not fetch new available ones.
> > > > > >
> > > > > > Its main use case is live migration, although it has other orthogonal
> > > > > > use cases. It can be used to safely discard requests that have not been
> > > > > > used: in other words, to rewind available descriptors.
> > > > >
> > > > > This sounds non-trivial and would require more explanation.
> > > > >
> > > >
> > > > You are right and it's not well explained here, I will try to develop
> > > > it better for the next version. It's in the cover letter explaining
> > > > one use case proposed by MST [1] and the answer by Jason [2].
> > > >
> > > > When the VQ is stopped, it is forced to flush all the descriptors (in
> > > > this version) and the used index in the case of split. With that
> > > > information, the driver can modify any available descriptor that has
> > > > not been used by the device, and to rewind the virtqueue available
> > > > index to an extent.
> > > >
> > > > If we add Jason's virtqueue state (first patch of previous version),
> > > > which I need to recover in later versions, the driver can move freely
> > > > available and used index prior to queue resetting.
> > > >
> > > > The (in comments) proposed solution for the device that cannot flush
> > > > its descriptor timely is to provide it with a way to report in-flight
> > > > descriptors. As a reference, vhost-user has done it before, but with a
> > > > memory region shared by a file descriptor [3]. If we add something
> > > > similar, the driver still knows what file descriptors is able to
> > > > rewind.
> > > >
> > > > Does that explanation make the driver rewind use case more clear?
> > >
> > > I understand the use case but it's not clear how the mechanism is
> > > supposed to work. Let's continue discussing in the sub-thread where I
> > > posted details.
> > >
> >
> > Ok!
> >
> > > >
> > > > > > Stopping the device in the live migration context is done by per-device
> > > > > > operations in vhost backends, but the introduction of STOP as a basic
> > > > > > virtio facility comes with advantages:
> > > > > > * All the device virtio-specific state is summarized in a single entity,
> > > > > >   making easier to reason about it.
> > > > >
> > > > > This point isn't clear to me. I think it's saying that using STOP
> > > > > somehow unifies things compared to the way that vhost devices are
> > > > > stopped today. Given that vhost already syncs state back to the VMM's
> > > > > emulated VIRTIO device, I'm not sure how STOP is different.
> > > > >
> > > >
> > > > It also achieves that, but it's more related to the fact that the
> > > > current way of getting the index through vhost net, user, ... is not
> > > > reusable by others methods to expose the device to VMM.
> > >
> > > ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > > vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > > that I'm missing?
> > >
> >
> > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > fails to address the case when we cannot use vhost to expose the
> > device. It can happen with the cases you explained better than me in
> > [1], or when exposing a vdpa device as a pure virtio device using
> > virtio_vdpa.
>
> I see. Maybe you can describe that motivation in the cover letter, I
> didn't get it until much later in our discussion.
>

Good point. I will clarify it.

> >
> > > "vhost net, user" is confusing. Do you mean vhost-kernel and vhost-user
> > > or do you mean vhost-net and vhost-user-net?
> > >
> >
> > I meant vhost-kernel and vhost-user, sorry.
> >
> > > > Avoiding
> > > > developing a different way to stop and get the status of each kind of
> > > > device helps others devices implementations out of VMM.
> > >
> > > What does "kind of device" mean? I think you mean vhost device types
> > > like net, scsi, blk, vsock, etc (a subset of VIRTIO Device Types that
> > > have been defined for vhost). As you say, they have different stop
> > > operations, but it's not true that getting the status of a vring is
> > > different for each one (they all use ->vhost_get_vring_base()).
> > >
> >
> > Reading that way I meant all VIRTIO Device Types, since this proposal
> > also addresses them even if they don't use vhost-*.
> >
> > I said "stop and get the status" as a the operation, but now I see
> > it's confusing, and I meant mostly stop as you say.
> >
> > > >
> > > > What you mention has more to do with the next bullet point.
> > > >
> > > > > > * VMM does not need to implement device specific operations in the
> > > > > >   driver part.
> > > > >
> > > > > What is the "driver part"?
> > > > >
> > > >
> > > > The part of qemu that talks to devices using virtio through (for
> > > > example) vhost messages. This set features, get features, etc.
> > > >
> > > > Each vhost device has its own method to stop the device. In networking
> > > > is setting a backend file descriptor -1, and others have their own
> > > > way.
> > > >
> > > > Using the status field allows out of VMM to unify that part too.
> > > >
> > > > Maybe this one and the above would be clearer if I use vhost as
> > > > examples. I will rewrite them anyhow, thanks!
> > >
> > > Thanks. Something like "The VMM does not need to implement a different
> > > stop operation like VHOST_NET_SET_BACKEND -1 for each device type" would
> > > be clearer.
> > >
> >
> > I will change, thanks!
> >
> > > > > > * Work out of the box for devices that use pure virtio backends in some
> > > > > >   part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
> > > > > >   in any transport the device can use.
> > > > >
> > > > > ?
> > > >
> > > > vp_vdpa makes a standard virtio device exposed as a vdpa one. This
> > > > implies that each of the vhost commands sent to vhost-vdpa needs to be
> > > > converted to standard virtio request if it needs to reach the actual
> > > > device. But there is currently no way to stop the device or retrieve
> > > > its state using just virtio.
> > > >
> > > > Because of that, it's also usable by a pure virtio device, like in the
> > > > case of vdpa devices using virtio_vdpa or other devices that simply
> > > > exposes itself as a virtio one with no further facilities.
> > > >
> > > > It is also not restricted by the transport you use to expose the
> > > > virtio: PCI, MMIO, etc, since you need to perform operations already
> > > > defined by any usable device (set and get the status).
> > >
> >
> > [1]
> > > I see. This says STOP needs to be an in-band VIRTIO operation so that
> > > vDPA/vhost can be stacked on top of VIRTIO devices. If STOP was only a
> > > vhost operation then it wouldn't be possible to forward it to underlying
> > > VIRTIO devices.
> > >
> >
> > I will use that to clarify the point, thanks!
> >
> > I think that all the points overlap too much, so I will try to rewrite
> > differently for the next version. Thank you very much for the
> > feedback!
>
> Sorry that I've been insisting on all these details. I was worried that
> I'm missing the motivation for this feature or misunderstanding it.
>

On the contrary: although the proposed features and changes are
relatively small, it's hard to place them in the big picture and
introduce them balancing everything.

These questions and comments help a lot to focus and clear the
proposal, so thank you very much for them!

> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-24 11:20           ` Stefan Hajnoczi
  2021-11-24 16:41             ` Eugenio Perez Martin
@ 2021-11-25  2:57             ` Jason Wang
  2021-11-29 10:29               ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-25  2:57 UTC (permalink / raw)
  To: Stefan Hajnoczi, Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Alexander Mikheev,
	Shahaf Shuler, Oren Duer, Halil Pasic, Cornelia Huck,
	Bodong Wang, Dr . David Alan Gilbert, Parav Pandit, Max Gurtovoy


在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
>> On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
>>>> On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
>>>>>> +the driver MAY change avail_idx in the case of split virtqueue, but the new
>>>>>> +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
>>>>> I'm trying to understand how this would work. Available buffers may be
>>>>> consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
>>>>> avail ring could contain something like:
>>>>>
>>>>>    avail.ring = [Used, Not used, Used, Not used, ...]
>>>>>                                                  ^--- avail.idx
>>>>>
>>>>> There are num_not_used = avail.idx - used.idx requests that are "Not
>>>>> used" in avail.ring.
>>>>>
>>>>> Does this mean the driver can rewind avail.idx by counting the number of
>>>>> "Not used" buffers and skipping "Used" buffers until it reaches
>>>>> num_not_used "Not used" buffers?
>>>>>
>>>> I'm going to also drop the "resume" part for the next version because
>>>> it adds extra complexity not actually needed, and it can be achieved
>>>> with a full reset in a simpler way.
>>>>
>>>> But I'll explain it below with your examples. Long story short, the
>>>> driver only can rewind the available descriptors that are still in the
>>>> available ring, and the device must flush the ones that cannot recover
>>>> from the ring.
>>>>
>>>>> I think there is a known issue with this approach:
>>>>>
>>>>> Imagine a vring with 4 elements:
>>>>>
>>>>>    avail.ring = [0,        1,    2,    3   ]
>>>>>                  Not used, used, used, used
>>>>>                                             ^--- avail.idx
>>>>>
>>>>> Since the device has used 3 buffers the driver now has space to make
>>>>> more buffers available. avail.idx wraps back to the start of the ring
>>>>> and the driver overwrites the first element ("Not used"):
>>>>>
>>>>>    avail.ring = [1,        N/A,  N/A,  N/A]
>>>>>                  Not used, N/A,  N/A,  N/A
>>>>>                           ^--- avail.idx
>>>>>
>>>>> Since vring descriptor 0 is still in use the driver chose descriptor 1
>>>>> for the new available buffer.
>>>>>
>>>>> Now we stop the device, knowing there are two buffers available that
>>>>> have not been used. But avail.ring[] actually only contains the new
>>>>> buffer (vring descriptor 1) that we made available because we overwrote
>>>>> the old avail.ring[] element (vring descriptor 0).
>>>>>
>>>>> What now? Where does the device reset its internal avail_idx to?
>>>> To be on the same page, in qemu the device maintains two "internal
>>>> avail idx": shadow_avail_idx (last seen in the available ring, could
>>>> be 4 in this case) and last_avail_idx (next descriptor to fetch from
>>>> avail, 2). The device must forget shadow_avail_idx and flush the
>>>> descriptors that cannot recover (0). So last_avail_idx is now 3. Now
>>>> it can stop.
>>>>
>>>> The proposal allows the device to fail descriptor 0 in a
>>>> device-specific way, but I think now it was a bad choice.
>>>>
>>>> The driver cannot move the device's last_avail_idx in this operation:
>>>> The device is simply forced to flush used ones to the used ring or
>>>> descriptor ring in a packed vq case. So the device's internal
>>>> avail_idx == used_idx == 3. When the device resumes, it's still 3.
>>>>
>>>> The device must keep its last_avail_idx through stop and resume cycle.
>>> Are you saying that all buffers avail->ring[i % ring_size] must be
>>> completed by the device before the STOP bit is reported where i <=
>>> last_avail_idx?
>>>
>>> This means the driver can modify avail->ring[i % ring_size] where
>>> avail_idx >= i > used_idx.
>>>
>> Yes, That's correct. The driver could also decide to modify the
>> descriptor table instead of the avail ring to do so, but I think the
>> point is clear now.
>>
>> Somehow it is thought after the premise that the out of order
> "Somehow it is thought after the premise" == "there is a fundamental
> design assumption"?
>
>> descriptors are descriptors that the device must wait to complete
>> before the pause anyway. Depending on the device, it might prefer to
>> cancel them, to wait for them, etc. The interesting descriptors to
>> rewind are the ones that have not reached the device (i > used_idx).
>> The driver can do whatever it wants with them.
>>
>> If we assume all the in-flight descriptors are idempotent and we
>> expose a way for the device to expose them, the model is way more
>> simpler than this.
> The constraint that the device has to mark all previously seen "avail"
> buffers as "used" is problematic. It makes STOP visible to the driver
> when the device has to fail requests.


I think we need some clarification here on the driver. For doing 
migration, some kind of mediation is a must.

As we've discussed in the previous versions of this proposal, the VMM 
usually won't advertise the STOP feature to guest if we don't want to do 
nested live migration (if we do we can shadow it anyhow).

So from the guest point of view it won't see neither STOP nor the 
inflight descriptors.


> That is incompatible with how
> devices behave across live migration today. If you want to use STOP for
> live migration then it's probably necessary to rethink this constraint.
>
> QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> a list so they can be retried after problems with the underlying storage
> have been resolved (e.g. more disk space becomes available and ENOSPC
> requests can be retried).


A question, I think for those "failure" it's actually not visible from 
the drive? If this is true, from the spec point of view, there are still 
inflight. The VMM may choose to migrate them to the destination and 
re-submit them there. This works more like vhost re-connection.


Thanks


> Based on the constraints you described, those
> requests cannot be kept in the list across STOP.
>
> QEMU live migration sends the retry list to the migration destination. I
> think you're saying that won't be possible when STOP is used to
> implement live migration?
>
> That would be a shame since one of the ways to resolve I/O errors is by
> migrating to another host :).
>
> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-23 16:19       ` Eugenio Perez Martin
  2021-11-24 15:26         ` Stefan Hajnoczi
@ 2021-11-25  3:05         ` Jason Wang
  2021-11-25  7:24           ` Eugenio Perez Martin
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-25  3:05 UTC (permalink / raw)
  To: Eugenio Perez Martin, Stefan Hajnoczi
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Alexander Mikheev,
	Shahaf Shuler, Oren Duer, Halil Pasic, Cornelia Huck,
	Bodong Wang, Dr . David Alan Gilbert, Parav Pandit, Max Gurtovoy


在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
>>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
>>>>> This patch introduces a new status bit STOP. This can be used by the
>>>>> driver to stop the device in order to safely fetch used descriptors
>>>>> status, making sure the device will not fetch new available ones.
>>>>>
>>>>> Its main use case is live migration, although it has other orthogonal
>>>>> use cases. It can be used to safely discard requests that have not been
>>>>> used: in other words, to rewind available descriptors.
>>>> This sounds non-trivial and would require more explanation.
>>>>
>>> You are right and it's not well explained here, I will try to develop
>>> it better for the next version. It's in the cover letter explaining
>>> one use case proposed by MST [1] and the answer by Jason [2].
>>>
>>> When the VQ is stopped, it is forced to flush all the descriptors (in
>>> this version) and the used index in the case of split. With that
>>> information, the driver can modify any available descriptor that has
>>> not been used by the device, and to rewind the virtqueue available
>>> index to an extent.
>>>
>>> If we add Jason's virtqueue state (first patch of previous version),
>>> which I need to recover in later versions, the driver can move freely
>>> available and used index prior to queue resetting.
>>>
>>> The (in comments) proposed solution for the device that cannot flush
>>> its descriptor timely is to provide it with a way to report in-flight
>>> descriptors. As a reference, vhost-user has done it before, but with a
>>> memory region shared by a file descriptor [3]. If we add something
>>> similar, the driver still knows what file descriptors is able to
>>> rewind.
>>>
>>> Does that explanation make the driver rewind use case more clear?
>> I understand the use case but it's not clear how the mechanism is
>> supposed to work. Let's continue discussing in the sub-thread where I
>> posted details.
>>
> Ok!
>
>>>>> Stopping the device in the live migration context is done by per-device
>>>>> operations in vhost backends, but the introduction of STOP as a basic
>>>>> virtio facility comes with advantages:
>>>>> * All the device virtio-specific state is summarized in a single entity,
>>>>>    making easier to reason about it.
>>>> This point isn't clear to me. I think it's saying that using STOP
>>>> somehow unifies things compared to the way that vhost devices are
>>>> stopped today. Given that vhost already syncs state back to the VMM's
>>>> emulated VIRTIO device, I'm not sure how STOP is different.
>>>>
>>> It also achieves that, but it's more related to the fact that the
>>> current way of getting the index through vhost net, user, ... is not
>>> reusable by others methods to expose the device to VMM.
>> ->vhost_get_vring_base() is a common interface across vhost-kernel,
>> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
>> that I'm missing?
>>
> It is usable by all VIRTIO Device types *exposed through vhost*, so it
> fails to address the case when we cannot use vhost to expose the
> device. It can happen with the cases you explained better than me in
> [1], or when exposing a vdpa device as a pure virtio device using
> virtio_vdpa.


I don't get this and may miss something. The stop is a basic facility 
that is required for the building block of live migration. We need that 
regardless what kind of software technology or framework is used.

For the case of virtio_vdpa, it's not because it can't use STOP but 
because we don't have a valid use case for that.

Thanks


>
>> "vhost net, user" is confusing. Do you mean vhost-kernel and vhost-user
>> or do you mean vhost-net and vhost-user-net?
>>
> I meant vhost-kernel and vhost-user, sorry.
>
>>> Avoiding
>>> developing a different way to stop and get the status of each kind of
>>> device helps others devices implementations out of VMM.
>> What does "kind of device" mean? I think you mean vhost device types
>> like net, scsi, blk, vsock, etc (a subset of VIRTIO Device Types that
>> have been defined for vhost). As you say, they have different stop
>> operations, but it's not true that getting the status of a vring is
>> different for each one (they all use ->vhost_get_vring_base()).
>>
> Reading that way I meant all VIRTIO Device Types, since this proposal
> also addresses them even if they don't use vhost-*.
>
> I said "stop and get the status" as a the operation, but now I see
> it's confusing, and I meant mostly stop as you say.
>
>>> What you mention has more to do with the next bullet point.
>>>
>>>>> * VMM does not need to implement device specific operations in the
>>>>>    driver part.
>>>> What is the "driver part"?
>>>>
>>> The part of qemu that talks to devices using virtio through (for
>>> example) vhost messages. This set features, get features, etc.
>>>
>>> Each vhost device has its own method to stop the device. In networking
>>> is setting a backend file descriptor -1, and others have their own
>>> way.
>>>
>>> Using the status field allows out of VMM to unify that part too.
>>>
>>> Maybe this one and the above would be clearer if I use vhost as
>>> examples. I will rewrite them anyhow, thanks!
>> Thanks. Something like "The VMM does not need to implement a different
>> stop operation like VHOST_NET_SET_BACKEND -1 for each device type" would
>> be clearer.
>>
> I will change, thanks!
>
>>>>> * Work out of the box for devices that use pure virtio backends in some
>>>>>    part of the device emulation chain (virtio_pci_vdpa or virtio_vdpa),
>>>>>    in any transport the device can use.
>>>> ?
>>> vp_vdpa makes a standard virtio device exposed as a vdpa one. This
>>> implies that each of the vhost commands sent to vhost-vdpa needs to be
>>> converted to standard virtio request if it needs to reach the actual
>>> device. But there is currently no way to stop the device or retrieve
>>> its state using just virtio.
>>>
>>> Because of that, it's also usable by a pure virtio device, like in the
>>> case of vdpa devices using virtio_vdpa or other devices that simply
>>> exposes itself as a virtio one with no further facilities.
>>>
>>> It is also not restricted by the transport you use to expose the
>>> virtio: PCI, MMIO, etc, since you need to perform operations already
>>> defined by any usable device (set and get the status).
> [1]
>> I see. This says STOP needs to be an in-band VIRTIO operation so that
>> vDPA/vhost can be stacked on top of VIRTIO devices. If STOP was only a
>> vhost operation then it wouldn't be possible to forward it to underlying
>> VIRTIO devices.
>>
> I will use that to clarify the point, thanks!
>
> I think that all the points overlap too much, so I will try to rewrite
> differently for the next version. Thank you very much for the
> feedback!
>
>> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-25  3:05         ` Jason Wang
@ 2021-11-25  7:24           ` Eugenio Perez Martin
  2021-11-25  7:38             ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-25  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 25, 2021 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> >>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> >>>>> This patch introduces a new status bit STOP. This can be used by the
> >>>>> driver to stop the device in order to safely fetch used descriptors
> >>>>> status, making sure the device will not fetch new available ones.
> >>>>>
> >>>>> Its main use case is live migration, although it has other orthogonal
> >>>>> use cases. It can be used to safely discard requests that have not been
> >>>>> used: in other words, to rewind available descriptors.
> >>>> This sounds non-trivial and would require more explanation.
> >>>>
> >>> You are right and it's not well explained here, I will try to develop
> >>> it better for the next version. It's in the cover letter explaining
> >>> one use case proposed by MST [1] and the answer by Jason [2].
> >>>
> >>> When the VQ is stopped, it is forced to flush all the descriptors (in
> >>> this version) and the used index in the case of split. With that
> >>> information, the driver can modify any available descriptor that has
> >>> not been used by the device, and to rewind the virtqueue available
> >>> index to an extent.
> >>>
> >>> If we add Jason's virtqueue state (first patch of previous version),
> >>> which I need to recover in later versions, the driver can move freely
> >>> available and used index prior to queue resetting.
> >>>
> >>> The (in comments) proposed solution for the device that cannot flush
> >>> its descriptor timely is to provide it with a way to report in-flight
> >>> descriptors. As a reference, vhost-user has done it before, but with a
> >>> memory region shared by a file descriptor [3]. If we add something
> >>> similar, the driver still knows what file descriptors is able to
> >>> rewind.
> >>>
> >>> Does that explanation make the driver rewind use case more clear?
> >> I understand the use case but it's not clear how the mechanism is
> >> supposed to work. Let's continue discussing in the sub-thread where I
> >> posted details.
> >>
> > Ok!
> >
> >>>>> Stopping the device in the live migration context is done by per-device
> >>>>> operations in vhost backends, but the introduction of STOP as a basic
> >>>>> virtio facility comes with advantages:
> >>>>> * All the device virtio-specific state is summarized in a single entity,
> >>>>>    making easier to reason about it.
> >>>> This point isn't clear to me. I think it's saying that using STOP
> >>>> somehow unifies things compared to the way that vhost devices are
> >>>> stopped today. Given that vhost already syncs state back to the VMM's
> >>>> emulated VIRTIO device, I'm not sure how STOP is different.
> >>>>
> >>> It also achieves that, but it's more related to the fact that the
> >>> current way of getting the index through vhost net, user, ... is not
> >>> reusable by others methods to expose the device to VMM.
> >> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> >> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> >> that I'm missing?
> >>
> > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > fails to address the case when we cannot use vhost to expose the
> > device. It can happen with the cases you explained better than me in
> > [1], or when exposing a vdpa device as a pure virtio device using
> > virtio_vdpa.
>
>
> I don't get this and may miss something. The stop is a basic facility
> that is required for the building block of live migration. We need that
> regardless what kind of software technology or framework is used.
>

Yes, those are just examples using vdpa, to expose the actual
limitation that vhost protocol has. But there can be other examples
with other frameworks for sure.

And they are valid for both getting vq index and stop.

> For the case of virtio_vdpa, it's not because it can't use STOP but
> because we don't have a valid use case for that.
>

I had rewinding descriptors in mind actually.

Thanks!

> Thanks
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-25  7:24           ` Eugenio Perez Martin
@ 2021-11-25  7:38             ` Jason Wang
  2021-11-25  9:01               ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-11-25  7:38 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 25, 2021 at 3:25 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Nov 25, 2021 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> > > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > >>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > >>>>> This patch introduces a new status bit STOP. This can be used by the
> > >>>>> driver to stop the device in order to safely fetch used descriptors
> > >>>>> status, making sure the device will not fetch new available ones.
> > >>>>>
> > >>>>> Its main use case is live migration, although it has other orthogonal
> > >>>>> use cases. It can be used to safely discard requests that have not been
> > >>>>> used: in other words, to rewind available descriptors.
> > >>>> This sounds non-trivial and would require more explanation.
> > >>>>
> > >>> You are right and it's not well explained here, I will try to develop
> > >>> it better for the next version. It's in the cover letter explaining
> > >>> one use case proposed by MST [1] and the answer by Jason [2].
> > >>>
> > >>> When the VQ is stopped, it is forced to flush all the descriptors (in
> > >>> this version) and the used index in the case of split. With that
> > >>> information, the driver can modify any available descriptor that has
> > >>> not been used by the device, and to rewind the virtqueue available
> > >>> index to an extent.
> > >>>
> > >>> If we add Jason's virtqueue state (first patch of previous version),
> > >>> which I need to recover in later versions, the driver can move freely
> > >>> available and used index prior to queue resetting.
> > >>>
> > >>> The (in comments) proposed solution for the device that cannot flush
> > >>> its descriptor timely is to provide it with a way to report in-flight
> > >>> descriptors. As a reference, vhost-user has done it before, but with a
> > >>> memory region shared by a file descriptor [3]. If we add something
> > >>> similar, the driver still knows what file descriptors is able to
> > >>> rewind.
> > >>>
> > >>> Does that explanation make the driver rewind use case more clear?
> > >> I understand the use case but it's not clear how the mechanism is
> > >> supposed to work. Let's continue discussing in the sub-thread where I
> > >> posted details.
> > >>
> > > Ok!
> > >
> > >>>>> Stopping the device in the live migration context is done by per-device
> > >>>>> operations in vhost backends, but the introduction of STOP as a basic
> > >>>>> virtio facility comes with advantages:
> > >>>>> * All the device virtio-specific state is summarized in a single entity,
> > >>>>>    making easier to reason about it.
> > >>>> This point isn't clear to me. I think it's saying that using STOP
> > >>>> somehow unifies things compared to the way that vhost devices are
> > >>>> stopped today. Given that vhost already syncs state back to the VMM's
> > >>>> emulated VIRTIO device, I'm not sure how STOP is different.
> > >>>>
> > >>> It also achieves that, but it's more related to the fact that the
> > >>> current way of getting the index through vhost net, user, ... is not
> > >>> reusable by others methods to expose the device to VMM.
> > >> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > >> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > >> that I'm missing?
> > >>
> > > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > > fails to address the case when we cannot use vhost to expose the
> > > device. It can happen with the cases you explained better than me in
> > > [1], or when exposing a vdpa device as a pure virtio device using
> > > virtio_vdpa.
> >
> >
> > I don't get this and may miss something. The stop is a basic facility
> > that is required for the building block of live migration. We need that
> > regardless what kind of software technology or framework is used.
> >
>
> Yes, those are just examples using vdpa, to expose the actual
> limitation that vhost protocol has. But there can be other examples
> with other frameworks for sure.
>
> And they are valid for both getting vq index and stop.
>
> > For the case of virtio_vdpa, it's not because it can't use STOP but
> > because we don't have a valid use case for that.
> >
>
> I had rewinding descriptors in mind actually.

Yes, STOP could be one of the building blocks for this. But it
actually requires more. e.g the ability to let the driver to change
the last_avail_idx in the device. And it depends on if we had other
requirements during the rewind, e.g can we afford to stop the whole
device or just a specific virtqueue. If it's the latter, we had
another special example (rewind to 0), Ali is proposing per virtqueue
reset which could be used to death unused buffers safely from a
specific queue (then it's something unrelated to STOP itself).

So I think STOP + virtqueue index state should be sufficient for
rewind if we can afford to stop the whole device (which might not be
the case for Ali's proposal).

Thanks

>
> Thanks!
>
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-25  7:38             ` Jason Wang
@ 2021-11-25  9:01               ` Eugenio Perez Martin
  2021-11-25  9:10                 ` Eugenio Perez Martin
       [not found]                 ` <CACGkMEvD+Z7cYszhMzBsnEaC0K0kfnHxzFDEfjT_qLOFiMR-XA@mail.gmail.com>
  0 siblings, 2 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-25  9:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 25, 2021 at 8:38 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Nov 25, 2021 at 3:25 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> > > > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > > >>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > >>>>> This patch introduces a new status bit STOP. This can be used by the
> > > >>>>> driver to stop the device in order to safely fetch used descriptors
> > > >>>>> status, making sure the device will not fetch new available ones.
> > > >>>>>
> > > >>>>> Its main use case is live migration, although it has other orthogonal
> > > >>>>> use cases. It can be used to safely discard requests that have not been
> > > >>>>> used: in other words, to rewind available descriptors.
> > > >>>> This sounds non-trivial and would require more explanation.
> > > >>>>
> > > >>> You are right and it's not well explained here, I will try to develop
> > > >>> it better for the next version. It's in the cover letter explaining
> > > >>> one use case proposed by MST [1] and the answer by Jason [2].
> > > >>>
> > > >>> When the VQ is stopped, it is forced to flush all the descriptors (in
> > > >>> this version) and the used index in the case of split. With that
> > > >>> information, the driver can modify any available descriptor that has
> > > >>> not been used by the device, and to rewind the virtqueue available
> > > >>> index to an extent.
> > > >>>
> > > >>> If we add Jason's virtqueue state (first patch of previous version),
> > > >>> which I need to recover in later versions, the driver can move freely
> > > >>> available and used index prior to queue resetting.
> > > >>>
> > > >>> The (in comments) proposed solution for the device that cannot flush
> > > >>> its descriptor timely is to provide it with a way to report in-flight
> > > >>> descriptors. As a reference, vhost-user has done it before, but with a
> > > >>> memory region shared by a file descriptor [3]. If we add something
> > > >>> similar, the driver still knows what file descriptors is able to
> > > >>> rewind.
> > > >>>
> > > >>> Does that explanation make the driver rewind use case more clear?
> > > >> I understand the use case but it's not clear how the mechanism is
> > > >> supposed to work. Let's continue discussing in the sub-thread where I
> > > >> posted details.
> > > >>
> > > > Ok!
> > > >
> > > >>>>> Stopping the device in the live migration context is done by per-device
> > > >>>>> operations in vhost backends, but the introduction of STOP as a basic
> > > >>>>> virtio facility comes with advantages:
> > > >>>>> * All the device virtio-specific state is summarized in a single entity,
> > > >>>>>    making easier to reason about it.
> > > >>>> This point isn't clear to me. I think it's saying that using STOP
> > > >>>> somehow unifies things compared to the way that vhost devices are
> > > >>>> stopped today. Given that vhost already syncs state back to the VMM's
> > > >>>> emulated VIRTIO device, I'm not sure how STOP is different.
> > > >>>>
> > > >>> It also achieves that, but it's more related to the fact that the
> > > >>> current way of getting the index through vhost net, user, ... is not
> > > >>> reusable by others methods to expose the device to VMM.
> > > >> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > > >> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > > >> that I'm missing?
> > > >>
> > > > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > > > fails to address the case when we cannot use vhost to expose the
> > > > device. It can happen with the cases you explained better than me in
> > > > [1], or when exposing a vdpa device as a pure virtio device using
> > > > virtio_vdpa.
> > >
> > >
> > > I don't get this and may miss something. The stop is a basic facility
> > > that is required for the building block of live migration. We need that
> > > regardless what kind of software technology or framework is used.
> > >
> >
> > Yes, those are just examples using vdpa, to expose the actual
> > limitation that vhost protocol has. But there can be other examples
> > with other frameworks for sure.
> >
> > And they are valid for both getting vq index and stop.
> >
> > > For the case of virtio_vdpa, it's not because it can't use STOP but
> > > because we don't have a valid use case for that.
> > >
> >
> > I had rewinding descriptors in mind actually.
>
> Yes, STOP could be one of the building blocks for this. But it
> actually requires more. e.g the ability to let the driver to change
> the last_avail_idx in the device.

I think it can be done without changing last_avail_idx for the guest
to be able to stop and rewind.

The guest must already know what buffers are available for its normal
operation. In case of stop & reset, the driver knows that the next
last_avail_idx will be 0, so it just needs to set the available
descriptors it does *not* want to rewind in the virtqueue and change
avail_idx accordingly.

Then we have the opposite problem of the live migration: What do "in
flight" descriptors mean here? For example, for block, is it valid to
rewind a write descriptor that is in-flight?

> And it depends on if we had other
> requirements during the rewind, e.g can we afford to stop the whole
> device or just a specific virtqueue. If it's the latter, we had
> another special example (rewind to 0), Ali is proposing per virtqueue
> reset which could be used to death unused buffers safely from a
> specific queue (then it's something unrelated to STOP itself).
>
> So I think STOP + virtqueue index state should be sufficient for
> rewind if we can afford to stop the whole device (which might not be
> the case for Ali's proposal).
>

Totally agree on this part.

Thanks!

> Thanks
>
> >
> > Thanks!
> >
> > > Thanks
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
  2021-11-25  9:01               ` Eugenio Perez Martin
@ 2021-11-25  9:10                 ` Eugenio Perez Martin
       [not found]                 ` <CACGkMEvD+Z7cYszhMzBsnEaC0K0kfnHxzFDEfjT_qLOFiMR-XA@mail.gmail.com>
  1 sibling, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-25  9:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Nov 25, 2021 at 10:01 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Nov 25, 2021 at 8:38 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 3:25 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> > > > > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > > > >>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > > >>>>> This patch introduces a new status bit STOP. This can be used by the
> > > > >>>>> driver to stop the device in order to safely fetch used descriptors
> > > > >>>>> status, making sure the device will not fetch new available ones.
> > > > >>>>>
> > > > >>>>> Its main use case is live migration, although it has other orthogonal
> > > > >>>>> use cases. It can be used to safely discard requests that have not been
> > > > >>>>> used: in other words, to rewind available descriptors.
> > > > >>>> This sounds non-trivial and would require more explanation.
> > > > >>>>
> > > > >>> You are right and it's not well explained here, I will try to develop
> > > > >>> it better for the next version. It's in the cover letter explaining
> > > > >>> one use case proposed by MST [1] and the answer by Jason [2].
> > > > >>>
> > > > >>> When the VQ is stopped, it is forced to flush all the descriptors (in
> > > > >>> this version) and the used index in the case of split. With that
> > > > >>> information, the driver can modify any available descriptor that has
> > > > >>> not been used by the device, and to rewind the virtqueue available
> > > > >>> index to an extent.
> > > > >>>
> > > > >>> If we add Jason's virtqueue state (first patch of previous version),
> > > > >>> which I need to recover in later versions, the driver can move freely
> > > > >>> available and used index prior to queue resetting.
> > > > >>>
> > > > >>> The (in comments) proposed solution for the device that cannot flush
> > > > >>> its descriptor timely is to provide it with a way to report in-flight
> > > > >>> descriptors. As a reference, vhost-user has done it before, but with a
> > > > >>> memory region shared by a file descriptor [3]. If we add something
> > > > >>> similar, the driver still knows what file descriptors is able to
> > > > >>> rewind.
> > > > >>>
> > > > >>> Does that explanation make the driver rewind use case more clear?
> > > > >> I understand the use case but it's not clear how the mechanism is
> > > > >> supposed to work. Let's continue discussing in the sub-thread where I
> > > > >> posted details.
> > > > >>
> > > > > Ok!
> > > > >
> > > > >>>>> Stopping the device in the live migration context is done by per-device
> > > > >>>>> operations in vhost backends, but the introduction of STOP as a basic
> > > > >>>>> virtio facility comes with advantages:
> > > > >>>>> * All the device virtio-specific state is summarized in a single entity,
> > > > >>>>>    making easier to reason about it.
> > > > >>>> This point isn't clear to me. I think it's saying that using STOP
> > > > >>>> somehow unifies things compared to the way that vhost devices are
> > > > >>>> stopped today. Given that vhost already syncs state back to the VMM's
> > > > >>>> emulated VIRTIO device, I'm not sure how STOP is different.
> > > > >>>>
> > > > >>> It also achieves that, but it's more related to the fact that the
> > > > >>> current way of getting the index through vhost net, user, ... is not
> > > > >>> reusable by others methods to expose the device to VMM.
> > > > >> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > > > >> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > > > >> that I'm missing?
> > > > >>
> > > > > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > > > > fails to address the case when we cannot use vhost to expose the
> > > > > device. It can happen with the cases you explained better than me in
> > > > > [1], or when exposing a vdpa device as a pure virtio device using
> > > > > virtio_vdpa.
> > > >
> > > >
> > > > I don't get this and may miss something. The stop is a basic facility
> > > > that is required for the building block of live migration. We need that
> > > > regardless what kind of software technology or framework is used.
> > > >
> > >
> > > Yes, those are just examples using vdpa, to expose the actual
> > > limitation that vhost protocol has. But there can be other examples
> > > with other frameworks for sure.
> > >
> > > And they are valid for both getting vq index and stop.
> > >
> > > > For the case of virtio_vdpa, it's not because it can't use STOP but
> > > > because we don't have a valid use case for that.
> > > >
> > >
> > > I had rewinding descriptors in mind actually.
> >
> > Yes, STOP could be one of the building blocks for this. But it
> > actually requires more. e.g the ability to let the driver to change
> > the last_avail_idx in the device.
>
> I think it can be done without changing last_avail_idx for the guest
> to be able to stop and rewind.
>
> The guest must already know what buffers are available for its normal
> operation. In case of stop & reset, the driver knows that the next
> last_avail_idx will be 0, so it just needs to set the available
> descriptors it does *not* want to rewind in the virtqueue and change
> avail_idx accordingly.
>
> Then we have the opposite problem of the live migration: What do "in
> flight" descriptors mean here? For example, for block, is it valid to
> rewind a write descriptor that is in-flight?
>

PS: To be on the same page here, in-flight means descriptors in the
range (used_idx, device's last_avail_idx) here.

> > And it depends on if we had other
> > requirements during the rewind, e.g can we afford to stop the whole
> > device or just a specific virtqueue. If it's the latter, we had
> > another special example (rewind to 0), Ali is proposing per virtqueue
> > reset which could be used to death unused buffers safely from a
> > specific queue (then it's something unrelated to STOP itself).
> >
> > So I think STOP + virtqueue index state should be sufficient for
> > rewind if we can afford to stop the whole device (which might not be
> > the case for Ali's proposal).
> >
>
> Totally agree on this part.
>
> Thanks!
>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > >
> >


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 0/2] virtio: introduce STOP status bit
       [not found]                 ` <CACGkMEvD+Z7cYszhMzBsnEaC0K0kfnHxzFDEfjT_qLOFiMR-XA@mail.gmail.com>
@ 2021-11-26  8:26                   ` Eugenio Perez Martin
  0 siblings, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-26  8:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Fri, Nov 26, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Nov 25, 2021 at 5:01 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 8:38 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 3:25 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 25, 2021 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > >
> > > > > 在 2021/11/24 上午12:19, Eugenio Perez Martin 写道:
> > > > > > On Tue, Nov 23, 2021 at 12:33 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >> On Thu, Nov 18, 2021 at 05:49:11PM +0100, Eugenio Perez Martin wrote:
> > > > > >>> On Thu, Nov 18, 2021 at 3:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >>>> On Thu, Nov 11, 2021 at 07:58:10PM +0100, Eugenio Pérez wrote:
> > > > > >>>>> This patch introduces a new status bit STOP. This can be used by the
> > > > > >>>>> driver to stop the device in order to safely fetch used descriptors
> > > > > >>>>> status, making sure the device will not fetch new available ones.
> > > > > >>>>>
> > > > > >>>>> Its main use case is live migration, although it has other orthogonal
> > > > > >>>>> use cases. It can be used to safely discard requests that have not been
> > > > > >>>>> used: in other words, to rewind available descriptors.
> > > > > >>>> This sounds non-trivial and would require more explanation.
> > > > > >>>>
> > > > > >>> You are right and it's not well explained here, I will try to develop
> > > > > >>> it better for the next version. It's in the cover letter explaining
> > > > > >>> one use case proposed by MST [1] and the answer by Jason [2].
> > > > > >>>
> > > > > >>> When the VQ is stopped, it is forced to flush all the descriptors (in
> > > > > >>> this version) and the used index in the case of split. With that
> > > > > >>> information, the driver can modify any available descriptor that has
> > > > > >>> not been used by the device, and to rewind the virtqueue available
> > > > > >>> index to an extent.
> > > > > >>>
> > > > > >>> If we add Jason's virtqueue state (first patch of previous version),
> > > > > >>> which I need to recover in later versions, the driver can move freely
> > > > > >>> available and used index prior to queue resetting.
> > > > > >>>
> > > > > >>> The (in comments) proposed solution for the device that cannot flush
> > > > > >>> its descriptor timely is to provide it with a way to report in-flight
> > > > > >>> descriptors. As a reference, vhost-user has done it before, but with a
> > > > > >>> memory region shared by a file descriptor [3]. If we add something
> > > > > >>> similar, the driver still knows what file descriptors is able to
> > > > > >>> rewind.
> > > > > >>>
> > > > > >>> Does that explanation make the driver rewind use case more clear?
> > > > > >> I understand the use case but it's not clear how the mechanism is
> > > > > >> supposed to work. Let's continue discussing in the sub-thread where I
> > > > > >> posted details.
> > > > > >>
> > > > > > Ok!
> > > > > >
> > > > > >>>>> Stopping the device in the live migration context is done by per-device
> > > > > >>>>> operations in vhost backends, but the introduction of STOP as a basic
> > > > > >>>>> virtio facility comes with advantages:
> > > > > >>>>> * All the device virtio-specific state is summarized in a single entity,
> > > > > >>>>>    making easier to reason about it.
> > > > > >>>> This point isn't clear to me. I think it's saying that using STOP
> > > > > >>>> somehow unifies things compared to the way that vhost devices are
> > > > > >>>> stopped today. Given that vhost already syncs state back to the VMM's
> > > > > >>>> emulated VIRTIO device, I'm not sure how STOP is different.
> > > > > >>>>
> > > > > >>> It also achieves that, but it's more related to the fact that the
> > > > > >>> current way of getting the index through vhost net, user, ... is not
> > > > > >>> reusable by others methods to expose the device to VMM.
> > > > > >> ->vhost_get_vring_base() is a common interface across vhost-kernel,
> > > > > >> vhost-user, etc and all VIRTIO Device Types. Is there a problem with it
> > > > > >> that I'm missing?
> > > > > >>
> > > > > > It is usable by all VIRTIO Device types *exposed through vhost*, so it
> > > > > > fails to address the case when we cannot use vhost to expose the
> > > > > > device. It can happen with the cases you explained better than me in
> > > > > > [1], or when exposing a vdpa device as a pure virtio device using
> > > > > > virtio_vdpa.
> > > > >
> > > > >
> > > > > I don't get this and may miss something. The stop is a basic facility
> > > > > that is required for the building block of live migration. We need that
> > > > > regardless what kind of software technology or framework is used.
> > > > >
> > > >
> > > > Yes, those are just examples using vdpa, to expose the actual
> > > > limitation that vhost protocol has. But there can be other examples
> > > > with other frameworks for sure.
> > > >
> > > > And they are valid for both getting vq index and stop.
> > > >
> > > > > For the case of virtio_vdpa, it's not because it can't use STOP but
> > > > > because we don't have a valid use case for that.
> > > > >
> > > >
> > > > I had rewinding descriptors in mind actually.
> > >
> > > Yes, STOP could be one of the building blocks for this. But it
> > > actually requires more. e.g the ability to let the driver to change
> > > the last_avail_idx in the device.
> >
> > I think it can be done without changing last_avail_idx for the guest
> > to be able to stop and rewind.
>
> At least the driver needs to know the exact value of last_avail_idx in
> this case.
>

In this proposal the device must dump it to the used ring. For the
next one the virtqueue state basic facility will be recovered.

> >
> > The guest must already know what buffers are available for its normal
> > operation. In case of stop & reset, the driver knows that the next
> > last_avail_idx will be 0, so it just needs to set the available
> > descriptors it does *not* want to rewind in the virtqueue and change
> > avail_idx accordingly.
>
> I may miss something, but in this case the device may start processing
> [0, avail_idx), is it safe?
>

The guest should also set the proper avail_idx and the descriptors
before the DRIVER_OK, so I'd say yes.

Something like:
* Driver set STOP, so it knows it has (at least) [used_idx ==
last_avail_idx, avail_idx) descriptors for rewinding.
* Driver adds the ones it wants to [0, new_avail_idx <= avail_idx). It
releases the resources of the descriptors that want to rewind.
* Driver set DRIVER_OK.

> >
> > Then we have the opposite problem of the live migration: What do "in
> > flight" descriptors mean here? For example, for block, is it valid to
> > rewind a write descriptor that is in-flight?
>
> I think if we don't care about rewind it would be much easier. (Anyhow
> we know we will support setting/getting index states from driver).
>

It may be, yes.

Thanks!

> Thanks
>
> >
> > > And it depends on if we had other
> > > requirements during the rewind, e.g can we afford to stop the whole
> > > device or just a specific virtqueue. If it's the latter, we had
> > > another special example (rewind to 0), Ali is proposing per virtqueue
> > > reset which could be used to death unused buffers safely from a
> > > specific queue (then it's something unrelated to STOP itself).
> > >
> > > So I think STOP + virtqueue index state should be sufficient for
> > > rewind if we can afford to stop the whole device (which might not be
> > > the case for Ali's proposal).
> > >
> >
> > Totally agree on this part.
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-25  2:57             ` Jason Wang
@ 2021-11-29 10:29               ` Stefan Hajnoczi
  2021-11-29 16:55                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-29 10:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, Virtio-Dev, virtio-comment,
	Michael Tsirkin, Alexander Mikheev, Shahaf Shuler, Oren Duer,
	Halil Pasic, Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 7513 bytes --]

On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> 
> 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > avail ring could contain something like:
> > > > > > 
> > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > >                                                  ^--- avail.idx
> > > > > > 
> > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > used" in avail.ring.
> > > > > > 
> > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > num_not_used "Not used" buffers?
> > > > > > 
> > > > > I'm going to also drop the "resume" part for the next version because
> > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > with a full reset in a simpler way.
> > > > > 
> > > > > But I'll explain it below with your examples. Long story short, the
> > > > > driver only can rewind the available descriptors that are still in the
> > > > > available ring, and the device must flush the ones that cannot recover
> > > > > from the ring.
> > > > > 
> > > > > > I think there is a known issue with this approach:
> > > > > > 
> > > > > > Imagine a vring with 4 elements:
> > > > > > 
> > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > >                  Not used, used, used, used
> > > > > >                                             ^--- avail.idx
> > > > > > 
> > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > 
> > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > >                           ^--- avail.idx
> > > > > > 
> > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > for the new available buffer.
> > > > > > 
> > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > 
> > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > it can stop.
> > > > > 
> > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > device-specific way, but I think now it was a bad choice.
> > > > > 
> > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > The device is simply forced to flush used ones to the used ring or
> > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > 
> > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > completed by the device before the STOP bit is reported where i <=
> > > > last_avail_idx?
> > > > 
> > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > avail_idx >= i > used_idx.
> > > > 
> > > Yes, That's correct. The driver could also decide to modify the
> > > descriptor table instead of the avail ring to do so, but I think the
> > > point is clear now.
> > > 
> > > Somehow it is thought after the premise that the out of order
> > "Somehow it is thought after the premise" == "there is a fundamental
> > design assumption"?
> > 
> > > descriptors are descriptors that the device must wait to complete
> > > before the pause anyway. Depending on the device, it might prefer to
> > > cancel them, to wait for them, etc. The interesting descriptors to
> > > rewind are the ones that have not reached the device (i > used_idx).
> > > The driver can do whatever it wants with them.
> > > 
> > > If we assume all the in-flight descriptors are idempotent and we
> > > expose a way for the device to expose them, the model is way more
> > > simpler than this.
> > The constraint that the device has to mark all previously seen "avail"
> > buffers as "used" is problematic. It makes STOP visible to the driver
> > when the device has to fail requests.
> 
> 
> I think we need some clarification here on the driver. For doing migration,
> some kind of mediation is a must.
> 
> As we've discussed in the previous versions of this proposal, the VMM
> usually won't advertise the STOP feature to guest if we don't want to do
> nested live migration (if we do we can shadow it anyhow).
> 
> So from the guest point of view it won't see neither STOP nor the inflight
> descriptors.

That's not how I understand STOP semantics. See below.

> > That is incompatible with how
> > devices behave across live migration today. If you want to use STOP for
> > live migration then it's probably necessary to rethink this constraint.
> > 
> > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > a list so they can be retried after problems with the underlying storage
> > have been resolved (e.g. more disk space becomes available and ENOSPC
> > requests can be retried).
> 
> 
> A question, I think for those "failure" it's actually not visible from the
> drive? If this is true, from the spec point of view, there are still
> inflight. The VMM may choose to migrate them to the destination and
> re-submit them there. This works more like vhost re-connection.

That's how I would like STOP to work, but the semantics seem to be
different. Eugenio can correct me if this is wrong:

All avail descriptors before the last used descriptor must be marked
used before the device reports the STOP bit. For example:

  avail.ring = [1, 2, 3, 4]
  used.ring = [3]

The driver writes the STOP bit. Now the device MUST complete 1 and 2
before reporting the STOP bit. Therefore we cannot keep 1 and 2
in flight but it can keep 4 in flight. The problem is that this
conflicts with the virtio-blk/scsi failed requests behavior where 1 and
2 should be kept in flight and migrated.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-24 16:41             ` Eugenio Perez Martin
@ 2021-11-29 10:32               ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-11-29 10:32 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Virtio-Dev, virtio-comment, Michael Tsirkin, Jason Wang,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 7452 bytes --]

On Wed, Nov 24, 2021 at 05:41:27PM +0100, Eugenio Perez Martin wrote:
> On Wed, Nov 24, 2021 at 12:21 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > >
> > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > avail ring could contain something like:
> > > > > >
> > > > > >   avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > >                                                 ^--- avail.idx
> > > > > >
> > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > used" in avail.ring.
> > > > > >
> > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > num_not_used "Not used" buffers?
> > > > > >
> > > > >
> > > > > I'm going to also drop the "resume" part for the next version because
> > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > with a full reset in a simpler way.
> > > > >
> > > > > But I'll explain it below with your examples. Long story short, the
> > > > > driver only can rewind the available descriptors that are still in the
> > > > > available ring, and the device must flush the ones that cannot recover
> > > > > from the ring.
> > > > >
> > > > > > I think there is a known issue with this approach:
> > > > > >
> > > > > > Imagine a vring with 4 elements:
> > > > > >
> > > > > >   avail.ring = [0,        1,    2,    3   ]
> > > > > >                 Not used, used, used, used
> > > > > >                                            ^--- avail.idx
> > > > > >
> > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > and the driver overwrites the first element ("Not used"):
> > > > > >
> > > > > >   avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > >                 Not used, N/A,  N/A,  N/A
> > > > > >                          ^--- avail.idx
> > > > > >
> > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > for the new available buffer.
> > > > > >
> > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > >
> > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > >
> > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > it can stop.
> > > > >
> > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > device-specific way, but I think now it was a bad choice.
> > > > >
> > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > The device is simply forced to flush used ones to the used ring or
> > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > >
> > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > >
> > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > completed by the device before the STOP bit is reported where i <=
> > > > last_avail_idx?
> > > >
> > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > avail_idx >= i > used_idx.
> > > >
> > >
> > > Yes, That's correct. The driver could also decide to modify the
> > > descriptor table instead of the avail ring to do so, but I think the
> > > point is clear now.
> > >
> > > Somehow it is thought after the premise that the out of order
> >
> > "Somehow it is thought after the premise" == "there is a fundamental
> > design assumption"?
> >
> 
> Well, there always are design assumptions :). I didn't see it as
> fundamental at the time of sending it, when I didn't consider
> "idempotents in-flight descriptors" as something I could take for
> granted. So I thought of it as the best we could do with the backend
> that must wait for them, and without introducing other complicated
> things (in-flight).
> 
> > > descriptors are descriptors that the device must wait to complete
> > > before the pause anyway. Depending on the device, it might prefer to
> > > cancel them, to wait for them, etc. The interesting descriptors to
> > > rewind are the ones that have not reached the device (i > used_idx).
> > > The driver can do whatever it wants with them.
> > >
> > > If we assume all the in-flight descriptors are idempotent and we
> > > expose a way for the device to expose them, the model is way more
> > > simpler than this.
> >
> > The constraint that the device has to mark all previously seen "avail"
> > buffers as "used" is problematic. It makes STOP visible to the driver
> > when the device has to fail requests. That is incompatible with how
> > devices behave across live migration today. If you want to use STOP for
> > live migration then it's probably necessary to rethink this constraint.
> >
> > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > a list so they can be retried after problems with the underlying storage
> > have been resolved (e.g. more disk space becomes available and ENOSPC
> > requests can be retried). Based on the constraints you described, those
> > requests cannot be kept in the list across STOP.
> >
> 
> I didn't know about that, thanks for the information! Can vhost ones
> do the same?

No, there is no vhost-blk upstream and vhost-scsi doesn't migrate failed
requests.

> > QEMU live migration sends the retry list to the migration destination. I
> > think you're saying that won't be possible when STOP is used to
> > implement live migration?
> >
> 
> Without out of order descriptor usage, the device can simply not mark
> them as used, and the destination will re-try them. Would that work?

The failed requests feature is not useful with VIRTIO_F_IN_ORDER since a
failed requests prevents further I/O processing.

> In the case of out or order, this proposal does not cover it 100% but
> the next one will do.

Great, looking forward to the next revision.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-29 10:29               ` Stefan Hajnoczi
@ 2021-11-29 16:55                 ` Eugenio Perez Martin
  2021-12-01 10:21                   ` Stefan Hajnoczi
  2021-12-02  2:40                   ` Jason Wang
  0 siblings, 2 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-11-29 16:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> >
> > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > avail ring could contain something like:
> > > > > > >
> > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > >                                                  ^--- avail.idx
> > > > > > >
> > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > used" in avail.ring.
> > > > > > >
> > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > num_not_used "Not used" buffers?
> > > > > > >
> > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > with a full reset in a simpler way.
> > > > > >
> > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > from the ring.
> > > > > >
> > > > > > > I think there is a known issue with this approach:
> > > > > > >
> > > > > > > Imagine a vring with 4 elements:
> > > > > > >
> > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > >                  Not used, used, used, used
> > > > > > >                                             ^--- avail.idx
> > > > > > >
> > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > >
> > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > >                           ^--- avail.idx
> > > > > > >
> > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > for the new available buffer.
> > > > > > >
> > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > >
> > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > it can stop.
> > > > > >
> > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > device-specific way, but I think now it was a bad choice.
> > > > > >
> > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > >
> > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > completed by the device before the STOP bit is reported where i <=
> > > > > last_avail_idx?
> > > > >
> > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > avail_idx >= i > used_idx.
> > > > >
> > > > Yes, That's correct. The driver could also decide to modify the
> > > > descriptor table instead of the avail ring to do so, but I think the
> > > > point is clear now.
> > > >
> > > > Somehow it is thought after the premise that the out of order
> > > "Somehow it is thought after the premise" == "there is a fundamental
> > > design assumption"?
> > >
> > > > descriptors are descriptors that the device must wait to complete
> > > > before the pause anyway. Depending on the device, it might prefer to
> > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > The driver can do whatever it wants with them.
> > > >
> > > > If we assume all the in-flight descriptors are idempotent and we
> > > > expose a way for the device to expose them, the model is way more
> > > > simpler than this.
> > > The constraint that the device has to mark all previously seen "avail"
> > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > when the device has to fail requests.
> >
> >
> > I think we need some clarification here on the driver. For doing migration,
> > some kind of mediation is a must.
> >
> > As we've discussed in the previous versions of this proposal, the VMM
> > usually won't advertise the STOP feature to guest if we don't want to do
> > nested live migration (if we do we can shadow it anyhow).
> >
> > So from the guest point of view it won't see neither STOP nor the inflight
> > descriptors.
>
> That's not how I understand STOP semantics. See below.
>
> > > That is incompatible with how
> > > devices behave across live migration today. If you want to use STOP for
> > > live migration then it's probably necessary to rethink this constraint.
> > >
> > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > a list so they can be retried after problems with the underlying storage
> > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > requests can be retried).
> >
> >
> > A question, I think for those "failure" it's actually not visible from the
> > drive? If this is true, from the spec point of view, there are still
> > inflight. The VMM may choose to migrate them to the destination and
> > re-submit them there. This works more like vhost re-connection.
>
> That's how I would like STOP to work, but the semantics seem to be
> different. Eugenio can correct me if this is wrong:
>
> All avail descriptors before the last used descriptor must be marked
> used before the device reports the STOP bit. For example:
>
>   avail.ring = [1, 2, 3, 4]
>   used.ring = [3]
>
> The driver writes the STOP bit. Now the device MUST complete 1 and 2
> before reporting the STOP bit. Therefore we cannot keep 1 and 2
> in flight but it can keep 4 in flight. The problem is that this
> conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> 2 should be kept in flight and migrated.
>

(Only answering to your example use case at this part of the mail)

My intention was a little bit less rigid actually, but it does not
meet the blk use case anyway.

In that case, the device should be free to not mark descriptor 2 as
used, since the device will start on last_avail_idx == 1, and it will
read it again after the reset. And that requirement was intended to be
removed once we implement a standard / device specific way to report
them differently. It's loosely expressed as "Depending on the device,
... as long as the driver can recover its normal operation if it
resumes the device without the need of resetting it".

Although I thought this freedom would help devices to implement stop
semantincs, to track overridden descriptors could actually be way
worse than simply split them as in flight == (used_idx,
last_avail_idx) or available (last_avail_idx, avail_idx).

I still think that, ideally, the device should be able to report
differently the descriptors that are not-rewindables (for example,
in-flight writes, because rewind them leave the device in an
inconsistent state) and rewindable (not started writes, reads) to the
driver. Just for the sake of flexibility. But potentially overridden
descriptors complicate it, so it's probably not worth it. And our
intended use case (live migration) has no use for it, so I think it's
better to stick with in flight vs avail.

(Now adding my view of Jason's point on top)

At this moment, blk is able to detect ENOSPC because the device is in
qemu's code, software based. If the device is out of qemu, it will
need either:
* A way to signal the error condition to qemu, so it can start the
migration to solve it.
* Another process to monitor available space so it can react & migrate.

Since you pointed out a queue of failed requests, I will go with the
first method. The data queues of the device reach directly the guest,
so the device cannot use them to signal ENOSPC: To deliver it via
VirtQueue will skip qemu. This is already outside of VirtIO. How would
that work in the nested migration case, for example? The only way I
can think at this moment is to use shadow virtqueue from the beginning
of qemu operation.

Once qemu receives that signal, the guest would only see that
descriptor id 1 has been used. For the next revision, it will see no
descriptor used.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-29 16:55                 ` Eugenio Perez Martin
@ 2021-12-01 10:21                   ` Stefan Hajnoczi
  2021-12-02  8:30                     ` Eugenio Perez Martin
  2021-12-02  2:40                   ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-12-01 10:21 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 11095 bytes --]

On Mon, Nov 29, 2021 at 05:55:57PM +0100, Eugenio Perez Martin wrote:
> On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > avail ring could contain something like:
> > > > > > > >
> > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > >                                                  ^--- avail.idx
> > > > > > > >
> > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > used" in avail.ring.
> > > > > > > >
> > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > num_not_used "Not used" buffers?
> > > > > > > >
> > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > with a full reset in a simpler way.
> > > > > > >
> > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > from the ring.
> > > > > > >
> > > > > > > > I think there is a known issue with this approach:
> > > > > > > >
> > > > > > > > Imagine a vring with 4 elements:
> > > > > > > >
> > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > >                  Not used, used, used, used
> > > > > > > >                                             ^--- avail.idx
> > > > > > > >
> > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > >
> > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > >                           ^--- avail.idx
> > > > > > > >
> > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > for the new available buffer.
> > > > > > > >
> > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > >
> > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > it can stop.
> > > > > > >
> > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > >
> > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > >
> > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > last_avail_idx?
> > > > > >
> > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > avail_idx >= i > used_idx.
> > > > > >
> > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > point is clear now.
> > > > >
> > > > > Somehow it is thought after the premise that the out of order
> > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > design assumption"?
> > > >
> > > > > descriptors are descriptors that the device must wait to complete
> > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > The driver can do whatever it wants with them.
> > > > >
> > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > expose a way for the device to expose them, the model is way more
> > > > > simpler than this.
> > > > The constraint that the device has to mark all previously seen "avail"
> > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > when the device has to fail requests.
> > >
> > >
> > > I think we need some clarification here on the driver. For doing migration,
> > > some kind of mediation is a must.
> > >
> > > As we've discussed in the previous versions of this proposal, the VMM
> > > usually won't advertise the STOP feature to guest if we don't want to do
> > > nested live migration (if we do we can shadow it anyhow).
> > >
> > > So from the guest point of view it won't see neither STOP nor the inflight
> > > descriptors.
> >
> > That's not how I understand STOP semantics. See below.
> >
> > > > That is incompatible with how
> > > > devices behave across live migration today. If you want to use STOP for
> > > > live migration then it's probably necessary to rethink this constraint.
> > > >
> > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > a list so they can be retried after problems with the underlying storage
> > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > requests can be retried).
> > >
> > >
> > > A question, I think for those "failure" it's actually not visible from the
> > > drive? If this is true, from the spec point of view, there are still
> > > inflight. The VMM may choose to migrate them to the destination and
> > > re-submit them there. This works more like vhost re-connection.
> >
> > That's how I would like STOP to work, but the semantics seem to be
> > different. Eugenio can correct me if this is wrong:
> >
> > All avail descriptors before the last used descriptor must be marked
> > used before the device reports the STOP bit. For example:
> >
> >   avail.ring = [1, 2, 3, 4]
> >   used.ring = [3]
> >
> > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > in flight but it can keep 4 in flight. The problem is that this
> > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > 2 should be kept in flight and migrated.
> >
> 
> (Only answering to your example use case at this part of the mail)
> 
> My intention was a little bit less rigid actually, but it does not
> meet the blk use case anyway.
> 
> In that case, the device should be free to not mark descriptor 2 as
> used, since the device will start on last_avail_idx == 1, and it will
> read it again after the reset.

last_avail_idx must be 3 since the device already saw avail.ring[0],
avail.ring[1], and avail.ring[2]. How can last_avail_idx be 1?

Also, what does "reset" mean? I don't think a VIRTIO device reset is
part of this process, just setting and clearing the STOP bit.

> And that requirement was intended to be
> removed once we implement a standard / device specific way to report
> them differently. It's loosely expressed as "Depending on the device,
> ... as long as the driver can recover its normal operation if it
> resumes the device without the need of resetting it".
> 
> Although I thought this freedom would help devices to implement stop
> semantincs, to track overridden descriptors could actually be way
> worse than simply split them as in flight == (used_idx,
> last_avail_idx) or available (last_avail_idx, avail_idx).
> 
> I still think that, ideally, the device should be able to report
> differently the descriptors that are not-rewindables (for example,
> in-flight writes, because rewind them leave the device in an
> inconsistent state) and rewindable (not started writes, reads) to the
> driver. Just for the sake of flexibility. But potentially overridden
> descriptors complicate it, so it's probably not worth it. And our
> intended use case (live migration) has no use for it, so I think it's
> better to stick with in flight vs avail.
> 
> (Now adding my view of Jason's point on top)
> 
> At this moment, blk is able to detect ENOSPC because the device is in
> qemu's code, software based. If the device is out of qemu, it will
> need either:
> * A way to signal the error condition to qemu, so it can start the
> migration to solve it.
> * Another process to monitor available space so it can react & migrate.
> 
> Since you pointed out a queue of failed requests, I will go with the
> first method. The data queues of the device reach directly the guest,
> so the device cannot use them to signal ENOSPC: To deliver it via
> VirtQueue will skip qemu. This is already outside of VirtIO. How would
> that work in the nested migration case, for example? The only way I
> can think at this moment is to use shadow virtqueue from the beginning
> of qemu operation.
> 
> Once qemu receives that signal, the guest would only see that
> descriptor id 1 has been used. For the next revision, it will see no
> descriptor used.

I don't understand the idea here. I'll wait until the next revision of
this series to think through virtio-blk again.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-11-29 16:55                 ` Eugenio Perez Martin
  2021-12-01 10:21                   ` Stefan Hajnoczi
@ 2021-12-02  2:40                   ` Jason Wang
  2021-12-02  9:44                     ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Wang @ 2021-12-02  2:40 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Tue, Nov 30, 2021 at 12:56 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > avail ring could contain something like:
> > > > > > > >
> > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > >                                                  ^--- avail.idx
> > > > > > > >
> > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > used" in avail.ring.
> > > > > > > >
> > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > num_not_used "Not used" buffers?
> > > > > > > >
> > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > with a full reset in a simpler way.
> > > > > > >
> > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > from the ring.
> > > > > > >
> > > > > > > > I think there is a known issue with this approach:
> > > > > > > >
> > > > > > > > Imagine a vring with 4 elements:
> > > > > > > >
> > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > >                  Not used, used, used, used
> > > > > > > >                                             ^--- avail.idx
> > > > > > > >
> > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > >
> > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > >                           ^--- avail.idx
> > > > > > > >
> > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > for the new available buffer.
> > > > > > > >
> > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > >
> > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > it can stop.
> > > > > > >
> > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > >
> > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > >
> > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > last_avail_idx?
> > > > > >
> > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > avail_idx >= i > used_idx.
> > > > > >
> > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > point is clear now.
> > > > >
> > > > > Somehow it is thought after the premise that the out of order
> > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > design assumption"?
> > > >
> > > > > descriptors are descriptors that the device must wait to complete
> > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > The driver can do whatever it wants with them.
> > > > >
> > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > expose a way for the device to expose them, the model is way more
> > > > > simpler than this.
> > > > The constraint that the device has to mark all previously seen "avail"
> > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > when the device has to fail requests.
> > >
> > >
> > > I think we need some clarification here on the driver. For doing migration,
> > > some kind of mediation is a must.
> > >
> > > As we've discussed in the previous versions of this proposal, the VMM
> > > usually won't advertise the STOP feature to guest if we don't want to do
> > > nested live migration (if we do we can shadow it anyhow).
> > >
> > > So from the guest point of view it won't see neither STOP nor the inflight
> > > descriptors.
> >
> > That's not how I understand STOP semantics. See below.
> >
> > > > That is incompatible with how
> > > > devices behave across live migration today. If you want to use STOP for
> > > > live migration then it's probably necessary to rethink this constraint.
> > > >
> > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > a list so they can be retried after problems with the underlying storage
> > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > requests can be retried).
> > >
> > >
> > > A question, I think for those "failure" it's actually not visible from the
> > > drive? If this is true, from the spec point of view, there are still
> > > inflight. The VMM may choose to migrate them to the destination and
> > > re-submit them there. This works more like vhost re-connection.
> >
> > That's how I would like STOP to work, but the semantics seem to be
> > different. Eugenio can correct me if this is wrong:
> >
> > All avail descriptors before the last used descriptor must be marked
> > used before the device reports the STOP bit. For example:
> >
> >   avail.ring = [1, 2, 3, 4]
> >   used.ring = [3]
> >
> > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > in flight but it can keep 4 in flight. The problem is that this
> > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > 2 should be kept in flight and migrated.

Ok, so I think I get your comments on the vhost. Regarding the failed
requests behaviour, it looks like it's an implementation specific
feature which is out of the virtio spec. If we want to preserve the
behaviour, we need to extend the virtio spec. Some quick thoughts:

1) extend the virtio-blk error codes
2) allow to configure the behaviour (report, ignore, stop) on error
via config space or control vq
3) signal the error via config interrupt

With all of the above, it may provide a virtio-blk that is fully
compatible with what is provided by qemu. Considering its complexity,
I wonder if we can start from something simple and build the features
gradually. If I was not wrong, we can start by exposing something to
make it work like vhost-(user)-blk. When all of those facilities were
implemented in the spec, vhost-vDPA got those facilities as well. Then
it looks to me it's sufficient to define:

1) STOP
2) indices state synchronization
3) inflight descriptors report (is this a must?)

Or even 3) could be optional, to make things easier, having 1) and 2)
makes it sufficient to migrate the networking devices. And we can do
3) on top?

> >
>
> (Only answering to your example use case at this part of the mail)
>
> My intention was a little bit less rigid actually, but it does not
> meet the blk use case anyway.
>
> In that case, the device should be free to not mark descriptor 2 as
> used, since the device will start on last_avail_idx == 1, and it will
> read it again after the reset. And that requirement was intended to be
> removed once we implement a standard / device specific way to report
> them differently. It's loosely expressed as "Depending on the device,
> ... as long as the driver can recover its normal operation if it
> resumes the device without the need of resetting it".
>
> Although I thought this freedom would help devices to implement stop
> semantincs, to track overridden descriptors could actually be way
> worse than simply split them as in flight == (used_idx,
> last_avail_idx) or available (last_avail_idx, avail_idx).
>
> I still think that, ideally, the device should be able to report
> differently the descriptors that are not-rewindables (for example,
> in-flight writes, because rewind them leave the device in an
> inconsistent state) and rewindable (not started writes, reads) to the
> driver. Just for the sake of flexibility. But potentially overridden
> descriptors complicate it, so it's probably not worth it. And our
> intended use case (live migration) has no use for it, so I think it's
> better to stick with in flight vs avail.
>
> (Now adding my view of Jason's point on top)
>
> At this moment, blk is able to detect ENOSPC because the device is in
> qemu's code, software based. If the device is out of qemu, it will
> need either:
> * A way to signal the error condition to qemu, so it can start the
> migration to solve it.
> * Another process to monitor available space so it can react & migrate.

See my above reply, it's not sufficient to be fully compatible with
the block layer behaviour with current Qemu.

>
> Since you pointed out a queue of failed requests, I will go with the
> first method. The data queues of the device reach directly the guest,
> so the device cannot use them to signal ENOSPC: To deliver it via
> VirtQueue will skip qemu. This is already outside of VirtIO. How would
> that work in the nested migration case, for example? The only way I
> can think at this moment is to use shadow virtqueue from the beginning
> of qemu operation.

Yes, we can't workaround the feature without explicitly defining it in
the spec. I think it would start from something simpler (vhost-blk),
and extend the virtio-blk on top (or do them in parallel).

Thanks

>
> Once qemu receives that signal, the guest would only see that
> descriptor id 1 has been used. For the next revision, it will see no
> descriptor used.
>
> Thanks!
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-12-01 10:21                   ` Stefan Hajnoczi
@ 2021-12-02  8:30                     ` Eugenio Perez Martin
  0 siblings, 0 replies; 43+ messages in thread
From: Eugenio Perez Martin @ 2021-12-02  8:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Virtio-Dev, virtio-comment, Michael Tsirkin,
	Alexander Mikheev, Shahaf Shuler, Oren Duer, Halil Pasic,
	Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Wed, Dec 1, 2021 at 11:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Nov 29, 2021 at 05:55:57PM +0100, Eugenio Perez Martin wrote:
> > On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > > avail ring could contain something like:
> > > > > > > > >
> > > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > > >                                                  ^--- avail.idx
> > > > > > > > >
> > > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > > used" in avail.ring.
> > > > > > > > >
> > > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > > num_not_used "Not used" buffers?
> > > > > > > > >
> > > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > > with a full reset in a simpler way.
> > > > > > > >
> > > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > > from the ring.
> > > > > > > >
> > > > > > > > > I think there is a known issue with this approach:
> > > > > > > > >
> > > > > > > > > Imagine a vring with 4 elements:
> > > > > > > > >
> > > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > > >                  Not used, used, used, used
> > > > > > > > >                                             ^--- avail.idx
> > > > > > > > >
> > > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > > >
> > > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > > >                           ^--- avail.idx
> > > > > > > > >
> > > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > > for the new available buffer.
> > > > > > > > >
> > > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > > >
> > > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > > it can stop.
> > > > > > > >
> > > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > > >
> > > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > > >
> > > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > > last_avail_idx?
> > > > > > >
> > > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > > avail_idx >= i > used_idx.
> > > > > > >
> > > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > > point is clear now.
> > > > > >
> > > > > > Somehow it is thought after the premise that the out of order
> > > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > > design assumption"?
> > > > >
> > > > > > descriptors are descriptors that the device must wait to complete
> > > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > > The driver can do whatever it wants with them.
> > > > > >
> > > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > > expose a way for the device to expose them, the model is way more
> > > > > > simpler than this.
> > > > > The constraint that the device has to mark all previously seen "avail"
> > > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > > when the device has to fail requests.
> > > >
> > > >
> > > > I think we need some clarification here on the driver. For doing migration,
> > > > some kind of mediation is a must.
> > > >
> > > > As we've discussed in the previous versions of this proposal, the VMM
> > > > usually won't advertise the STOP feature to guest if we don't want to do
> > > > nested live migration (if we do we can shadow it anyhow).
> > > >
> > > > So from the guest point of view it won't see neither STOP nor the inflight
> > > > descriptors.
> > >
> > > That's not how I understand STOP semantics. See below.
> > >
> > > > > That is incompatible with how
> > > > > devices behave across live migration today. If you want to use STOP for
> > > > > live migration then it's probably necessary to rethink this constraint.
> > > > >
> > > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > > a list so they can be retried after problems with the underlying storage
> > > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > > requests can be retried).
> > > >
> > > >
> > > > A question, I think for those "failure" it's actually not visible from the
> > > > drive? If this is true, from the spec point of view, there are still
> > > > inflight. The VMM may choose to migrate them to the destination and
> > > > re-submit them there. This works more like vhost re-connection.
> > >
> > > That's how I would like STOP to work, but the semantics seem to be
> > > different. Eugenio can correct me if this is wrong:
> > >
> > > All avail descriptors before the last used descriptor must be marked
> > > used before the device reports the STOP bit. For example:
> > >
> > >   avail.ring = [1, 2, 3, 4]
> > >   used.ring = [3]
> > >
> > > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > > in flight but it can keep 4 in flight. The problem is that this
> > > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > > 2 should be kept in flight and migrated.
> > >
> >
> > (Only answering to your example use case at this part of the mail)
> >
> > My intention was a little bit less rigid actually, but it does not
> > meet the blk use case anyway.
> >
> > In that case, the device should be free to not mark descriptor 2 as
> > used, since the device will start on last_avail_idx == 1, and it will
> > read it again after the reset.
>
> last_avail_idx must be 3 since the device already saw avail.ring[0],
> avail.ring[1], and avail.ring[2]. How can last_avail_idx be 1?
>

(This is another proof of how complicated my proposal was without
realizing it, and that giving so much freedom to the device could have
been a mistake :) ).

You are right in case of tx queues or blk writes, for example. Writes
of descriptors 1 and 2 could already be in progress, so it's NOT
transparent to the guest. In that case, it should act as you say.

Seeing it as a rx queue or read-only operations, last_avail_idx will
be 3 before the stop, but after the stop the device is free to give
the status it wants as long as it can recover its state. And it can
recover those descriptors.

The main point is that that rewind by the device is transparent to the
driver in that case, because it is part of the device internal state.
The driver cannot tell if the device was able to see them or not. So
the device can recover its normal operation on resume. And those
operations would not block the STOP.

But I can see how this freedom blurs the proposal a lot, adding no real value.

> Also, what does "reset" mean? I don't think a VIRTIO device reset is
> part of this process, just setting and clearing the STOP bit.
>

Sorry, s/reset/resume/.

> > And that requirement was intended to be
> > removed once we implement a standard / device specific way to report
> > them differently. It's loosely expressed as "Depending on the device,
> > ... as long as the driver can recover its normal operation if it
> > resumes the device without the need of resetting it".
> >
> > Although I thought this freedom would help devices to implement stop
> > semantincs, to track overridden descriptors could actually be way
> > worse than simply split them as in flight == (used_idx,
> > last_avail_idx) or available (last_avail_idx, avail_idx).
> >
> > I still think that, ideally, the device should be able to report
> > differently the descriptors that are not-rewindables (for example,
> > in-flight writes, because rewind them leave the device in an
> > inconsistent state) and rewindable (not started writes, reads) to the
> > driver. Just for the sake of flexibility. But potentially overridden
> > descriptors complicate it, so it's probably not worth it. And our
> > intended use case (live migration) has no use for it, so I think it's
> > better to stick with in flight vs avail.
> >
> > (Now adding my view of Jason's point on top)
> >
> > At this moment, blk is able to detect ENOSPC because the device is in
> > qemu's code, software based. If the device is out of qemu, it will
> > need either:
> > * A way to signal the error condition to qemu, so it can start the
> > migration to solve it.
> > * Another process to monitor available space so it can react & migrate.
> >
> > Since you pointed out a queue of failed requests, I will go with the
> > first method. The data queues of the device reach directly the guest,
> > so the device cannot use them to signal ENOSPC: To deliver it via
> > VirtQueue will skip qemu. This is already outside of VirtIO. How would
> > that work in the nested migration case, for example? The only way I
> > can think at this moment is to use shadow virtqueue from the beginning
> > of qemu operation.
> >
> > Once qemu receives that signal, the guest would only see that
> > descriptor id 1 has been used. For the next revision, it will see no
> > descriptor used.
>
> I don't understand the idea here. I'll wait until the next revision of
> this series to think through virtio-blk again.
>
> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-12-02  2:40                   ` Jason Wang
@ 2021-12-02  9:44                     ` Stefan Hajnoczi
  2021-12-03  2:09                       ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2021-12-02  9:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, Virtio-Dev, virtio-comment,
	Michael Tsirkin, Alexander Mikheev, Shahaf Shuler, Oren Duer,
	Halil Pasic, Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 10471 bytes --]

On Thu, Dec 02, 2021 at 10:40:57AM +0800, Jason Wang wrote:
> On Tue, Nov 30, 2021 at 12:56 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > > avail ring could contain something like:
> > > > > > > > >
> > > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > > >                                                  ^--- avail.idx
> > > > > > > > >
> > > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > > used" in avail.ring.
> > > > > > > > >
> > > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > > num_not_used "Not used" buffers?
> > > > > > > > >
> > > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > > with a full reset in a simpler way.
> > > > > > > >
> > > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > > from the ring.
> > > > > > > >
> > > > > > > > > I think there is a known issue with this approach:
> > > > > > > > >
> > > > > > > > > Imagine a vring with 4 elements:
> > > > > > > > >
> > > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > > >                  Not used, used, used, used
> > > > > > > > >                                             ^--- avail.idx
> > > > > > > > >
> > > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > > >
> > > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > > >                           ^--- avail.idx
> > > > > > > > >
> > > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > > for the new available buffer.
> > > > > > > > >
> > > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > > >
> > > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > > it can stop.
> > > > > > > >
> > > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > > >
> > > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > > >
> > > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > > last_avail_idx?
> > > > > > >
> > > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > > avail_idx >= i > used_idx.
> > > > > > >
> > > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > > point is clear now.
> > > > > >
> > > > > > Somehow it is thought after the premise that the out of order
> > > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > > design assumption"?
> > > > >
> > > > > > descriptors are descriptors that the device must wait to complete
> > > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > > The driver can do whatever it wants with them.
> > > > > >
> > > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > > expose a way for the device to expose them, the model is way more
> > > > > > simpler than this.
> > > > > The constraint that the device has to mark all previously seen "avail"
> > > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > > when the device has to fail requests.
> > > >
> > > >
> > > > I think we need some clarification here on the driver. For doing migration,
> > > > some kind of mediation is a must.
> > > >
> > > > As we've discussed in the previous versions of this proposal, the VMM
> > > > usually won't advertise the STOP feature to guest if we don't want to do
> > > > nested live migration (if we do we can shadow it anyhow).
> > > >
> > > > So from the guest point of view it won't see neither STOP nor the inflight
> > > > descriptors.
> > >
> > > That's not how I understand STOP semantics. See below.
> > >
> > > > > That is incompatible with how
> > > > > devices behave across live migration today. If you want to use STOP for
> > > > > live migration then it's probably necessary to rethink this constraint.
> > > > >
> > > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > > a list so they can be retried after problems with the underlying storage
> > > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > > requests can be retried).
> > > >
> > > >
> > > > A question, I think for those "failure" it's actually not visible from the
> > > > drive? If this is true, from the spec point of view, there are still
> > > > inflight. The VMM may choose to migrate them to the destination and
> > > > re-submit them there. This works more like vhost re-connection.
> > >
> > > That's how I would like STOP to work, but the semantics seem to be
> > > different. Eugenio can correct me if this is wrong:
> > >
> > > All avail descriptors before the last used descriptor must be marked
> > > used before the device reports the STOP bit. For example:
> > >
> > >   avail.ring = [1, 2, 3, 4]
> > >   used.ring = [3]
> > >
> > > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > > in flight but it can keep 4 in flight. The problem is that this
> > > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > > 2 should be kept in flight and migrated.
> 
> Ok, so I think I get your comments on the vhost. Regarding the failed
> requests behaviour, it looks like it's an implementation specific
> feature which is out of the virtio spec. If we want to preserve the
> behaviour, we need to extend the virtio spec.
> 
> Some quick thoughts:
> 
> 1) extend the virtio-blk error codes
> 2) allow to configure the behaviour (report, ignore, stop) on error
> via config space or control vq
> 3) signal the error via config interrupt
> 
> With all of the above, it may provide a virtio-blk that is fully
> compatible with what is provided by qemu. Considering its complexity,
> I wonder if we can start from something simple and build the features
> gradually. If I was not wrong, we can start by exposing something to
> make it work like vhost-(user)-blk. When all of those facilities were
> implemented in the spec, vhost-vDPA got those facilities as well. Then
> it looks to me it's sufficient to define:
> 
> 1) STOP
> 2) indices state synchronization
> 3) inflight descriptors report (is this a must?)
> 
> Or even 3) could be optional, to make things easier, having 1) and 2)
> makes it sufficient to migrate the networking devices. And we can do
> 3) on top?

I suggest:

1. STOP
2. vhost-style get_vring_base() to fetch last_avail_idx
3. Device state save/load

Under this model the driver can rewind the avail ring, but only back to
last_avail_idx. This gives the device the freedom to keep in-flight
requests like virtio-blk/scsi's failed requests list. Information about
those requests will be transferred as part of device state save/load.

This patch series just needs to define STOP. Rewinding descriptors
requires a new get_vring_base()-style operation while the device is in
the stopped state. This can be proposed separately. Device state
save/load can also be proposed separately.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  2021-12-02  9:44                     ` Stefan Hajnoczi
@ 2021-12-03  2:09                       ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2021-12-03  2:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eugenio Perez Martin, Virtio-Dev, virtio-comment,
	Michael Tsirkin, Alexander Mikheev, Shahaf Shuler, Oren Duer,
	Halil Pasic, Cornelia Huck, Bodong Wang, Dr . David Alan Gilbert,
	Parav Pandit, Max Gurtovoy

On Thu, Dec 2, 2021 at 9:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Dec 02, 2021 at 10:40:57AM +0800, Jason Wang wrote:
> > On Tue, Nov 30, 2021 at 12:56 AM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/11/24 下午7:20, Stefan Hajnoczi 写道:
> > > > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio Pérez wrote:
> > > > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > > > avail ring could contain something like:
> > > > > > > > > >
> > > > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > > > >                                                  ^--- avail.idx
> > > > > > > > > >
> > > > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > > > used" in avail.ring.
> > > > > > > > > >
> > > > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > > > num_not_used "Not used" buffers?
> > > > > > > > > >
> > > > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > > > with a full reset in a simpler way.
> > > > > > > > >
> > > > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > > > from the ring.
> > > > > > > > >
> > > > > > > > > > I think there is a known issue with this approach:
> > > > > > > > > >
> > > > > > > > > > Imagine a vring with 4 elements:
> > > > > > > > > >
> > > > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > > > >                  Not used, used, used, used
> > > > > > > > > >                                             ^--- avail.idx
> > > > > > > > > >
> > > > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > > > >
> > > > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > > > >                           ^--- avail.idx
> > > > > > > > > >
> > > > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > > > for the new available buffer.
> > > > > > > > > >
> > > > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > > > >
> > > > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > > > it can stop.
> > > > > > > > >
> > > > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > > > >
> > > > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > > > >
> > > > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > > > last_avail_idx?
> > > > > > > >
> > > > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > > > avail_idx >= i > used_idx.
> > > > > > > >
> > > > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > > > point is clear now.
> > > > > > >
> > > > > > > Somehow it is thought after the premise that the out of order
> > > > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > > > design assumption"?
> > > > > >
> > > > > > > descriptors are descriptors that the device must wait to complete
> > > > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > > > The driver can do whatever it wants with them.
> > > > > > >
> > > > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > > > expose a way for the device to expose them, the model is way more
> > > > > > > simpler than this.
> > > > > > The constraint that the device has to mark all previously seen "avail"
> > > > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > > > when the device has to fail requests.
> > > > >
> > > > >
> > > > > I think we need some clarification here on the driver. For doing migration,
> > > > > some kind of mediation is a must.
> > > > >
> > > > > As we've discussed in the previous versions of this proposal, the VMM
> > > > > usually won't advertise the STOP feature to guest if we don't want to do
> > > > > nested live migration (if we do we can shadow it anyhow).
> > > > >
> > > > > So from the guest point of view it won't see neither STOP nor the inflight
> > > > > descriptors.
> > > >
> > > > That's not how I understand STOP semantics. See below.
> > > >
> > > > > > That is incompatible with how
> > > > > > devices behave across live migration today. If you want to use STOP for
> > > > > > live migration then it's probably necessary to rethink this constraint.
> > > > > >
> > > > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > > > a list so they can be retried after problems with the underlying storage
> > > > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > > > requests can be retried).
> > > > >
> > > > >
> > > > > A question, I think for those "failure" it's actually not visible from the
> > > > > drive? If this is true, from the spec point of view, there are still
> > > > > inflight. The VMM may choose to migrate them to the destination and
> > > > > re-submit them there. This works more like vhost re-connection.
> > > >
> > > > That's how I would like STOP to work, but the semantics seem to be
> > > > different. Eugenio can correct me if this is wrong:
> > > >
> > > > All avail descriptors before the last used descriptor must be marked
> > > > used before the device reports the STOP bit. For example:
> > > >
> > > >   avail.ring = [1, 2, 3, 4]
> > > >   used.ring = [3]
> > > >
> > > > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > > > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > > > in flight but it can keep 4 in flight. The problem is that this
> > > > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > > > 2 should be kept in flight and migrated.
> >
> > Ok, so I think I get your comments on the vhost. Regarding the failed
> > requests behaviour, it looks like it's an implementation specific
> > feature which is out of the virtio spec. If we want to preserve the
> > behaviour, we need to extend the virtio spec.
> >
> > Some quick thoughts:
> >
> > 1) extend the virtio-blk error codes
> > 2) allow to configure the behaviour (report, ignore, stop) on error
> > via config space or control vq
> > 3) signal the error via config interrupt
> >
> > With all of the above, it may provide a virtio-blk that is fully
> > compatible with what is provided by qemu. Considering its complexity,
> > I wonder if we can start from something simple and build the features
> > gradually. If I was not wrong, we can start by exposing something to
> > make it work like vhost-(user)-blk. When all of those facilities were
> > implemented in the spec, vhost-vDPA got those facilities as well. Then
> > it looks to me it's sufficient to define:
> >
> > 1) STOP
> > 2) indices state synchronization
> > 3) inflight descriptors report (is this a must?)
> >
> > Or even 3) could be optional, to make things easier, having 1) and 2)
> > makes it sufficient to migrate the networking devices. And we can do
> > 3) on top?
>
> I suggest:
>
> 1. STOP
> 2. vhost-style get_vring_base() to fetch last_avail_idx
> 3. Device state save/load
>
> Under this model the driver can rewind the avail ring, but only back to
> last_avail_idx. This gives the device the freedom to keep in-flight
> requests like virtio-blk/scsi's failed requests list. Information about
> those requests will be transferred as part of device state save/load.
>
> This patch series just needs to define STOP. Rewinding descriptors
> requires a new get_vring_base()-style operation while the device is in
> the stopped state.

Right, I think we agree that in the next version we will send the
last_avail_idx stuff.

> This can be proposed separately. Device state
> save/load can also be proposed separately.

Yes.

Thanks

>
> Stefan


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-12-03  2:09 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11 18:58 [PATCH v3 0/2] virtio: introduce STOP status bit Eugenio Pérez
2021-11-11 18:58 ` [PATCH v3 1/2] content: Explain better the status clearing bits Eugenio Pérez
2021-11-12  3:46   ` Jason Wang
2021-11-12 11:41     ` Eugenio Perez Martin
2021-11-12 10:34   ` [virtio-dev] " Cornelia Huck
2021-11-12 11:41     ` Eugenio Perez Martin
2021-11-11 18:58 ` [PATCH v3 2/2] virtio: introduce STOP status bit Eugenio Pérez
2021-11-12  4:18   ` Jason Wang
2021-11-12 10:50     ` Eugenio Perez Martin
2021-11-15  4:08       ` Jason Wang
2021-11-15 18:16         ` Eugenio Perez Martin
2021-11-16  6:56           ` Jason Wang
2021-11-16 14:50             ` Eugenio Perez Martin
2021-11-17  3:27               ` Jason Wang
2021-11-17  8:08                 ` Eugenio Perez Martin
2021-11-18  3:27                   ` Jason Wang
2021-11-18 15:59   ` Stefan Hajnoczi
2021-11-18 19:58     ` Eugenio Perez Martin
2021-11-23 12:16       ` Stefan Hajnoczi
2021-11-23 17:00         ` [virtio-dev] " Eugenio Perez Martin
2021-11-24 11:20           ` Stefan Hajnoczi
2021-11-24 16:41             ` Eugenio Perez Martin
2021-11-29 10:32               ` Stefan Hajnoczi
2021-11-25  2:57             ` Jason Wang
2021-11-29 10:29               ` Stefan Hajnoczi
2021-11-29 16:55                 ` Eugenio Perez Martin
2021-12-01 10:21                   ` Stefan Hajnoczi
2021-12-02  8:30                     ` Eugenio Perez Martin
2021-12-02  2:40                   ` Jason Wang
2021-12-02  9:44                     ` Stefan Hajnoczi
2021-12-03  2:09                       ` Jason Wang
2021-11-18 14:45 ` [PATCH v3 0/2] " Stefan Hajnoczi
2021-11-18 16:49   ` Eugenio Perez Martin
2021-11-23 11:33     ` Stefan Hajnoczi
2021-11-23 16:19       ` Eugenio Perez Martin
2021-11-24 15:26         ` Stefan Hajnoczi
2021-11-24 16:58           ` Eugenio Perez Martin
2021-11-25  3:05         ` Jason Wang
2021-11-25  7:24           ` Eugenio Perez Martin
2021-11-25  7:38             ` Jason Wang
2021-11-25  9:01               ` Eugenio Perez Martin
2021-11-25  9:10                 ` Eugenio Perez Martin
     [not found]                 ` <CACGkMEvD+Z7cYszhMzBsnEaC0K0kfnHxzFDEfjT_qLOFiMR-XA@mail.gmail.com>
2021-11-26  8:26                   ` Eugenio Perez Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.