All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] Add device reset timeout field
@ 2021-10-06 14:10 Parav Pandit
  2021-10-06 15:22 ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-06 14:10 UTC (permalink / raw)
  To: cohuck, virtio-dev, mst; +Cc: mgurtovoy, shahafs, oren, parav

Motivation:
==========
Currently when driver initiates a device reset operation, a device may
not be able finish the reset operation immediately. In such case driver
waits for an arbitrary amount of time in a hope that device will finish
the reset. Depending on the device type and its backend implementation a
device timeout can be different. Trying to wait for all device type in
same value may not be efficient or adequate.

Proposal:
========
Hence, enhance the specification to let device report a device reset
timeout value in milliseconds. Such hint helps driver to know how long
should it wait for device reset to finish.

This proposal introduces optional device reset timeout field for MMIO
and PCI transports. Such transports have a use case where virtio devices
are implemented in hardware and made available to the guest.

A device reset timeout field has following attributes.
(a) It is an optional field to maintain backward compatibility.
(b) It is valid only when device advertises a new feature bit
    VIRTIO_F_DEV_RESET_TO
(b) When it is not advertised, this field is invalid and driver should
    choose the reasonable reset timeout.
(c) It is a 16-bit field covering wide range of timeout from 10
milliseconds to several hundred seconds.

This proposal is an improvement over a past proposal [1].
Instead of just passing a flag for wait, this proposal offers upper bound
wait time making the device reset timeout and overall device initialization
more predictable.

[1] https://lists.oasis-open.org/archives/virtio-dev/202108/msg00161.html

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>

---
changelog:
v1->v2:
- v1: https://lists.oasis-open.org/archives/virtio-dev/202110/msg00027.html
- Addressed below comments from Cornelia Huck
- fixed article 'the' addition before driver
- removed maximum from reset timeout
- fixed removed article 'the' from unwanted places
- changed 'feature is set' to 'feature offered'
- rewrote device reset timeout paragraph for driver handling
v0->v1:
- v0: https://lists.oasis-open.org/archives/virtio-dev/202109/msg00133.html
- Addressed below comments from Cornelia Huck
- renamed VIRTIO_F_DEV_RESET_TO to VIRTIO_F_DEV_RESET_TIMEOUT
- removed references to device initialization sequence during device
  reset flow
- rewrote 'giving up on device'
---
 content.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/content.tex b/content.tex
index 37c45d3..9b1399f 100644
--- a/content.tex
+++ b/content.tex
@@ -73,6 +73,12 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 recover by issuing a reset.
 \end{note}
 
+Once the driver initiates a reset, the device may not be able to finish the reset
+immediately. To accommodate that situation, the device can provide a hint to the
+driver about how long it might take the device to complete the reset. The driver should
+wait at least for the time specified by the device to let device finish the reset or if
+the device status field to become 0 within the specified time interval.
+
 \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
 
 The device MUST NOT consume buffers or send any used buffer
@@ -210,11 +216,23 @@ \section{Device Reset}\label{sec:Basic Facilities of a Virtio Device / Device Re
 indicating completion of the reset by reinitializing \field{device status}
 to 0, until the driver re-initializes the device.
 
+A device may not be able to complete reset action immediately when the driver initiates a reset operation.
+Such a device should provide a hint as to how long a device may take to finish the reset operation.
+This hint is provided by the device via a device reset timeout value in units of 10 milliseconds.
+
 \drivernormative{\subsection}{Device Reset}{Basic Facilities of a Virtio Device / Device Reset}
 
 The driver SHOULD consider a driver-initiated reset complete when it
 reads \field{device status} as 0.
 
+When a device reports a non-zero device reset timeout value, the driver SHOULD wait for the device
+to finish the reset action before considering the reset operation to have failed.
+When a device reports a non-zero device reset timeout value, the driver SHOULD wait for the device
+to finish the reset action. Once the specified timeout has expired, the driver should consider that reset
+operation has failed. When device reset timeout is not provided by the device, the driver SHOULD choose a
+reasonable timeout value for reset operation to complete.
+
+
 \section{Device Configuration Space}\label{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
 
 Device configuration space is generally used for rarely-changing or
@@ -859,6 +877,8 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
         le64 queue_driver;              /* read-write */
         le64 queue_device;              /* read-write */
         le16 queue_notify_data;         /* read-only for driver */
+        /* About the whole device. */
+        le16 device_reset_timeout;      /* read-only for driver */
 };
 \end{lstlisting}
 
@@ -938,6 +958,15 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
         may benefit from providing another value, for example an internal virtqueue
         identifier, or an internal offset related to the virtqueue number.
         \end{note}
+
+\item[\field{device_reset_timeout}]
+        This field exists only if VIRTIO_F_DEV_RESET_TIMEOUT is advertised by the device.
+        This field provides a hint to the driver indicating how much maximum time a
+        device can take to complete the reset once the driver initiates the device reset
+        operation. This field unit is in 10 milliseconds. For example, a field value of
+        3 indicates device reset timeout is 30 milliseconds. When VIRTIO_F_DEV_RESET_TIMEOUT
+	feature is offered by the device this field must contain a non zero value.
+
 \end{description}
 
 \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}
@@ -1804,6 +1833,15 @@ \subsection{MMIO Device Register Layout}\label{sec:Virtio Transport Options / Vi
     accesses apply to the queue selected by writing to \field{QueueSel}.
   }
   \hline 
+  \mmioreg{DeviceResetTimeout}{Device reset timeout}{0x04c}{R}{%
+    This register exists only if VIRTIO_F_DEV_RESET_TIMEOUT is advertised by the device.
+    It provides the hint to the driver indicating how much maximum time a device can take
+    to complete the reset once the driver initiates the device reset operation.
+    This field unit is in 10 milliseconds. For example, a field value of 3 indicates device
+    reset timeout is 30 milliseconds. When VIRTIO_F_DEV_RESET_TIMEOUT feature is offered by the device
+    this field must contain a non zero value.
+  }
+  \hline
   \mmioreg{QueueNotify}{Queue notifier}{0x050}{W}{%
     Writing a value to this register notifies the device that
     there are new buffers to process in a queue.
@@ -6673,6 +6711,11 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   transport specific.
   For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
 
+  \item[VIRTIO_F_DEV_RESET_TIMEOUT(40)] This feature indicates that the device
+  advertises a reset timeout which the driver should use during device reset stage.
+  For more details about device reset timeout over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}.
+  For more details about device reset timeout over MMIO see \ref{sec:Virtio Transport Options / Virtio Over MMIO / MMIO Device Register Layout}.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-06 14:10 [PATCH v2] Add device reset timeout field Parav Pandit
@ 2021-10-06 15:22 ` Michael S. Tsirkin
  2021-10-06 16:11   ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-06 15:22 UTC (permalink / raw)
  To: Parav Pandit; +Cc: cohuck, virtio-dev, mgurtovoy, shahafs, oren

On Wed, Oct 06, 2021 at 05:10:33PM +0300, Parav Pandit wrote:
> Motivation:
> ==========
> Currently when driver initiates a device reset operation, a device may
> not be able finish the reset operation immediately. In such case driver
> waits for an arbitrary amount of time in a hope that device will finish
> the reset.

Hmm does it? Where does spec say that it does? Does not linux
really wait forever?

> Depending on the device type and its backend implementation a
> device timeout can be different. Trying to wait for all device type in
> same value may not be efficient or adequate.

Right but do we really want a timeout at all?
Why not wait until device is ready or until ctrl-c?
A timeout makes things like debugging backends trickier ...

> 
> Proposal:
> ========
> Hence, enhance the specification to let device report a device reset
> timeout value in milliseconds. Such hint helps driver to know how long
> should it wait for device reset to finish.
> 
> This proposal introduces optional device reset timeout field for MMIO
> and PCI transports. Such transports have a use case where virtio devices
> are implemented in hardware and made available to the guest.
> 
> A device reset timeout field has following attributes.
> (a) It is an optional field to maintain backward compatibility.
> (b) It is valid only when device advertises a new feature bit
>     VIRTIO_F_DEV_RESET_TO
> (b) When it is not advertised, this field is invalid and driver should
>     choose the reasonable reset timeout.
> (c) It is a 16-bit field covering wide range of timeout from 10
> milliseconds to several hundred seconds.
> 
> This proposal is an improvement over a past proposal [1].
> Instead of just passing a flag for wait, this proposal offers upper bound
> wait time making the device reset timeout and overall device initialization
> more predictable.
> 
> [1] https://lists.oasis-open.org/archives/virtio-dev/202108/msg00161.html
> 
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> 
> ---
> changelog:
> v1->v2:
> - v1: https://lists.oasis-open.org/archives/virtio-dev/202110/msg00027.html
> - Addressed below comments from Cornelia Huck
> - fixed article 'the' addition before driver
> - removed maximum from reset timeout
> - fixed removed article 'the' from unwanted places
> - changed 'feature is set' to 'feature offered'
> - rewrote device reset timeout paragraph for driver handling
> v0->v1:
> - v0: https://lists.oasis-open.org/archives/virtio-dev/202109/msg00133.html
> - Addressed below comments from Cornelia Huck
> - renamed VIRTIO_F_DEV_RESET_TO to VIRTIO_F_DEV_RESET_TIMEOUT
> - removed references to device initialization sequence during device
>   reset flow
> - rewrote 'giving up on device'
> ---
>  content.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 37c45d3..9b1399f 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -73,6 +73,12 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +Once the driver initiates a reset, the device may not be able to finish the reset
> +immediately. To accommodate that situation, the device can provide a hint to the
> +driver about how long it might take the device to complete the reset. The driver should
> +wait at least for the time specified by the device to let device finish the reset or if
> +the device status field to become 0 within the specified time interval.
> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  
>  The device MUST NOT consume buffers or send any used buffer

First, do not use should outside normative sections. e.g. you can say
"is expected".

second, I don't see a story around compatibility here.
previously pci drivers did wait, other drivers didn't.

I think drivers that actually do wait should somehow
let the device know it's ready to wait.

Third how about making e.g. 0 a special value meaning
no limit wait forever?



> @@ -210,11 +216,23 @@ \section{Device Reset}\label{sec:Basic Facilities of a Virtio Device / Device Re
>  indicating completion of the reset by reinitializing \field{device status}
>  to 0, until the driver re-initializes the device.
>  
> +A device may not be able to complete reset action immediately when the driver initiates a reset operation.
> +Such a device should provide a hint as to how long a device may take to finish the reset operation.
> +This hint is provided by the device via a device reset timeout value in units of 10 milliseconds.
> +
>  \drivernormative{\subsection}{Device Reset}{Basic Facilities of a Virtio Device / Device Reset}
>  
>  The driver SHOULD consider a driver-initiated reset complete when it
>  reads \field{device status} as 0.
>  
> +When a device reports a non-zero device reset timeout value, the driver SHOULD wait for the device
> +to finish the reset action before considering the reset operation to have failed.
> +When a device reports a non-zero device reset timeout value, the driver SHOULD wait for the device
> +to finish the reset action. Once the specified timeout has expired, the driver should consider that reset
> +operation has failed. When device reset timeout is not provided by the device, the driver SHOULD choose a
> +reasonable timeout value for reset operation to complete.
> +
> +
>  \section{Device Configuration Space}\label{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
>  
>  Device configuration space is generally used for rarely-changing or
> @@ -859,6 +877,8 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>          le64 queue_driver;              /* read-write */
>          le64 queue_device;              /* read-write */
>          le16 queue_notify_data;         /* read-only for driver */
> +        /* About the whole device. */
> +        le16 device_reset_timeout;      /* read-only for driver */
>  };
>  \end{lstlisting}
>  
> @@ -938,6 +958,15 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>          may benefit from providing another value, for example an internal virtqueue
>          identifier, or an internal offset related to the virtqueue number.
>          \end{note}
> +
> +\item[\field{device_reset_timeout}]
> +        This field exists only if VIRTIO_F_DEV_RESET_TIMEOUT is advertised by the device.
> +        This field provides a hint to the driver indicating how much maximum time a
> +        device can take to complete the reset once the driver initiates the device reset
> +        operation. This field unit is in 10 milliseconds. For example, a field value of
> +        3 indicates device reset timeout is 30 milliseconds. When VIRTIO_F_DEV_RESET_TIMEOUT
> +	feature is offered by the device this field must contain a non zero value.
> +
>  \end{description}
>  
>  \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}
> @@ -1804,6 +1833,15 @@ \subsection{MMIO Device Register Layout}\label{sec:Virtio Transport Options / Vi
>      accesses apply to the queue selected by writing to \field{QueueSel}.
>    }
>    \hline 
> +  \mmioreg{DeviceResetTimeout}{Device reset timeout}{0x04c}{R}{%
> +    This register exists only if VIRTIO_F_DEV_RESET_TIMEOUT is advertised by the device.
> +    It provides the hint to the driver indicating how much maximum time a device can take
> +    to complete the reset once the driver initiates the device reset operation.
> +    This field unit is in 10 milliseconds. For example, a field value of 3 indicates device
> +    reset timeout is 30 milliseconds. When VIRTIO_F_DEV_RESET_TIMEOUT feature is offered by the device
> +    this field must contain a non zero value.
> +  }
> +  \hline
>    \mmioreg{QueueNotify}{Queue notifier}{0x050}{W}{%
>      Writing a value to this register notifies the device that
>      there are new buffers to process in a queue.
> @@ -6673,6 +6711,11 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    transport specific.
>    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>  
> +  \item[VIRTIO_F_DEV_RESET_TIMEOUT(40)] This feature indicates that the device
> +  advertises a reset timeout which the driver should use during device reset stage.
> +  For more details about device reset timeout over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}.
> +  For more details about device reset timeout over MMIO see \ref{sec:Virtio Transport Options / Virtio Over MMIO / MMIO Device Register Layout}.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-06 15:22 ` Michael S. Tsirkin
@ 2021-10-06 16:11   ` Parav Pandit
  2021-10-06 20:53     ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-06 16:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 6, 2021 8:52 PM
> 
> On Wed, Oct 06, 2021 at 05:10:33PM +0300, Parav Pandit wrote:
> > Motivation:
> > ==========
> > Currently when driver initiates a device reset operation, a device may
> > not be able finish the reset operation immediately. In such case
> > driver waits for an arbitrary amount of time in a hope that device
> > will finish the reset.
> 
> Hmm does it? Where does spec say that it does? Does not linux really wait
> forever?
This proposal doesn't claim that spec says so.
Linux pci transport waits infinitely.
Infinite is a valid arbitrary amount. Isn't it ?:-)

> 
> > Depending on the device type and its backend implementation a device
> > timeout can be different. Trying to wait for all device type in same
> > value may not be efficient or adequate.
> 
> Right but do we really want a timeout at all?
Yes, so that instead of waiting infinitely, it can be time bounded. More below.
> Why not wait until device is ready or until ctrl-c?
Device may not be ready at all if its hotunplug is already initiated.
A device may not come back at all from reset.
Hence, device providing an upper bound is good hint to driver.

[..]

> >
> > +Once the driver initiates a reset, the device may not be able to
> > +finish the reset immediately. To accommodate that situation, the
> > +device can provide a hint to the driver about how long it might take
> > +the device to complete the reset. The driver should wait at least for
> > +the time specified by the device to let device finish the reset or if the device
> status field to become 0 within the specified time interval.
> > +
> >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities
> > of a Virtio Device / Device Status Field}
> >
> >  The device MUST NOT consume buffers or send any used buffer
> 
> First, do not use should outside normative sections. e.g. you can say "is
> expected".
> 
should is used in same section which has used _must_ in it.
Can you please help me understand what makes a section normative?

> second, I don't see a story around compatibility here.
Device reset timeout is supported only when device offers the new VIRTIO_DEV_RESET_TIMEOUT.
When such feature is not offered, driver follows the guidance provided in this proposal.
Compatibility is covered with the feature bit.

> previously pci drivers did wait, other drivers didn't.
>
The newer driver that implements updated spec, will adhere to newer content of the spec.
 
> I think drivers that actually do wait should somehow let the device know it's
> ready to wait.
>
Why should driver tell device that it is waiting?
Driver initiated a reset, device may take its time for reset and eventually comes out.
What can device do differently by knowing that driver is waiting for reset to finish?
Device usually follow its standard steps to complete the reset action.
I do not anticipate that device will do reset differently if it knows that no one is waiting for reset to finish.

> Third how about making e.g. 0 a special value meaning no limit wait forever?
The whole idea is to have some finite/deterministic behavior, and hence the reasonable finite value.
When value is not specified, driver is expected to choose reasonable value terminate the reset operation.
Do you propose 0 as no limit to follow current driver behavior?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-06 16:11   ` Parav Pandit
@ 2021-10-06 20:53     ` Michael S. Tsirkin
  2021-10-07  3:42       ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-06 20:53 UTC (permalink / raw)
  To: Parav Pandit; +Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Wed, Oct 06, 2021 at 04:11:19PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, October 6, 2021 8:52 PM
> > 
> > On Wed, Oct 06, 2021 at 05:10:33PM +0300, Parav Pandit wrote:
> > > Motivation:
> > > ==========
> > > Currently when driver initiates a device reset operation, a device may
> > > not be able finish the reset operation immediately. In such case
> > > driver waits for an arbitrary amount of time in a hope that device
> > > will finish the reset.
> > 
> > Hmm does it? Where does spec say that it does? Does not linux really wait
> > forever?
> This proposal doesn't claim that spec says so.
> Linux pci transport waits infinitely.
> Infinite is a valid arbitrary amount. Isn't it ?:-)
> 
> > 
> > > Depending on the device type and its backend implementation a device
> > > timeout can be different. Trying to wait for all device type in same
> > > value may not be efficient or adequate.
> > 
> > Right but do we really want a timeout at all?
> Yes, so that instead of waiting infinitely, it can be time bounded. More below.
> > Why not wait until device is ready or until ctrl-c?
> Device may not be ready at all if its hotunplug is already initiated.
> A device may not come back at all from reset.
> Hence, device providing an upper bound is good hint to driver.
> 
> [..]
> 
> > >
> > > +Once the driver initiates a reset, the device may not be able to
> > > +finish the reset immediately. To accommodate that situation, the
> > > +device can provide a hint to the driver about how long it might take
> > > +the device to complete the reset. The driver should wait at least for
> > > +the time specified by the device to let device finish the reset or if the device
> > status field to become 0 within the specified time interval.
> > > +
> > >  \devicenormative{\subsection}{Device Status Field}{Basic Facilities
> > > of a Virtio Device / Device Status Field}
> > >
> > >  The device MUST NOT consume buffers or send any used buffer
> > 
> > First, do not use should outside normative sections. e.g. you can say "is
> > expected".
> > 
> should is used in same section which has used _must_ in it.

yea, some of these things slip through.

> Can you please help me understand what makes a section normative?

It starts with 
\devicenormative
or
\drivernormative

and is linked from the conformance section.



> 
> > second, I don't see a story around compatibility here.
> Device reset timeout is supported only when device offers the new VIRTIO_DEV_RESET_TIMEOUT.
> When such feature is not offered, driver follows the guidance provided in this proposal.
> Compatibility is covered with the feature bit.

Well hmm. Problem is, devices need to be reset before feature bits
are read.

> > previously pci drivers did wait, other drivers didn't.
> >
> The newer driver that implements updated spec, will adhere to newer content of the spec.

right but we can't just unconditionally say driver must wait,
this would declare old driver non-compliant and we promised
not to do that.

> > I think drivers that actually do wait should somehow let the device know it's
> > ready to wait.
> >
> Why should driver tell device that it is waiting?
> Driver initiated a reset, device may take its time for reset and eventually comes out.
> What can device do differently by knowing that driver is waiting for reset to finish?
> Device usually follow its standard steps to complete the reset action.
> I do not anticipate that device will do reset differently if it knows that no one is waiting for reset to finish.

I think it might matter for hypervisors, these can thinkably block the
write until reset completes if driver is not going to wait.  For the
hardware - if driver is not waiting then hardware has to block the
response to the following non posted transaction whatever that is.
Doing that blocks the cpu and generally breaks the whole
timeout idea. Which you seem to want for some reason ...


> 
> > Third how about making e.g. 0 a special value meaning no limit wait forever?
> The whole idea is to have some finite/deterministic behavior,

I guess I'm being dense, I just don't yet understand the motivation.
The cost is difficulty in migration since each and every
piece of hardware will have a different timeout.

The idea is to have finite behaviour, I got that, but
I don't really know where does the requirement come from,
in what sense it's deterministic and why do we worry
about reset specifically.

Recovering from failure when hardware is broken?

Can we even trust timeout values hardware gives us then?

What about a ton of other stuff like command vq commands
timing out?

> and hence the reasonable finite value.
> When value is not specified, driver is expected to choose reasonable value terminate the reset operation.

No one has any idea what's reasonable.
There's no reason for a driver to choose any
value - it has nothing else to do.

> Do you propose 0 as no limit to follow current driver behavior?

Or maybe 0 will make driver poll.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-06 20:53     ` Michael S. Tsirkin
@ 2021-10-07  3:42       ` Parav Pandit
  2021-10-07 16:10         ` [virtio-dev] " Cornelia Huck
  2021-10-08 10:44         ` Michael S. Tsirkin
  0 siblings, 2 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-07  3:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 7, 2021 2:23 AM
> On Wed, Oct 06, 2021 at 04:11:19PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, October 6, 2021 8:52 PM
> > >
> > > On Wed, Oct 06, 2021 at 05:10:33PM +0300, Parav Pandit wrote:
> > > > Motivation:
> > > > ==========
> > > > Currently when driver initiates a device reset operation, a device
> > > > may not be able finish the reset operation immediately. In such
> > > > case driver waits for an arbitrary amount of time in a hope that
> > > > device will finish the reset.
> > >
> > > Hmm does it? Where does spec say that it does? Does not linux really
> > > wait forever?
> > This proposal doesn't claim that spec says so.
> > Linux pci transport waits infinitely.
> > Infinite is a valid arbitrary amount. Isn't it ?:-)

[..]
> > > First, do not use should outside normative sections. e.g. you can
> > > say "is expected".
> > >
So this proposal used the SHOULD in the section 2.1.1 which is the normative section as per the driver conformance clause 1.
Or I misunderstood?

> > should is used in same section which has used _must_ in it.
> 
> yea, some of these things slip through.
> 
> > Can you please help me understand what makes a section normative?
> 
> It starts with
> \devicenormative
> or
> \drivernormative
> 
> and is linked from the conformance section.
>
Ok. I understand better now.

> >
> > > second, I don't see a story around compatibility here.
> > Device reset timeout is supported only when device offers the new
> VIRTIO_DEV_RESET_TIMEOUT.
> > When such feature is not offered, driver follows the guidance provided in
> this proposal.
> > Compatibility is covered with the feature bit.
> 
> Well hmm. Problem is, devices need to be reset before feature bits are read.
>
Yeah, this means that we should remove the feature bit and instead follow your suggestion that if the register value is zero, the feature is not offered,
If it is non zero, driver should follow the timeout.
 
> > > previously pci drivers did wait, other drivers didn't.
> > >
> > The newer driver that implements updated spec, will adhere to newer
> content of the spec.
> 
> right but we can't just unconditionally say driver must wait, this would declare
> old driver non-compliant and we promised not to do that.
> 
It is not said unconditionally. It comes with spec version, isn't it?
This change should increment the spec version so old driver are compliant to some older version of the spec.
Hence, there is no breakage with some old driver.

> > > I think drivers that actually do wait should somehow let the device
> > > know it's ready to wait.
> > >
> > Why should driver tell device that it is waiting?
> > Driver initiated a reset, device may take its time for reset and eventually
> comes out.
> > What can device do differently by knowing that driver is waiting for reset to
> finish?
> > Device usually follow its standard steps to complete the reset action.
> > I do not anticipate that device will do reset differently if it knows that no one
> is waiting for reset to finish.
> 
> I think it might matter for hypervisors, these can thinkably block the write until
> reset completes if driver is not going to wait.  For the hardware - if driver is not
> waiting then hardware has to block the response to the following non posted
> transaction whatever that is.
Hardware will continue to follow reset semantics regardless of advertising the timeout. (and hence 

> Doing that blocks the cpu and generally breaks the whole timeout idea. Which
> you seem to want for some reason ...
No. advertising the reset timeout value doesn't change the reset flow of device or driver.
So it doesn't break anything.
It only means that linux driver code doing vp_reset() infinitely, will instead do it for the timeout given by device.

> 
> 
> >
> > > Third how about making e.g. 0 a special value meaning no limit wait
> forever?
> > The whole idea is to have some finite/deterministic behavior,
> 
> I guess I'm being dense, I just don't yet understand the motivation.
> The cost is difficulty in migration since each and every piece of hardware will
> have a different timeout.
>
Such timeout is already there. Advertising timeout doesn't change the LM flow.
A migrating device at the destination is anyway out of the reset when migration occurs.
 
> The idea is to have finite behaviour, I got that, but I don't really know where
> does the requirement come from, in what sense it's deterministic and why do
> we worry about reset specifically.
> 
> Recovering from failure when hardware is broken?
>
I provided few examples in previous email.
Terminating a probe sequence of the device which is already on its path of hotunplug.
 
> Can we even trust timeout values hardware gives us then?
> 
Sure, why not. Device usually going to provide the reset timeout value based on its backend implementation.
If its completely broken device (driver can always do min() of timeout value given by device and its own).

> What about a ton of other stuff like command vq commands timing out?
>
VQ commands timeout handling is indeed needed but it is a separate enhancement in itself.
This one is self-contained enough that doesn't need to combined with command timeout.
 
> > and hence the reasonable finite value.
> > When value is not specified, driver is expected to choose reasonable value
> terminate the reset operation.
> 
> No one has any idea what's reasonable.
I am sure a finite timeout in driver is reasonable than current infinite one.
This is because it is making system more resilient to occasional device errors.

> There's no reason for a driver to choose any value - it has nothing else to do.
>
At least for the devices that we are seeing, driver choosing a default value of 2 to 3 minutes is good enough and of course useful.

> > Do you propose 0 as no limit to follow current driver behavior?
> 
> Or maybe 0 will make driver poll.
Poll infinitely? Didn't follow the suggestion of 0.

> 
> --
> MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-07  3:42       ` Parav Pandit
@ 2021-10-07 16:10         ` Cornelia Huck
  2021-10-07 17:58           ` Parav Pandit
  2021-10-08 10:44         ` Michael S. Tsirkin
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-07 16:10 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Thu, Oct 07 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Thursday, October 7, 2021 2:23 AM
>> On Wed, Oct 06, 2021 at 04:11:19PM +0000, Parav Pandit wrote:
>> >
>> > > From: Michael S. Tsirkin <mst@redhat.com>
>> > > Sent: Wednesday, October 6, 2021 8:52 PM

>> > > second, I don't see a story around compatibility here.
>> > Device reset timeout is supported only when device offers the new
>> VIRTIO_DEV_RESET_TIMEOUT.
>> > When such feature is not offered, driver follows the guidance provided in
>> this proposal.
>> > Compatibility is covered with the feature bit.
>> 
>> Well hmm. Problem is, devices need to be reset before feature bits are read.
>>
> Yeah, this means that we should remove the feature bit and instead follow your suggestion that if the register value is zero, the feature is not offered,
> If it is non zero, driver should follow the timeout.

What about old devices -- they don't provide the timeout value at all,
obviously; how can they stay compliant?

>  
>> > > previously pci drivers did wait, other drivers didn't.
>> > >
>> > The newer driver that implements updated spec, will adhere to newer
>> content of the spec.
>> 
>> right but we can't just unconditionally say driver must wait, this would declare
>> old driver non-compliant and we promised not to do that.
>> 
> It is not said unconditionally. It comes with spec version, isn't it?
> This change should increment the spec version so old driver are compliant to some older version of the spec.
> Hence, there is no breakage with some old driver.

We do not really say that a driver or a device is conforming to a
certain version of the spec, though. If it has been written to an older
version of the spec, I still expect it to be compliant, it's just not
implementing some newer things.

(...)

>> > > Third how about making e.g. 0 a special value meaning no limit wait
>> forever?
>> > The whole idea is to have some finite/deterministic behavior,
>> 
>> I guess I'm being dense, I just don't yet understand the motivation.
>> The cost is difficulty in migration since each and every piece of hardware will
>> have a different timeout.
>>
> Such timeout is already there. Advertising timeout doesn't change the LM flow.
> A migrating device at the destination is anyway out of the reset when migration occurs.

But it introduces values that may be different between different devices
of the same type, no? I guess the destination would need to re-read the
value to get the current one. Not an insurmountable problem, but still
needs some care.

>  
>> The idea is to have finite behaviour, I got that, but I don't really know where
>> does the requirement come from, in what sense it's deterministic and why do
>> we worry about reset specifically.
>> 
>> Recovering from failure when hardware is broken?
>>
> I provided few examples in previous email.
> Terminating a probe sequence of the device which is already on its path of hotunplug.

For that sequence, don't we need a timeout for *any* interaction with
the device? I thought this proposal was mainly about devices that take a
long time to "warm up"...

>  
>> Can we even trust timeout values hardware gives us then?
>> 
> Sure, why not. Device usually going to provide the reset timeout value based on its backend implementation.
> If its completely broken device (driver can always do min() of timeout value given by device and its own).
>
>> What about a ton of other stuff like command vq commands timing out?
>>
> VQ commands timeout handling is indeed needed but it is a separate enhancement in itself.
> This one is self-contained enough that doesn't need to combined with command timeout.

What about e.g. config writes; I think they might trigger some actions?

>  
>> > and hence the reasonable finite value.
>> > When value is not specified, driver is expected to choose reasonable value
>> terminate the reset operation.
>> 
>> No one has any idea what's reasonable.
> I am sure a finite timeout in driver is reasonable than current infinite one.
> This is because it is making system more resilient to occasional device errors.
>
>> There's no reason for a driver to choose any value - it has nothing else to do.
>>
> At least for the devices that we are seeing, driver choosing a default value of 2 to 3 minutes is good enough and of course useful.
>
>> > Do you propose 0 as no limit to follow current driver behavior?
>> 
>> Or maybe 0 will make driver poll.
> Poll infinitely? Didn't follow the suggestion of 0.

Isn't polling vs waiting something that's inherently an implementation
choice for the driver? It's not like it can do something different if
the operation just does not finish.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-07 16:10         ` [virtio-dev] " Cornelia Huck
@ 2021-10-07 17:58           ` Parav Pandit
  2021-10-08 10:00             ` [virtio-dev] " Cornelia Huck
  2021-10-08 10:12             ` Michael S. Tsirkin
  0 siblings, 2 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-07 17:58 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Thursday, October 7, 2021 9:41 PM
> On Thu, Oct 07 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Thursday, October 7, 2021 2:23 AM On Wed, Oct 06, 2021 at
> >> 04:11:19PM +0000, Parav Pandit wrote:
> >> >
> >> > > From: Michael S. Tsirkin <mst@redhat.com>
> >> > > Sent: Wednesday, October 6, 2021 8:52 PM
> 
> >> > > second, I don't see a story around compatibility here.
> >> > Device reset timeout is supported only when device offers the new
> >> VIRTIO_DEV_RESET_TIMEOUT.
> >> > When such feature is not offered, driver follows the guidance
> >> > provided in
> >> this proposal.
> >> > Compatibility is covered with the feature bit.
> >>
> >> Well hmm. Problem is, devices need to be reset before feature bits are read.
> >>
> > Yeah, this means that we should remove the feature bit and instead
> > follow your suggestion that if the register value is zero, the feature is not
> offered, If it is non zero, driver should follow the timeout.
> 
> What about old devices -- they don't provide the timeout value at all,
> obviously; how can they stay compliant?
>
If the old device which doesn't provide a timeout, the newer version of the spec suggests that driver chooses a reasonable value.
 
> >
> >> > > previously pci drivers did wait, other drivers didn't.
> >> > >
> >> > The newer driver that implements updated spec, will adhere to newer
> >> content of the spec.
> >>
> >> right but we can't just unconditionally say driver must wait, this
> >> would declare old driver non-compliant and we promised not to do that.
> >>
> > It is not said unconditionally. It comes with spec version, isn't it?
> > This change should increment the spec version so old driver are compliant to
> some older version of the spec.
> > Hence, there is no breakage with some old driver.
> 
> We do not really say that a driver or a device is conforming to a certain version
> of the spec, though. If it has been written to an older version of the spec, I still
> expect it to be compliant, it's just not implementing some newer things.
>
The only way a older driver following older spec can be compliant to newer spec, if the new spec remains same as old spec. :-)
However, the spec is suggesting to have finite timeout as "SHOULD".
Timeout is not part of the conformance _must_ clause.
So if old driver not following "SHOULD" suggestion of newer spec, does it make old driver non-compliant?
Your input on this will help me understand the conformance part.

> (...)
> 
> >> > > Third how about making e.g. 0 a special value meaning no limit
> >> > > wait
> >> forever?
> >> > The whole idea is to have some finite/deterministic behavior,
> >>
> >> I guess I'm being dense, I just don't yet understand the motivation.
> >> The cost is difficulty in migration since each and every piece of
> >> hardware will have a different timeout.
> >>
> > Such timeout is already there. Advertising timeout doesn't change the LM
> flow.
> > A migrating device at the destination is anyway out of the reset when
> migration occurs.
> 
> But it introduces values that may be different between different devices of the
> same type, no? I guess the destination would need to re-read the value to get
> the current one. Not an insurmountable problem, but still needs some care.
>
When a device migrates to destination, it starts from where the device left off on the source side.
So yes, destination side, device must be usable (out of reset), and after that its current state will be overwritten by the migrating device.
If you ask, does migration overwrite the reset timeout register value? I would say no, because how long device would take to reset is decided by the destination side implementation.

And this is probably yet another good reason to define migratable bits of a virtio device in the live migration spec extension.

> >
> >> The idea is to have finite behaviour, I got that, but I don't really
> >> know where does the requirement come from, in what sense it's
> >> deterministic and why do we worry about reset specifically.
> >>
> >> Recovering from failure when hardware is broken?
> >>
> > I provided few examples in previous email.
> > Terminating a probe sequence of the device which is already on its path of
> hotunplug.
> 
> For that sequence, don't we need a timeout for *any* interaction with the
> device? 
We do need a command timeout that will handle hotunplug scenario while under probe.
But handling command timeout has wider use, even without non racing hotunplug and plug sequences.
So it is better to drive it as separate extension in itself, which is self-contained.

> I thought this proposal was mainly about devices that take a long time
> to "warm up"...
>
Yes, certainly this proposal covers such device scenario.
In general, this proposal covers reset timeout as the register suggests.

> > VQ commands timeout handling is indeed needed but it is a separate
> enhancement in itself.
> > This one is self-contained enough that doesn't need to combined with
> command timeout.
> 
> What about e.g. config writes; I think they might trigger some actions?
>
A driver can do config writes to the PCI virtio device without reset is complete.
This is very common practice that driver follows to discover device properties.
It may trigger some action on the device, but PCI device is obligated to respond to it for host driver to function properly.

> >
> >> > and hence the reasonable finite value.
> >> > When value is not specified, driver is expected to choose
> >> > reasonable value
> >> terminate the reset operation.
> >>
> >> No one has any idea what's reasonable.
> > I am sure a finite timeout in driver is reasonable than current infinite one.
> > This is because it is making system more resilient to occasional device errors.
> >
> >> There's no reason for a driver to choose any value - it has nothing else to
> do.
> >>
> > At least for the devices that we are seeing, driver choosing a default value of
> 2 to 3 minutes is good enough and of course useful.
> >
> >> > Do you propose 0 as no limit to follow current driver behavior?
> >>
> >> Or maybe 0 will make driver poll.
> > Poll infinitely? Didn't follow the suggestion of 0.
> 
> Isn't polling vs waiting something that's inherently an implementation choice
> for the driver? It's not like it can do something different if the operation just
> does not finish.

Exactly. I read it few times, but I wasn't sure what Michael was suggesting.

Timeout hint is to break infinite loop in a deterministic way. It is really that simple.

I will wait to see if you/Michael has more comments before sending v3 that removes the feature bit.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-07 17:58           ` Parav Pandit
@ 2021-10-08 10:00             ` Cornelia Huck
  2021-10-08 10:19               ` Parav Pandit
  2021-10-08 10:12             ` Michael S. Tsirkin
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-08 10:00 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Thu, Oct 07 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Cornelia Huck <cohuck@redhat.com>
>> Sent: Thursday, October 7, 2021 9:41 PM
>> On Thu, Oct 07 2021, Parav Pandit <parav@nvidia.com> wrote:
>> 
>> >> From: Michael S. Tsirkin <mst@redhat.com>
>> >> Sent: Thursday, October 7, 2021 2:23 AM On Wed, Oct 06, 2021 at
>> >> 04:11:19PM +0000, Parav Pandit wrote:
>> >> >
>> >> > > From: Michael S. Tsirkin <mst@redhat.com>
>> >> > > Sent: Wednesday, October 6, 2021 8:52 PM
>> 
>> >> > > second, I don't see a story around compatibility here.
>> >> > Device reset timeout is supported only when device offers the new
>> >> VIRTIO_DEV_RESET_TIMEOUT.
>> >> > When such feature is not offered, driver follows the guidance
>> >> > provided in
>> >> this proposal.
>> >> > Compatibility is covered with the feature bit.
>> >>
>> >> Well hmm. Problem is, devices need to be reset before feature bits are read.
>> >>
>> > Yeah, this means that we should remove the feature bit and instead
>> > follow your suggestion that if the register value is zero, the feature is not
>> offered, If it is non zero, driver should follow the timeout.
>> 
>> What about old devices -- they don't provide the timeout value at all,
>> obviously; how can they stay compliant?
>>
> If the old device which doesn't provide a timeout, the newer version of the spec suggests that driver chooses a reasonable value.

If we don't guard the new field via a feature bit, the driver will try
to access something that does not exist for older devices. If we say
that the field exists, older devices are simply not compliant.

Now, the interaction with the feature bit is obviously a problem... it
would mean that the driver can only find out what the device offers too
late in the initial startup sequence. We could specify that the field
may or may not exist, but that might lead to other problems down the
road.

>  
>> >
>> >> > > previously pci drivers did wait, other drivers didn't.
>> >> > >
>> >> > The newer driver that implements updated spec, will adhere to newer
>> >> content of the spec.
>> >>
>> >> right but we can't just unconditionally say driver must wait, this
>> >> would declare old driver non-compliant and we promised not to do that.
>> >>
>> > It is not said unconditionally. It comes with spec version, isn't it?
>> > This change should increment the spec version so old driver are compliant to
>> some older version of the spec.
>> > Hence, there is no breakage with some old driver.
>> 
>> We do not really say that a driver or a device is conforming to a certain version
>> of the spec, though. If it has been written to an older version of the spec, I still
>> expect it to be compliant, it's just not implementing some newer things.
>>
> The only way a older driver following older spec can be compliant to newer spec, if the new spec remains same as old spec. :-)

Eh, no. We just need to make new features optional. A driver written to
an older version of the spec that does not implement e.g. packed ring is
still compliant, since packed ring is guarded by a feature bit.

> However, the spec is suggesting to have finite timeout as "SHOULD".
> Timeout is not part of the conformance _must_ clause.
> So if old driver not following "SHOULD" suggestion of newer spec, does it make old driver non-compliant?
> Your input on this will help me understand the conformance part.

This depends upon how we introduce the new feature. If (as had been
suggested upthread) we ditch the feature bit, we specify that the driver
needs to do something (read the value) which an old driver simply won't
do. The only way that can work is if we make it optional for the driver
to read the timeout value. So yes, a driver not following a SHOULD
clause is still compliant. We need to be careful as to what we actually
require; normally, feature bits solve this neatly, but it seems
problematic here.

>
>> (...)
>> 
>> >> > > Third how about making e.g. 0 a special value meaning no limit
>> >> > > wait
>> >> forever?
>> >> > The whole idea is to have some finite/deterministic behavior,
>> >>
>> >> I guess I'm being dense, I just don't yet understand the motivation.
>> >> The cost is difficulty in migration since each and every piece of
>> >> hardware will have a different timeout.
>> >>
>> > Such timeout is already there. Advertising timeout doesn't change the LM
>> flow.
>> > A migrating device at the destination is anyway out of the reset when
>> migration occurs.
>> 
>> But it introduces values that may be different between different devices of the
>> same type, no? I guess the destination would need to re-read the value to get
>> the current one. Not an insurmountable problem, but still needs some care.
>>
> When a device migrates to destination, it starts from where the device left off on the source side.
> So yes, destination side, device must be usable (out of reset), and after that its current state will be overwritten by the migrating device.
> If you ask, does migration overwrite the reset timeout register value? I would say no, because how long device would take to reset is decided by the destination side implementation.

Agreed on that, but the driver would need to know that it has to re-read
the timeout value. IOW, migration would either not be fully transparent,
or the destination device would need to have a timeout value that would
not conflict with the timeout value on the source device (i.e. if it is
significantly different).

>
> And this is probably yet another good reason to define migratable bits of a virtio device in the live migration spec extension.

Well yes, live migration is certainly a problem we need to tackle.

>
>> >
>> >> The idea is to have finite behaviour, I got that, but I don't really
>> >> know where does the requirement come from, in what sense it's
>> >> deterministic and why do we worry about reset specifically.
>> >>
>> >> Recovering from failure when hardware is broken?
>> >>
>> > I provided few examples in previous email.
>> > Terminating a probe sequence of the device which is already on its path of
>> hotunplug.
>> 
>> For that sequence, don't we need a timeout for *any* interaction with the
>> device? 
> We do need a command timeout that will handle hotunplug scenario while under probe.
> But handling command timeout has wider use, even without non racing hotunplug and plug sequences.
> So it is better to drive it as separate extension in itself, which is self-contained.

So, why is reset special in that case? As...

>
>> I thought this proposal was mainly about devices that take a long time
>> to "warm up"...
>>
> Yes, certainly this proposal covers such device scenario.
> In general, this proposal covers reset timeout as the register suggests.

...I do not really understand this, then. We want reset timeouts,
because they cover a certain scenario, but we don't want something more
general?

>
>> > VQ commands timeout handling is indeed needed but it is a separate
>> enhancement in itself.
>> > This one is self-contained enough that doesn't need to combined with
>> command timeout.
>> 
>> What about e.g. config writes; I think they might trigger some actions?
>>
> A driver can do config writes to the PCI virtio device without reset is complete.
> This is very common practice that driver follows to discover device properties.
> It may trigger some action on the device, but PCI device is obligated to respond to it for host driver to function properly.

But isn't that in violation of the spec? I understand it needs to do
reset -> (wait until complete) -> set ACKNOWLEDGE status -> set DRIVER
status -> potentially *read* the config space -> set FEATURES_OK
and only then *write* to the config space?

As an aside, not all transports even allow accessing the config space
while reset is still in progress. I would naively have expected that to
be the norm.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-07 17:58           ` Parav Pandit
  2021-10-08 10:00             ` [virtio-dev] " Cornelia Huck
@ 2021-10-08 10:12             ` Michael S. Tsirkin
  2021-10-08 10:51               ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 10:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Thu, Oct 07, 2021 at 05:58:09PM +0000, Parav Pandit wrote:
> > >> > > Third how about making e.g. 0 a special value meaning no limit
> > >> > > wait
> > >> forever?
> > >> > The whole idea is to have some finite/deterministic behavior,
> > >>
> > >> I guess I'm being dense, I just don't yet understand the motivation.
> > >> The cost is difficulty in migration since each and every piece of
> > >> hardware will have a different timeout.
> > >>
> > > Such timeout is already there. Advertising timeout doesn't change the LM
> > > flow.
> > > A migrating device at the destination is anyway out of the reset when
> > > migration occurs.
> > 
> > But it introduces values that may be different between different devices of the
> > same type, no? I guess the destination would need to re-read the value to get
> > the current one. Not an insurmountable problem, but still needs some care.
> >
> When a device migrates to destination, it starts from where the device
> left off on the source side.  So yes, destination side, device must be
> usable (out of reset), and after that its current state will be
> overwritten by the migrating device.

I get what you are trying to say here but it's a hack. Nothing
prevents a reset for driver's internal reasons at any point,
and in particular reset is used e.g. for driver removal.

>  If you ask, does migration
> overwrite the reset timeout register value? I would say no, because
> how long device would take to reset is decided by the destination side
> implementation.

Problem is, driver can cache the value on source. Then it's migrated
and used on destination when driver wants to reset the device.
This can lead to a timeout if the destination does not
finish within the source timeout value.

That's why I ask: why do we bother? What's wrong with just waiting
forever or until user gets tired of this and cancels with CTRL-C?
Is there a use-case where that's not good enough?


> And this is probably yet another good reason to define migratable bits
> of a virtio device in the live migration spec extension.

"migratable bits" being what? non-guest visible device state? Sure, would
be great to have.  Don't think it will help in this instance.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 10:00             ` [virtio-dev] " Cornelia Huck
@ 2021-10-08 10:19               ` Parav Pandit
  0 siblings, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 10:19 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Friday, October 8, 2021 3:31 PM
> 
> On Thu, Oct 07 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Cornelia Huck <cohuck@redhat.com>
> >> Sent: Thursday, October 7, 2021 9:41 PM On Thu, Oct 07 2021, Parav
> >> Pandit <parav@nvidia.com> wrote:
> >>
> >> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> >> Sent: Thursday, October 7, 2021 2:23 AM On Wed, Oct 06, 2021 at
> >> >> 04:11:19PM +0000, Parav Pandit wrote:
> >> >> >
> >> >> > > From: Michael S. Tsirkin <mst@redhat.com>
> >> >> > > Sent: Wednesday, October 6, 2021 8:52 PM
> >>
> >> >> > > second, I don't see a story around compatibility here.
> >> >> > Device reset timeout is supported only when device offers the
> >> >> > new
> >> >> VIRTIO_DEV_RESET_TIMEOUT.
> >> >> > When such feature is not offered, driver follows the guidance
> >> >> > provided in
> >> >> this proposal.
> >> >> > Compatibility is covered with the feature bit.
> >> >>
> >> >> Well hmm. Problem is, devices need to be reset before feature bits are
> read.
> >> >>
> >> > Yeah, this means that we should remove the feature bit and instead
> >> > follow your suggestion that if the register value is zero, the
> >> > feature is not
> >> offered, If it is non zero, driver should follow the timeout.
> >>
> >> What about old devices -- they don't provide the timeout value at
> >> all, obviously; how can they stay compliant?
> >>
> > If the old device which doesn't provide a timeout, the newer version of the
> spec suggests that driver chooses a reasonable value.
> 
> If we don't guard the new field via a feature bit, the driver will try to access
> something that does not exist for older devices. If we say that the field exists,
> older devices are simply not compliant.
Lets take base line that for reset timeout we cannot add the feature bit because device may not be ready when reset is started.
With that baseline, 
If the field exists, driver should honor it.
If the field doesn't exists, it is old device. So driver chooses a reasonable value of timeout.

> 
> Now, the interaction with the feature bit is obviously a problem... it would
> mean that the driver can only find out what the device offers too late in the
> initial startup sequence. We could specify that the field may or may not exist,
> but that might lead to other problems down the road.
> 
Yes, newer spec will have the field. Value can be zero or non-zero.
Zero = device doesn't support reset timeout
Non_zero = device suggested the timeout

> >
> >> >
> >> >> > > previously pci drivers did wait, other drivers didn't.
> >> >> > >
> >> >> > The newer driver that implements updated spec, will adhere to
> >> >> > newer
> >> >> content of the spec.
> >> >>
> >> >> right but we can't just unconditionally say driver must wait, this
> >> >> would declare old driver non-compliant and we promised not to do that.
> >> >>
> >> > It is not said unconditionally. It comes with spec version, isn't it?
> >> > This change should increment the spec version so old driver are
> >> > compliant to
> >> some older version of the spec.
> >> > Hence, there is no breakage with some old driver.
> >>
> >> We do not really say that a driver or a device is conforming to a
> >> certain version of the spec, though. If it has been written to an
> >> older version of the spec, I still expect it to be compliant, it's just not
> implementing some newer things.
> >>
> > The only way a older driver following older spec can be compliant to
> > newer spec, if the new spec remains same as old spec. :-)
> 
> Eh, no. We just need to make new features optional. A driver written to an
> older version of the spec that does not implement e.g. packed ring is still
> compliant, since packed ring is guarded by a feature bit.
Ok. older driver will just simply ignore the existence of device reset timeout field.
Newer driver will honor the value if the field is non zero.
If its zero, spec suggested newer driver to choose its reasonable value.
Since it is not a must/conformance statement, older driver continue to ignore the field that it doesn't know about.

> 
> > However, the spec is suggesting to have finite timeout as "SHOULD".
> > Timeout is not part of the conformance _must_ clause.
> > So if old driver not following "SHOULD" suggestion of newer spec, does it
> make old driver non-compliant?
> > Your input on this will help me understand the conformance part.
> 
> This depends upon how we introduce the new feature. If (as had been
> suggested upthread) we ditch the feature bit, we specify that the driver needs
> to do something (read the value) which an old driver simply won't do. The only
> way that can work is if we make it optional for the driver to read the timeout
> value. So yes, a driver not following a SHOULD clause is still compliant. We
> need to be careful as to what we actually require; normally, feature bits solve
> this neatly, but it seems problematic here.
> 
True, feature bit solves this neatly as you suggested but unfortunately feature bit is not useful here.
So lets proceed without feature bit as optional for driver.


> >> >> The idea is to have finite behaviour, I got that, but I don't
> >> >> really know where does the requirement come from, in what sense
> >> >> it's deterministic and why do we worry about reset specifically.
> >> >>
> >> >> Recovering from failure when hardware is broken?
> >> >>
> >> > I provided few examples in previous email.
> >> > Terminating a probe sequence of the device which is already on its
> >> > path of
> >> hotunplug.
> >>
> >> For that sequence, don't we need a timeout for *any* interaction with
> >> the device?
> > We do need a command timeout that will handle hotunplug scenario while
> under probe.
> > But handling command timeout has wider use, even without non racing
> hotunplug and plug sequences.
> > So it is better to drive it as separate extension in itself, which is self-
> contained.
> 
> So, why is reset special in that case? As...
> 
Normally composing a device takes a longer or more efforts for the backend compare to serving a command.
So device reset timeout is likely higher than a command timeout value.
So overloading or defining single field for command timeout and reset timeout looks incorrect to me.

Hence, command timeout should be a different feature bit. Given that it is own feature bit, I propose to keep such proposal self-contained and separate from this one.

> >
> >> I thought this proposal was mainly about devices that take a long
> >> time to "warm up"...
> >>
> > Yes, certainly this proposal covers such device scenario.
> > In general, this proposal covers reset timeout as the register suggests.
> 
> ...I do not really understand this, then. We want reset timeouts, because they
> cover a certain scenario, but we don't want something more general?

A command timeout is a new feature bit and a different field than reset timeout.
Lets not combine both into single proposal and single field.

> 
> >
> >> > VQ commands timeout handling is indeed needed but it is a separate
> >> enhancement in itself.
> >> > This one is self-contained enough that doesn't need to combined
> >> > with
> >> command timeout.
> >>
> >> What about e.g. config writes; I think they might trigger some actions?
> >>
> > A driver can do config writes to the PCI virtio device without reset is
> complete.
> > This is very common practice that driver follows to discover device
> properties.
> > It may trigger some action on the device, but PCI device is obligated to
> respond to it for host driver to function properly.
> 
> But isn't that in violation of the spec? I understand it needs to do reset -> (wait
> until complete) -> set ACKNOWLEDGE status -> set DRIVER status -> potentially
> *read* the config space -> set FEATURES_OK and only then *write* to the
> config space?
> 
I think I misunderstood your point about "config space".
Since I am mostly familiar with pci devices, I took "config space" as "pci config space", which must be accessible to the driver before device virtio level registers are accessed.

You mentioned virtio level config space, which as you say, has to be accessed after reset is completed.

> As an aside, not all transports even allow accessing the config space while reset
> is still in progress. I would naively have expected that to be the norm.
True. My bad on interpreting virtio config space = pci config space.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-07  3:42       ` Parav Pandit
  2021-10-07 16:10         ` [virtio-dev] " Cornelia Huck
@ 2021-10-08 10:44         ` Michael S. Tsirkin
  2021-10-08 10:59           ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 10:44 UTC (permalink / raw)
  To: Parav Pandit; +Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Thu, Oct 07, 2021 at 03:42:24AM +0000, Parav Pandit wrote:
> > There's no reason for a driver to choose any value - it has nothing else to do.
> >
> At least for the devices that we are seeing, driver choosing a default value of 2 to 3 minutes is good enough and of course useful.

Well I don't get this of course yet. I guess it's useful for some
workloads, but VM being blocked for 2 minutes breaks lots of workloads.
Guests with 30 devices do exist, we are talking about an hour to start
these guest.

I wonder whether we should instead just provide some guidance on how long
is it ok for reset to take? Say if we want to address kata, <100ms boot
time, I guess we can recomment not more than 10ms? Anything
more would trigger at least a warning.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 10:12             ` Michael S. Tsirkin
@ 2021-10-08 10:51               ` Parav Pandit
  2021-10-08 11:18                 ` [virtio-dev] " Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 10:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 3:42 PM

> > When a device migrates to destination, it starts from where the device
> > left off on the source side.  So yes, destination side, device must be
> > usable (out of reset), and after that its current state will be
> > overwritten by the migrating device.
> 
> I get what you are trying to say here but it's a hack. 
What is a hack? I didn't follow.

> Nothing prevents a reset
> for driver's internal reasons at any point, and in particular reset is used e.g. for
> driver removal.
Sure nothing prevents a reset on destination. When device state restoration is going on, if the device reset, device state restoration will simply not go through.

> >  If you ask, does migration
> > overwrite the reset timeout register value? I would say no, because
> > how long device would take to reset is decided by the destination side
> > implementation.
> 
> Problem is, driver can cache the value on source. Then it's migrated and used
> on destination when driver wants to reset the device.
> This can lead to a timeout if the destination does not finish within the source
> timeout value.
>
Usually driver has interest in caching fields that it uses in data path. Device reset timeout is not close to it.
But fair enough, driver can cache the value whichever it chooses to.
Like other fields of the device migratable device, a backend needs to have same device or lower device reset timeout.
 
> That's why I ask: why do we bother? What's wrong with just waiting forever or
> until user gets tired of this and cancels with CTRL-C?

Today, device removal of the device gets stuck for the device which didn't finish the reset, because its waiting for ever.

modprobe to my knowledge cannot be Ctrl-C. In another scenario, device probing of hot plug device occurs by hotplug driver in a workqueue context.

It doesn't sound right to pass the burden to the user to invent some kind of ctrl-C cancel operation in hotplug drivers.

> Is there a use-case where that's not good enough?
A guest has got one good and another device that encountered a fault.
Due to this faulty device which is unable to reset, is blocking other operations in the system.

> 
> 
> > And this is probably yet another good reason to define migratable bits
> > of a virtio device in the live migration spec extension.
> 
> "migratable bits" being what? non-guest visible device state? Sure, would be
> great to have.  Don't think it will help in this instance.
> 
> --
> MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 10:44         ` Michael S. Tsirkin
@ 2021-10-08 10:59           ` Parav Pandit
  2021-10-08 11:21             ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 10:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 4:14 PM
> 
> On Thu, Oct 07, 2021 at 03:42:24AM +0000, Parav Pandit wrote:
> > > There's no reason for a driver to choose any value - it has nothing else to
> do.
> > >
> > At least for the devices that we are seeing, driver choosing a default value of
> 2 to 3 minutes is good enough and of course useful.
> 
> Well I don't get this of course yet. I guess it's useful for some workloads, but
> VM being blocked for 2 minutes breaks lots of workloads.
> Guests with 30 devices do exist, we are talking about an hour to start these
> guest.
> 
In usual flow device doesn't take so long to reset. But every single time things doesn't go well in a system.
So device reset timeout is the upper bound limit. A user can always cancel/reboot a physical server or ctrl+c a qemu/kata process.

This can be done regardless of reset timeout anyway.
However the addition here indicates programable way for the sw to recover from such device or wait enough.

> I wonder whether we should instead just provide some guidance on how long is
> it ok for reset to take? Say if we want to address kata, <100ms boot time, I
> guess we can recomment not more than 10ms? Anything more would trigger at
> least a warning.

It may be even a pre-boot environment where 100msec or 10msec may be too short interval as other extreme of VM boot time example.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] Re: [PATCH v2] Add device reset timeout field
  2021-10-08 10:51               ` Parav Pandit
@ 2021-10-08 11:18                 ` Michael S. Tsirkin
  2021-10-08 12:55                   ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 11:18 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 08, 2021 at 10:51:02AM +0000, Parav Pandit wrote:
> > That's why I ask: why do we bother? What's wrong with just waiting forever or
> > until user gets tired of this and cancels with CTRL-C?
> 
> Today, device removal of the device gets stuck for the device which didn't finish the reset, because its waiting for ever.
> 
> modprobe to my knowledge cannot be Ctrl-C. In another scenario, device probing of hot plug device occurs by hotplug driver in a workqueue context.

Frankly I'm not sure we need to worry about esoterica like
hotplug working when device can't get out of reset. If guest is
more or less alive that's already an achievement.

> It doesn't sound right to pass the burden to the user to invent some kind of ctrl-C cancel operation in hotplug drivers.

Oh interesting. Thanks for providing the motivation.

So if it's device removal you are after to fix,
then the proposed spec won't be enough I suspect, since
there's no specific time when we can be sure DMA won't
happen anymore. Just giving up on device isn't possible,
if you do you risk corrupting guest memory which seems scarier
than just blocking hotplug.

I guess with this in mind, what would be needed is a more fine-grained
approach. E.g. driver writes 0 to reset, device returns 1 to indicate
reset in progress, at that point it can promise that DMA/interrupts
won't happen any longer, so driver can go away.




> > Is there a use-case where that's not good enough?
> A guest has got one good and another device that encountered a fault.
> Due to this faulty device which is unable to reset, is blocking other operations in the system.

It's pretty uncommon for guests to actually have devices
they don't really need to function normally.

> > 
> > 
> > > And this is probably yet another good reason to define migratable bits
> > > of a virtio device in the live migration spec extension.
> > 
> > "migratable bits" being what? non-guest visible device state? Sure, would be
> > great to have.  Don't think it will help in this instance.
> > 
> > --
> > MST


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-08 10:59           ` Parav Pandit
@ 2021-10-08 11:21             ` Michael S. Tsirkin
  2021-10-08 11:45               ` Parav Pandit
  2021-10-08 11:47               ` [virtio-dev] " Cornelia Huck
  0 siblings, 2 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 11:21 UTC (permalink / raw)
  To: Parav Pandit; +Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 08, 2021 at 10:59:02AM +0000, Parav Pandit wrote:
> It may be even a pre-boot environment where 100msec or 10msec may be too short interval as other extreme of VM boot time example.

I don't really know what this means. We are talking about how long
it takes device to calm down and stop poking at the host
after it's told to reset. 10ms worst case not enough for this?

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 11:21             ` Michael S. Tsirkin
@ 2021-10-08 11:45               ` Parav Pandit
  2021-10-08 11:47               ` [virtio-dev] " Cornelia Huck
  1 sibling, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 11:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cohuck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 4:52 PM
> 
> On Fri, Oct 08, 2021 at 10:59:02AM +0000, Parav Pandit wrote:
> > It may be even a pre-boot environment where 100msec or 10msec may be
> too short interval as other extreme of VM boot time example.
> 
> I don't really know what this means. We are talking about how long it takes
> device to calm down and stop poking at the host after it's told to reset. 10ms
> worst case not enough for this?
For the example you gave, probably yes, it is.

However, when the device is coming up for the first time, in such reset scenario, device is not even poking at the host yet.
It is still under reset.
10msec may not be enough.
And infinite wait is yet another extreme where system is optimistic that device _will_ finish the reset.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] Re: [PATCH v2] Add device reset timeout field
  2021-10-08 11:21             ` Michael S. Tsirkin
  2021-10-08 11:45               ` Parav Pandit
@ 2021-10-08 11:47               ` Cornelia Huck
  2021-10-08 12:12                 ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-08 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 08 2021, "Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Oct 08, 2021 at 10:59:02AM +0000, Parav Pandit wrote:
>> It may be even a pre-boot environment where 100msec or 10msec may be too short interval as other extreme of VM boot time example.
>
> I don't really know what this means. We are talking about how long
> it takes device to calm down and stop poking at the host
> after it's told to reset. 10ms worst case not enough for this?

To me, that sounded more like a physical device that needs to do
something like boot its firmware before it can perform an actual virtio
operation (and reset simply happens to be the first one.)

So, I'm getting more confused about the scope of this timeout. If it's
more a "device might not be ready yet" issue, I don't think we need a
timeout for reset specifically. Same for races with hotplugging. If it's
about "reset may take some time, because it will take some time before
all operations have quiesced", I don't see how the device can come up
with a value that isn't anything other than a wild guess, and the driver
could do wild guessing equally well.

I feel like I'm missing something.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 11:47               ` [virtio-dev] " Cornelia Huck
@ 2021-10-08 12:12                 ` Parav Pandit
  2021-10-08 12:57                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 12:12 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Friday, October 8, 2021 5:18 PM
> 
> On Fri, Oct 08 2021, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Fri, Oct 08, 2021 at 10:59:02AM +0000, Parav Pandit wrote:
> >> It may be even a pre-boot environment where 100msec or 10msec may be
> too short interval as other extreme of VM boot time example.
> >
> > I don't really know what this means. We are talking about how long it
> > takes device to calm down and stop poking at the host after it's told
> > to reset. 10ms worst case not enough for this?
> 
> To me, that sounded more like a physical device that needs to do something
> like boot its firmware before it can perform an actual virtio operation (and
> reset simply happens to be the first one.)
> 
> So, I'm getting more confused about the scope of this timeout. If it's more a
> "device might not be ready yet" issue, I don't think we need a timeout for reset
> specifically. Same for races with hotplugging. If it's about "reset may take some
> time, because it will take some time before all operations have quiesced", I
> don't see how the device can come up with a value that isn't anything other
> than a wild guess, and the driver could do wild guessing equally well.

Device implementation has good knowledge of how a given virtio device is implemented to not do wild guess.
I will take real world examples.
1. A physical virtio device backed by a firmware will take more than 10msec of boot time to respond to the reset operation.

2. A sriov VF virtio device for our case takes a lot lesser than this, but may take anywhere between 10 msec to 250msec.
This can happen on a firmware where user enabled 500 SR-IOV VFs.
Pci spec indicates that all VFs to initialize within 100msec. This translates to 0.2msec for one VF.
In some scenario this can be a hard to initialize a VF in 0.2 msec depending on what else a firmware is doing at that time.

3. A system has one or more virtio boot devices.
One of them happens to be faulty after a firmware upgrade.
Pre-boot env is infinitely waiting. Michael suggest to do disable such PCI slot by means of abstract Ctrl+C.
If PCI slot is disabled, that device must be physically taken out for recovery.
In an alternative, if device advertised a finite timeout, that device didn't boot, system gave up after finite timeout and server picked second boot option, and booted.
Now a system admin can repair the faulty device without physically taking it out.
Will infinite timeout help here? Or a device advertising finite timeout and recovering the system more useful?

4. device was hotplug in system and before it is fully probed, a hot unplug is triggered.
Device cannot respond to reset, because its hot unplugged.
OS waits infinitely for reset to complete.
And system component is stuck just because of one device.
Would a finite timeout help to abort this operation? Yes.

So is wild guess of 10msec for all devices or an infinite time most efficient way to handle above scenarios?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 11:18                 ` [virtio-dev] " Michael S. Tsirkin
@ 2021-10-08 12:55                   ` Parav Pandit
  0 siblings, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 12:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 4:48 PM
> 
> On Fri, Oct 08, 2021 at 10:51:02AM +0000, Parav Pandit wrote:
> > > That's why I ask: why do we bother? What's wrong with just waiting
> > > forever or until user gets tired of this and cancels with CTRL-C?
> >
> > Today, device removal of the device gets stuck for the device which didn't
> finish the reset, because its waiting for ever.
> >
> > modprobe to my knowledge cannot be Ctrl-C. In another scenario, device
> probing of hot plug device occurs by hotplug driver in a workqueue context.
> 
> Frankly I'm not sure we need to worry about esoterica like hotplug working
> when device can't get out of reset. 

A linux virtio driver is not even able to pass a developer's basic test where 30 devices are hotplugged and unplugged.
This is because the driver is waiting infinitely for device to finish the reset, and a device cannot tell that it has 1sec timeout.

> 
> So if it's device removal you are after to fix, then the proposed spec won't be
> enough I suspect, since there's no specific time when we can be sure DMA
> won't happen anymore. Just giving up on device isn't possible, if you do you risk
> corrupting guest memory which seems scarier than just blocking hotplug.
> 
You are right, if the device is running and reset is initiated, and if timeout expires, it is better to leave the device the state it is in.

However as explained in other examples where device is undergoing its first reset during the spec defined initialization sequence, there is no reason for it to wait infinitely when DMA/interrupts are not even ever initiated or even features negotiation didn't even finish.
Right?

> I guess with this in mind, what would be needed is a more fine-grained
> approach. E.g. driver writes 0 to reset, device returns 1 to indicate reset in
> progress, at that point it can promise that DMA/interrupts won't happen any
> longer, so driver can go away.
> 
How is this helping? This is what happens today, no?
But it doesn't indicate that DMA/interrupts won't continue anymore.
I must be missing something.
I thought device would set DEVICE_NEEDS_RESET if it encountered error during reset or otherwise.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-08 12:12                 ` Parav Pandit
@ 2021-10-08 12:57                   ` Michael S. Tsirkin
  2021-10-08 13:23                     ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 12:57 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 08, 2021 at 12:12:35PM +0000, Parav Pandit wrote:
> 
> 
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Friday, October 8, 2021 5:18 PM
> > 
> > On Fri, Oct 08 2021, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Fri, Oct 08, 2021 at 10:59:02AM +0000, Parav Pandit wrote:
> > >> It may be even a pre-boot environment where 100msec or 10msec may be
> > too short interval as other extreme of VM boot time example.
> > >
> > > I don't really know what this means. We are talking about how long it
> > > takes device to calm down and stop poking at the host after it's told
> > > to reset. 10ms worst case not enough for this?
> > 
> > To me, that sounded more like a physical device that needs to do something
> > like boot its firmware before it can perform an actual virtio operation (and
> > reset simply happens to be the first one.)
> > 
> > So, I'm getting more confused about the scope of this timeout. If it's more a
> > "device might not be ready yet" issue, I don't think we need a timeout for reset
> > specifically. Same for races with hotplugging. If it's about "reset may take some
> > time, because it will take some time before all operations have quiesced", I
> > don't see how the device can come up with a value that isn't anything other
> > than a wild guess, and the driver could do wild guessing equally well.
> 
> Device implementation has good knowledge of how a given virtio device is implemented to not do wild guess.
> I will take real world examples.
> 1. A physical virtio device backed by a firmware will take more than 10msec of boot time to respond to the reset operation.
> 
> 2. A sriov VF virtio device for our case takes a lot lesser than this, but may take anywhere between 10 msec to 250msec.
> This can happen on a firmware where user enabled 500 SR-IOV VFs.
> Pci spec indicates that all VFs to initialize within 100msec. This translates to 0.2msec for one VF.
> In some scenario this can be a hard to initialize a VF in 0.2 msec depending on what else a firmware is doing at that time.

That's separate from virtio reset though. virtio reset is
much lighter weight than a VF reset, all it needs to do is
return config space to original values and stop DMA.

> 3. A system has one or more virtio boot devices.
> One of them happens to be faulty after a firmware upgrade.
> Pre-boot env is infinitely waiting. Michael suggest to do disable such PCI slot by means of abstract Ctrl+C.
> If PCI slot is disabled, that device must be physically taken out for recovery.
> In an alternative, if device advertised a finite timeout, that device didn't boot, system gave up after finite timeout and server picked second boot option, and booted.
> Now a system admin can repair the faulty device without physically taking it out.
> Will infinite timeout help here? Or a device advertising finite timeout and recovering the system more useful?
> 
> 4. device was hotplug in system and before it is fully probed, a hot unplug is triggered.


I don't get this one. Are you talking about surprise removal here?
The way to handle that is surely not a timeout, we should be able to
test for device presence.

> Device cannot respond to reset, because its hot unplugged.
> OS waits infinitely for reset to complete.
> And system component is stuck just because of one device.
> Would a finite timeout help to abort this operation? Yes.

Except if it takes minutes it is not agile enough for many workloads.

> 
> So is wild guess of 10msec for all devices or an infinite time most efficient way to handle above scenarios?

Donnu, but as I hope you begin to see, as we start digging into
actual requirements, neither does a huge reset promise by the device.
How about some "keepalive" signal then? E.g. a register where each read
needs to respond with a different value, if it's the same then
device is stuck ...

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 12:57                   ` Michael S. Tsirkin
@ 2021-10-08 13:23                     ` Parav Pandit
  2021-10-08 23:09                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-08 13:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 6:27 PM

> > 2. A sriov VF virtio device for our case takes a lot lesser than this, but may
> take anywhere between 10 msec to 250msec.
> > This can happen on a firmware where user enabled 500 SR-IOV VFs.
> > Pci spec indicates that all VFs to initialize within 100msec. This translates to
> 0.2msec for one VF.
> > In some scenario this can be a hard to initialize a VF in 0.2 msec depending
> on what else a firmware is doing at that time.
> 
> That's separate from virtio reset though. virtio reset is much lighter weight
> than a VF reset, all it needs to do is return config space to original values and
> stop DMA.
Again you took the valid example to stop the DMA of already initialized device, while above case is for the first init. :-)
virtio device is going the first reset during initialization. It should be able to tell how long to wait.
A device firmware may take more than 0.2msec to finish needed initialization to serve a virtio device.
Infinite wait of today works here. Question was for wild guess by driver for 100msec vs 10msec vs 0.2 msec.
Is that enough? 

> 
> > 3. A system has one or more virtio boot devices.
> > One of them happens to be faulty after a firmware upgrade.
> > Pre-boot env is infinitely waiting. Michael suggest to do disable such PCI slot
> by means of abstract Ctrl+C.
> > If PCI slot is disabled, that device must be physically taken out for recovery.
> > In an alternative, if device advertised a finite timeout, that device didn't boot,
> system gave up after finite timeout and server picked second boot option, and
> booted.
> > Now a system admin can repair the faulty device without physically taking it
> out.
> > Will infinite timeout help here? Or a device advertising finite timeout and
> recovering the system more useful?
> >
> > 4. device was hotplug in system and before it is fully probed, a hot unplug is
> triggered.
> 
> 
> I don't get this one. Are you talking about surprise removal here?
Yes.
> The way to handle that is surely not a timeout, we should be able to test for
> device presence.
Yes, it should be possible to update device presence of device under probe while its surprised removed.
I will look into it more.
However, this is not the only place timeout is used.
> 
> > Device cannot respond to reset, because its hot unplugged.
> > OS waits infinitely for reset to complete.
> > And system component is stuck just because of one device.
> > Would a finite timeout help to abort this operation? Yes.
> 
> Except if it takes minutes it is not agile enough for many workloads.
> 
> >
> > So is wild guess of 10msec for all devices or an infinite time most efficient
> way to handle above scenarios?
> 
> Donnu, but as I hope you begin to see, as we start digging into actual
> requirements, neither does a huge reset promise by the device.

A finite reset timeout helps in making the virtio devices more predicable to use.

> How about some "keepalive" signal then? E.g. a register where each read
> needs to respond with a different value, if it's the same then device is stuck ...

A device should be out of the reset, keep alive feature negotiated to respond to a keep alive requests from host driver.
Keep alive is useful post the reset+ init stage.
(keep alive is also used by nvme devices, similar to device ready TIMEOUT with granularity of 500msec, similar to virtio device reset timeout).


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-08 13:23                     ` Parav Pandit
@ 2021-10-08 23:09                       ` Michael S. Tsirkin
  2021-10-11 14:29                         ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-08 23:09 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 08, 2021 at 01:23:52PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, October 8, 2021 6:27 PM
> 
> > > 2. A sriov VF virtio device for our case takes a lot lesser than this, but may
> > take anywhere between 10 msec to 250msec.
> > > This can happen on a firmware where user enabled 500 SR-IOV VFs.
> > > Pci spec indicates that all VFs to initialize within 100msec. This translates to
> > 0.2msec for one VF.
> > > In some scenario this can be a hard to initialize a VF in 0.2 msec depending
> > on what else a firmware is doing at that time.
> > 
> > That's separate from virtio reset though. virtio reset is much lighter weight
> > than a VF reset, all it needs to do is return config space to original values and
> > stop DMA.
> Again you took the valid example to stop the DMA of already initialized device, while above case is for the first init. :-)
> virtio device is going the first reset during initialization. It should be able to tell how long to wait.
> A device firmware may take more than 0.2msec to finish needed initialization to serve a virtio device.
> Infinite wait of today works here.

Looks like it's as Cornelia said - nothing to do with reset. E.g. it's
likely device can not even serve pci config before the init is complete.

> Question was for wild guess by driver for 100msec vs 10msec vs 0.2 msec.
> Is that enough? 

So some guidance in the spec on how long it should take will address
this I think.

> > 
> > > 3. A system has one or more virtio boot devices.
> > > One of them happens to be faulty after a firmware upgrade.
> > > Pre-boot env is infinitely waiting. Michael suggest to do disable such PCI slot
> > by means of abstract Ctrl+C.
> > > If PCI slot is disabled, that device must be physically taken out for recovery.
> > > In an alternative, if device advertised a finite timeout, that device didn't boot,
> > system gave up after finite timeout and server picked second boot option, and
> > booted.
> > > Now a system admin can repair the faulty device without physically taking it
> > out.
> > > Will infinite timeout help here? Or a device advertising finite timeout and
> > recovering the system more useful?
> > >
> > > 4. device was hotplug in system and before it is fully probed, a hot unplug is
> > triggered.
> > 
> > 
> > I don't get this one. Are you talking about surprise removal here?
> Yes.
> > The way to handle that is surely not a timeout, we should be able to test for
> > device presence.
> Yes, it should be possible to update device presence of device under probe while its surprised removed.
> I will look into it more.
> However, this is not the only place timeout is used.

As in this example, I'd be worried people will rely on timeout instead
of addressing things properly.

> > 
> > > Device cannot respond to reset, because its hot unplugged.
> > > OS waits infinitely for reset to complete.
> > > And system component is stuck just because of one device.
> > > Would a finite timeout help to abort this operation? Yes.
> > 
> > Except if it takes minutes it is not agile enough for many workloads.
> > 
> > >
> > > So is wild guess of 10msec for all devices or an infinite time most efficient
> > way to handle above scenarios?
> > 
> > Donnu, but as I hope you begin to see, as we start digging into actual
> > requirements, neither does a huge reset promise by the device.
> 
> A finite reset timeout helps in making the virtio devices more predicable to use.
> 
> > How about some "keepalive" signal then? E.g. a register where each read
> > needs to respond with a different value, if it's the same then device is stuck ...
> 
> A device should be out of the reset, keep alive feature negotiated to respond to a keep alive requests from host driver.
> Keep alive is useful post the reset+ init stage.
> (keep alive is also used by nvme devices, similar to device ready TIMEOUT with granularity of 500msec, similar to virtio device reset timeout).

Just to clarify, what I call keepalive here is a counter
providing a different value on each read.
This can thinkably work even before feature negotiation.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-08 23:09                       ` Michael S. Tsirkin
@ 2021-10-11 14:29                         ` Parav Pandit
  2021-10-11 14:59                           ` [virtio-dev] " Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-11 14:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Saturday, October 9, 2021 4:39 AM
> 
> On Fri, Oct 08, 2021 at 01:23:52PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, October 8, 2021 6:27 PM
> >
> > > > 2. A sriov VF virtio device for our case takes a lot lesser than
> > > > this, but may
> > > take anywhere between 10 msec to 250msec.
> > > > This can happen on a firmware where user enabled 500 SR-IOV VFs.
> > > > Pci spec indicates that all VFs to initialize within 100msec. This
> > > > translates to
> > > 0.2msec for one VF.
> > > > In some scenario this can be a hard to initialize a VF in 0.2 msec
> > > > depending
> > > on what else a firmware is doing at that time.
> > >
> > > That's separate from virtio reset though. virtio reset is much
> > > lighter weight than a VF reset, all it needs to do is return config
> > > space to original values and stop DMA.
> > Again you took the valid example to stop the DMA of already
> > initialized device, while above case is for the first init. :-) virtio device is going
> the first reset during initialization. It should be able to tell how long to wait.
> > A device firmware may take more than 0.2msec to finish needed initialization
> to serve a virtio device.
> > Infinite wait of today works here.
> 
> Looks like it's as Cornelia said - nothing to do with reset. E.g. it's likely device
> can not even serve pci config before the init is complete.
> 
Not sure. I replied to Cornelia.
A pci device once placed on the pci bus has to respond to the config requests.
However virtio pci device virtio config registers can be accessed only after reset is completed.

More below.

> > Question was for wild guess by driver for 100msec vs 10msec vs 0.2 msec.
> > Is that enough?
> 
> So some guidance in the spec on how long it should take will address this I
> think.
What would be that guidance, what is the finite time that sw driver wait for the device reset to finish?
I still not understand, what is the problem in device passing this optional hint to sw to indicate it, instead of some guess work.
> > >
> > > > 3. A system has one or more virtio boot devices.
> > > > One of them happens to be faulty after a firmware upgrade.
> > > > Pre-boot env is infinitely waiting. Michael suggest to do disable
> > > > such PCI slot
> > > by means of abstract Ctrl+C.
> > > > If PCI slot is disabled, that device must be physically taken out for
> recovery.
> > > > In an alternative, if device advertised a finite timeout, that
> > > > device didn't boot,
> > > system gave up after finite timeout and server picked second boot
> > > option, and booted.
> > > > Now a system admin can repair the faulty device without physically
> > > > taking it
> > > out.
> > > > Will infinite timeout help here? Or a device advertising finite
> > > > timeout and
> > > recovering the system more useful?
> > > >
> > > > 4. device was hotplug in system and before it is fully probed, a
> > > > hot unplug is
> > > triggered.
> > >
> > >
> > > I don't get this one. Are you talking about surprise removal here?
> > Yes.
> > > The way to handle that is surely not a timeout, we should be able to
> > > test for device presence.
> > Yes, it should be possible to update device presence of device under probe
> while its surprised removed.
> > I will look into it more.
> > However, this is not the only place timeout is used.
> 
> As in this example, I'd be worried people will rely on timeout instead of
> addressing things properly.
> 

> > > How about some "keepalive" signal then? E.g. a register where each
> > > read needs to respond with a different value, if it's the same then device is
> stuck ...
> >
> > A device should be out of the reset, keep alive feature negotiated to respond
> to a keep alive requests from host driver.
> > Keep alive is useful post the reset+ init stage.
> > (keep alive is also used by nvme devices, similar to device ready TIMEOUT
> with granularity of 500msec, similar to virtio device reset timeout).
> 
> Just to clarify, what I call keepalive here is a counter providing a different value
> on each read.
> This can thinkably work even before feature negotiation.
> 
Feature negotiation is mainly for backward compatibility.
This is unlikely to work the reset is completed. Because a real device implementing this would prefer to do this in fw for 1000 virtio devices sitting on the physical card.
And it is very much driven by such implementation at device devel.
So it cannot update the counter value if reset is not completed for the device.

I think read only device reset timeout is most elegant option during device initialization phase that eliminates infinite loop of today.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-11 14:29                         ` Parav Pandit
@ 2021-10-11 14:59                           ` Cornelia Huck
  2021-10-11 15:44                             ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-11 14:59 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Saturday, October 9, 2021 4:39 AM
>> 
>> On Fri, Oct 08, 2021 at 01:23:52PM +0000, Parav Pandit wrote:
>> >
>> >
>> > > From: Michael S. Tsirkin <mst@redhat.com>
>> > > Sent: Friday, October 8, 2021 6:27 PM
>> >
>> > > > 2. A sriov VF virtio device for our case takes a lot lesser than
>> > > > this, but may
>> > > take anywhere between 10 msec to 250msec.
>> > > > This can happen on a firmware where user enabled 500 SR-IOV VFs.
>> > > > Pci spec indicates that all VFs to initialize within 100msec. This
>> > > > translates to
>> > > 0.2msec for one VF.
>> > > > In some scenario this can be a hard to initialize a VF in 0.2 msec
>> > > > depending
>> > > on what else a firmware is doing at that time.
>> > >
>> > > That's separate from virtio reset though. virtio reset is much
>> > > lighter weight than a VF reset, all it needs to do is return config
>> > > space to original values and stop DMA.
>> > Again you took the valid example to stop the DMA of already
>> > initialized device, while above case is for the first init. :-) virtio device is going
>> the first reset during initialization. It should be able to tell how long to wait.
>> > A device firmware may take more than 0.2msec to finish needed initialization
>> to serve a virtio device.
>> > Infinite wait of today works here.
>> 
>> Looks like it's as Cornelia said - nothing to do with reset. E.g. it's likely device
>> can not even serve pci config before the init is complete.
>> 
> Not sure. I replied to Cornelia.
> A pci device once placed on the pci bus has to respond to the config requests.
> However virtio pci device virtio config registers can be accessed only after reset is completed.

What I don't understand: if a pci device is in such a state that it
cannot complete a reset, how can it be able to serve the pci config? In
the cases I saw mentioned (hotunplug, slow device startup), I'd expect
the device simply being slow/unable to respond to anything.
>
> More below.
>
>> > Question was for wild guess by driver for 100msec vs 10msec vs 0.2 msec.
>> > Is that enough?
>> 
>> So some guidance in the spec on how long it should take will address this I
>> think.
> What would be that guidance, what is the finite time that sw driver wait for the device reset to finish?
> I still not understand, what is the problem in device passing this optional hint to sw to indicate it, instead of some guess work.
>> > >
>> > > > 3. A system has one or more virtio boot devices.
>> > > > One of them happens to be faulty after a firmware upgrade.
>> > > > Pre-boot env is infinitely waiting. Michael suggest to do disable
>> > > > such PCI slot
>> > > by means of abstract Ctrl+C.
>> > > > If PCI slot is disabled, that device must be physically taken out for
>> recovery.
>> > > > In an alternative, if device advertised a finite timeout, that
>> > > > device didn't boot,
>> > > system gave up after finite timeout and server picked second boot
>> > > option, and booted.
>> > > > Now a system admin can repair the faulty device without physically
>> > > > taking it
>> > > out.
>> > > > Will infinite timeout help here? Or a device advertising finite
>> > > > timeout and
>> > > recovering the system more useful?
>> > > >
>> > > > 4. device was hotplug in system and before it is fully probed, a
>> > > > hot unplug is
>> > > triggered.
>> > >
>> > >
>> > > I don't get this one. Are you talking about surprise removal here?
>> > Yes.
>> > > The way to handle that is surely not a timeout, we should be able to
>> > > test for device presence.
>> > Yes, it should be possible to update device presence of device under probe
>> while its surprised removed.
>> > I will look into it more.
>> > However, this is not the only place timeout is used.
>> 
>> As in this example, I'd be worried people will rely on timeout instead of
>> addressing things properly.
>> 
>
>> > > How about some "keepalive" signal then? E.g. a register where each
>> > > read needs to respond with a different value, if it's the same then device is
>> stuck ...
>> >
>> > A device should be out of the reset, keep alive feature negotiated to respond
>> to a keep alive requests from host driver.
>> > Keep alive is useful post the reset+ init stage.
>> > (keep alive is also used by nvme devices, similar to device ready TIMEOUT
>> with granularity of 500msec, similar to virtio device reset timeout).
>> 
>> Just to clarify, what I call keepalive here is a counter providing a different value
>> on each read.
>> This can thinkably work even before feature negotiation.
>> 
> Feature negotiation is mainly for backward compatibility.

Related, I found I had opened this issue some time ago:
https://github.com/oasis-tcs/virtio-spec/issues/98 Might be relevant
when considering guarding this via a feature bit.

> This is unlikely to work the reset is completed. Because a real device implementing this would prefer to do this in fw for 1000 virtio devices sitting on the physical card.
> And it is very much driven by such implementation at device devel.
> So it cannot update the counter value if reset is not completed for the device.
>
> I think read only device reset timeout is most elegant option during device initialization phase that eliminates infinite loop of today.

Why can't a driver just go ahead and do a timeout regardless? This seems
pretty much to depend on the implementation. Nothing in the virtio spec
forces the driver to reset all devices serially and wait potentially
forever. (If you deal with a large amount of subdevices you cannot
access concurrently, you likely already have problems.)


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-11 14:59                           ` [virtio-dev] " Cornelia Huck
@ 2021-10-11 15:44                             ` Parav Pandit
  2021-10-11 16:00                               ` Michael S. Tsirkin
  2021-10-11 16:22                               ` [virtio-dev] " Cornelia Huck
  0 siblings, 2 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-11 15:44 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Monday, October 11, 2021 8:29 PM
> 
> On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Saturday, October 9, 2021 4:39 AM
> >>
> >> On Fri, Oct 08, 2021 at 01:23:52PM +0000, Parav Pandit wrote:
> >> >
> >> >
> >> > > From: Michael S. Tsirkin <mst@redhat.com>
> >> > > Sent: Friday, October 8, 2021 6:27 PM
> >> >
> >> > > > 2. A sriov VF virtio device for our case takes a lot lesser
> >> > > > than this, but may
> >> > > take anywhere between 10 msec to 250msec.
> >> > > > This can happen on a firmware where user enabled 500 SR-IOV VFs.
> >> > > > Pci spec indicates that all VFs to initialize within 100msec.
> >> > > > This translates to
> >> > > 0.2msec for one VF.
> >> > > > In some scenario this can be a hard to initialize a VF in 0.2
> >> > > > msec depending
> >> > > on what else a firmware is doing at that time.
> >> > >
> >> > > That's separate from virtio reset though. virtio reset is much
> >> > > lighter weight than a VF reset, all it needs to do is return
> >> > > config space to original values and stop DMA.
> >> > Again you took the valid example to stop the DMA of already
> >> > initialized device, while above case is for the first init. :-)
> >> > virtio device is going
> >> the first reset during initialization. It should be able to tell how long to wait.
> >> > A device firmware may take more than 0.2msec to finish needed
> >> > initialization
> >> to serve a virtio device.
> >> > Infinite wait of today works here.
> >>
> >> Looks like it's as Cornelia said - nothing to do with reset. E.g.
> >> it's likely device can not even serve pci config before the init is complete.
> >>
> > Not sure. I replied to Cornelia.
> > A pci device once placed on the pci bus has to respond to the config requests.
> > However virtio pci device virtio config registers can be accessed only after
> reset is completed.
> 
> What I don't understand: if a pci device is in such a state that it cannot
> complete a reset, how can it be able to serve the pci config? In the cases I saw
> mentioned (hotunplug, slow device startup), I'd expect the device simply being
> slow/unable to respond to anything.
> >
Not necessarily.
PCI device implementation in hw/fw needs to comply to the PCI spec that defines pci level semantics.
Device implements virtio specification on top of the PCI spec.
Serving virtio functionality is more than just PCI configs, so device implementation able to service pci requests but may not be ready to serve virtio block and networking IOs.
This provides best tradeoffs in hw implementation of what is done in which layer of hw/fw etc for PCI and virtio layer.

Such implementation is very common in multiple PCI device families such as mlx5, nvme and more.
For example nvme device has controller ready timeout with granularity of 500msec and has 16-bits of timeout.

> > More below.
> >
> > Feature negotiation is mainly for backward compatibility.
> 
> Related, I found I had opened this issue some time ago:
> https://github.com/oasis-tcs/virtio-spec/issues/98 Might be relevant when
> considering guarding this via a feature bit.
> 
If you think above can see day light, we may have feature bit for device reset too.
However above github note says "read optional field before FEATURES_OK", and not read optional field _before_ device reset.
Do you plan to modify issues/98 to read optional field before device reset?

> > This is unlikely to work the reset is completed. Because a real device
> implementing this would prefer to do this in fw for 1000 virtio devices sitting on
> the physical card.
> > And it is very much driven by such implementation at device devel.
> > So it cannot update the counter value if reset is not completed for the device.
> >
> > I think read only device reset timeout is most elegant option during device
> initialization phase that eliminates infinite loop of today.
> 
> Why can't a driver just go ahead and do a timeout regardless? 
o.k. lets consider this thought exercise. What is the timeout value that driver will choose if device doesn't specify one?
I explained in previous thread and you acked that actual fw based device may take longer to initialize than pure sw implementation backend.
In second example a pre-boot device can take even longer initialization time.
Sriov VF device may initialize lot faster.
Instead of driver having such transport, and device specific checks, (or some very short or very long timeout), we propose, that let device mention such timeout value.

> This seems
> pretty much to depend on the implementation. Nothing in the virtio spec forces
> the driver to reset all devices serially and wait potentially forever. (If you deal
> with a large amount of subdevices you cannot access concurrently, you likely
> already have problems.)
Sure, nothing in driver forces serial initialization.
However in reality, PCI device scanning on the bus in hotplug driver, preboot env, sriov vf initialization is serial.
In preboot case, even the order is defined, so it is sort of serial.

In other uses, concurrent access is fine and host does that today when it disables autoprobe on the PCI Bus and probes them in parallel after enabling sriov.
I hope you understand that there are both use-cases exists which scan parallel and some serial.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-11 15:44                             ` Parav Pandit
@ 2021-10-11 16:00                               ` Michael S. Tsirkin
  2021-10-12  8:51                                 ` Parav Pandit
  2021-10-11 16:22                               ` [virtio-dev] " Cornelia Huck
  1 sibling, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-11 16:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > This is unlikely to work the reset is completed. Because a real device
> > implementing this would prefer to do this in fw for 1000 virtio devices sitting on
> > the physical card.
> > > And it is very much driven by such implementation at device devel.
> > > So it cannot update the counter value if reset is not completed for the device.
> > >
> > > I think read only device reset timeout is most elegant option during device
> > initialization phase that eliminates infinite loop of today.
> > 
> > Why can't a driver just go ahead and do a timeout regardless? 
> o.k. lets consider this thought exercise. What is the timeout value that driver will choose if device doesn't specify one?
> I explained in previous thread and you acked that actual fw based device may take longer to initialize than pure sw implementation backend.
> In second example a pre-boot device can take even longer initialization time.
> Sriov VF device may initialize lot faster.
> Instead of driver having such transport, and device specific checks, (or some very short or very long timeout), we propose, that let device mention such timeout value.

Parav I think you are conflating reset with initialization time.
initialization is just for host boot which takes seconds anyway -
but no, minutes is not reasonable their, either.
reset affects guest boot. This needs to complete in milliseconds.

This conflation is IMHO one of the problems with this proposal.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-11 15:44                             ` Parav Pandit
  2021-10-11 16:00                               ` Michael S. Tsirkin
@ 2021-10-11 16:22                               ` Cornelia Huck
  2021-10-12 10:35                                 ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-11 16:22 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Cornelia Huck <cohuck@redhat.com>
>> Sent: Monday, October 11, 2021 8:29 PM
>> 
>> On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:

>> > I think read only device reset timeout is most elegant option during device
>> initialization phase that eliminates infinite loop of today.
>> 
>> Why can't a driver just go ahead and do a timeout regardless? 
> o.k. lets consider this thought exercise. What is the timeout value that driver will choose if device doesn't specify one?
> I explained in previous thread and you acked that actual fw based device may take longer to initialize than pure sw implementation backend.
> In second example a pre-boot device can take even longer initialization time.
> Sriov VF device may initialize lot faster.
> Instead of driver having such transport, and device specific checks, (or some very short or very long timeout), we propose, that let device mention such timeout value.

But does the driver really care in all cases? Say, I'm a virtio-blk
driver, and I drive all kinds of virtio-blk devices. If I choose a
timeout value that accommodates hardware devices, software devices will
just always breeze past, and nothing bad happens. If I know that I'm in
a pre-boot environment, I'll just add on a value that allows devices to
warm up. I'd say reasonable values depend way more on the driver
implementation and the environment it is running in than on the device.

>
>> This seems
>> pretty much to depend on the implementation. Nothing in the virtio spec forces
>> the driver to reset all devices serially and wait potentially forever. (If you deal
>> with a large amount of subdevices you cannot access concurrently, you likely
>> already have problems.)
> Sure, nothing in driver forces serial initialization.
> However in reality, PCI device scanning on the bus in hotplug driver, preboot env, sriov vf initialization is serial.

So, can't pci scan asynchronously? Is that a limitation in the pci spec?
The Linux ccw bus switched to asynchronous processing precisely because
of slow devices years ago, FWIW, so that may be why I don't really
understand the issue.

> In preboot case, even the order is defined, so it is sort of serial.

Again, is that somewhere in the hardware spec, and can it be fixed?
Serializing something that might take some time just seems troublesome.

> In other uses, concurrent access is fine and host does that today when it disables autoprobe on the PCI Bus and probes them in parallel after enabling sriov.
> I hope you understand that there are both use-cases exists which scan parallel and some serial.

Can at least some processing be made asynchronous? Like, you discover
the device, but then do the virtio setup out of band.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-11 16:00                               ` Michael S. Tsirkin
@ 2021-10-12  8:51                                 ` Parav Pandit
  2021-10-12  9:01                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-12  8:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 11, 2021 9:30 PM
> 
> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > This is unlikely to work the reset is completed. Because a real
> > > > device
> > > implementing this would prefer to do this in fw for 1000 virtio
> > > devices sitting on the physical card.
> > > > And it is very much driven by such implementation at device devel.
> > > > So it cannot update the counter value if reset is not completed for the
> device.
> > > >
> > > > I think read only device reset timeout is most elegant option
> > > > during device
> > > initialization phase that eliminates infinite loop of today.
> > >
> > > Why can't a driver just go ahead and do a timeout regardless?
> > o.k. lets consider this thought exercise. What is the timeout value that driver
> will choose if device doesn't specify one?
> > I explained in previous thread and you acked that actual fw based device
> may take longer to initialize than pure sw implementation backend.
> > In second example a pre-boot device can take even longer initialization time.
> > Sriov VF device may initialize lot faster.
> > Instead of driver having such transport, and device specific checks, (or some
> very short or very long timeout), we propose, that let device mention such
> timeout value.
> 
> Parav I think you are conflating reset with initialization time.
> initialization is just for host boot which takes seconds anyway - but no,
> minutes is not reasonable their, either.
> reset affects guest boot. This needs to complete in milliseconds.
>
I cannot promise, but with newer generation devices usually functionality improves.
Enforcing in milliseconds doesn't look practical for type of devices.
Some of the block devices may need to establish TCP connections in the backend.
It is more useful to wait for few more seconds to initialize device after power on the system, instead of giving up booting the server completely.
For example, a nvme block device starts with a minimum timeout of 500msec.

Yes, I agree to your point that a device given to a guest VM will likely have very short reset time that should complete in milliseconds.
 
> This conflation is IMHO one of the problems with this proposal.

Device initialization consist of device reset from the spec section 3.1.1.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-12  8:51                                 ` Parav Pandit
@ 2021-10-12  9:01                                   ` Michael S. Tsirkin
  2021-10-12  9:12                                     ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-12  9:01 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, October 11, 2021 9:30 PM
> > 
> > On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > > This is unlikely to work the reset is completed. Because a real
> > > > > device
> > > > implementing this would prefer to do this in fw for 1000 virtio
> > > > devices sitting on the physical card.
> > > > > And it is very much driven by such implementation at device devel.
> > > > > So it cannot update the counter value if reset is not completed for the
> > device.
> > > > >
> > > > > I think read only device reset timeout is most elegant option
> > > > > during device
> > > > initialization phase that eliminates infinite loop of today.
> > > >
> > > > Why can't a driver just go ahead and do a timeout regardless?
> > > o.k. lets consider this thought exercise. What is the timeout value that driver
> > will choose if device doesn't specify one?
> > > I explained in previous thread and you acked that actual fw based device
> > may take longer to initialize than pure sw implementation backend.
> > > In second example a pre-boot device can take even longer initialization time.
> > > Sriov VF device may initialize lot faster.
> > > Instead of driver having such transport, and device specific checks, (or some
> > very short or very long timeout), we propose, that let device mention such
> > timeout value.
> > 
> > Parav I think you are conflating reset with initialization time.
> > initialization is just for host boot which takes seconds anyway - but no,
> > minutes is not reasonable their, either.
> > reset affects guest boot. This needs to complete in milliseconds.
> >
> I cannot promise, but with newer generation devices usually functionality improves.
> Enforcing in milliseconds doesn't look practical for type of devices.
> Some of the block devices may need to establish TCP connections in the backend.
> It is more useful to wait for few more seconds to initialize device after power on the system, instead of giving up booting the server completely.
> For example, a nvme block device starts with a minimum timeout of 500msec.
> 
> Yes, I agree to your point that a device given to a guest VM will likely have very short reset time that should complete in milliseconds.
>  
> > This conflation is IMHO one of the problems with this proposal.
> 
> Device initialization consist of device reset from the spec section 3.1.1.

It does. But maybe we need to create a way for driver to distinguish
between the two. When under reset, use a driver supplied timeout.
When powering up, use a longer device supplied one.
migration is not a problem for baremetal so all's good from
that point of view. And power up seems irrelevant for
ccw/mmio since these are always within VMs. So it's a pci only thing.

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-12  9:01                                   ` Michael S. Tsirkin
@ 2021-10-12  9:12                                     ` Parav Pandit
  2021-10-14 17:35                                       ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-12  9:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, October 12, 2021 2:32 PM
> 
> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, October 11, 2021 9:30 PM
> > >
> > > On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > > > This is unlikely to work the reset is completed. Because a
> > > > > > real device
> > > > > implementing this would prefer to do this in fw for 1000 virtio
> > > > > devices sitting on the physical card.
> > > > > > And it is very much driven by such implementation at device devel.
> > > > > > So it cannot update the counter value if reset is not
> > > > > > completed for the
> > > device.
> > > > > >
> > > > > > I think read only device reset timeout is most elegant option
> > > > > > during device
> > > > > initialization phase that eliminates infinite loop of today.
> > > > >
> > > > > Why can't a driver just go ahead and do a timeout regardless?
> > > > o.k. lets consider this thought exercise. What is the timeout
> > > > value that driver
> > > will choose if device doesn't specify one?
> > > > I explained in previous thread and you acked that actual fw based
> > > > device
> > > may take longer to initialize than pure sw implementation backend.
> > > > In second example a pre-boot device can take even longer initialization
> time.
> > > > Sriov VF device may initialize lot faster.
> > > > Instead of driver having such transport, and device specific
> > > > checks, (or some
> > > very short or very long timeout), we propose, that let device
> > > mention such timeout value.
> > >
> > > Parav I think you are conflating reset with initialization time.
> > > initialization is just for host boot which takes seconds anyway -
> > > but no, minutes is not reasonable their, either.
> > > reset affects guest boot. This needs to complete in milliseconds.
> > >
> > I cannot promise, but with newer generation devices usually functionality
> improves.
> > Enforcing in milliseconds doesn't look practical for type of devices.
> > Some of the block devices may need to establish TCP connections in the
> backend.
> > It is more useful to wait for few more seconds to initialize device after power
> on the system, instead of giving up booting the server completely.
> > For example, a nvme block device starts with a minimum timeout of
> 500msec.
> >
> > Yes, I agree to your point that a device given to a guest VM will likely have
> very short reset time that should complete in milliseconds.
> >
> > > This conflation is IMHO one of the problems with this proposal.
> >
> > Device initialization consist of device reset from the spec section 3.1.1.
> 
> It does. But maybe we need to create a way for driver to distinguish between
> the two. When under reset, use a driver supplied timeout.
This make sense, because as we discussed when device undergo a reset with active DMA, after timeout expires, driver still cannot cleanup.
So this can be short driver decided value as longer timeout is not useful.

> When powering up, use a longer device supplied one.
In v0, v1 I initially considered only the powering up case of the device initialization. There was text around that.
And v2 I removed the initialization text, and I totally missed the above case with active DMA.
This should work.
We should word this part of the spec accordingly.

> migration is not a problem for baremetal so all's good from that point of view.
> And power up seems irrelevant for ccw/mmio since these are always within
> VMs. So it's a pci only thing.
>
I am not 100% sure for MMIO.
There may be future MMIO device similar to PCI, not sure.
I thought someone else too (Qualcomm?) had MMIO device that took bit longer to initialize in past discussion?
 
> --
> MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-11 16:22                               ` [virtio-dev] " Cornelia Huck
@ 2021-10-12 10:35                                 ` Parav Pandit
  0 siblings, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-12 10:35 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Monday, October 11, 2021 9:52 PM
> 
> On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Cornelia Huck <cohuck@redhat.com>
> >> Sent: Monday, October 11, 2021 8:29 PM
> >>
> >> On Mon, Oct 11 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> > I think read only device reset timeout is most elegant option
> >> > during device
> >> initialization phase that eliminates infinite loop of today.
> >>
> >> Why can't a driver just go ahead and do a timeout regardless?
> > o.k. lets consider this thought exercise. What is the timeout value that driver
> will choose if device doesn't specify one?
> > I explained in previous thread and you acked that actual fw based device
> may take longer to initialize than pure sw implementation backend.
> > In second example a pre-boot device can take even longer initialization time.
> > Sriov VF device may initialize lot faster.
> > Instead of driver having such transport, and device specific checks, (or some
> very short or very long timeout), we propose, that let device mention such
> timeout value.
> 
> But does the driver really care in all cases? 
Idea is that driver shouldn't care based on device type; rather just care what device provided.

> Say, I'm a virtio-blk driver, and I
> drive all kinds of virtio-blk devices. If I choose a timeout value that
> accommodates hardware devices, software devices will just always breeze
> past, and nothing bad happens. 
I had hard time defining a universal timeout that fits all device types of hw and sw.
We have 3 variants of the hw based devices and they have range from milliseconds to ten of seconds (in worst case).
Hence, the request is to let device feed the value based on the internal logic.

> If I know that I'm in a pre-boot environment,
> I'll just add on a value that allows devices to warm up. I'd say reasonable
> values depend way more on the driver implementation and the environment
> it is running in than on the device.
> 
It will be usually min_of (env.max_timeout, device.reset_timeout) if env wants to enforce its upper limit.

> >
> >> This seems
> >> pretty much to depend on the implementation. Nothing in the virtio
> >> spec forces the driver to reset all devices serially and wait
> >> potentially forever. (If you deal with a large amount of subdevices
> >> you cannot access concurrently, you likely already have problems.)
> > Sure, nothing in driver forces serial initialization.
> > However in reality, PCI device scanning on the bus in hotplug driver, preboot
> env, sriov vf initialization is serial.
> 
> So, can't pci scan asynchronously? Is that a limitation in the pci spec?
There isn't limitation from pci spec to my knowledge.
Linux device discovery to my knowledge is not parallel across or within a subsystem.

> The Linux ccw bus switched to asynchronous processing precisely because of
> slow devices years ago, FWIW, so that may be why I don't really understand
> the issue.
> 
> > In preboot case, even the order is defined, so it is sort of serial.
> 
> Again, is that somewhere in the hardware spec, and can it be fixed?
It is not in the hardware spec.
Preboot env usually runs with limited text and data memory without the OS.
User defined the boot order at the BIOS level. It is the way some systems work.
virtio device is fitting in the existing eco-system of physical servers.

> Serializing something that might take some time just seems troublesome.
> 
True, which is why as Michael mentioned, device resetting within short interval is better.
I cannot promise, but yes, with newer generation usually such timing and functionality should improves.
There is still finite timeout though.

> > In other uses, concurrent access is fine and host does that today when it
> disables autoprobe on the PCI Bus and probes them in parallel after enabling
> sriov.
> > I hope you understand that there are both use-cases exists which scan
> parallel and some serial.
> 
> Can at least some processing be made asynchronous? Like, you discover the
> device, but then do the virtio setup out of band.
Do you mean differ virtio device setup in a workqueue context or similar?

This may be possible, but when user does sysfs operations, it adds uncertainly to when the device's netdevice will actually appear even though sysfs operation completed for probe.
It requires users to listen to udev events to know that for a given PCI virtio pci device, now a net/block device is available.
This makes the whole flow complex to use.
May be certain device omit exposing device reset timeout, and workqueue/differed context can be avoided, but again now the user sw is unaware when to use udev to discover vs inline.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-12  9:12                                     ` Parav Pandit
@ 2021-10-14 17:35                                       ` Parav Pandit
  2021-10-14 22:28                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-14 17:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

Hi Michael, Cornelia,	

> From: Parav Pandit
> Sent: Tuesday, October 12, 2021 2:42 PM
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, October 12, 2021 2:32 PM
> >
> > On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, October 11, 2021 9:30 PM
> > > >
> > > > On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > > > > This is unlikely to work the reset is completed. Because a
> > > > > > > real device
> > > > > > implementing this would prefer to do this in fw for 1000
> > > > > > virtio devices sitting on the physical card.
> > > > > > > And it is very much driven by such implementation at device devel.
> > > > > > > So it cannot update the counter value if reset is not
> > > > > > > completed for the
> > > > device.
> > > > > > >
> > > > > > > I think read only device reset timeout is most elegant
> > > > > > > option during device
> > > > > > initialization phase that eliminates infinite loop of today.
> > > > > >
> > > > > > Why can't a driver just go ahead and do a timeout regardless?
> > > > > o.k. lets consider this thought exercise. What is the timeout
> > > > > value that driver
> > > > will choose if device doesn't specify one?
> > > > > I explained in previous thread and you acked that actual fw
> > > > > based device
> > > > may take longer to initialize than pure sw implementation backend.
> > > > > In second example a pre-boot device can take even longer
> > > > > initialization
> > time.
> > > > > Sriov VF device may initialize lot faster.
> > > > > Instead of driver having such transport, and device specific
> > > > > checks, (or some
> > > > very short or very long timeout), we propose, that let device
> > > > mention such timeout value.
> > > >
> > > > Parav I think you are conflating reset with initialization time.
> > > > initialization is just for host boot which takes seconds anyway -
> > > > but no, minutes is not reasonable their, either.
> > > > reset affects guest boot. This needs to complete in milliseconds.
> > > >
> > > I cannot promise, but with newer generation devices usually
> > > functionality
> > improves.
> > > Enforcing in milliseconds doesn't look practical for type of devices.
> > > Some of the block devices may need to establish TCP connections in
> > > the
> > backend.
> > > It is more useful to wait for few more seconds to initialize device
> > > after power
> > on the system, instead of giving up booting the server completely.
> > > For example, a nvme block device starts with a minimum timeout of
> > 500msec.
> > >
> > > Yes, I agree to your point that a device given to a guest VM will
> > > likely have
> > very short reset time that should complete in milliseconds.
> > >
> > > > This conflation is IMHO one of the problems with this proposal.
> > >
> > > Device initialization consist of device reset from the spec section 3.1.1.
> >
> > It does. But maybe we need to create a way for driver to distinguish
> > between the two. When under reset, use a driver supplied timeout.
> This make sense, because as we discussed when device undergo a reset with
> active DMA, after timeout expires, driver still cannot cleanup.
> So this can be short driver decided value as longer timeout is not useful.
> 
> > When powering up, use a longer device supplied one.
> In v0, v1 I initially considered only the powering up case of the device
> initialization. There was text around that.
> And v2 I removed the initialization text, and I totally missed the above case with
> active DMA.
> This should work.
> We should word this part of the spec accordingly.

Below changes are good for v3?
1. driver should use device reset time during initialization stage
2. remove feature bit as feature bits are only readable after reset is completed
3. device reset timeout field of zero indicates that device doesn't support it.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] Add device reset timeout field
  2021-10-14 17:35                                       ` Parav Pandit
@ 2021-10-14 22:28                                         ` Michael S. Tsirkin
  2021-10-15  4:36                                           ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-14 22:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> Hi Michael, Cornelia,	
> 
> > From: Parav Pandit
> > Sent: Tuesday, October 12, 2021 2:42 PM
> > 
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Tuesday, October 12, 2021 2:32 PM
> > >
> > > On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Monday, October 11, 2021 9:30 PM
> > > > >
> > > > > On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > > > > > This is unlikely to work the reset is completed. Because a
> > > > > > > > real device
> > > > > > > implementing this would prefer to do this in fw for 1000
> > > > > > > virtio devices sitting on the physical card.
> > > > > > > > And it is very much driven by such implementation at device devel.
> > > > > > > > So it cannot update the counter value if reset is not
> > > > > > > > completed for the
> > > > > device.
> > > > > > > >
> > > > > > > > I think read only device reset timeout is most elegant
> > > > > > > > option during device
> > > > > > > initialization phase that eliminates infinite loop of today.
> > > > > > >
> > > > > > > Why can't a driver just go ahead and do a timeout regardless?
> > > > > > o.k. lets consider this thought exercise. What is the timeout
> > > > > > value that driver
> > > > > will choose if device doesn't specify one?
> > > > > > I explained in previous thread and you acked that actual fw
> > > > > > based device
> > > > > may take longer to initialize than pure sw implementation backend.
> > > > > > In second example a pre-boot device can take even longer
> > > > > > initialization
> > > time.
> > > > > > Sriov VF device may initialize lot faster.
> > > > > > Instead of driver having such transport, and device specific
> > > > > > checks, (or some
> > > > > very short or very long timeout), we propose, that let device
> > > > > mention such timeout value.
> > > > >
> > > > > Parav I think you are conflating reset with initialization time.
> > > > > initialization is just for host boot which takes seconds anyway -
> > > > > but no, minutes is not reasonable their, either.
> > > > > reset affects guest boot. This needs to complete in milliseconds.
> > > > >
> > > > I cannot promise, but with newer generation devices usually
> > > > functionality
> > > improves.
> > > > Enforcing in milliseconds doesn't look practical for type of devices.
> > > > Some of the block devices may need to establish TCP connections in
> > > > the
> > > backend.
> > > > It is more useful to wait for few more seconds to initialize device
> > > > after power
> > > on the system, instead of giving up booting the server completely.
> > > > For example, a nvme block device starts with a minimum timeout of
> > > 500msec.
> > > >
> > > > Yes, I agree to your point that a device given to a guest VM will
> > > > likely have
> > > very short reset time that should complete in milliseconds.
> > > >
> > > > > This conflation is IMHO one of the problems with this proposal.
> > > >
> > > > Device initialization consist of device reset from the spec section 3.1.1.
> > >
> > > It does. But maybe we need to create a way for driver to distinguish
> > > between the two. When under reset, use a driver supplied timeout.
> > This make sense, because as we discussed when device undergo a reset with
> > active DMA, after timeout expires, driver still cannot cleanup.
> > So this can be short driver decided value as longer timeout is not useful.
> > 
> > > When powering up, use a longer device supplied one.
> > In v0, v1 I initially considered only the powering up case of the device
> > initialization. There was text around that.
> > And v2 I removed the initialization text, and I totally missed the above case with
> > active DMA.
> > This should work.
> > We should word this part of the spec accordingly.
> 
> Below changes are good for v3?
> 1. driver should use device reset time during initialization stage

How does driver identify this though?

> 2. remove feature bit as feature bits are only readable after reset is completed
> 3. device reset timeout field of zero indicates that device doesn't support it.

I'm not sure about 3. I think each transport will need its own way to do it.

So I propose: maybe a capability like this, with a timeout field?
And within VMs, we can just do without, since it got out of reset once
it will surely get out of reset again...

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-14 22:28                                         ` Michael S. Tsirkin
@ 2021-10-15  4:36                                           ` Parav Pandit
  2021-10-15  5:15                                             ` [virtio-dev] " Jason Wang
  2021-10-15  6:51                                             ` Cornelia Huck
  0 siblings, 2 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-15  4:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 15, 2021 3:59 AM
> 
> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> > Hi Michael, Cornelia,
> >
> > > From: Parav Pandit
> > > Sent: Tuesday, October 12, 2021 2:42 PM
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Tuesday, October 12, 2021 2:32 PM
> > > >
> > > > On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Monday, October 11, 2021 9:30 PM
> > > > > >
> > > > > > On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > > > > > > This is unlikely to work the reset is completed. Because
> > > > > > > > > a real device
> > > > > > > > implementing this would prefer to do this in fw for 1000
> > > > > > > > virtio devices sitting on the physical card.
> > > > > > > > > And it is very much driven by such implementation at device
> devel.
> > > > > > > > > So it cannot update the counter value if reset is not
> > > > > > > > > completed for the
> > > > > > device.
> > > > > > > > >
> > > > > > > > > I think read only device reset timeout is most elegant
> > > > > > > > > option during device
> > > > > > > > initialization phase that eliminates infinite loop of today.
> > > > > > > >
> > > > > > > > Why can't a driver just go ahead and do a timeout regardless?
> > > > > > > o.k. lets consider this thought exercise. What is the
> > > > > > > timeout value that driver
> > > > > > will choose if device doesn't specify one?
> > > > > > > I explained in previous thread and you acked that actual fw
> > > > > > > based device
> > > > > > may take longer to initialize than pure sw implementation backend.
> > > > > > > In second example a pre-boot device can take even longer
> > > > > > > initialization
> > > > time.
> > > > > > > Sriov VF device may initialize lot faster.
> > > > > > > Instead of driver having such transport, and device specific
> > > > > > > checks, (or some
> > > > > > very short or very long timeout), we propose, that let device
> > > > > > mention such timeout value.
> > > > > >
> > > > > > Parav I think you are conflating reset with initialization time.
> > > > > > initialization is just for host boot which takes seconds
> > > > > > anyway - but no, minutes is not reasonable their, either.
> > > > > > reset affects guest boot. This needs to complete in milliseconds.
> > > > > >
> > > > > I cannot promise, but with newer generation devices usually
> > > > > functionality
> > > > improves.
> > > > > Enforcing in milliseconds doesn't look practical for type of devices.
> > > > > Some of the block devices may need to establish TCP connections
> > > > > in the
> > > > backend.
> > > > > It is more useful to wait for few more seconds to initialize
> > > > > device after power
> > > > on the system, instead of giving up booting the server completely.
> > > > > For example, a nvme block device starts with a minimum timeout
> > > > > of
> > > > 500msec.
> > > > >
> > > > > Yes, I agree to your point that a device given to a guest VM
> > > > > will likely have
> > > > very short reset time that should complete in milliseconds.
> > > > >
> > > > > > This conflation is IMHO one of the problems with this proposal.
> > > > >
> > > > > Device initialization consist of device reset from the spec section 3.1.1.
> > > >
> > > > It does. But maybe we need to create a way for driver to
> > > > distinguish between the two. When under reset, use a driver supplied
> timeout.
> > > This make sense, because as we discussed when device undergo a reset
> > > with active DMA, after timeout expires, driver still cannot cleanup.
> > > So this can be short driver decided value as longer timeout is not useful.
> > >
> > > > When powering up, use a longer device supplied one.
> > > In v0, v1 I initially considered only the powering up case of the
> > > device initialization. There was text around that.
> > > And v2 I removed the initialization text, and I totally missed the
> > > above case with active DMA.
> > > This should work.
> > > We should word this part of the spec accordingly.
> >
> > Below changes are good for v3?
> > 1. driver should use device reset time during initialization stage
> 
> How does driver identify this though?
Existence of device_reset_timeout field in struct virtio_pci_common_cfg indicates that this field exists.
If device support it, it will place non zero value and driver knows that this field should be used.

> 
> > 2. remove feature bit as feature bits are only readable after reset is
> > completed 3. device reset timeout field of zero indicates that device doesn't
> support it.
> 
> I'm not sure about 3. I think each transport will need its own way to do it.
> 
For pci a value of zero indicates it isn't supported.
For mmio DeviceResetTimeout at offset 0x04c indicates same.
Currently only these 2 transports have the use.

> So I propose: maybe a capability like this, with a timeout field?
Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like VIRTIO_PCI_CAP_COMMON_CFG?
This will contain one or more timeout? For example with his proposal it contains only device reset timeout.
Later same capability will be further extended to contains command timeout too? Yes?

> And within VMs, we can just do without, since it got out of reset once it will
> surely get out of reset again...
Yes, VM might not need it. It is really the HV's choice to implement and not part of the virtio spec.
Our internal cloud passthrough a PF to the VM.
It is probably better to let HV to choose if they want to do ctrl+c or have timeout.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  4:36                                           ` Parav Pandit
@ 2021-10-15  5:15                                             ` Jason Wang
  2021-10-15  5:20                                               ` Parav Pandit
  2021-10-15  6:51                                             ` Cornelia Huck
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-15  5:15 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


在 2021/10/15 下午12:36, Parav Pandit 写道:
>
>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Friday, October 15, 2021 3:59 AM
>>
>> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
>>> Hi Michael, Cornelia,
>>>
>>>> From: Parav Pandit
>>>> Sent: Tuesday, October 12, 2021 2:42 PM
>>>>
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Tuesday, October 12, 2021 2:32 PM
>>>>>
>>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
>>>>>>
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Monday, October 11, 2021 9:30 PM
>>>>>>>
>>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
>>>>>>>>>> This is unlikely to work the reset is completed. Because
>>>>>>>>>> a real device
>>>>>>>>> implementing this would prefer to do this in fw for 1000
>>>>>>>>> virtio devices sitting on the physical card.
>>>>>>>>>> And it is very much driven by such implementation at device
>> devel.
>>>>>>>>>> So it cannot update the counter value if reset is not
>>>>>>>>>> completed for the
>>>>>>> device.
>>>>>>>>>> I think read only device reset timeout is most elegant
>>>>>>>>>> option during device
>>>>>>>>> initialization phase that eliminates infinite loop of today.
>>>>>>>>>
>>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
>>>>>>>> o.k. lets consider this thought exercise. What is the
>>>>>>>> timeout value that driver
>>>>>>> will choose if device doesn't specify one?
>>>>>>>> I explained in previous thread and you acked that actual fw
>>>>>>>> based device
>>>>>>> may take longer to initialize than pure sw implementation backend.
>>>>>>>> In second example a pre-boot device can take even longer
>>>>>>>> initialization
>>>>> time.
>>>>>>>> Sriov VF device may initialize lot faster.
>>>>>>>> Instead of driver having such transport, and device specific
>>>>>>>> checks, (or some
>>>>>>> very short or very long timeout), we propose, that let device
>>>>>>> mention such timeout value.
>>>>>>>
>>>>>>> Parav I think you are conflating reset with initialization time.
>>>>>>> initialization is just for host boot which takes seconds
>>>>>>> anyway - but no, minutes is not reasonable their, either.
>>>>>>> reset affects guest boot. This needs to complete in milliseconds.
>>>>>>>
>>>>>> I cannot promise, but with newer generation devices usually
>>>>>> functionality
>>>>> improves.
>>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
>>>>>> Some of the block devices may need to establish TCP connections
>>>>>> in the
>>>>> backend.
>>>>>> It is more useful to wait for few more seconds to initialize
>>>>>> device after power
>>>>> on the system, instead of giving up booting the server completely.
>>>>>> For example, a nvme block device starts with a minimum timeout
>>>>>> of
>>>>> 500msec.
>>>>>> Yes, I agree to your point that a device given to a guest VM
>>>>>> will likely have
>>>>> very short reset time that should complete in milliseconds.
>>>>>>> This conflation is IMHO one of the problems with this proposal.
>>>>>> Device initialization consist of device reset from the spec section 3.1.1.
>>>>> It does. But maybe we need to create a way for driver to
>>>>> distinguish between the two. When under reset, use a driver supplied
>> timeout.
>>>> This make sense, because as we discussed when device undergo a reset
>>>> with active DMA, after timeout expires, driver still cannot cleanup.
>>>> So this can be short driver decided value as longer timeout is not useful.
>>>>
>>>>> When powering up, use a longer device supplied one.
>>>> In v0, v1 I initially considered only the powering up case of the
>>>> device initialization. There was text around that.
>>>> And v2 I removed the initialization text, and I totally missed the
>>>> above case with active DMA.
>>>> This should work.
>>>> We should word this part of the spec accordingly.
>>> Below changes are good for v3?
>>> 1. driver should use device reset time during initialization stage
>> How does driver identify this though?
> Existence of device_reset_timeout field in struct virtio_pci_common_cfg indicates that this field exists.
> If device support it, it will place non zero value and driver knows that this field should be used.
>
>>> 2. remove feature bit as feature bits are only readable after reset is
>>> completed 3. device reset timeout field of zero indicates that device doesn't
>> support it.
>>
>> I'm not sure about 3. I think each transport will need its own way to do it.
>>
> For pci a value of zero indicates it isn't supported.
> For mmio DeviceResetTimeout at offset 0x04c indicates same.
> Currently only these 2 transports have the use.
>
>> So I propose: maybe a capability like this, with a timeout field?
> Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like VIRTIO_PCI_CAP_COMMON_CFG?
> This will contain one or more timeout? For example with his proposal it contains only device reset timeout.
> Later same capability will be further extended to contains command timeout too? Yes?
>
>> And within VMs, we can just do without, since it got out of reset once it will
>> surely get out of reset again...
> Yes, VM might not need it. It is really the HV's choice to implement and not part of the virtio spec.


Well, this will break the migration between HW virtio and SW virtio.

Thanks


> Our internal cloud passthrough a PF to the VM.
> It is probably better to let HV to choose if they want to do ctrl+c or have timeout.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  5:15                                             ` [virtio-dev] " Jason Wang
@ 2021-10-15  5:20                                               ` Parav Pandit
  2021-10-15  6:40                                                 ` Jason Wang
  2021-10-15  6:42                                                 ` Jason Wang
  0 siblings, 2 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-15  5:20 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 15, 2021 10:46 AM
> 
> 
> 在 2021/10/15 下午12:36, Parav Pandit 写道:
> >
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Friday, October 15, 2021 3:59 AM
> >>
> >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> >>> Hi Michael, Cornelia,
> >>>
> >>>> From: Parav Pandit
> >>>> Sent: Tuesday, October 12, 2021 2:42 PM
> >>>>
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Sent: Tuesday, October 12, 2021 2:32 PM
> >>>>>
> >>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> >>>>>>
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Monday, October 11, 2021 9:30 PM
> >>>>>>>
> >>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> >>>>>>>>>> This is unlikely to work the reset is completed. Because a
> >>>>>>>>>> real device
> >>>>>>>>> implementing this would prefer to do this in fw for 1000
> >>>>>>>>> virtio devices sitting on the physical card.
> >>>>>>>>>> And it is very much driven by such implementation at device
> >> devel.
> >>>>>>>>>> So it cannot update the counter value if reset is not
> >>>>>>>>>> completed for the
> >>>>>>> device.
> >>>>>>>>>> I think read only device reset timeout is most elegant option
> >>>>>>>>>> during device
> >>>>>>>>> initialization phase that eliminates infinite loop of today.
> >>>>>>>>>
> >>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
> >>>>>>>> o.k. lets consider this thought exercise. What is the timeout
> >>>>>>>> value that driver
> >>>>>>> will choose if device doesn't specify one?
> >>>>>>>> I explained in previous thread and you acked that actual fw
> >>>>>>>> based device
> >>>>>>> may take longer to initialize than pure sw implementation backend.
> >>>>>>>> In second example a pre-boot device can take even longer
> >>>>>>>> initialization
> >>>>> time.
> >>>>>>>> Sriov VF device may initialize lot faster.
> >>>>>>>> Instead of driver having such transport, and device specific
> >>>>>>>> checks, (or some
> >>>>>>> very short or very long timeout), we propose, that let device
> >>>>>>> mention such timeout value.
> >>>>>>>
> >>>>>>> Parav I think you are conflating reset with initialization time.
> >>>>>>> initialization is just for host boot which takes seconds anyway
> >>>>>>> - but no, minutes is not reasonable their, either.
> >>>>>>> reset affects guest boot. This needs to complete in milliseconds.
> >>>>>>>
> >>>>>> I cannot promise, but with newer generation devices usually
> >>>>>> functionality
> >>>>> improves.
> >>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
> >>>>>> Some of the block devices may need to establish TCP connections
> >>>>>> in the
> >>>>> backend.
> >>>>>> It is more useful to wait for few more seconds to initialize
> >>>>>> device after power
> >>>>> on the system, instead of giving up booting the server completely.
> >>>>>> For example, a nvme block device starts with a minimum timeout of
> >>>>> 500msec.
> >>>>>> Yes, I agree to your point that a device given to a guest VM will
> >>>>>> likely have
> >>>>> very short reset time that should complete in milliseconds.
> >>>>>>> This conflation is IMHO one of the problems with this proposal.
> >>>>>> Device initialization consist of device reset from the spec section 3.1.1.
> >>>>> It does. But maybe we need to create a way for driver to
> >>>>> distinguish between the two. When under reset, use a driver
> >>>>> supplied
> >> timeout.
> >>>> This make sense, because as we discussed when device undergo a
> >>>> reset with active DMA, after timeout expires, driver still cannot cleanup.
> >>>> So this can be short driver decided value as longer timeout is not useful.
> >>>>
> >>>>> When powering up, use a longer device supplied one.
> >>>> In v0, v1 I initially considered only the powering up case of the
> >>>> device initialization. There was text around that.
> >>>> And v2 I removed the initialization text, and I totally missed the
> >>>> above case with active DMA.
> >>>> This should work.
> >>>> We should word this part of the spec accordingly.
> >>> Below changes are good for v3?
> >>> 1. driver should use device reset time during initialization stage
> >> How does driver identify this though?
> > Existence of device_reset_timeout field in struct virtio_pci_common_cfg
> indicates that this field exists.
> > If device support it, it will place non zero value and driver knows that this
> field should be used.
> >
> >>> 2. remove feature bit as feature bits are only readable after reset
> >>> is completed 3. device reset timeout field of zero indicates that
> >>> device doesn't
> >> support it.
> >>
> >> I'm not sure about 3. I think each transport will need its own way to do it.
> >>
> > For pci a value of zero indicates it isn't supported.
> > For mmio DeviceResetTimeout at offset 0x04c indicates same.
> > Currently only these 2 transports have the use.
> >
> >> So I propose: maybe a capability like this, with a timeout field?
> > Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like
> VIRTIO_PCI_CAP_COMMON_CFG?
> > This will contain one or more timeout? For example with his proposal it
> contains only device reset timeout.
> > Later same capability will be further extended to contains command timeout
> too? Yes?
> >
> >> And within VMs, we can just do without, since it got out of reset
> >> once it will surely get out of reset again...
> > Yes, VM might not need it. It is really the HV's choice to implement and not
> part of the virtio spec.
> 
> 
> Well, this will break the migration between HW virtio and SW virtio.
> 
How does it break? Can you please explain?
SW virtio will emulate what HW virtio does. This field is exposing only read-only field to driver to not wait infinitely.
If src is hw and dst is sw, sw will likely have similar capabilities as hw, not just this particular one but many other. Isn't it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  5:20                                               ` Parav Pandit
@ 2021-10-15  6:40                                                 ` Jason Wang
  2021-10-15  6:42                                                 ` Jason Wang
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Wang @ 2021-10-15  6:40 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Cornelia Huck, virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer


在 2021/10/15 下午1:20, Parav Pandit 写道:
>
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Friday, October 15, 2021 10:46 AM
>>
>>
>> 在 2021/10/15 下午12:36, Parav Pandit 写道:
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Friday, October 15, 2021 3:59 AM
>>>>
>>>> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
>>>>> Hi Michael, Cornelia,
>>>>>
>>>>>> From: Parav Pandit
>>>>>> Sent: Tuesday, October 12, 2021 2:42 PM
>>>>>>
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Tuesday, October 12, 2021 2:32 PM
>>>>>>>
>>>>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Monday, October 11, 2021 9:30 PM
>>>>>>>>>
>>>>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
>>>>>>>>>>>> This is unlikely to work the reset is completed. Because a
>>>>>>>>>>>> real device
>>>>>>>>>>> implementing this would prefer to do this in fw for 1000
>>>>>>>>>>> virtio devices sitting on the physical card.
>>>>>>>>>>>> And it is very much driven by such implementation at device
>>>> devel.
>>>>>>>>>>>> So it cannot update the counter value if reset is not
>>>>>>>>>>>> completed for the
>>>>>>>>> device.
>>>>>>>>>>>> I think read only device reset timeout is most elegant option
>>>>>>>>>>>> during device
>>>>>>>>>>> initialization phase that eliminates infinite loop of today.
>>>>>>>>>>>
>>>>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
>>>>>>>>>> o.k. lets consider this thought exercise. What is the timeout
>>>>>>>>>> value that driver
>>>>>>>>> will choose if device doesn't specify one?
>>>>>>>>>> I explained in previous thread and you acked that actual fw
>>>>>>>>>> based device
>>>>>>>>> may take longer to initialize than pure sw implementation backend.
>>>>>>>>>> In second example a pre-boot device can take even longer
>>>>>>>>>> initialization
>>>>>>> time.
>>>>>>>>>> Sriov VF device may initialize lot faster.
>>>>>>>>>> Instead of driver having such transport, and device specific
>>>>>>>>>> checks, (or some
>>>>>>>>> very short or very long timeout), we propose, that let device
>>>>>>>>> mention such timeout value.
>>>>>>>>>
>>>>>>>>> Parav I think you are conflating reset with initialization time.
>>>>>>>>> initialization is just for host boot which takes seconds anyway
>>>>>>>>> - but no, minutes is not reasonable their, either.
>>>>>>>>> reset affects guest boot. This needs to complete in milliseconds.
>>>>>>>>>
>>>>>>>> I cannot promise, but with newer generation devices usually
>>>>>>>> functionality
>>>>>>> improves.
>>>>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
>>>>>>>> Some of the block devices may need to establish TCP connections
>>>>>>>> in the
>>>>>>> backend.
>>>>>>>> It is more useful to wait for few more seconds to initialize
>>>>>>>> device after power
>>>>>>> on the system, instead of giving up booting the server completely.
>>>>>>>> For example, a nvme block device starts with a minimum timeout of
>>>>>>> 500msec.
>>>>>>>> Yes, I agree to your point that a device given to a guest VM will
>>>>>>>> likely have
>>>>>>> very short reset time that should complete in milliseconds.
>>>>>>>>> This conflation is IMHO one of the problems with this proposal.
>>>>>>>> Device initialization consist of device reset from the spec section 3.1.1.
>>>>>>> It does. But maybe we need to create a way for driver to
>>>>>>> distinguish between the two. When under reset, use a driver
>>>>>>> supplied
>>>> timeout.
>>>>>> This make sense, because as we discussed when device undergo a
>>>>>> reset with active DMA, after timeout expires, driver still cannot cleanup.
>>>>>> So this can be short driver decided value as longer timeout is not useful.
>>>>>>
>>>>>>> When powering up, use a longer device supplied one.
>>>>>> In v0, v1 I initially considered only the powering up case of the
>>>>>> device initialization. There was text around that.
>>>>>> And v2 I removed the initialization text, and I totally missed the
>>>>>> above case with active DMA.
>>>>>> This should work.
>>>>>> We should word this part of the spec accordingly.
>>>>> Below changes are good for v3?
>>>>> 1. driver should use device reset time during initialization stage
>>>> How does driver identify this though?
>>> Existence of device_reset_timeout field in struct virtio_pci_common_cfg
>> indicates that this field exists.
>>> If device support it, it will place non zero value and driver knows that this
>> field should be used.
>>>>> 2. remove feature bit as feature bits are only readable after reset
>>>>> is completed 3. device reset timeout field of zero indicates that
>>>>> device doesn't
>>>> support it.
>>>>
>>>> I'm not sure about 3. I think each transport will need its own way to do it.
>>>>
>>> For pci a value of zero indicates it isn't supported.
>>> For mmio DeviceResetTimeout at offset 0x04c indicates same.
>>> Currently only these 2 transports have the use.
>>>
>>>> So I propose: maybe a capability like this, with a timeout field?
>>> Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like
>> VIRTIO_PCI_CAP_COMMON_CFG?
>>> This will contain one or more timeout? For example with his proposal it
>> contains only device reset timeout.
>>> Later same capability will be further extended to contains command timeout
>> too? Yes?
>>>> And within VMs, we can just do without, since it got out of reset
>>>> once it will surely get out of reset again...
>>> Yes, VM might not need it. It is really the HV's choice to implement and not
>> part of the virtio spec.
>>
>>
>> Well, this will break the migration between HW virtio and SW virtio.
>>
> How does it break? Can you please explain?
> SW virtio will emulate what HW virtio does. This field is exposing only read-only field to driver to not wait infinitely.


As discussed, can we do transport layer reset in this case?


> If src is hw and dst is sw, sw will likely have similar capabilities as hw, not just this particular one but many other. Isn't it?


But how about the case when src is software without those capabilities?

Thanks



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  5:20                                               ` Parav Pandit
  2021-10-15  6:40                                                 ` Jason Wang
@ 2021-10-15  6:42                                                 ` Jason Wang
  2021-10-15  6:48                                                   ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-15  6:42 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Fri, Oct 15, 2021 at 1:20 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 15, 2021 10:46 AM
> >
> >
> > 在 2021/10/15 下午12:36, Parav Pandit 写道:
> > >
> > >> From: Michael S. Tsirkin <mst@redhat.com>
> > >> Sent: Friday, October 15, 2021 3:59 AM
> > >>
> > >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> > >>> Hi Michael, Cornelia,
> > >>>
> > >>>> From: Parav Pandit
> > >>>> Sent: Tuesday, October 12, 2021 2:42 PM
> > >>>>
> > >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > >>>>> Sent: Tuesday, October 12, 2021 2:32 PM
> > >>>>>
> > >>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > >>>>>>
> > >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>> Sent: Monday, October 11, 2021 9:30 PM
> > >>>>>>>
> > >>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > >>>>>>>>>> This is unlikely to work the reset is completed. Because a
> > >>>>>>>>>> real device
> > >>>>>>>>> implementing this would prefer to do this in fw for 1000
> > >>>>>>>>> virtio devices sitting on the physical card.
> > >>>>>>>>>> And it is very much driven by such implementation at device
> > >> devel.
> > >>>>>>>>>> So it cannot update the counter value if reset is not
> > >>>>>>>>>> completed for the
> > >>>>>>> device.
> > >>>>>>>>>> I think read only device reset timeout is most elegant option
> > >>>>>>>>>> during device
> > >>>>>>>>> initialization phase that eliminates infinite loop of today.
> > >>>>>>>>>
> > >>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
> > >>>>>>>> o.k. lets consider this thought exercise. What is the timeout
> > >>>>>>>> value that driver
> > >>>>>>> will choose if device doesn't specify one?
> > >>>>>>>> I explained in previous thread and you acked that actual fw
> > >>>>>>>> based device
> > >>>>>>> may take longer to initialize than pure sw implementation backend.
> > >>>>>>>> In second example a pre-boot device can take even longer
> > >>>>>>>> initialization
> > >>>>> time.
> > >>>>>>>> Sriov VF device may initialize lot faster.
> > >>>>>>>> Instead of driver having such transport, and device specific
> > >>>>>>>> checks, (or some
> > >>>>>>> very short or very long timeout), we propose, that let device
> > >>>>>>> mention such timeout value.
> > >>>>>>>
> > >>>>>>> Parav I think you are conflating reset with initialization time.
> > >>>>>>> initialization is just for host boot which takes seconds anyway
> > >>>>>>> - but no, minutes is not reasonable their, either.
> > >>>>>>> reset affects guest boot. This needs to complete in milliseconds.
> > >>>>>>>
> > >>>>>> I cannot promise, but with newer generation devices usually
> > >>>>>> functionality
> > >>>>> improves.
> > >>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
> > >>>>>> Some of the block devices may need to establish TCP connections
> > >>>>>> in the
> > >>>>> backend.
> > >>>>>> It is more useful to wait for few more seconds to initialize
> > >>>>>> device after power
> > >>>>> on the system, instead of giving up booting the server completely.
> > >>>>>> For example, a nvme block device starts with a minimum timeout of
> > >>>>> 500msec.
> > >>>>>> Yes, I agree to your point that a device given to a guest VM will
> > >>>>>> likely have
> > >>>>> very short reset time that should complete in milliseconds.
> > >>>>>>> This conflation is IMHO one of the problems with this proposal.
> > >>>>>> Device initialization consist of device reset from the spec section 3.1.1.
> > >>>>> It does. But maybe we need to create a way for driver to
> > >>>>> distinguish between the two. When under reset, use a driver
> > >>>>> supplied
> > >> timeout.
> > >>>> This make sense, because as we discussed when device undergo a
> > >>>> reset with active DMA, after timeout expires, driver still cannot cleanup.
> > >>>> So this can be short driver decided value as longer timeout is not useful.
> > >>>>
> > >>>>> When powering up, use a longer device supplied one.
> > >>>> In v0, v1 I initially considered only the powering up case of the
> > >>>> device initialization. There was text around that.
> > >>>> And v2 I removed the initialization text, and I totally missed the
> > >>>> above case with active DMA.
> > >>>> This should work.
> > >>>> We should word this part of the spec accordingly.
> > >>> Below changes are good for v3?
> > >>> 1. driver should use device reset time during initialization stage
> > >> How does driver identify this though?
> > > Existence of device_reset_timeout field in struct virtio_pci_common_cfg
> > indicates that this field exists.
> > > If device support it, it will place non zero value and driver knows that this
> > field should be used.
> > >
> > >>> 2. remove feature bit as feature bits are only readable after reset
> > >>> is completed 3. device reset timeout field of zero indicates that
> > >>> device doesn't
> > >> support it.
> > >>
> > >> I'm not sure about 3. I think each transport will need its own way to do it.
> > >>
> > > For pci a value of zero indicates it isn't supported.
> > > For mmio DeviceResetTimeout at offset 0x04c indicates same.
> > > Currently only these 2 transports have the use.
> > >
> > >> So I propose: maybe a capability like this, with a timeout field?
> > > Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like
> > VIRTIO_PCI_CAP_COMMON_CFG?
> > > This will contain one or more timeout? For example with his proposal it
> > contains only device reset timeout.
> > > Later same capability will be further extended to contains command timeout
> > too? Yes?
> > >
> > >> And within VMs, we can just do without, since it got out of reset
> > >> once it will surely get out of reset again...
> > > Yes, VM might not need it. It is really the HV's choice to implement and not
> > part of the virtio spec.
> >
> >
> > Well, this will break the migration between HW virtio and SW virtio.
> >
> How does it break? Can you please explain?
> SW virtio will emulate what HW virtio does. This field is exposing only read-only field to driver to not wait infinitely.

As discussed previously, can we use transport level reset in this case?

> If src is hw and dst is sw, sw will likely have similar capabilities as hw, not just this particular one but many other. Isn't it?

The problem is the when src is software backend without this capability.

Thanks


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  6:42                                                 ` Jason Wang
@ 2021-10-15  6:48                                                   ` Parav Pandit
  2021-10-15  7:02                                                     ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-15  6:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer



> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 15, 2021 12:12 PM
> 
> On Fri, Oct 15, 2021 at 1:20 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 15, 2021 10:46 AM
> > >
> > >
> > > 在 2021/10/15 下午12:36, Parav Pandit 写道:
> > > >
> > > >> From: Michael S. Tsirkin <mst@redhat.com>
> > > >> Sent: Friday, October 15, 2021 3:59 AM
> > > >>
> > > >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> > > >>> Hi Michael, Cornelia,
> > > >>>
> > > >>>> From: Parav Pandit
> > > >>>> Sent: Tuesday, October 12, 2021 2:42 PM
> > > >>>>
> > > >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>> Sent: Tuesday, October 12, 2021 2:32 PM
> > > >>>>>
> > > >>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > > >>>>>>
> > > >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>>>> Sent: Monday, October 11, 2021 9:30 PM
> > > >>>>>>>
> > > >>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > >>>>>>>>>> This is unlikely to work the reset is completed. Because
> > > >>>>>>>>>> a real device
> > > >>>>>>>>> implementing this would prefer to do this in fw for 1000
> > > >>>>>>>>> virtio devices sitting on the physical card.
> > > >>>>>>>>>> And it is very much driven by such implementation at
> > > >>>>>>>>>> device
> > > >> devel.
> > > >>>>>>>>>> So it cannot update the counter value if reset is not
> > > >>>>>>>>>> completed for the
> > > >>>>>>> device.
> > > >>>>>>>>>> I think read only device reset timeout is most elegant
> > > >>>>>>>>>> option during device
> > > >>>>>>>>> initialization phase that eliminates infinite loop of today.
> > > >>>>>>>>>
> > > >>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
> > > >>>>>>>> o.k. lets consider this thought exercise. What is the
> > > >>>>>>>> timeout value that driver
> > > >>>>>>> will choose if device doesn't specify one?
> > > >>>>>>>> I explained in previous thread and you acked that actual fw
> > > >>>>>>>> based device
> > > >>>>>>> may take longer to initialize than pure sw implementation
> backend.
> > > >>>>>>>> In second example a pre-boot device can take even longer
> > > >>>>>>>> initialization
> > > >>>>> time.
> > > >>>>>>>> Sriov VF device may initialize lot faster.
> > > >>>>>>>> Instead of driver having such transport, and device
> > > >>>>>>>> specific checks, (or some
> > > >>>>>>> very short or very long timeout), we propose, that let
> > > >>>>>>> device mention such timeout value.
> > > >>>>>>>
> > > >>>>>>> Parav I think you are conflating reset with initialization time.
> > > >>>>>>> initialization is just for host boot which takes seconds
> > > >>>>>>> anyway
> > > >>>>>>> - but no, minutes is not reasonable their, either.
> > > >>>>>>> reset affects guest boot. This needs to complete in milliseconds.
> > > >>>>>>>
> > > >>>>>> I cannot promise, but with newer generation devices usually
> > > >>>>>> functionality
> > > >>>>> improves.
> > > >>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
> > > >>>>>> Some of the block devices may need to establish TCP
> > > >>>>>> connections in the
> > > >>>>> backend.
> > > >>>>>> It is more useful to wait for few more seconds to initialize
> > > >>>>>> device after power
> > > >>>>> on the system, instead of giving up booting the server completely.
> > > >>>>>> For example, a nvme block device starts with a minimum
> > > >>>>>> timeout of
> > > >>>>> 500msec.
> > > >>>>>> Yes, I agree to your point that a device given to a guest VM
> > > >>>>>> will likely have
> > > >>>>> very short reset time that should complete in milliseconds.
> > > >>>>>>> This conflation is IMHO one of the problems with this proposal.
> > > >>>>>> Device initialization consist of device reset from the spec section
> 3.1.1.
> > > >>>>> It does. But maybe we need to create a way for driver to
> > > >>>>> distinguish between the two. When under reset, use a driver
> > > >>>>> supplied
> > > >> timeout.
> > > >>>> This make sense, because as we discussed when device undergo a
> > > >>>> reset with active DMA, after timeout expires, driver still cannot
> cleanup.
> > > >>>> So this can be short driver decided value as longer timeout is not
> useful.
> > > >>>>
> > > >>>>> When powering up, use a longer device supplied one.
> > > >>>> In v0, v1 I initially considered only the powering up case of
> > > >>>> the device initialization. There was text around that.
> > > >>>> And v2 I removed the initialization text, and I totally missed
> > > >>>> the above case with active DMA.
> > > >>>> This should work.
> > > >>>> We should word this part of the spec accordingly.
> > > >>> Below changes are good for v3?
> > > >>> 1. driver should use device reset time during initialization
> > > >>> stage
> > > >> How does driver identify this though?
> > > > Existence of device_reset_timeout field in struct
> > > > virtio_pci_common_cfg
> > > indicates that this field exists.
> > > > If device support it, it will place non zero value and driver
> > > > knows that this
> > > field should be used.
> > > >
> > > >>> 2. remove feature bit as feature bits are only readable after
> > > >>> reset is completed 3. device reset timeout field of zero
> > > >>> indicates that device doesn't
> > > >> support it.
> > > >>
> > > >> I'm not sure about 3. I think each transport will need its own way to do it.
> > > >>
> > > > For pci a value of zero indicates it isn't supported.
> > > > For mmio DeviceResetTimeout at offset 0x04c indicates same.
> > > > Currently only these 2 transports have the use.
> > > >
> > > >> So I propose: maybe a capability like this, with a timeout field?
> > > > Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT
> > > > like
> > > VIRTIO_PCI_CAP_COMMON_CFG?
> > > > This will contain one or more timeout? For example with his
> > > > proposal it
> > > contains only device reset timeout.
> > > > Later same capability will be further extended to contains command
> > > > timeout
> > > too? Yes?
> > > >
> > > >> And within VMs, we can just do without, since it got out of reset
> > > >> once it will surely get out of reset again...
> > > > Yes, VM might not need it. It is really the HV's choice to
> > > > implement and not
> > > part of the virtio spec.
> > >
> > >
> > > Well, this will break the migration between HW virtio and SW virtio.
> > >
> > How does it break? Can you please explain?
> > SW virtio will emulate what HW virtio does. This field is exposing only read-
> only field to driver to not wait infinitely.
> 
> As discussed previously, can we use transport level reset in this case?
> 
Do you have pointer to it? I do not understand transport level reset.
If you are asking can PCI reset can be used to reset the device? Yes.
But here what we are talking about is, when virtio layer issue the device reset, how long should it wait for that reset to complete.
Typical example is virtio driver loading on physical virtio device agnostic of the transport and waiting for the device to come out of reset.

> > If src is hw and dst is sw, sw will likely have similar capabilities as hw, not just
> this particular one but many other. Isn't it?
> 
> The problem is the when src is software backend without this capability.

It isn’t any different than any other RW field of the device.
For example sw did virtio pci device emulation with 30 msix vectors and hw has only 2.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  4:36                                           ` Parav Pandit
  2021-10-15  5:15                                             ` [virtio-dev] " Jason Wang
@ 2021-10-15  6:51                                             ` Cornelia Huck
  2021-10-15  8:09                                               ` Parav Pandit
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-15  6:51 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Friday, October 15, 2021 3:59 AM
>> 
>> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
>> > Below changes are good for v3?
>> > 1. driver should use device reset time during initialization stage
>> 
>> How does driver identify this though?
> Existence of device_reset_timeout field in struct virtio_pci_common_cfg indicates that this field exists.
> If device support it, it will place non zero value and driver knows that this field should be used.

But how does the driver know? The very first step in the initialization
sequence is "reset the device", before it may read the config space.

>
>> 
>> > 2. remove feature bit as feature bits are only readable after reset is
>> > completed 3. device reset timeout field of zero indicates that device doesn't
>> support it.
>> 
>> I'm not sure about 3. I think each transport will need its own way to do it.
>> 
> For pci a value of zero indicates it isn't supported.
> For mmio DeviceResetTimeout at offset 0x04c indicates same.
> Currently only these 2 transports have the use.

I don't think we want to force all transports, including not yet
existing ones, to use that mechanism. They might as well use a command
to retrieve the value and fail the command, for example.

>
>> So I propose: maybe a capability like this, with a timeout field?
> Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like VIRTIO_PCI_CAP_COMMON_CFG?
> This will contain one or more timeout? For example with his proposal it contains only device reset timeout.
> Later same capability will be further extended to contains command timeout too? Yes?
>
>> And within VMs, we can just do without, since it got out of reset once it will
>> surely get out of reset again...
> Yes, VM might not need it. It is really the HV's choice to implement and not part of the virtio spec.
> Our internal cloud passthrough a PF to the VM.
> It is probably better to let HV to choose if they want to do ctrl+c or have timeout.

But you want it in the spec, don't you? Confused.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  6:48                                                   ` Parav Pandit
@ 2021-10-15  7:02                                                     ` Jason Wang
  2021-10-15  8:21                                                       ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-15  7:02 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Fri, Oct 15, 2021 at 2:48 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 15, 2021 12:12 PM
> >
> > On Fri, Oct 15, 2021 at 1:20 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Friday, October 15, 2021 10:46 AM
> > > >
> > > >
> > > > 在 2021/10/15 下午12:36, Parav Pandit 写道:
> > > > >
> > > > >> From: Michael S. Tsirkin <mst@redhat.com>
> > > > >> Sent: Friday, October 15, 2021 3:59 AM
> > > > >>
> > > > >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> > > > >>> Hi Michael, Cornelia,
> > > > >>>
> > > > >>>> From: Parav Pandit
> > > > >>>> Sent: Tuesday, October 12, 2021 2:42 PM
> > > > >>>>
> > > > >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>> Sent: Tuesday, October 12, 2021 2:32 PM
> > > > >>>>>
> > > > >>>>> On Tue, Oct 12, 2021 at 08:51:34AM +0000, Parav Pandit wrote:
> > > > >>>>>>
> > > > >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>>>> Sent: Monday, October 11, 2021 9:30 PM
> > > > >>>>>>>
> > > > >>>>>>> On Mon, Oct 11, 2021 at 03:44:14PM +0000, Parav Pandit wrote:
> > > > >>>>>>>>>> This is unlikely to work the reset is completed. Because
> > > > >>>>>>>>>> a real device
> > > > >>>>>>>>> implementing this would prefer to do this in fw for 1000
> > > > >>>>>>>>> virtio devices sitting on the physical card.
> > > > >>>>>>>>>> And it is very much driven by such implementation at
> > > > >>>>>>>>>> device
> > > > >> devel.
> > > > >>>>>>>>>> So it cannot update the counter value if reset is not
> > > > >>>>>>>>>> completed for the
> > > > >>>>>>> device.
> > > > >>>>>>>>>> I think read only device reset timeout is most elegant
> > > > >>>>>>>>>> option during device
> > > > >>>>>>>>> initialization phase that eliminates infinite loop of today.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Why can't a driver just go ahead and do a timeout regardless?
> > > > >>>>>>>> o.k. lets consider this thought exercise. What is the
> > > > >>>>>>>> timeout value that driver
> > > > >>>>>>> will choose if device doesn't specify one?
> > > > >>>>>>>> I explained in previous thread and you acked that actual fw
> > > > >>>>>>>> based device
> > > > >>>>>>> may take longer to initialize than pure sw implementation
> > backend.
> > > > >>>>>>>> In second example a pre-boot device can take even longer
> > > > >>>>>>>> initialization
> > > > >>>>> time.
> > > > >>>>>>>> Sriov VF device may initialize lot faster.
> > > > >>>>>>>> Instead of driver having such transport, and device
> > > > >>>>>>>> specific checks, (or some
> > > > >>>>>>> very short or very long timeout), we propose, that let
> > > > >>>>>>> device mention such timeout value.
> > > > >>>>>>>
> > > > >>>>>>> Parav I think you are conflating reset with initialization time.
> > > > >>>>>>> initialization is just for host boot which takes seconds
> > > > >>>>>>> anyway
> > > > >>>>>>> - but no, minutes is not reasonable their, either.
> > > > >>>>>>> reset affects guest boot. This needs to complete in milliseconds.
> > > > >>>>>>>
> > > > >>>>>> I cannot promise, but with newer generation devices usually
> > > > >>>>>> functionality
> > > > >>>>> improves.
> > > > >>>>>> Enforcing in milliseconds doesn't look practical for type of devices.
> > > > >>>>>> Some of the block devices may need to establish TCP
> > > > >>>>>> connections in the
> > > > >>>>> backend.
> > > > >>>>>> It is more useful to wait for few more seconds to initialize
> > > > >>>>>> device after power
> > > > >>>>> on the system, instead of giving up booting the server completely.
> > > > >>>>>> For example, a nvme block device starts with a minimum
> > > > >>>>>> timeout of
> > > > >>>>> 500msec.
> > > > >>>>>> Yes, I agree to your point that a device given to a guest VM
> > > > >>>>>> will likely have
> > > > >>>>> very short reset time that should complete in milliseconds.
> > > > >>>>>>> This conflation is IMHO one of the problems with this proposal.
> > > > >>>>>> Device initialization consist of device reset from the spec section
> > 3.1.1.
> > > > >>>>> It does. But maybe we need to create a way for driver to
> > > > >>>>> distinguish between the two. When under reset, use a driver
> > > > >>>>> supplied
> > > > >> timeout.
> > > > >>>> This make sense, because as we discussed when device undergo a
> > > > >>>> reset with active DMA, after timeout expires, driver still cannot
> > cleanup.
> > > > >>>> So this can be short driver decided value as longer timeout is not
> > useful.
> > > > >>>>
> > > > >>>>> When powering up, use a longer device supplied one.
> > > > >>>> In v0, v1 I initially considered only the powering up case of
> > > > >>>> the device initialization. There was text around that.
> > > > >>>> And v2 I removed the initialization text, and I totally missed
> > > > >>>> the above case with active DMA.
> > > > >>>> This should work.
> > > > >>>> We should word this part of the spec accordingly.
> > > > >>> Below changes are good for v3?
> > > > >>> 1. driver should use device reset time during initialization
> > > > >>> stage
> > > > >> How does driver identify this though?
> > > > > Existence of device_reset_timeout field in struct
> > > > > virtio_pci_common_cfg
> > > > indicates that this field exists.
> > > > > If device support it, it will place non zero value and driver
> > > > > knows that this
> > > > field should be used.
> > > > >
> > > > >>> 2. remove feature bit as feature bits are only readable after
> > > > >>> reset is completed 3. device reset timeout field of zero
> > > > >>> indicates that device doesn't
> > > > >> support it.
> > > > >>
> > > > >> I'm not sure about 3. I think each transport will need its own way to do it.
> > > > >>
> > > > > For pci a value of zero indicates it isn't supported.
> > > > > For mmio DeviceResetTimeout at offset 0x04c indicates same.
> > > > > Currently only these 2 transports have the use.
> > > > >
> > > > >> So I propose: maybe a capability like this, with a timeout field?
> > > > > Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT
> > > > > like
> > > > VIRTIO_PCI_CAP_COMMON_CFG?
> > > > > This will contain one or more timeout? For example with his
> > > > > proposal it
> > > > contains only device reset timeout.
> > > > > Later same capability will be further extended to contains command
> > > > > timeout
> > > > too? Yes?
> > > > >
> > > > >> And within VMs, we can just do without, since it got out of reset
> > > > >> once it will surely get out of reset again...
> > > > > Yes, VM might not need it. It is really the HV's choice to
> > > > > implement and not
> > > > part of the virtio spec.
> > > >
> > > >
> > > > Well, this will break the migration between HW virtio and SW virtio.
> > > >
> > > How does it break? Can you please explain?
> > > SW virtio will emulate what HW virtio does. This field is exposing only read-
> > only field to driver to not wait infinitely.
> >
> > As discussed previously, can we use transport level reset in this case?
> >
> Do you have pointer to it? I do not understand transport level reset.

https://lore.kernel.org/virtualization/7ebb9ba0-69a0-2279-9b9e-60c50db06e94@redhat.com/

> If you are asking can PCI reset can be used to reset the device? Yes.

Yes, I meant PCI reset.

> But here what we are talking about is, when virtio layer issue the device reset, how long should it wait for that reset to complete.

A question is that what do we expect the driver to do if there's a
timeout on the device reset?

> Typical example is virtio driver loading on physical virtio device agnostic of the transport and waiting for the device to come out of reset.

I'm not sure how hard we can meet a device/transport agnostic timeout.
E.g how can we know the timeout satisfy for the requirement of all the
transport?

>
> > > If src is hw and dst is sw, sw will likely have similar capabilities as hw, not just
> > this particular one but many other. Isn't it?
> >
> > The problem is the when src is software backend without this capability.
>
> It isn’t any different than any other RW field of the device.
> For example sw did virtio pci device emulation with 30 msix vectors and hw has only 2.

In the case of MSI-X, the migration can't be done directly from src to
dest. The only choice is some kind of software mediation.

In the case of timeout, the mediation won't even help, since we don't
even present the timeout to guest. Even if the hypervisor can see the
timeout interface.

Thanks


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-15  6:51                                             ` Cornelia Huck
@ 2021-10-15  8:09                                               ` Parav Pandit
  2021-10-15  9:25                                                 ` [virtio-dev] " Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-15  8:09 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer



> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Friday, October 15, 2021 12:22 PM
> 
> On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Friday, October 15, 2021 3:59 AM
> >>
> >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> >> > Below changes are good for v3?
> >> > 1. driver should use device reset time during initialization stage
> >>
> >> How does driver identify this though?
> > Existence of device_reset_timeout field in struct virtio_pci_common_cfg
> indicates that this field exists.
> > If device support it, it will place non zero value and driver knows that this
> field should be used.
> 
> But how does the driver know? The very first step in the initialization sequence
> is "reset the device", before it may read the config space.
PCI configuration space is always accessible when a PCI device exists at PCI level even before virtio layer can operate it.
So struct virtio_pci_common_cfg is accessible to the virtio transport layer before resetting the device.
It needs to contains what the timeout is, just like reset of the fields.

I made mistake in understanding last time between virtio config space such as (struct virtio_net_config) vs PCI config space.
So want to double check, did you mean pci config space and struct virtio_pci_common_cfg or something else when you mentioned _config_space_?
Virtio_net_config certainly accessible only after reset.
But transport level pci capabilities are accessible before reset by the PCI spec.

> 
> >
> >>
> >> > 2. remove feature bit as feature bits are only readable after reset
> >> > is completed 3. device reset timeout field of zero indicates that
> >> > device doesn't
> >> support it.
> >>
> >> I'm not sure about 3. I think each transport will need its own way to do it.
> >>
> > For pci a value of zero indicates it isn't supported.
> > For mmio DeviceResetTimeout at offset 0x04c indicates same.
> > Currently only these 2 transports have the use.
> 
> I don't think we want to force all transports, including not yet existing ones, to
> use that mechanism. They might as well use a command to retrieve the value
> and fail the command, for example.
> 
ok. it is interesting for me to understand that virtio spec care for even an undefined transport.

So when such undefined transport gets defined in virtio spec, a spec will get updated anyway to describes that new_transport performs reset via cmd interface.

Though I do not well understand how driver can reset a device via a command without establishing a command channel.
A device usually needs to respond/accept the command channel bits to accept the command.
But there may be some transport that behave differently that I didn't anticipate yet.

So let's define such transport specific changes when such transport will be done for real in virtio spec.

> >
> >> So I propose: maybe a capability like this, with a timeout field?
> > Do you mean a new capability like say VIRTIO_PCI_DEVICE_TIMEOUT like
> VIRTIO_PCI_CAP_COMMON_CFG?
> > This will contain one or more timeout? For example with his proposal it
> contains only device reset timeout.
> > Later same capability will be further extended to contains command timeout
> too? Yes?
> >
> >> And within VMs, we can just do without, since it got out of reset
> >> once it will surely get out of reset again...
> > Yes, VM might not need it. It is really the HV's choice to implement and not
> part of the virtio spec.
> > Our internal cloud passthrough a PF to the VM.
> > It is probably better to let HV to choose if they want to do ctrl+c or have
> timeout.
> 
> But you want it in the spec, don't you? Confused.
It is a SHOULD area of the proposal that recommends to honor the timeout honored.
What I meant to say, that I agree to Michaels point that "VM can do without timeout" -> that a HV can build a device or pass a device to the VM without device reset timeout functionality.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  7:02                                                     ` Jason Wang
@ 2021-10-15  8:21                                                       ` Parav Pandit
  2021-10-15  8:42                                                         ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-15  8:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer


> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 15, 2021 12:33 PM
> > >
> > Do you have pointer to it? I do not understand transport level reset.
> 
> https://lore.kernel.org/virtualization/7ebb9ba0-69a0-2279-9b9e-
> 60c50db06e94@redhat.com/
> 
I read it. PCI level reset to recover a faulty device is useful.
But we are not talking about resetting a PCI level here.

It is very straight forward. A virtio level is reseting the device and infinitely waiting for it to come up.
Instead of doing so, device is providing hint to let it wait for the time driver should.

> > If you are asking can PCI reset can be used to reset the device? Yes.
> 
> Yes, I meant PCI reset.
> 
> > But here what we are talking about is, when virtio layer issue the device
> reset, how long should it wait for that reset to complete.
> 
> A question is that what do we expect the driver to do if there's a timeout on the
> device reset?
Oh ok. I understand your question now.

As described in this proposal, when timeout expires during initialization probe time, this device cannot be operated by the virtio layer, 
So it simply aborts further initialization sequence of the spec in section 3.1.1 after step_1.

Device probe has failed. So HV has several options to act on it.
For example, do device specific debug, or do transport specific reset, recreate the device.

> 
> > Typical example is virtio driver loading on physical virtio device agnostic of
> the transport and waiting for the device to come out of reset.
> 
> I'm not sure how hard we can meet a device/transport agnostic timeout.
> E.g how can we know the timeout satisfy for the requirement of all the
> transport?
> 
> >
> > > > If src is hw and dst is sw, sw will likely have similar
> > > > capabilities as hw, not just
> > > this particular one but many other. Isn't it?
> > >
> > > The problem is the when src is software backend without this capability.
> >
> > It isn’t any different than any other RW field of the device.
> > For example sw did virtio pci device emulation with 30 msix vectors and hw
> has only 2.
> 
> In the case of MSI-X, the migration can't be done directly from src to dest. The
> only choice is some kind of software mediation.
> 
> In the case of timeout, the mediation won't even help, since we don't even
> present the timeout to guest. Even if the hypervisor can see the timeout
> interface.
Since sw didn’t presented the timeout to the guest, it will find a HW device with similar capability (without timeout).
Or may be it can find a device that already passed the device reset state and attach such device.
HV has its options.

Still not any different than rest of the RW capabilities/registers.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  8:21                                                       ` Parav Pandit
@ 2021-10-15  8:42                                                         ` Jason Wang
  2021-10-22  7:20                                                           ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-15  8:42 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Fri, Oct 15, 2021 at 4:21 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 15, 2021 12:33 PM
> > > >
> > > Do you have pointer to it? I do not understand transport level reset.
> >
> > https://lore.kernel.org/virtualization/7ebb9ba0-69a0-2279-9b9e-
> > 60c50db06e94@redhat.com/
> >
> I read it. PCI level reset to recover a faulty device is useful.
> But we are not talking about resetting a PCI level here.
>
> It is very straight forward. A virtio level is reseting the device and infinitely waiting for it to come up.
> Instead of doing so, device is providing hint to let it wait for the time driver should.

See the question below. But what I want to say is something like:

1) transport driver may have a transport specific timeout
2) if the transport driver meet that timeout during virtio reset,
perform the transport specific reset

It looks to me it will guarantee the success of the reset?

>
> > > If you are asking can PCI reset can be used to reset the device? Yes.
> >
> > Yes, I meant PCI reset.
> >
> > > But here what we are talking about is, when virtio layer issue the device
> > reset, how long should it wait for that reset to complete.
> >
> > A question is that what do we expect the driver to do if there's a timeout on the
> > device reset?
> Oh ok. I understand your question now.
>
> As described in this proposal, when timeout expires during initialization probe time, this device cannot be operated by the virtio layer,
> So it simply aborts further initialization sequence of the spec in section 3.1.1 after step_1.
>
> Device probe has failed. So HV has several options to act on it.
> For example, do device specific debug, or do transport specific reset, recreate the device.

But how about the device removal?

>
> >
> > > Typical example is virtio driver loading on physical virtio device agnostic of
> > the transport and waiting for the device to come out of reset.
> >
> > I'm not sure how hard we can meet a device/transport agnostic timeout.
> > E.g how can we know the timeout satisfy for the requirement of all the
> > transport?
> >
> > >
> > > > > If src is hw and dst is sw, sw will likely have similar
> > > > > capabilities as hw, not just
> > > > this particular one but many other. Isn't it?
> > > >
> > > > The problem is the when src is software backend without this capability.
> > >
> > > It isn’t any different than any other RW field of the device.
> > > For example sw did virtio pci device emulation with 30 msix vectors and hw
> > has only 2.
> >
> > In the case of MSI-X, the migration can't be done directly from src to dest. The
> > only choice is some kind of software mediation.
> >
> > In the case of timeout, the mediation won't even help, since we don't even
> > present the timeout to guest. Even if the hypervisor can see the timeout
> > interface.
> Since sw didn’t presented the timeout to the guest, it will find a HW device with similar capability (without timeout).
> Or may be it can find a device that already passed the device reset state and attach such device.
> HV has its options.
>
> Still not any different than rest of the RW capabilities/registers.

So the timeout doesn't work if we can enforce that. Seems sub optimal
than doing a transport reset if possible.

Thanks


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  8:09                                               ` Parav Pandit
@ 2021-10-15  9:25                                                 ` Cornelia Huck
  2021-10-22  6:29                                                   ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2021-10-15  9:25 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:

>> From: Cornelia Huck <cohuck@redhat.com>
>> Sent: Friday, October 15, 2021 12:22 PM
>> 
>> On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:
>> 
>> >> From: Michael S. Tsirkin <mst@redhat.com>
>> >> Sent: Friday, October 15, 2021 3:59 AM
>> >>
>> >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
>> >> > Below changes are good for v3?
>> >> > 1. driver should use device reset time during initialization stage
>> >>
>> >> How does driver identify this though?
>> > Existence of device_reset_timeout field in struct virtio_pci_common_cfg
>> indicates that this field exists.
>> > If device support it, it will place non zero value and driver knows that this
>> field should be used.
>> 
>> But how does the driver know? The very first step in the initialization sequence
>> is "reset the device", before it may read the config space.
> PCI configuration space is always accessible when a PCI device exists at PCI level even before virtio layer can operate it.
> So struct virtio_pci_common_cfg is accessible to the virtio transport layer before resetting the device.
> It needs to contains what the timeout is, just like reset of the fields.
>
> I made mistake in understanding last time between virtio config space such as (struct virtio_net_config) vs PCI config space.
> So want to double check, did you mean pci config space and struct virtio_pci_common_cfg or something else when you mentioned _config_space_?
> Virtio_net_config certainly accessible only after reset.
> But transport level pci capabilities are accessible before reset by the PCI spec.

So, how does that work for non-pci? Can mmio access its config space
like that?

>
>> 
>> >
>> >>
>> >> > 2. remove feature bit as feature bits are only readable after reset
>> >> > is completed 3. device reset timeout field of zero indicates that
>> >> > device doesn't
>> >> support it.
>> >>
>> >> I'm not sure about 3. I think each transport will need its own way to do it.
>> >>
>> > For pci a value of zero indicates it isn't supported.
>> > For mmio DeviceResetTimeout at offset 0x04c indicates same.
>> > Currently only these 2 transports have the use.
>> 
>> I don't think we want to force all transports, including not yet existing ones, to
>> use that mechanism. They might as well use a command to retrieve the value
>> and fail the command, for example.
>> 
> ok. it is interesting for me to understand that virtio spec care for even an undefined transport.
>
> So when such undefined transport gets defined in virtio spec, a spec will get updated anyway to describes that new_transport performs reset via cmd interface.
>
> Though I do not well understand how driver can reset a device via a command without establishing a command channel.
> A device usually needs to respond/accept the command channel bits to accept the command.

[As a side note: s390 channel I/O works just like that. You have the
channel subsystem (and the control units) take care of the mechanism, no
device involvement required until the command is actually processed, and
even that is not needed for all commands.]

> But there may be some transport that behave differently that I didn't anticipate yet.
>
> So let's define such transport specific changes when such transport will be done for real in virtio spec.

My point is: don't make it harder for future transports than really
neccessary. If you look at today's transports, ccw is already
substantially different from pci in many ways, and yet both are
functional virtio transports. Take the concept of "config space": ccw
devices do not have such a thing. I created two channel commands to
basically emulate that concept for *virtio* *only*. Should we want to
introduce the timeout value for ccw (which is not likely, but let's go
for it for the sake of the argument), we cannot put it into a
non-existing config space; it needs to use something else, like a
command to retrieve the value. A future transport might be in a similar
position, and "put it into the config space" may simply not be possible.

Look at it that way: You come from the PCI perspective, and you have an
idea how to model things there. I come from a channel I/O background,
and I have an idea how to model things there. Both are quite different;
if we can come up with something that can work for both transports,
chances are good that a future transport can also work with it. If we
say that the driver retrieves that information in a transport-specific
way without any mention of config spaces, we should be fine.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v2] Add device reset timeout field
  2021-10-15  9:25                                                 ` [virtio-dev] " Cornelia Huck
@ 2021-10-22  6:29                                                   ` Parav Pandit
  0 siblings, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-22  6:29 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin
  Cc: virtio-dev, Max Gurtovoy, Shahaf Shuler, Oren Duer

Hi Cornelia,

I wasn't able to respond last few days.
Sorry for the delayed response below.

> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Friday, October 15, 2021 2:56 PM

> On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:
> 
> >> From: Cornelia Huck <cohuck@redhat.com>
> >> Sent: Friday, October 15, 2021 12:22 PM
> >>
> >> On Fri, Oct 15 2021, Parav Pandit <parav@nvidia.com> wrote:
> >>
> >> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> >> Sent: Friday, October 15, 2021 3:59 AM
> >> >>
> >> >> On Thu, Oct 14, 2021 at 05:35:37PM +0000, Parav Pandit wrote:
> >> >> > Below changes are good for v3?
> >> >> > 1. driver should use device reset time during initialization
> >> >> > stage
> >> >>
> >> >> How does driver identify this though?
> >> > Existence of device_reset_timeout field in struct
> >> > virtio_pci_common_cfg
> >> indicates that this field exists.
> >> > If device support it, it will place non zero value and driver knows
> >> > that this
> >> field should be used.
> >>
> >> But how does the driver know? The very first step in the
> >> initialization sequence is "reset the device", before it may read the config
> space.
> > PCI configuration space is always accessible when a PCI device exists at PCI
> level even before virtio layer can operate it.
> > So struct virtio_pci_common_cfg is accessible to the virtio transport layer
> before resetting the device.
> > It needs to contains what the timeout is, just like reset of the fields.
> >
> > I made mistake in understanding last time between virtio config space such
> as (struct virtio_net_config) vs PCI config space.
> > So want to double check, did you mean pci config space and struct
> virtio_pci_common_cfg or something else when you mentioned
> _config_space_?
> > Virtio_net_config certainly accessible only after reset.
> > But transport level pci capabilities are accessible before reset by the PCI
> spec.
> 
> So, how does that work for non-pci? Can mmio access its config space like
> that?
Yes. MMIO device initialization starts in transport layer.
This device init starts by reading the magicvalue and version registers from the MMIO config space.
After this is done, it follows the device init flow described in the spec section 4.2.3.1.1.

> 
> My point is: don't make it harder for future transports than really neccessary. If
> you look at today's transports, ccw is already substantially different from pci in
> many ways, and yet both are functional virtio transports. Take the concept of
> "config space": ccw devices do not have such a thing. I created two channel
> commands to basically emulate that concept for *virtio* *only*. Should we
> want to introduce the timeout value for ccw (which is not likely, but let's go for
> it for the sake of the argument), we cannot put it into a non-existing config
> space; it needs to use something else, like a command to retrieve the value. A
> future transport might be in a similar position, and "put it into the config
> space" may simply not be possible.
> 
> Look at it that way: You come from the PCI perspective, and you have an idea
> how to model things there. I come from a channel I/O background, and I have
> an idea how to model things there. Both are quite different; if we can come up
> with something that can work for both transports, chances are good that a
> future transport can also work with it. 

In this proposal, for pci and mmio transports, the new timeout register is located in the transport specific device config space.
This is because these two transports initialization is based on this config space.
Both the transport clearly describe that they follow the spec 3.1 device initialization flow.

For ccw as you well described above and as part of the spec there is no link to spec 3.1 device initialization flow.

> If we say that the driver retrieves that
> information in a transport-specific way without any mention of config spaces,
> we should be fine.
Yes. I totally agree with you.
Its transport specific way.
In the current proposal, in the device reset timeout description section 2.4.1, 2.4.2, there is no mention of config space.
Each transport (pci, mmio) extends their own config space and describes what is the transport specific extension it is to interact with device.

So if new non config space based transport or ccw wants to do this in future, they will use their transport specific way (as you described channel command interface).

I didn't muddy the device reset description for ccw exceptions because for rest the device reset piece, it doesn't have any link to ccw differences.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-15  8:42                                                         ` Jason Wang
@ 2021-10-22  7:20                                                           ` Parav Pandit
  2021-10-25  5:41                                                             ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-22  7:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer



> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 15, 2021 2:12 PM
> 
> On Fri, Oct 15, 2021 at 4:21 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 15, 2021 12:33 PM
> > > > >
> > > > Do you have pointer to it? I do not understand transport level reset.
> > >
> > > https://lore.kernel.org/virtualization/7ebb9ba0-69a0-2279-9b9e-
> > > 60c50db06e94@redhat.com/
> > >
> > I read it. PCI level reset to recover a faulty device is useful.
> > But we are not talking about resetting a PCI level here.
> >
> > It is very straight forward. A virtio level is reseting the device and infinitely
> waiting for it to come up.
> > Instead of doing so, device is providing hint to let it wait for the time driver
> should.
> 
> See the question below. But what I want to say is something like:
> 
> 1) transport driver may have a transport specific timeout
Yes. virtio PCI transport has transport specific timeout provided by the PCI device in this proposal.

> 2) if the transport driver meet that timeout during virtio reset, perform the
> transport specific reset
>
This is the implementation choice when a device specified timeout expires, what to do.
Should system abort initializing the device further?
Should qemu user do ctrl+c?
Should it retry device init at virtio level?
Should it call transport (in this case PCI FLR) to reset the device?
This depends on where virtio device is used.
This proposal doesn't define what a system should do when reset timer expires. It out of scope and implementation choice based on the env.

> It looks to me it will guarantee the success of the reset?

You are describing the timeout expiration handling.
Yes, an implementation can choose to perform transport level reset when timeout expires.

> 
> >
> > > > If you are asking can PCI reset can be used to reset the device? Yes.
> > >
> > > Yes, I meant PCI reset.
> > >
> > > > But here what we are talking about is, when virtio layer issue the
> > > > device
> > > reset, how long should it wait for that reset to complete.
> > >
> > > A question is that what do we expect the driver to do if there's a
> > > timeout on the device reset?
> > Oh ok. I understand your question now.
> >
> > As described in this proposal, when timeout expires during
> > initialization probe time, this device cannot be operated by the virtio layer,
> So it simply aborts further initialization sequence of the spec in section 3.1.1
> after step_1.
> >
> > Device probe has failed. So HV has several options to act on it.
> > For example, do device specific debug, or do transport specific reset, recreate
> the device.
> 
> But how about the device removal?
> 
> >
> > >
> > > > Typical example is virtio driver loading on physical virtio device
> > > > agnostic of
> > > the transport and waiting for the device to come out of reset.
> > >
> > > I'm not sure how hard we can meet a device/transport agnostic timeout.
> > > E.g how can we know the timeout satisfy for the requirement of all
> > > the transport?
> > >
> > > >
> > > > > > If src is hw and dst is sw, sw will likely have similar
> > > > > > capabilities as hw, not just
> > > > > this particular one but many other. Isn't it?
> > > > >
> > > > > The problem is the when src is software backend without this
> capability.
> > > >
> > > > It isn’t any different than any other RW field of the device.
> > > > For example sw did virtio pci device emulation with 30 msix
> > > > vectors and hw
> > > has only 2.
> > >
> > > In the case of MSI-X, the migration can't be done directly from src
> > > to dest. The only choice is some kind of software mediation.
> > >
> > > In the case of timeout, the mediation won't even help, since we
> > > don't even present the timeout to guest. Even if the hypervisor can
> > > see the timeout interface.
> > Since sw didn’t presented the timeout to the guest, it will find a HW device
> with similar capability (without timeout).
> > Or may be it can find a device that already passed the device reset state and
> attach such device.
> > HV has its options.
> >
> > Still not any different than rest of the RW capabilities/registers.
> 
> So the timeout doesn't work if we can enforce that. 
Didn’t follow your comment. Enforce what?

> Seems sub optimal than doing a transport reset if possible.

Transport level reset is not helpful for two main reasons.
1. A device under reset has nothing wrong at the PCI level.
Device is just still under boot/resetting to service virtio level commands.
So doing PCI level reset is actually adversely affect it further because PCI level reset will trigger more layers of unwanted reset.

2. PCI level reset must complete in 100msec.

I still don't understand mixing transport level reset with virtio level reset in context of LM.

If I read your original concern, that a sw device was exposed on HV-1 without device reset timeout.
And now user wants to migrate this VM to HV-2 which has a hw device with device reset timeout.
And now this device cannot be passed to the HV-2 because older sw device was attached to the VM.
I am frankly not sure, if this is really a case to consider.
It still similar to any other migration case where two devices have different capabilities.
For example, SW device supported 16 RSS queues. HW has 4 RSS queues....
So now emulate that as well? :)
In another MSI-X example you provided emulated MSIX alterative was proposed. This is again not mapping the full hw device.

So it is still sw device on other hand. User might as well just continue in sw mode.

Let's assume for a moment that, this (rare) case should be considered.

Once way to overcome this is below.
(a) Do not place new device_reset_timeout field in struct virtio_pci_common_cfg.
This is because this map be in memory mapped BAR directly accessible to the guest and future version of the spec has more fields after device reset timeout.
Such fields are somehow implemented by the sw device on HV-1.

(b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
This points to the struct containing device reset timeout.
Exposing this capability can be skipped when mapping the hw virtio device to VM on HV-2 to match sw virtio device of HV-1.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-22  7:20                                                           ` Parav Pandit
@ 2021-10-25  5:41                                                             ` Jason Wang
  2021-10-25  6:11                                                               ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-25  5:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer


在 2021/10/22 下午3:20, Parav Pandit 写道:
>
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Friday, October 15, 2021 2:12 PM
>>
>> On Fri, Oct 15, 2021 at 4:21 PM Parav Pandit <parav@nvidia.com> wrote:
>>>
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Sent: Friday, October 15, 2021 12:33 PM
>>>>> Do you have pointer to it? I do not understand transport level reset.
>>>> https://lore.kernel.org/virtualization/7ebb9ba0-69a0-2279-9b9e-
>>>> 60c50db06e94@redhat.com/
>>>>
>>> I read it. PCI level reset to recover a faulty device is useful.
>>> But we are not talking about resetting a PCI level here.
>>>
>>> It is very straight forward. A virtio level is reseting the device and infinitely
>> waiting for it to come up.
>>> Instead of doing so, device is providing hint to let it wait for the time driver
>> should.
>>
>> See the question below. But what I want to say is something like:
>>
>> 1) transport driver may have a transport specific timeout
> Yes. virtio PCI transport has transport specific timeout provided by the PCI device in this proposal.
>
>> 2) if the transport driver meet that timeout during virtio reset, perform the
>> transport specific reset
>>
> This is the implementation choice when a device specified timeout expires, what to do.
> Should system abort initializing the device further?
> Should qemu user do ctrl+c?
> Should it retry device init at virtio level?
> Should it call transport (in this case PCI FLR) to reset the device?
> This depends on where virtio device is used.
> This proposal doesn't define what a system should do when reset timer expires. It out of scope and implementation choice based on the env.


So what I meant is, if PCI gives us an upper limit of the reset (e.g 
100ms as you said), can virtio device perform reset more than that? If 
not, can we simply use 100ms for virtio-pci or can we gain from having a 
smaller per device timeout? If yes, it looks like a violation for PCI.

It looks to there're many choices

1) driver reset timeout via module parameter or sysfs
2) transport driver reset timeout (e.g via virtio-pci)
3) device reset timeout

If we can solve the issue in 1) and 2), there's no need for 3) consider 
3) requires a changes is various places and may bring extra troubles.


>
>> It looks to me it will guarantee the success of the reset?
> You are describing the timeout expiration handling.


This looks more important than just having a device hint of the timeout. 
We don't want to block thing like module unload or device removal etc.


> Yes, an implementation can choose to perform transport level reset when timeout expires.
>
>>>>> If you are asking can PCI reset can be used to reset the device? Yes.
>>>> Yes, I meant PCI reset.
>>>>
>>>>> But here what we are talking about is, when virtio layer issue the
>>>>> device
>>>> reset, how long should it wait for that reset to complete.
>>>>
>>>> A question is that what do we expect the driver to do if there's a
>>>> timeout on the device reset?
>>> Oh ok. I understand your question now.
>>>
>>> As described in this proposal, when timeout expires during
>>> initialization probe time, this device cannot be operated by the virtio layer,
>> So it simply aborts further initialization sequence of the spec in section 3.1.1
>> after step_1.
>>> Device probe has failed. So HV has several options to act on it.
>>> For example, do device specific debug, or do transport specific reset, recreate
>> the device.
>>
>> But how about the device removal?
>>
>>>>> Typical example is virtio driver loading on physical virtio device
>>>>> agnostic of
>>>> the transport and waiting for the device to come out of reset.
>>>>
>>>> I'm not sure how hard we can meet a device/transport agnostic timeout.
>>>> E.g how can we know the timeout satisfy for the requirement of all
>>>> the transport?
>>>>
>>>>>>> If src is hw and dst is sw, sw will likely have similar
>>>>>>> capabilities as hw, not just
>>>>>> this particular one but many other. Isn't it?
>>>>>>
>>>>>> The problem is the when src is software backend without this
>> capability.
>>>>> It isn’t any different than any other RW field of the device.
>>>>> For example sw did virtio pci device emulation with 30 msix
>>>>> vectors and hw
>>>> has only 2.
>>>>
>>>> In the case of MSI-X, the migration can't be done directly from src
>>>> to dest. The only choice is some kind of software mediation.
>>>>
>>>> In the case of timeout, the mediation won't even help, since we
>>>> don't even present the timeout to guest. Even if the hypervisor can
>>>> see the timeout interface.
>>> Since sw didn’t presented the timeout to the guest, it will find a HW device
>> with similar capability (without timeout).
>>> Or may be it can find a device that already passed the device reset state and
>> attach such device.
>>> HV has its options.
>>>
>>> Still not any different than rest of the RW capabilities/registers.
>> So the timeout doesn't work if we can enforce that.
> Didn’t follow your comment. Enforce what?


I meant the timeout value doesn't work here since guest can't see that.


>
>> Seems sub optimal than doing a transport reset if possible.
> Transport level reset is not helpful for two main reasons.
> 1. A device under reset has nothing wrong at the PCI level.
> Device is just still under boot/resetting to service virtio level commands.
> So doing PCI level reset is actually adversely affect it further because PCI level reset will trigger more layers of unwanted reset.


Well, looking at the driver codes, there're various places that the 
reset() is not expected to be fail. My understanding is that your 
proposal only solve the issue when the reset can fail but not other. 
That's why we need seek help for transport layer support.


>
> 2. PCI level reset must complete in 100msec.


Did you mean 100ms is too long? If yes, please explain why.


>
> I still don't understand mixing transport level reset with virtio level reset in context of LM.
>
> If I read your original concern, that a sw device was exposed on HV-1 without device reset timeout.
> And now user wants to migrate this VM to HV-2 which has a hw device with device reset timeout.
> And now this device cannot be passed to the HV-2 because older sw device was attached to the VM.
> I am frankly not sure, if this is really a case to consider.


Actually, it's a choice of the cloud vendors. But consider there are 
very huge number of instances which is based on software virtio, it's a 
valid use case. But what I want to say is that, if we touch the spec, it 
will requires changes in the management layer as well. E.g if we can not 
migrate from hardware that support reset timeout to software that 
doesn't support reset timeout. It will be better if we can do that only 
in the driver.


> It still similar to any other migration case where two devices have different capabilities.
> For example, SW device supported 16 RSS queues. HW has 4 RSS queues....
> So now emulate that as well? :)


Similar but one more burden for the management if a simple 100ms timeout 
can work for virtio-pci.


> In another MSI-X example you provided emulated MSIX alterative was proposed. This is again not mapping the full hw device.
>
> So it is still sw device on other hand. User might as well just continue in sw mode.
>
> Let's assume for a moment that, this (rare) case should be considered.
>
> Once way to overcome this is below.
> (a) Do not place new device_reset_timeout field in struct virtio_pci_common_cfg.
> This is because this map be in memory mapped BAR directly accessible to the guest and future version of the spec has more fields after device reset timeout.
> Such fields are somehow implemented by the sw device on HV-1.
>
> (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> This points to the struct containing device reset timeout.
> Exposing this capability can be skipped when mapping the hw virtio device to VM on HV-2 to match sw virtio device of HV-1.


This will ease the hypervisor but anyhow hypervisor can choose to not 
expose the capability directly to guest (which will soon became a burden 
to support cross vendor live migration).

Thanks



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-25  5:41                                                             ` Jason Wang
@ 2021-10-25  6:11                                                               ` Parav Pandit
  2021-10-26  4:03                                                                 ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-25  6:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 25, 2021 11:12 AM
[..] 
> So what I meant is, if PCI gives us an upper limit of the reset (e.g 100ms as you
> said), can virtio device perform reset more than that? If not, can we simply use
> 100ms for virtio-pci or can we gain from having a smaller per device timeout?
> If yes, it looks like a violation for PCI.
No. Not at all a violation of PCI spec.
PCI spec is transport layer.
Virtio io is layer above transport.

PCI device can be reset, but not necessary, it is ready to service upper layer protocols who may need more time to finish the reset.

There are several examples of it in industry today.
nvme, mlx5, efa, intel ice and more who may need more time to reset the upper layer device tan 100msec.

second point that I already explained, is there is nothing wrong at the PCI level that _requires_ further reset.
PCI device is just fine, it is under reset by virtio. Depending on the device implementation it just takes more time.
In few responses in the thread, I already described those different types of devices in response to Cornelia and Michael.
Please go through the thread. Lets  not repeat same discussion again.

> 
> It looks to there're many choices
> 
> 1) driver reset timeout via module parameter or sysfs
> 2) transport driver reset timeout (e.g via virtio-pci)
> 3) device reset timeout
> 
> If we can solve the issue in 1) and 2), there's no need for 3) consider
> 3) requires a changes is various places and may bring extra troubles.
> 
What are those extra troubles?
Device reset timeout is addressing #1, #2 and #3 without any of the extra user config knobs. This delivers less painful systems to user and also vendor agnostic method.

Actually sysfs is more troublesome, where user needs to reboot the system and do scripts to configure these values even before kernel-builtin modules are loaded.
Some pre-boot env, doesn’t have such sysfs.

> 
> >
> >> It looks to me it will guarantee the success of the reset?
> > You are describing the timeout expiration handling.
> 
> 
> This looks more important than just having a device hint of the timeout.
> We don't want to block thing like module unload or device removal etc.
> 
We already discussed in this thread with Michael, that a device which is already reseted before, a second reset should be fast to service anyway.
Please read the discussion in this thread. Lets not repeat same discussion again.
> >>> Still not any different than rest of the RW capabilities/registers.
> >> So the timeout doesn't work if we can enforce that.
> > Didn’t follow your comment. Enforce what?
> 
> 
> I meant the timeout value doesn't work here since guest can't see that.
Your example was for VM migration from sw_without_device_reset to hw_with_device_reset.
And here, you wanted to hide the device reset timeout.
So guest _should_not_ be able to see it. Right?

> 
> 
> >
> >> Seems sub optimal than doing a transport reset if possible.
> > Transport level reset is not helpful for two main reasons.
> > 1. A device under reset has nothing wrong at the PCI level.
> > Device is just still under boot/resetting to service virtio level commands.
> > So doing PCI level reset is actually adversely affect it further because PCI
> level reset will trigger more layers of unwanted reset.
> 
> 
> Well, looking at the driver codes, there're various places that the
> reset() is not expected to be fail. My understanding is that your
> proposal only solve the issue when the reset can fail but not other.
> That's why we need seek help for transport layer support.
> 
Driver can issue transport level reset when reset timeout expires.
First we need the reset timeout to be defined and let it expire. :)
Once it expires, sure, transport level resets can be applied.

I think you missed this part, that we do not reset to wait infinitely in all places as done today in linux driver.
Wait finitely, and when finite time expires, sure, driver can issue transport level reset.

> 
> >
> > 2. PCI level reset must complete in 100msec.
> 
> 
> Did you mean 100ms is too long? If yes, please explain why.
This is discussed and Cornelia gave the example also in this thread.
i.e. a fw based hw virtio device may not have finished its initialization in its 100 msec.
so virtio layer may need to wait longer.

> 
> 
> >
> > I still don't understand mixing transport level reset with virtio level reset in
> context of LM.
> >
> > If I read your original concern, that a sw device was exposed on HV-1 without
> device reset timeout.
> > And now user wants to migrate this VM to HV-2 which has a hw device with
> device reset timeout.
> > And now this device cannot be passed to the HV-2 because older sw device
> was attached to the VM.
> > I am frankly not sure, if this is really a case to consider.
> 
> 
> Actually, it's a choice of the cloud vendors. But consider there are
> very huge number of instances which is based on software virtio, it's a
> valid use case. But what I want to say is that, if we touch the spec, it
> will requires changes in the management layer as well. E.g if we can not
> migrate from hardware that support reset timeout to software that
> doesn't support reset timeout. It will be better if we can do that only
> in the driver.
Above explanation chooses now a different example.
In this example, you mention hw to sw migration.
Previously your example was from inferior sw to newer hw.
 
> 
> Similar but one more burden for the management if a simple 100ms timeout
> can work for virtio-pci.
Resilient spec and resilient sw, hw stack is not a burden to management stack.
It is just any other hw/sw attribute of the virtio device to take care.

> 
> > In another MSI-X example you provided emulated MSIX alterative was
> proposed. This is again not mapping the full hw device.
> >
> > So it is still sw device on other hand. User might as well just continue in sw
> mode.
> >
> > Let's assume for a moment that, this (rare) case should be considered.
> >
> > Once way to overcome this is below.
> > (a) Do not place new device_reset_timeout field in struct
> virtio_pci_common_cfg.
> > This is because this map be in memory mapped BAR directly accessible to the
> guest and future version of the spec has more fields after device reset timeout.
> > Such fields are somehow implemented by the sw device on HV-1.
> >
> > (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> > This points to the struct containing device reset timeout.
> > Exposing this capability can be skipped when mapping the hw virtio device to
> VM on HV-2 to match sw virtio device of HV-1.
> 
> 
> This will ease the hypervisor but anyhow hypervisor can choose to not
> expose the capability directly to guest (which will soon became a burden
> to support cross vendor live migration).
Again it is not a burden.
If a live migration sw needs the ability to migrate between vendors, right approach is to have attributes in the spec.
So that sw has the ability to understand where can device migrate.

As I explained, there are many RW features in the virtio spec that live migration sw needs to take care to migrate across cross vendor.
A new RO field that improves the sw stack cannot be attributed as burden to live migration sw.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-25  6:11                                                               ` Parav Pandit
@ 2021-10-26  4:03                                                                 ` Jason Wang
  2021-10-27  8:04                                                                   ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2021-10-26  4:03 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Mon, Oct 25, 2021 at 2:11 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 25, 2021 11:12 AM
> [..]
> > So what I meant is, if PCI gives us an upper limit of the reset (e.g 100ms as you
> > said), can virtio device perform reset more than that? If not, can we simply use
> > 100ms for virtio-pci or can we gain from having a smaller per device timeout?
> > If yes, it looks like a violation for PCI.
> No. Not at all a violation of PCI spec.
> PCI spec is transport layer.
> Virtio io is layer above transport.
>
> PCI device can be reset, but not necessary, it is ready to service upper layer protocols who may need more time to finish the reset.

Isn't transport level reset implies a virtio reset? I think this is
the question we need to answer or clarify it in the spec. All my
suggestions were based on the answer "yes".

Another question is that we don't define reset actually, do we need to
do that before trying to define reset timeout? To me the reset means
the device can be used by the driver which means all the basic device
facilities are ready. It means the device is ready from the
perspective of a "driver", if the device requires a connection to a
remote subsystem, whether or not such connection is established is out
of the scope of virtio reset. Devices can choose not to process the
requests from the virtqueue.

Another question is that, can we present a PCI device if the
underlayer device logic is not ready? If the question is yes, do we
need a new facility to tell the driver that the device is being
initialized?

>
> There are several examples of it in industry today.
> nvme, mlx5, efa, intel ice and more who may need more time to reset the upper layer device tan 100msec.

Virtio supports more types than the above list, if it's the timeout
that is specific to some types of the devices (e.g block), we need to
use device specific methods.

>
> second point that I already explained, is there is nothing wrong at the PCI level that _requires_ further reset.
> PCI device is just fine, it is under reset by virtio. Depending on the device implementation it just takes more time.
> In few responses in the thread, I already described those different types of devices in response to Cornelia and Michael.
> Please go through the thread. Let's not repeat the same discussion again.

So my point is to avoid exposing any implementation specific to the
driver. It has a different movitiaton than the discussion you refer to
here.

>
> >
> > It looks to there're many choices
> >
> > 1) driver reset timeout via module parameter or sysfs
> > 2) transport driver reset timeout (e.g via virtio-pci)
> > 3) device reset timeout
> >
> > If we can solve the issue in 1) and 2), there's no need for 3) consider
> > 3) requires a changes is various places and may bring extra troubles.
> >
> What are those extra troubles?
> Device reset timeout is addressing #1, #2 and #3 without any of the extra user config knobs. This delivers less painful systems to user and also vendor agnostic method.
>
> Actually sysfs is more troublesome, where user needs to reboot the system and do scripts to configure these values even before kernel-builtin modules are loaded.
> Some pre-boot env, doesn’t have such sysfs.

It's just an example, it could be a hard coded value. If 100ms is not
enough we can use something longer.

>
> >
> > >
> > >> It looks to me it will guarantee the success of the reset?
> > > You are describing the timeout expiration handling.
> >
> >
> > This looks more important than just having a device hint of the timeout.
> > We don't want to block thing like module unload or device removal etc.
> >
> We already discussed in this thread with Michael, that a device which is already reseted before, a second reset should be fast to service anyway.

Well, this part I don't understand. If you meant the time spent on the
"first" reset is implementation dependent but time spent on the
"second" reset is fast (not much implementation dependent). It looks
like a conclusion that applies only to your specific implementation or
hardware.

> Please read the discussion in this thread. Let's not repeat the same discussion again.

It's not a repeat of the discussion, we don't want to end up with a
solution that works only for a single vendor or implementation.

> > >>> Still not any different than rest of the RW capabilities/registers.
> > >> So the timeout doesn't work if we can enforce that.
> > > Didn’t follow your comment. Enforce what?
> >
> >
> > I meant the timeout value doesn't work here since guest can't see that.
> Your example was for VM migration from sw_without_device_reset to hw_with_device_reset.
> And here, you wanted to hide the device reset timeout.
> So guest _should_not_ be able to see it. Right?

Yes, so if the hardware can work with the infinite waiting in the
guest, there's no much value for the device specific reset timeout
actually. Isn't it?

>
> >
> >
> > >
> > >> Seems sub optimal than doing a transport reset if possible.
> > > Transport level reset is not helpful for two main reasons.
> > > 1. A device under reset has nothing wrong at the PCI level.
> > > Device is just still under boot/resetting to service virtio level commands.
> > > So doing PCI level reset is actually adversely affect it further because PCI
> > level reset will trigger more layers of unwanted reset.
> >
> >
> > Well, looking at the driver codes, there're various places that the
> > reset() is not expected to be fail. My understanding is that your
> > proposal only solve the issue when the reset can fail but not other.
> > That's why we need seek help for transport layer support.
> >
> Driver can issue transport level reset when reset timeout expires.
> First we need the reset timeout to be defined and let it expire. :)
> Once it expires, sure, transport level resets can be applied.

Yes, but I'm not sure how device specific timeout can help more:

1) if device works well, there won't be an infinite wait on the reset
2) if device is buggy,
2.1) wait for ever
2.2) doing a transport specific reset to prevent data corruption after a timeout

Using 2.2 is tricky since as discussed, the hypervisor may choose to
hide this and let guests just use infinite wait for most of the cases.

>
> I think you missed this part, that we do not reset to wait infinitely in all places as done today in linux driver.

The reason is that virtio drivers are mostly used in the guest where
the device is emulated by the software. It's almost impossible to
enforce a timeout for any specific operation, setting device status is
just one of the possible operations that may wait forever.

> Wait finitely, and when finite time expires, sure, driver can issue transport level reset.

Ok, so I think if we finally go with the way of the device advertised
timeout, we need to clarify the transport level reset is a must after
timeout in the spec anyhow.

>
> >
> > >
> > > 2. PCI level reset must complete in 100msec.
> >
> >
> > Did you mean 100ms is too long? If yes, please explain why.
> This is discussed and Cornelia gave the example also in this thread.
> i.e. a fw based hw virtio device may not have finished its initialization in its 100 msec.
> so virtio layer may need to wait longer.

Yes, but current code without timeout works in this case.

>
> >
> >
> > >
> > > I still don't understand mixing transport level reset with virtio level reset in
> > context of LM.
> > >
> > > If I read your original concern, that a sw device was exposed on HV-1 without
> > device reset timeout.
> > > And now user wants to migrate this VM to HV-2 which has a hw device with
> > device reset timeout.
> > > And now this device cannot be passed to the HV-2 because older sw device
> > was attached to the VM.
> > > I am frankly not sure, if this is really a case to consider.
> >
> >
> > Actually, it's a choice of the cloud vendors. But consider there are
> > very huge number of instances which is based on software virtio, it's a
> > valid use case. But what I want to say is that, if we touch the spec, it
> > will requires changes in the management layer as well. E.g if we can not
> > migrate from hardware that support reset timeout to software that
> > doesn't support reset timeout. It will be better if we can do that only
> > in the driver.
> Above explanation chooses now a different example.
> In this example, you mention hw to sw migration.
> Previously your example was from inferior sw to newer hw.

It just to demonstrate possible troubles:

1) migrate from sw(no timeout) to hw (timeout)
2) migrate from hw(timeout) to sw(no timeout)
3) migrate from hw_vendorA(timeoutA) to hw_vendorB(timeoutB)
4) migrate from hw_vendorA(timeoutA_implementationA) to
hw_vendorA(timeoutB_implementationB)
5) migrate from hw_vendorA(timeoutA_implementationA) to
hw_vendorA(timeoutA_implementationA)

The hypervisor can only choose to expose timeout value in the case 5)
(and it can only do this if it knows all the migration destinations
have the same timeout value) and it will let the 1 to 4 to simply use
infinite wait by hiding the capability. The management stack should
have virtio specific knowledge to poke into the PCIE capabilities list
to find and report the timeout and we need cross machine co-ordination
with the migration destination or clusters to allow the migration.
Hypervisors like Qemu maintain the migration compatibility by using
machine type, but this implementation specific attributes makes it
almost impossible to be a part of the machine type which means virtio
ad-hoc codes were needed.

>
> >
> > Similar but one more burden for the management if a simple 100ms timeout
> > can work for virtio-pci.
> Resilient spec and resilient sw, hw stack is not a burden to management stack.
> It is just any other hw/sw attribute of the virtio device to take care.

This is only true if you don't expose any vendor or implementation
specific attributes. See above.

>
> >
> > > In another MSI-X example you provided emulated MSIX alterative was
> > proposed. This is again not mapping the full hw device.
> > >
> > > So it is still sw device on other hand. User might as well just continue in sw
> > mode.
> > >
> > > Let's assume for a moment that, this (rare) case should be considered.
> > >
> > > Once way to overcome this is below.
> > > (a) Do not place new device_reset_timeout field in struct
> > virtio_pci_common_cfg.
> > > This is because this map be in memory mapped BAR directly accessible to the
> > guest and future version of the spec has more fields after device reset timeout.
> > > Such fields are somehow implemented by the sw device on HV-1.
> > >
> > > (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> > > This points to the struct containing device reset timeout.
> > > Exposing this capability can be skipped when mapping the hw virtio device to
> > VM on HV-2 to match sw virtio device of HV-1.
> >
> >
> > This will ease the hypervisor but anyhow hypervisor can choose to not
> > expose the capability directly to guest (which will soon became a burden
> > to support cross vendor live migration).
> Again it is not a burden.
> If a live migration sw needs the ability to migrate between vendors, right approach is to have attributes in the spec.
> So that sw has the ability to understand where can device migrate.

So waiting forever can make sure the guest can migrate without caring
about the timeout.

>
> As I explained, there are many RW features in the virtio spec that live migration sw needs to take care to migrate across cross vendor.

Note that those features are virtio features not the transport
specific features. Spec only cares about how to migrate virtio,
instead of migrating transport specific stuff. Another thing is that,
unlike #queue_paris or #msi_vectors, the device rest timeout is not
something that can be provisioned or mediated before/after doing live
migration. Hypervisor may just choose to disable them for the guest
for migration compatibility.

Thanks

> A new RO field that improves the sw stack cannot be attributed as burden to live migration sw.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-26  4:03                                                                 ` Jason Wang
@ 2021-10-27  8:04                                                                   ` Parav Pandit
  2021-10-27  8:26                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-27  8:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 26, 2021 9:34 AM
> 
> On Mon, Oct 25, 2021 at 2:11 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 25, 2021 11:12 AM
> > [..]
> > > So what I meant is, if PCI gives us an upper limit of the reset (e.g
> > > 100ms as you said), can virtio device perform reset more than that?
> > > If not, can we simply use 100ms for virtio-pci or can we gain from having a
> smaller per device timeout?
> > > If yes, it looks like a violation for PCI.
> > No. Not at all a violation of PCI spec.
> > PCI spec is transport layer.
> > Virtio io is layer above transport.
> >
> > PCI device can be reset, but not necessary, it is ready to service upper layer
> protocols who may need more time to finish the reset.
> 
> Isn't transport level reset implies a virtio reset? I think this is the question we
> need to answer or clarify it in the spec. All my suggestions were based on the
> answer "yes".
Transport level resets are often too short. It ensures that reset occurred, hence, notifications and DMA activities won't occur after transport reset.
However, it may not have finished the upper layer reset to re-service the device after transport reset.

> 
> Another question is that we don't define reset actually, 
It has reasonably well definition in the spec in commit: a306bf467850e ("clarify device reset")

> do we need to do that
> before trying to define reset timeout? To me the reset means the device can be
> used by the driver which means all the basic device facilities are ready. 
Reset is well in spec 1.0 compare to 0.95 indicating that driver need to wait reset to finish.

i.e. that driver should wait for device status to turn zero during reset and must do reset.

Unlike 0.95 that explicitly stated " Reset the device. This is not required on initial startup." In section 2.2.1.

> It
> means the device is ready from the perspective of a "driver", if the device
> requires a connection to a remote subsystem, whether or not such connection
> is established is out of the scope of virtio reset. 
Sure such implementation detail about remote system connection is outside the scope of spec.

> Devices can choose not to process the requests from the virtqueue.
Yes, that is right, but it is not limited to only vq processing.
As mentioned in the spec section 3.1.1 before VQs are created feature bits, virtio config space values can be read only after reset step_1 is completed.

> 
> Another question is that, can we present a PCI device if the underlayer device
> logic is not ready? 
Yes, this is very common for hw based devices, unlike synthetic devices for guest VMs.
This is also discussed in this thread with sriov example.

> If the question is yes, do we need a new facility to tell the
> driver that the device is being initialized?
> 
May be yes, a different register than the reset can be helpful to indicate that.
Though it shares lot of functionalities common to the reseting the device.

Given that this is too early in device initialization sequence,
Device doesn’t know if should return fake reset_completed status and if driver is going to wait for initializing register or not.

So let reset register indicate if the reset is completed or not seems more appropriate and less complex solution.

> >
> > There are several examples of it in industry today.
> > nvme, mlx5, efa, intel ice and more who may need more time to reset the
> upper layer device tan 100msec.
> 
> Virtio supports more types than the above list, if it's the timeout that is specific
> to some types of the devices (e.g block), we need to use device specific
> methods.
Block is most common example, but it is not limited to it.
There are non-synthetic, hw PCI devices of block and non-block types that may need remote connection or may take more time to reset.

As you described below, traditionally virtio devices were exposed to guest VM by the hypervisor and hence they were always there.
So resiliency was implicitly present on such guest VM case.

I gave example of above devices which are used in guest VM devices and as physical devices.
There isn't the question about widely popular and used virtio devices for guest VM. :-)

> >
> > second point that I already explained, is there is nothing wrong at the PCI
> level that _requires_ further reset.
> > PCI device is just fine, it is under reset by virtio. Depending on the device
> implementation it just takes more time.
> > In few responses in the thread, I already described those different types of
> devices in response to Cornelia and Michael.
> > Please go through the thread. Let's not repeat the same discussion again.
> 
> So my point is to avoid exposing any implementation specific to the driver. It
> has a different movitiaton than the discussion you refer to here.
> 
There are different type (physical, sriov) and class (net/blk/gpu) of hw PCI devices in addition to sw based devices.
Single timeout (either infinite or some arbitrary 1 msec) doesn’t fit all class of the devices.

This proposal provides generic, optional, vendor agnostic, platform agnostic RO register to driver.

> It's just an example, it could be a hard coded value. If 100ms is not enough we
> can use something longer.

It doesn’t make sense to have too short or too high timeout single value by the driver that fits all.
For example sw synthetic device to guest VM doesn’t need timeout. Guest VM can wait infinitely as device didn't expose the field.

> > >
> > > >
> > > >> It looks to me it will guarantee the success of the reset?
> > > > You are describing the timeout expiration handling.
> > >
> > >
> > > This looks more important than just having a device hint of the timeout.
> > > We don't want to block thing like module unload or device removal etc.
> > >
> > We already discussed in this thread with Michael, that a device which is
> already reseted before, a second reset should be fast to service anyway.
> 
> Well, this part I don't understand. If you meant the time spent on the "first"
> reset is implementation dependent but time spent on the "second" reset is fast
> (not much implementation dependent). It looks like a conclusion that applies
> only to your specific implementation or hardware.

No. it is not a specific hw implementation. Usually a firmware/hw based devices takes finite time come up.
It still have to present the physical PCI device which exists there even before virtio layer shows up for various reasons.
In such case a first device reset may take longer after power-on.

A device under control of driver, and running VQs, if being reset is likely to take shorter time.
This is because it doesn't need to undergo initialization that was already completed on initial first power-on.

One can still implement to do second reset also as slow as first, but such would be sub-optimal and not useful.
Not sure, why one would do that.

> 
> > Please read the discussion in this thread. Let's not repeat the same discussion
> again.
> 
> It's not a repeat of the discussion, we don't want to end up with a solution that
> works only for a single vendor or implementation.
Yes, I don't see any vendor specific implementation pushed here.
The proposal is making it platform, vendor and device type agnostic and continue to remain optional for sw based devices.

> 
> > > >>> Still not any different than rest of the RW capabilities/registers.
> > > >> So the timeout doesn't work if we can enforce that.
> > > > Didn’t follow your comment. Enforce what?
> > >
> > >
> > > I meant the timeout value doesn't work here since guest can't see that.
> > Your example was for VM migration from sw_without_device_reset to
> hw_with_device_reset.
> > And here, you wanted to hide the device reset timeout.
> > So guest _should_not_ be able to see it. Right?
> 
> Yes, so if the hardware can work with the infinite waiting in the guest, there's
> no much value for the device specific reset timeout actually. Isn't it?
Unlike HV sw exposing device, hw device may experience failures sometimes.
And a system to recover/resume in finite time from it is useful.

> > >
> > Driver can issue transport level reset when reset timeout expires.
> > First we need the reset timeout to be defined and let it expire. :)
> > Once it expires, sure, transport level resets can be applied.
> 
> Yes, but I'm not sure how device specific timeout can help more:
> 
> 1) if device works well, there won't be an infinite wait on the reset
> 2) if device is buggy,
> 2.1) wait for ever
A system might not even boot if it waits for ever just because of single faulty device.
A user may not even have chance to recover the system/upgrade fw of this faulty device.

A pre-boot environment using physical device cannot wait forever just because one device in whole system is buggy.

> 2.2) doing a transport specific reset to prevent data corruption after a timeout
> 
> Using 2.2 is tricky since as discussed, the hypervisor may choose to hide this
> and let guests just use infinite wait for most of the cases.
> 
Sure, some hypervisor can hide this for guest VM if it wish to.

> >
> > I think you missed this part, that we do not reset to wait infinitely in all places
> as done today in linux driver.
> 
> The reason is that virtio drivers are mostly used in the guest where the device
> is emulated by the software. It's almost impossible to enforce a timeout for any
> specific operation, setting device status is just one of the possible operations
> that may wait forever.

virtio drivers are having wider uses than just guest env, including preboot, physical devices, sriov etc.

RO timeout is not about enforcement, it is to set upper bound limit.

Guest VM that doesn’t follow the spec can do,
(a) not honor the timeout, and wait forever. 
(b) ignores the timeout and gets non-operational device.

And incorrect behavior can happen even with current drivers regardless of the timeout.

> 
> > Wait finitely, and when finite time expires, sure, driver can issue transport
> level reset.
> 
> Ok, so I think if we finally go with the way of the device advertised timeout, we
> need to clarify the transport level reset is a must after timeout in the spec
> anyhow.
> 
I am not sure it is in the scope of the virtio spec.
I understand it may be good to reset the device, but this is really implementation choice.
Why cant platform/sw ignore the device with single retry to reset the device without transport level reset?
I would put that as guidance to driver and not _must_ requirement in the spec.

Guiding driver to issue transport level reset is very useful to indicate driver that such transport level reset will ensure that notifications and DMA will stop.
So yes, we should document this instruction for drivers.

> > This is discussed and Cornelia gave the example also in this thread.
> > i.e. a fw based hw virtio device may not have finished its initialization in its
> 100 msec.
> > so virtio layer may need to wait longer.
> 
> Yes, but current code without timeout works in this case.
It works only if the device didn’t fail to reset.

> It just to demonstrate possible troubles:
> 
> 1) migrate from sw(no timeout) to hw (timeout)
> 2) migrate from hw(timeout) to sw(no timeout)
Sw should be enhanced to match the migration from hw.

> 3) migrate from hw_vendorA(timeoutA) to hw_vendorB(timeoutB)
Again, this is no different than any other features between two vendors.
Any emulation, mediation is not A to B migration, it is in between of #2 and #3.
So perf wise it is still not same.

> 4) migrate from hw_vendorA(timeoutA_implementationA) to
> hw_vendorA(timeoutB_implementationB)
Frankly, this is more practical scenario compare to #3.

> 5) migrate from hw_vendorA(timeoutA_implementationA) to
> hw_vendorA(timeoutA_implementationA)
This doesn’t need any extra care. I think you wanted to say,

hw_vendorA(timeoutA_implementationA) to hw_vendorA(timeoutA_implementationB)
                                                                                                                                                    ^^^^^
> 
> The hypervisor can only choose to expose timeout value in the case 5) (and it
> can only do this if it knows all the migration destinations have the same
> timeout value) and it will let the 1 to 4 to simply use infinite wait by hiding the
> capability. The management stack should have virtio specific knowledge to
> poke into the PCIE capabilities list to find and report the timeout and we need
> cross machine co-ordination with the migration destination or clusters to allow
> the migration.
> Hypervisors like Qemu maintain the migration compatibility by using machine
> type, but this implementation specific attributes makes it almost impossible to
> be a part of the machine type which means virtio ad-hoc codes were needed.

With above explanation, I don’t see how QEMU can migrate a virtio device from 4 RSS queues to 2 RSS queues based device with just machine type.
Does machine type encompass all the device level details too?

A HV needs to compose (and/or find) the device compatible to the source anyway.
It is less difficult to find a virtio spec defined device compare to some ad-hoc device.

> > >
> > > Similar but one more burden for the management if a simple 100ms
> > > timeout can work for virtio-pci.
> > Resilient spec and resilient sw, hw stack is not a burden to management
> stack.
> > It is just any other hw/sw attribute of the virtio device to take care.
> 
> This is only true if you don't expose any vendor or implementation specific
> attributes. See above.
> 
It is no different than any other virtio features. If source virtio device support of vendor-A supports control VQ,
And destination virtio device of vendor X (A/B) implementation Y (A/B), doesn't support control VQ,
They may not be able to migrate.

HV or other migration sw needs to compatibility check or sometime composing device on destination that matches source attributes.

It is important to keep LM eyes open, but this is no different than other existing or new virtio features (say statistics of net device in guest vm supported by VF).

> >
> > >
> > > > In another MSI-X example you provided emulated MSIX alterative was
> > > proposed. This is again not mapping the full hw device.
> > > >
> > > > So it is still sw device on other hand. User might as well just
> > > > continue in sw
> > > mode.
> > > >
> > > > Let's assume for a moment that, this (rare) case should be considered.
> > > >
> > > > Once way to overcome this is below.
> > > > (a) Do not place new device_reset_timeout field in struct
> > > virtio_pci_common_cfg.
> > > > This is because this map be in memory mapped BAR directly
> > > > accessible to the
> > > guest and future version of the spec has more fields after device reset
> timeout.
> > > > Such fields are somehow implemented by the sw device on HV-1.
> > > >
> > > > (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> > > > This points to the struct containing device reset timeout.
> > > > Exposing this capability can be skipped when mapping the hw virtio
> > > > device to
> > > VM on HV-2 to match sw virtio device of HV-1.
> > >
> > >
> > > This will ease the hypervisor but anyhow hypervisor can choose to
> > > not expose the capability directly to guest (which will soon became
> > > a burden to support cross vendor live migration).
> > Again it is not a burden.
> > If a live migration sw needs the ability to migrate between vendors, right
> approach is to have attributes in the spec.
> > So that sw has the ability to understand where can device migrate.
> 
> So waiting forever can make sure the guest can migrate without caring about
> the timeout.
> 
> >
> > As I explained, there are many RW features in the virtio spec that live
> migration sw needs to take care to migrate across cross vendor.
> 
> Note that those features are virtio features not the transport specific features.
> Spec only cares about how to migrate virtio, instead of migrating transport
> specific stuff. 
Transport is part of the virtio spec and equally part of the conformance clauses in section 7.
I don't see how unwritten virtio LM spec won't care about it.

> Another thing is that, unlike #queue_paris or #msi_vectors, the
> device rest timeout is not something that can be provisioned or mediated
> before/after doing live migration. 
We should work towards it so that if this is really needed, we can mediate it.
I proposed the option in this thread, that we can do as device reset timeout register via VIRTIO_PCI_CAP_PCI_DEV_TIMEOUT.

This also enables to not expose in different migration cases you listed above.

> Hypervisor may just choose to disable them
> for the guest for migration compatibility.
> 
Yes, but this may need disabling many things in a virtio device.
For any to any migration, a device with lowest common denominator features in the given network will work
Such device may miss on non-compatible features of the device.
virtio level control VQ can be one example of it.
Not sure how one can mediate control VQ while keeping other VQs to device after migration, if it was supported by source device without mediation.

An LM implementation needs to find/compose destination device either in device or through mediation that match source device.
Virtio spec addition that makes the mediation easier is a good goal, but not sure all extensions can reach there.
Some may not, and hence not enhancing the spec for it, doesn’t seem right. Though in case of device reset time, we have alternative proposal to 
expose via new VIRTIO_PCI_CAP_PCI_DEV_TIMEOUT instead of placement inside struct virtio_pci_common_cfg.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-27  8:04                                                                   ` Parav Pandit
@ 2021-10-27  8:26                                                                     ` Michael S. Tsirkin
  2021-10-28  4:01                                                                       ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-27  8:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Wed, Oct 27, 2021 at 08:04:10AM +0000, Parav Pandit wrote:
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 26, 2021 9:34 AM
> > 
> > On Mon, Oct 25, 2021 at 2:11 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 25, 2021 11:12 AM
> > > [..]
> > > > So what I meant is, if PCI gives us an upper limit of the reset (e.g
> > > > 100ms as you said), can virtio device perform reset more than that?
> > > > If not, can we simply use 100ms for virtio-pci or can we gain from having a
> > smaller per device timeout?
> > > > If yes, it looks like a violation for PCI.
> > > No. Not at all a violation of PCI spec.
> > > PCI spec is transport layer.
> > > Virtio io is layer above transport.
> > >
> > > PCI device can be reset, but not necessary, it is ready to service upper layer
> > protocols who may need more time to finish the reset.
> > 
> > Isn't transport level reset implies a virtio reset? I think this is the question we
> > need to answer or clarify it in the spec. All my suggestions were based on the
> > answer "yes".
> Transport level resets are often too short. It ensures that reset occurred, hence, notifications and DMA activities won't occur after transport reset.
> However, it may not have finished the upper layer reset to re-service the device after transport reset.
> 
> > 
> > Another question is that we don't define reset actually, 
> It has reasonably well definition in the spec in commit: a306bf467850e ("clarify device reset")
> 
> > do we need to do that
> > before trying to define reset timeout? To me the reset means the device can be
> > used by the driver which means all the basic device facilities are ready. 
> Reset is well in spec 1.0 compare to 0.95 indicating that driver need to wait reset to finish.
> 
> i.e. that driver should wait for device status to turn zero during reset and must do reset.
> 
> Unlike 0.95 that explicitly stated " Reset the device. This is not required on initial startup." In section 2.2.1.


Right. So again, I'm trying to help you here. I think the problem with
your approach so far is that it conflates the two things,
physical device startup might take a lot of time and it's
ok since physical host boot takes very long.

This does not seem to be related to a driver init at all
which is what the spec describes.


A reset timeout which can run into seconds will break usecases such as
kata containers. This is why there's so much resistance to adding such a
feature to the spec. reset should be short, in fact it seems reasonable
for spec to give guidance that it should take around ~10ms tops.











> > It
> > means the device is ready from the perspective of a "driver", if the device
> > requires a connection to a remote subsystem, whether or not such connection
> > is established is out of the scope of virtio reset. 
> Sure such implementation detail about remote system connection is outside the scope of spec.
> 
> > Devices can choose not to process the requests from the virtqueue.
> Yes, that is right, but it is not limited to only vq processing.
> As mentioned in the spec section 3.1.1 before VQs are created feature bits, virtio config space values can be read only after reset step_1 is completed.
> 
> > 
> > Another question is that, can we present a PCI device if the underlayer device
> > logic is not ready? 
> Yes, this is very common for hw based devices, unlike synthetic devices for guest VMs.
> This is also discussed in this thread with sriov example.
> 
> > If the question is yes, do we need a new facility to tell the
> > driver that the device is being initialized?
> > 
> May be yes, a different register than the reset can be helpful to indicate that.
> Though it shares lot of functionalities common to the reseting the device.
> 
> Given that this is too early in device initialization sequence,
> Device doesn’t know if should return fake reset_completed status and if driver is going to wait for initializing register or not.
> 
> So let reset register indicate if the reset is completed or not seems more appropriate and less complex solution.

It's not reset though. What is not completed is startup. virtio reset
can't take seconds.

> > >
> > > There are several examples of it in industry today.
> > > nvme, mlx5, efa, intel ice and more who may need more time to reset the
> > upper layer device tan 100msec.
> > 
> > Virtio supports more types than the above list, if it's the timeout that is specific
> > to some types of the devices (e.g block), we need to use device specific
> > methods.
> Block is most common example, but it is not limited to it.
> There are non-synthetic, hw PCI devices of block and non-block types that may need remote connection or may take more time to reset.
> 
> As you described below, traditionally virtio devices were exposed to guest VM by the hypervisor and hence they were always there.
> So resiliency was implicitly present on such guest VM case.
> 
> I gave example of above devices which are used in guest VM devices and as physical devices.
> There isn't the question about widely popular and used virtio devices for guest VM. :-)
> 
> > >
> > > second point that I already explained, is there is nothing wrong at the PCI
> > level that _requires_ further reset.
> > > PCI device is just fine, it is under reset by virtio. Depending on the device
> > implementation it just takes more time.
> > > In few responses in the thread, I already described those different types of
> > devices in response to Cornelia and Michael.
> > > Please go through the thread. Let's not repeat the same discussion again.
> > 
> > So my point is to avoid exposing any implementation specific to the driver. It
> > has a different movitiaton than the discussion you refer to here.
> > 
> There are different type (physical, sriov) and class (net/blk/gpu) of hw PCI devices in addition to sw based devices.
> Single timeout (either infinite or some arbitrary 1 msec) doesn’t fit all class of the devices.
> 
> This proposal provides generic, optional, vendor agnostic, platform agnostic RO register to driver.
> 
> > It's just an example, it could be a hard coded value. If 100ms is not enough we
> > can use something longer.
> 
> It doesn’t make sense to have too short or too high timeout single value by the driver that fits all.
> For example sw synthetic device to guest VM doesn’t need timeout. Guest VM can wait infinitely as device didn't expose the field.
> 
> > > >
> > > > >
> > > > >> It looks to me it will guarantee the success of the reset?
> > > > > You are describing the timeout expiration handling.
> > > >
> > > >
> > > > This looks more important than just having a device hint of the timeout.
> > > > We don't want to block thing like module unload or device removal etc.
> > > >
> > > We already discussed in this thread with Michael, that a device which is
> > already reseted before, a second reset should be fast to service anyway.
> > 
> > Well, this part I don't understand. If you meant the time spent on the "first"
> > reset is implementation dependent but time spent on the "second" reset is fast
> > (not much implementation dependent). It looks like a conclusion that applies
> > only to your specific implementation or hardware.
> 
> No. it is not a specific hw implementation. Usually a firmware/hw based devices takes finite time come up.
> It still have to present the physical PCI device which exists there even before virtio layer shows up for various reasons.
> In such case a first device reset may take longer after power-on.
> 
> A device under control of driver, and running VQs, if being reset is likely to take shorter time.
> This is because it doesn't need to undergo initialization that was already completed on initial first power-on.
> 
> One can still implement to do second reset also as slow as first, but such would be sub-optimal and not useful.
> Not sure, why one would do that.
> 
> > 
> > > Please read the discussion in this thread. Let's not repeat the same discussion
> > again.
> > 
> > It's not a repeat of the discussion, we don't want to end up with a solution that
> > works only for a single vendor or implementation.
> Yes, I don't see any vendor specific implementation pushed here.
> The proposal is making it platform, vendor and device type agnostic and continue to remain optional for sw based devices.
> 
> > 
> > > > >>> Still not any different than rest of the RW capabilities/registers.
> > > > >> So the timeout doesn't work if we can enforce that.
> > > > > Didn’t follow your comment. Enforce what?
> > > >
> > > >
> > > > I meant the timeout value doesn't work here since guest can't see that.
> > > Your example was for VM migration from sw_without_device_reset to
> > hw_with_device_reset.
> > > And here, you wanted to hide the device reset timeout.
> > > So guest _should_not_ be able to see it. Right?
> > 
> > Yes, so if the hardware can work with the infinite waiting in the guest, there's
> > no much value for the device specific reset timeout actually. Isn't it?
> Unlike HV sw exposing device, hw device may experience failures sometimes.
> And a system to recover/resume in finite time from it is useful.

Hmm given software after all runs on the hardware declaring that
it's fundamentally more robust seems controversial ;)

> > > >
> > > Driver can issue transport level reset when reset timeout expires.
> > > First we need the reset timeout to be defined and let it expire. :)
> > > Once it expires, sure, transport level resets can be applied.
> > 
> > Yes, but I'm not sure how device specific timeout can help more:
> > 
> > 1) if device works well, there won't be an infinite wait on the reset
> > 2) if device is buggy,
> > 2.1) wait for ever
> A system might not even boot if it waits for ever just because of single faulty device.
> A user may not even have chance to recover the system/upgrade fw of this faulty device.
> 
> A pre-boot environment using physical device cannot wait forever just because one device in whole system is buggy.
> 
> > 2.2) doing a transport specific reset to prevent data corruption after a timeout
> > 
> > Using 2.2 is tricky since as discussed, the hypervisor may choose to hide this
> > and let guests just use infinite wait for most of the cases.
> > 
> Sure, some hypervisor can hide this for guest VM if it wish to.
> 
> > >
> > > I think you missed this part, that we do not reset to wait infinitely in all places
> > as done today in linux driver.
> > 
> > The reason is that virtio drivers are mostly used in the guest where the device
> > is emulated by the software. It's almost impossible to enforce a timeout for any
> > specific operation, setting device status is just one of the possible operations
> > that may wait forever.
> 
> virtio drivers are having wider uses than just guest env, including preboot, physical devices, sriov etc.
> 
> RO timeout is not about enforcement, it is to set upper bound limit.
> 
> Guest VM that doesn’t follow the spec can do,
> (a) not honor the timeout, and wait forever. 
> (b) ignores the timeout and gets non-operational device.
> 
> And incorrect behavior can happen even with current drivers regardless of the timeout.
> 
> > 
> > > Wait finitely, and when finite time expires, sure, driver can issue transport
> > level reset.
> > 
> > Ok, so I think if we finally go with the way of the device advertised timeout, we
> > need to clarify the transport level reset is a must after timeout in the spec
> > anyhow.
> > 
> I am not sure it is in the scope of the virtio spec.
> I understand it may be good to reset the device, but this is really implementation choice.
> Why cant platform/sw ignore the device with single retry to reset the device without transport level reset?
> I would put that as guidance to driver and not _must_ requirement in the spec.
> 
> Guiding driver to issue transport level reset is very useful to indicate driver that such transport level reset will ensure that notifications and DMA will stop.
> So yes, we should document this instruction for drivers.
> 
> > > This is discussed and Cornelia gave the example also in this thread.
> > > i.e. a fw based hw virtio device may not have finished its initialization in its
> > 100 msec.
> > > so virtio layer may need to wait longer.
> > 
> > Yes, but current code without timeout works in this case.
> It works only if the device didn’t fail to reset.
> 
> > It just to demonstrate possible troubles:
> > 
> > 1) migrate from sw(no timeout) to hw (timeout)
> > 2) migrate from hw(timeout) to sw(no timeout)
> Sw should be enhanced to match the migration from hw.
> 
> > 3) migrate from hw_vendorA(timeoutA) to hw_vendorB(timeoutB)
> Again, this is no different than any other features between two vendors.
> Any emulation, mediation is not A to B migration, it is in between of #2 and #3.
> So perf wise it is still not same.
> 
> > 4) migrate from hw_vendorA(timeoutA_implementationA) to
> > hw_vendorA(timeoutB_implementationB)
> Frankly, this is more practical scenario compare to #3.
> 
> > 5) migrate from hw_vendorA(timeoutA_implementationA) to
> > hw_vendorA(timeoutA_implementationA)
> This doesn’t need any extra care. I think you wanted to say,
> 
> hw_vendorA(timeoutA_implementationA) to hw_vendorA(timeoutA_implementationB)
>                                                                                                                                                     ^^^^^
> > 
> > The hypervisor can only choose to expose timeout value in the case 5) (and it
> > can only do this if it knows all the migration destinations have the same
> > timeout value) and it will let the 1 to 4 to simply use infinite wait by hiding the
> > capability. The management stack should have virtio specific knowledge to
> > poke into the PCIE capabilities list to find and report the timeout and we need
> > cross machine co-ordination with the migration destination or clusters to allow
> > the migration.
> > Hypervisors like Qemu maintain the migration compatibility by using machine
> > type, but this implementation specific attributes makes it almost impossible to
> > be a part of the machine type which means virtio ad-hoc codes were needed.
> 
> With above explanation, I don’t see how QEMU can migrate a virtio device from 4 RSS queues to 2 RSS queues based device with just machine type.
> Does machine type encompass all the device level details too?
> 
> A HV needs to compose (and/or find) the device compatible to the source anyway.
> It is less difficult to find a virtio spec defined device compare to some ad-hoc device.
> 
> > > >
> > > > Similar but one more burden for the management if a simple 100ms
> > > > timeout can work for virtio-pci.
> > > Resilient spec and resilient sw, hw stack is not a burden to management
> > stack.
> > > It is just any other hw/sw attribute of the virtio device to take care.
> > 
> > This is only true if you don't expose any vendor or implementation specific
> > attributes. See above.
> > 
> It is no different than any other virtio features. If source virtio device support of vendor-A supports control VQ,
> And destination virtio device of vendor X (A/B) implementation Y (A/B), doesn't support control VQ,
> They may not be able to migrate.
> 
> HV or other migration sw needs to compatibility check or sometime composing device on destination that matches source attributes.
> 
> It is important to keep LM eyes open, but this is no different than other existing or new virtio features (say statistics of net device in guest vm supported by VF).
> 
> > >
> > > >
> > > > > In another MSI-X example you provided emulated MSIX alterative was
> > > > proposed. This is again not mapping the full hw device.
> > > > >
> > > > > So it is still sw device on other hand. User might as well just
> > > > > continue in sw
> > > > mode.
> > > > >
> > > > > Let's assume for a moment that, this (rare) case should be considered.
> > > > >
> > > > > Once way to overcome this is below.
> > > > > (a) Do not place new device_reset_timeout field in struct
> > > > virtio_pci_common_cfg.
> > > > > This is because this map be in memory mapped BAR directly
> > > > > accessible to the
> > > > guest and future version of the spec has more fields after device reset
> > timeout.
> > > > > Such fields are somehow implemented by the sw device on HV-1.
> > > > >
> > > > > (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> > > > > This points to the struct containing device reset timeout.
> > > > > Exposing this capability can be skipped when mapping the hw virtio
> > > > > device to
> > > > VM on HV-2 to match sw virtio device of HV-1.
> > > >
> > > >
> > > > This will ease the hypervisor but anyhow hypervisor can choose to
> > > > not expose the capability directly to guest (which will soon became
> > > > a burden to support cross vendor live migration).
> > > Again it is not a burden.
> > > If a live migration sw needs the ability to migrate between vendors, right
> > approach is to have attributes in the spec.
> > > So that sw has the ability to understand where can device migrate.
> > 
> > So waiting forever can make sure the guest can migrate without caring about
> > the timeout.
> > 
> > >
> > > As I explained, there are many RW features in the virtio spec that live
> > migration sw needs to take care to migrate across cross vendor.
> > 
> > Note that those features are virtio features not the transport specific features.
> > Spec only cares about how to migrate virtio, instead of migrating transport
> > specific stuff. 
> Transport is part of the virtio spec and equally part of the conformance clauses in section 7.
> I don't see how unwritten virtio LM spec won't care about it.
> 
> > Another thing is that, unlike #queue_paris or #msi_vectors, the
> > device rest timeout is not something that can be provisioned or mediated
> > before/after doing live migration. 
> We should work towards it so that if this is really needed, we can mediate it.
> I proposed the option in this thread, that we can do as device reset timeout register via VIRTIO_PCI_CAP_PCI_DEV_TIMEOUT.
> 
> This also enables to not expose in different migration cases you listed above.
> 
> > Hypervisor may just choose to disable them
> > for the guest for migration compatibility.
> > 
> Yes, but this may need disabling many things in a virtio device.
> For any to any migration, a device with lowest common denominator features in the given network will work
> Such device may miss on non-compatible features of the device.
> virtio level control VQ can be one example of it.
> Not sure how one can mediate control VQ while keeping other VQs to device after migration, if it was supported by source device without mediation.
> 
> An LM implementation needs to find/compose destination device either in device or through mediation that match source device.
> Virtio spec addition that makes the mediation easier is a good goal, but not sure all extensions can reach there.
> Some may not, and hence not enhancing the spec for it, doesn’t seem right. Though in case of device reset time, we have alternative proposal to 
> expose via new VIRTIO_PCI_CAP_PCI_DEV_TIMEOUT instead of placement inside struct virtio_pci_common_cfg.


And just to make things clear this is not a reset timeout.
It's a readiness timeout unaffected by reset.

Here's one way I see it working that will likely cause much less
contention than your current proposals:

- new capability that allows reporting device readiness.
  how it looks like, whether there is a timeout is
  a separate question. E.g. a read counter sounds
  more generic, if device is stuck it can just stop
  incrementing it.

- clarify that it's unaffected by virtio reset

- if device is accessed while not ready (e.g. with existing drivers),
  device will not exit reset. For legacy/mmio it will have to block read of
  features until it is ready.


Above is just a vague outlike that should make all this much less
controversial.


-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-27  8:26                                                                     ` Michael S. Tsirkin
@ 2021-10-28  4:01                                                                       ` Parav Pandit
  2021-10-28  5:50                                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Parav Pandit @ 2021-10-28  4:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 27, 2021 1:56 PM
> 
> On Wed, Oct 27, 2021 at 08:04:10AM +0000, Parav Pandit wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 26, 2021 9:34 AM
> > >
> > > On Mon, Oct 25, 2021 at 2:11 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, October 25, 2021 11:12 AM
> > > > [..]
> > > > > So what I meant is, if PCI gives us an upper limit of the reset
> > > > > (e.g 100ms as you said), can virtio device perform reset more than that?
> > > > > If not, can we simply use 100ms for virtio-pci or can we gain
> > > > > from having a
> > > smaller per device timeout?
> > > > > If yes, it looks like a violation for PCI.
> > > > No. Not at all a violation of PCI spec.
> > > > PCI spec is transport layer.
> > > > Virtio io is layer above transport.
> > > >
> > > > PCI device can be reset, but not necessary, it is ready to service
> > > > upper layer
> > > protocols who may need more time to finish the reset.
> > >
> > > Isn't transport level reset implies a virtio reset? I think this is
> > > the question we need to answer or clarify it in the spec. All my
> > > suggestions were based on the answer "yes".
> > Transport level resets are often too short. It ensures that reset occurred,
> hence, notifications and DMA activities won't occur after transport reset.
> > However, it may not have finished the upper layer reset to re-service the
> device after transport reset.
> >
> > >
> > > Another question is that we don't define reset actually,
> > It has reasonably well definition in the spec in commit: a306bf467850e
> > ("clarify device reset")
> >
> > > do we need to do that
> > > before trying to define reset timeout? To me the reset means the
> > > device can be used by the driver which means all the basic device facilities
> are ready.
> > Reset is well in spec 1.0 compare to 0.95 indicating that driver need to wait
> reset to finish.
> >
> > i.e. that driver should wait for device status to turn zero during reset and
> must do reset.
> >
> > Unlike 0.95 that explicitly stated " Reset the device. This is not required on
> initial startup." In section 2.2.1.
> 
> 
> Right. So again, I'm trying to help you here. I think the problem with your
> approach so far is that it conflates the two things, physical device startup
> might take a lot of time and it's ok since physical host boot takes very long.
> 
It is the preboot env driver that may need to use the physical virtio device before the OS driver uses it.
Physical host booting before OS takes lot less than complete physical server booting.

> This does not seem to be related to a driver init at all which is what the spec
> describes.

> 
> 
> A reset timeout which can run into seconds will break usecases such as kata
> containers. This is why there's so much resistance to adding such a feature to
> the spec. reset should be short, in fact it seems reasonable for spec to give
> guidance that it should take around ~10ms tops.
It doesn’t make sense to put such short timing requirements on the device implementation.
One size doesn't fit all.
Those use cases that requires extremely short device reset timing, will not use physical devices or device with higher reset timeout anyway.
So there is no reason to limit device timing by just one use case, which is even shorter than PCI reset timing!
 
> 
> > > It
> > > means the device is ready from the perspective of a "driver", if the
> > > device requires a connection to a remote subsystem, whether or not
> > > such connection is established is out of the scope of virtio reset.
> > Sure such implementation detail about remote system connection is outside
> the scope of spec.
> >
> > > Devices can choose not to process the requests from the virtqueue.
> > Yes, that is right, but it is not limited to only vq processing.
> > As mentioned in the spec section 3.1.1 before VQs are created feature bits,
> virtio config space values can be read only after reset step_1 is completed.
> >
> > >
> > > Another question is that, can we present a PCI device if the
> > > underlayer device logic is not ready?
> > Yes, this is very common for hw based devices, unlike synthetic devices for
> guest VMs.
> > This is also discussed in this thread with sriov example.
> >
> > > If the question is yes, do we need a new facility to tell the driver
> > > that the device is being initialized?
> > >
> > May be yes, a different register than the reset can be helpful to indicate that.
> > Though it shares lot of functionalities common to the reseting the device.
> >
> > Given that this is too early in device initialization sequence, Device
> > doesn’t know if should return fake reset_completed status and if driver is
> going to wait for initializing register or not.
> >
> > So let reset register indicate if the reset is completed or not seems more
> appropriate and less complex solution.
> 
> It's not reset though. What is not completed is startup. virtio reset can't take
> seconds.
> 
One use case as lowest common denominator doesn't drive the spec. it should be able to cover wide range of use cases.
And this proposal covers it. Kata container use case can even use the device that doesn't have timeout.

> 
> And just to make things clear this is not a reset timeout.
> It's a readiness timeout unaffected by reset.
>
Yes.
 
> Here's one way I see it working that will likely cause much less contention than
> your current proposals:
> 
> - new capability that allows reporting device readiness.

Yes. I will explore this option.

>   how it looks like, whether there is a timeout is
>   a separate question.

> E.g. a read counter sounds
>   more generic, if device is stuck it can just stop
>   incrementing it.
This is not scalable solution for hundreds of devices and it breaks the layering too.
A virtio layer itself is not in up_state to perform counter update.
A non virtio layer is expected to implement this counter. This is not right.

But this timer is not useful for publishing device readiness. Having device readiness register is enough to indicate that.
For health monitoring such counting register is useful. But its out of scope to address this problem.
So will skip this.

I will explore the device readiness register and update the proposal.
Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-28  4:01                                                                       ` Parav Pandit
@ 2021-10-28  5:50                                                                         ` Michael S. Tsirkin
  2021-10-28  6:06                                                                           ` Parav Pandit
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-28  5:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer

On Thu, Oct 28, 2021 at 04:01:52AM +0000, Parav Pandit wrote:
> >   how it looks like, whether there is a timeout is
> >   a separate question.
> 
> > E.g. a read counter sounds
> >   more generic, if device is stuck it can just stop
> >   incrementing it.
> This is not scalable solution for hundreds of devices and it breaks the layering too.
> A virtio layer itself is not in up_state to perform counter update.
> A non virtio layer is expected to implement this counter. This is not right.

Donnu what does the above mean. My point is simple: there's little
difference between device being stuck during firmware boot and
device being stuck right after boot.

> But this timer is not useful for publishing device readiness. Having device readiness register is enough to indicate that.
> For health monitoring such counting register is useful. But its out of scope to address this problem.
> So will skip this.
> 
> I will explore the device readiness register and update the proposal.
> Thanks.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field
  2021-10-28  5:50                                                                         ` Michael S. Tsirkin
@ 2021-10-28  6:06                                                                           ` Parav Pandit
  0 siblings, 0 replies; 55+ messages in thread
From: Parav Pandit @ 2021-10-28  6:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Cornelia Huck, virtio-dev, Max Gurtovoy,
	Shahaf Shuler, Oren Duer


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 28, 2021 11:21 AM
> 
> On Thu, Oct 28, 2021 at 04:01:52AM +0000, Parav Pandit wrote:
> > >   how it looks like, whether there is a timeout is
> > >   a separate question.
> >
> > > E.g. a read counter sounds
> > >   more generic, if device is stuck it can just stop
> > >   incrementing it.
> > This is not scalable solution for hundreds of devices and it breaks the layering
> too.
> > A virtio layer itself is not in up_state to perform counter update.
> > A non virtio layer is expected to implement this counter. This is not right.
> 
> Donnu what does the above mean. My point is simple: there's little difference
> between device being stuck during firmware boot and device being stuck right
> after boot.
Virtio functionality is something to be implemented by the virtio layer in the device.
When the device is booting, virtio layer may not have yet booted to update this counter. (device is not stuck).
Some other layer before the virtio layer within the device is asked to implement virtio counters.

And hence it incorrect to have such counter to indicate that the device is booting.

A init bit is good enough to indicate that initialization has not yet finished. When the virtio layer is booted in the device, it can connect to this layer to update about its init status.
This is far more scalable and don't break layers.

Secondly, a free flowing counter implementation for tens of hundreds of devices doesn't scale in pre-init stage.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2021-10-28  6:06 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-06 14:10 [PATCH v2] Add device reset timeout field Parav Pandit
2021-10-06 15:22 ` Michael S. Tsirkin
2021-10-06 16:11   ` Parav Pandit
2021-10-06 20:53     ` Michael S. Tsirkin
2021-10-07  3:42       ` Parav Pandit
2021-10-07 16:10         ` [virtio-dev] " Cornelia Huck
2021-10-07 17:58           ` Parav Pandit
2021-10-08 10:00             ` [virtio-dev] " Cornelia Huck
2021-10-08 10:19               ` Parav Pandit
2021-10-08 10:12             ` Michael S. Tsirkin
2021-10-08 10:51               ` Parav Pandit
2021-10-08 11:18                 ` [virtio-dev] " Michael S. Tsirkin
2021-10-08 12:55                   ` Parav Pandit
2021-10-08 10:44         ` Michael S. Tsirkin
2021-10-08 10:59           ` Parav Pandit
2021-10-08 11:21             ` Michael S. Tsirkin
2021-10-08 11:45               ` Parav Pandit
2021-10-08 11:47               ` [virtio-dev] " Cornelia Huck
2021-10-08 12:12                 ` Parav Pandit
2021-10-08 12:57                   ` Michael S. Tsirkin
2021-10-08 13:23                     ` Parav Pandit
2021-10-08 23:09                       ` Michael S. Tsirkin
2021-10-11 14:29                         ` Parav Pandit
2021-10-11 14:59                           ` [virtio-dev] " Cornelia Huck
2021-10-11 15:44                             ` Parav Pandit
2021-10-11 16:00                               ` Michael S. Tsirkin
2021-10-12  8:51                                 ` Parav Pandit
2021-10-12  9:01                                   ` Michael S. Tsirkin
2021-10-12  9:12                                     ` Parav Pandit
2021-10-14 17:35                                       ` Parav Pandit
2021-10-14 22:28                                         ` Michael S. Tsirkin
2021-10-15  4:36                                           ` Parav Pandit
2021-10-15  5:15                                             ` [virtio-dev] " Jason Wang
2021-10-15  5:20                                               ` Parav Pandit
2021-10-15  6:40                                                 ` Jason Wang
2021-10-15  6:42                                                 ` Jason Wang
2021-10-15  6:48                                                   ` Parav Pandit
2021-10-15  7:02                                                     ` Jason Wang
2021-10-15  8:21                                                       ` Parav Pandit
2021-10-15  8:42                                                         ` Jason Wang
2021-10-22  7:20                                                           ` Parav Pandit
2021-10-25  5:41                                                             ` Jason Wang
2021-10-25  6:11                                                               ` Parav Pandit
2021-10-26  4:03                                                                 ` Jason Wang
2021-10-27  8:04                                                                   ` Parav Pandit
2021-10-27  8:26                                                                     ` Michael S. Tsirkin
2021-10-28  4:01                                                                       ` Parav Pandit
2021-10-28  5:50                                                                         ` Michael S. Tsirkin
2021-10-28  6:06                                                                           ` Parav Pandit
2021-10-15  6:51                                             ` Cornelia Huck
2021-10-15  8:09                                               ` Parav Pandit
2021-10-15  9:25                                                 ` [virtio-dev] " Cornelia Huck
2021-10-22  6:29                                                   ` Parav Pandit
2021-10-11 16:22                               ` [virtio-dev] " Cornelia Huck
2021-10-12 10:35                                 ` Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.