virtio-fs.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration
@ 2023-10-04 12:58 Hanna Czenczek
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

RFC:
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html

v1:
https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html

v2:
https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html

v3:
https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg03750.html


Based-on: <20231004014532.1228637-1-stefanha@redhat.com>
          ([PATCH v2 0/3] vhost: clean up device reset)


Hi,

This v4 includes largely unchanged patches from v3.  The main
addition/change is what came out of the discussion between Stefan and me
around how to proceed without SUSPEND/RESUME, which is that this series
is now based on his reset fix, and it includes more documentation
changes.

Changes in detail:

- Patch 1: Fall-out from the reset fix: Currently, the status byte is
  effectively unused (qemu only uses it for resetting, which all
  back-ends ignore; DPDK uses it to announce potential feature
  negotiation failure, which qemu ignores).  It is also not defined what
  exactly front-end or back-end should do with this byte, except
  pointing at the virtio spec, which however naturally does not say how
  this integrates with vhost-user’s RESET_DEVICE or [GS]ET_FEATURES.
  Furthermore, there does not seem to be a use for this; we have
  RESET_DEVICE for resetting, and we have [GS]ET_FEATURES (and
  REPLY_ACK, which can be used on SET_FEATURES) for feature
  negotation.
  Therefore, deprecate the status byte, pointing to those other commands
  instead.

- Patch 2: Patch 4 defines a suspended state for the whole back-end if
  all vrings are stopped.  I think this should be mentioned in
  GET_VRING_BASE, but upon trying to add it, I found that it does not
  even mention that it stops the vring (mentioned only in the Ring
  States section), and remembered that the whole description of both
  GET_VRING_BASE and SET_VRING_BASE really was not helpful when trying
  to implement a vhost-user back-end.  Took the opportunity to overhaul
  both.

- Patch 3: This one’s from v3, but quite heavily modified.  Stefan
  suggested consistently defining the started/stopped and
  enabled/disabled states to be independent, and indeed doing so
  simplifies a whole lot of stuff.  Specifically, it makes the magic
  “enabled/disabled when started” go away.  Basically, I found this
  change alone is enough to remove the confusion I had with the existing
  documentation.

- Patch 4: As suggested by Stefan, just define a suspended state without
  introducing SUSPEND.  vDPA needs SUSPEND because its GET_VRING_BASE
  does not stop the vring, but vhost-user’s does, so we can define the
  suspended state to be when all vrings are stopped.

- Patch 5: Reference the suspended state.

- Patches 6 through 8: Unmodified, except for them being rebase on
  Stefan’s series.


Hanna Czenczek (8):
  vhost-user.rst: Deprecate [GS]ET_STATUS
  vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  vhost-user.rst: Clarify enabling/disabling vrings
  vhost-user.rst: Introduce suspended state
  vhost-user.rst: Migrating back-end-internal state
  vhost-user: Interface for migration state transfer
  vhost: Add high-level state save/load functions
  vhost-user-fs: Implement internal migration

 docs/interop/vhost-user.rst       | 318 +++++++++++++++++++++++++++---
 include/hw/virtio/vhost-backend.h |  24 +++
 include/hw/virtio/vhost.h         | 113 +++++++++++
 hw/virtio/vhost-user-fs.c         | 101 +++++++++-
 hw/virtio/vhost-user.c            | 148 ++++++++++++++
 hw/virtio/vhost.c                 | 241 ++++++++++++++++++++++
 6 files changed, 917 insertions(+), 28 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
@ 2023-10-04 12:58 ` Hanna Czenczek
  2023-10-05 17:08   ` Stefan Hajnoczi
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

There is no clearly defined purpose for the virtio status byte in
vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
protocol extension, it is possible for SET_FEATURES to return errors
(SET_PROTOCOL_FEATURES may be called before SET_FEATURES).

As for implementations, SET_STATUS is not widely implemented.  dpdk does
implement it, but only uses it to signal feature negotiation failure.
While it does log reset requests (SET_STATUS 0) as such, it effectively
ignores them, in contrast to RESET_OWNER (which is deprecated, and today
means the same thing as RESET_DEVICE).

While qemu superficially has support for [GS]ET_STATUS, it does not
forward the guest-set status byte, but instead just makes it up
internally, and actually completely ignores what the back-end returns,
only using it as the template for a subsequent SET_STATUS to add single
bits to it.  Notably, after setting FEATURES_OK, it never reads it back
to see whether the flag is still set, which is the only way in which
dpdk uses the status byte.

As-is, no front-end or back-end can rely on the other side handling this
field in a useful manner, and it also provides no practical use over
other mechanisms the vhost-user protocol has, which are more clearly
defined.  Deprecate it.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 5a070adbc1..2f68e67a1a 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1424,21 +1424,35 @@ Front-end message types
   :request payload: ``u64``
   :reply payload: N/A
 
-  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
-  successfully negotiated, this message is submitted by the front-end to
-  notify the back-end with updated device status as defined in the Virtio
+.. admonition:: Deprecated
+
+  This is no longer used. Used to be sent by the front-end to notify the
+  back-end with updated device status as defined in the Virtio
   specification.
 
+  However, its purpose in vhost-user was never well-defined; for
+  example, how or if it would replace VHOST_USER_RESET_DEVICE, or how it
+  integrates with the feature negotiation phase.  Therefore,
+  implementations in practice were less than strict in how the status
+  value was handled, which means there was actually no protocol between
+  front-end and back-end on the use of the status value.
+
+  For resetting, use VHOST_USER_RESET_DEVICE instead.  For feature
+  negotiation with acknowledgment from the device, use
+  VHOST_USER_SET_FEATURES with the :ref:`REPLY_ACK <reply_ack>` feature
+  instead.
+
 ``VHOST_USER_GET_STATUS``
   :id: 40
   :equivalent ioctl: VHOST_VDPA_GET_STATUS
   :request payload: N/A
   :reply payload: ``u64``
 
-  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
-  successfully negotiated, this message is submitted by the front-end to
-  query the back-end for its device status as defined in the Virtio
-  specification.
+.. admonition:: Deprecated
+
+  This is no longer used. Used to be sent by the front-end to query the
+  back-end for its device status as defined in the Virtio specification.
+  Deprecated together with VHOST_USER_SET_STATUS.
 
 
 Back-end message types
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek
@ 2023-10-04 12:58 ` Hanna Czenczek
  2023-10-05 17:38   ` Stefan Hajnoczi
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

GET_VRING_BASE does not mention that it stops the respective ring.  Fix
that.

Furthermore, it is not fully clear what the "base offset" these
commands' documentation refers to is; an offset could be many things.
Be more precise and verbose about it, especially given that these
commands use different payload structures depending on whether the vring
is split or packed.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 2f68e67a1a..50f5acebe5 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -108,6 +108,37 @@ A vring state description
 
 :num: a 32-bit number
 
+A vring descriptor index for split virtqueues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+---------------------+
+| vring index | index in avail ring |
++-------------+---------------------+
+
+:vring index: 32-bit index of the respective virtqueue
+
+:index in avail ring: 32-bit value, of which currently only the lower 16
+  bits are used:
+
+  - Bits 0–15: Next descriptor index in the *Available Ring*
+  - Bits 16–31: Reserved (set to zero)
+
+Vring descriptor indices for packed virtqueues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+--------------------+
+| vring index | descriptor indices |
++-------------+--------------------+
+
+:vring index: 32-bit index of the respective virtqueue
+
+:descriptor indices: 32-bit value:
+
+  - Bits 0–14: Index in the *Available Ring*
+  - Bit 15: Driver (Available) Ring Wrap Counter
+  - Bits 16–30: Index in the *Used Ring*
+  - Bit 31: Device (Used) Ring Wrap Counter
+
 A vring address description
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -1031,18 +1062,45 @@ Front-end message types
 ``VHOST_USER_SET_VRING_BASE``
   :id: 10
   :equivalent ioctl: ``VHOST_SET_VRING_BASE``
-  :request payload: vring state description
+  :request payload: vring descriptor index/indices
   :reply payload: N/A
 
-  Sets the base offset in the available vring.
+  Sets the next index to use for descriptors in this vring:
+
+  * For a split virtqueue, sets only the next descriptor index in the
+    *Available Ring*.  The device is supposed to read the next index in
+    the *Used Ring* from the respective vring structure in guest memory.
+
+  * For a packed virtqueue, both indices are supplied, as they are not
+    explicitly available in memory.
+
+  Consequently, the payload type is specific to the type of virt queue
+  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+  indices for packed virtqueues*).
 
 ``VHOST_USER_GET_VRING_BASE``
   :id: 11
   :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
   :request payload: vring state description
-  :reply payload: vring state description
+  :reply payload: vring descriptor index/indices
+
+  Stops the vring and returns the current descriptor index or indices:
+
+    * For a split virtqueue, returns only the 16-bit next descriptor
+      index in the *Available Ring*.  The index in the *Used Ring* is
+      controlled by the guest driver and can be read from the vring
+      structure in memory, so is not covered.
+
+    * For a packed virtqueue, neither index is explicitly available to
+      read from memory, so both indices (as maintained by the device) are
+      returned.
+
+  Consequently, the payload type is specific to the type of virt queue
+  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+  indices for packed virtqueues*).
 
-  Get the available vring base offset.
+  The request payload’s *num* field is currently reserved and must be
+  set to 0.
 
 ``VHOST_USER_SET_VRING_KICK``
   :id: 12
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek
@ 2023-10-04 12:58 ` Hanna Czenczek
  2023-10-05 17:43   ` Stefan Hajnoczi
  2023-10-18 12:14   ` Michael S. Tsirkin
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

Currently, the vhost-user documentation says that rings are to be
initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
negotiated.  However, by the time of feature negotiation, all rings have
already been initialized, so it is not entirely clear what this means.

At least the vhost-user-backend Rust crate's implementation interpreted
it to mean that whenever this feature is negotiated, all rings are to
put into a disabled state, which means that every SET_FEATURES call
would disable all rings, effectively halting the device.  This is
problematic because the VHOST_F_LOG_ALL feature is also set or cleared
this way, which happens during migration.  Doing so should not halt the
device.

Other implementations have interpreted this to mean that the device is
to be initialized with all rings disabled, and a subsequent SET_FEATURES
call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
them.  Here, SET_FEATURES will never disable any ring.

This interpretation does not suffer the problem of unintentionally
halting the device whenever features are set or cleared, so it seems
better and more reasonable.

We can clarify this in the documentation by making it explicit that the
enabled/disabled state is tracked even while the vring is stopped.
Every vring is initialized in a disabled state, and SET_FEATURES without
VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
vrings.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 50f5acebe5..9f4940a036 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -395,31 +395,33 @@ negotiation.
 Ring states
 -----------
 
-Rings can be in one of three states:
+Rings have two independent states: started/stopped, and enabled/disabled.
 
-* stopped: the back-end must not process the ring at all.
+* While a ring is stopped, the back-end must not process the ring at
+  all, regardless of whether it is enabled or disabled.  The
+  enabled/disabled state should still be tracked, though, so it can come
+  into effect once the ring is started.
 
-* started but disabled: the back-end must process the ring without
+* started and disabled: The back-end must process the ring without
   causing any side effects.  For example, for a networking device,
   in the disabled state the back-end must not supply any new RX packets,
   but must process and discard any TX packets.
 
-* started and enabled.
+* started and enabled: The back-end must process the ring normally, i.e.
+  process all requests and execute them.
 
-Each ring is initialized in a stopped state.  The back-end must start
-ring upon receiving a kick (that is, detecting that file descriptor is
-readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK``
-or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
-and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
+Each ring is initialized in a stopped and disabled state.  The back-end
+must start a ring upon receiving a kick (that is, detecting that file
+descriptor is readable) on the descriptor specified by
+``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
+``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving
+``VHOST_USER_GET_VRING_BASE``.
 
 Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
 
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
-ring starts directly in the enabled state.
-
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
-initialized in a disabled state and is enabled by
-``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
+In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from
+the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the
+back-end must enable all rings immediately.
 
 While processing the rings (whether they are enabled or not), the back-end
 must support changing some configuration aspects on the fly.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (2 preceding siblings ...)
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek
@ 2023-10-04 12:59 ` Hanna Czenczek
  2023-10-05 17:44   ` Stefan Hajnoczi
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

In vDPA, GET_VRING_BASE does not stop the queried vring, which is why
SUSPEND was introduced so that the returned index would be stable.  In
vhost-user, it does stop the vring, so under the same reasoning, it can
get away without SUSPEND.

Still, we do want to clarify that if the device is completely stopped,
i.e. all vrings are stopped, the back-end should cease to modify any
state relating to the guest.  Do this by calling it "suspended".

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 9f4940a036..d282155562 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -426,6 +426,19 @@ back-end must enable all rings immediately.
 While processing the rings (whether they are enabled or not), the back-end
 must support changing some configuration aspects on the fly.
 
+.. _suspended_device_state:
+
+Suspended device state
+^^^^^^^^^^^^^^^^^^^^^^
+
+While all vrings are stopped, the device is *suspended*.  In addition to
+not processing any vring (because they are stopped), the device must:
+
+* not write to any guest memory regions,
+* not send any notifications to the guest,
+* not send any messages to the front-end,
+* still process and reply to messages from the front-end.
+
 Multiple queue support
 ----------------------
 
@@ -513,7 +526,8 @@ ancillary data, it may be used to inform the front-end that the log has
 been modified.
 
 Once the source has finished migration, rings will be stopped by the
-source. No further update must be done before rings are restarted.
+source (:ref:`Suspended device state <suspended_device_state>`). No
+further update must be done before rings are restarted.
 
 In postcopy migration the back-end is started before all the memory has
 been received from the source host, and care must be taken to avoid
@@ -1101,6 +1115,10 @@ Front-end message types
   (*a vring descriptor index for split virtqueues* vs. *vring descriptor
   indices for packed virtqueues*).
 
+  When and as long as all of a device’s vrings are stopped, it is
+  *suspended*, see :ref:`Suspended device state
+  <suspended_device_state>`.
+
   The request payload’s *num* field is currently reserved and must be
   set to 0.
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (3 preceding siblings ...)
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek
@ 2023-10-04 12:59 ` Hanna Czenczek
  2023-10-05 17:46   ` Stefan Hajnoczi
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

For vhost-user devices, qemu can migrate the virtio state, but not the
back-end's internal state.  To do so, we need to be able to transfer
this internal state between front-end (qemu) and back-end.

At this point, this new feature is added for the purpose of virtio-fs
migration.  Because virtiofsd's internal state will not be too large, we
believe it is best to transfer it as a single binary blob after the
streaming phase.

These are the additions to the protocol:
- New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
- SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
  descriptor over which to transfer the state.
- CHECK_DEVICE_STATE: After the state has been transferred through the
  file descriptor, the front-end invokes this function to verify
  success.  There is no in-band way (through the file descriptor) to
  indicate failure, so we need to check explicitly.

Once the transfer FD has been established via SET_DEVICE_STATE_FD
(which includes establishing the direction of transfer and migration
phase), the sending side writes its data into it, and the reading side
reads it until it sees an EOF.  Then, the front-end will check for
success via CHECK_DEVICE_STATE, which on the destination side includes
checking for integrity (i.e. errors during deserialization).

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index d282155562..aa91e2b34e 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -306,6 +306,32 @@ Inflight description
 
 :queue size: a 16-bit size of virtqueues
 
+Device state transfer parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++--------------------+-----------------+
+| transfer direction | migration phase |
++--------------------+-----------------+
+
+:transfer direction: a 32-bit enum, describing the direction in which
+  the state is transferred:
+
+  - 0: Save: Transfer the state from the back-end to the front-end,
+    which happens on the source side of migration
+  - 1: Load: Transfer the state from the front-end to the back-end,
+    which happens on the destination side of migration
+
+:migration phase: a 32-bit enum, describing the state in which the VM
+  guest and devices are:
+
+  - 0: Stopped (in the period after the transfer of memory-mapped
+    regions before switch-over to the destination): The VM guest is
+    stopped, and the vhost-user device is suspended (see
+    :ref:`Suspended device state <suspended_device_state>`).
+
+  In the future, additional phases might be added e.g. to allow
+  iterative migration while the device is running.
+
 C structure
 -----------
 
@@ -365,6 +391,7 @@ in the ancillary data:
 * ``VHOST_USER_SET_VRING_ERR``
 * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
 * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+* ``VHOST_USER_SET_DEVICE_STATE_FD``
 
 If *front-end* is unable to send the full message or receives a wrong
 reply it will close the connection. An optional reconnection mechanism
@@ -539,6 +566,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled
 back-end.  The front-end indicates support for this via the
 ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
 
+.. _migrating_backend_state:
+
+Migrating back-end state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Migrating device state involves transferring the state from one
+back-end, called the source, to another back-end, called the
+destination.  After migration, the destination transparently resumes
+operation without requiring the driver to re-initialize the device at
+the VIRTIO level.  If the migration fails, then the source can
+transparently resume operation until another migration attempt is made.
+
+Generally, the front-end is connected to a virtual machine guest (which
+contains the driver), which has its own state to transfer between source
+and destination, and therefore will have an implementation-specific
+mechanism to do so.  The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature
+provides functionality to have the front-end include the back-end's
+state in this transfer operation so the back-end does not need to
+implement its own mechanism, and so the virtual machine may have its
+complete state, including vhost-user devices' states, contained within a
+single stream of data.
+
+To do this, the back-end state is transferred from back-end to front-end
+on the source side, and vice versa on the destination side.  This
+transfer happens over a channel that is negotiated using the
+``VHOST_USER_SET_DEVICE_STATE_FD`` message.  This message has two
+parameters:
+
+* Direction of transfer: On the source, the data is saved, transferring
+  it from the back-end to the front-end.  On the destination, the data
+  is loaded, transferring it from the front-end to the back-end.
+
+* Migration phase: Currently, the only supported phase is the period
+  after the transfer of memory-mapped regions before switch-over to the
+  destination, when both the source and destination devices are
+  suspended (:ref:`Suspended device state <suspended_device_state>`).
+  In the future, additional phases might be supported to allow iterative
+  migration while the device is running.
+
+The nature of the channel is implementation-defined, but it must
+generally behave like a pipe: The writing end will write all the data it
+has into it, signalling the end of data by closing its end.  The reading
+end must read all of this data (until encountering the end of file) and
+process it.
+
+* When saving, the writing end is the source back-end, and the reading
+  end is the source front-end.  After reading the state data from the
+  channel, the source front-end must transfer it to the destination
+  front-end through an implementation-defined mechanism.
+
+* When loading, the writing end is the destination front-end, and the
+  reading end is the destination back-end.  After reading the state data
+  from the channel, the destination back-end must deserialize its
+  internal state from that data and set itself up to allow the driver to
+  seamlessly resume operation on the VIRTIO level.
+
+Seamlessly resuming operation means that the migration must be
+transparent to the guest driver, which operates on the VIRTIO level.
+This driver will not perform any re-initialization steps, but continue
+to use the device as if no migration had occurred.  The vhost-user
+front-end, however, will re-initialize the vhost state on the
+destination, following the usual protocol for establishing a connection
+to a vhost-user back-end: This includes, for example, setting up memory
+mappings and kick and call FDs as necessary, negotiating protocol
+features, or setting the initial vring base indices (to the same value
+as on the source side, so that operation can resume).
+
+Both on the source and on the destination side, after the respective
+front-end has seen all data transferred (when the transfer FD has been
+closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to
+verify that data transfer was successful in the back-end, too.  The
+back-end responds once it knows whether the transfer and processing was
+successful or not.
+
 Memory access
 -------------
 
@@ -932,6 +1033,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
   #define VHOST_USER_PROTOCOL_F_STATUS               16
   #define VHOST_USER_PROTOCOL_F_XEN_MMAP             17
+  #define VHOST_USER_PROTOCOL_F_DEVICE_STATE         18
 
 Front-end message types
 -----------------------
@@ -1532,6 +1634,76 @@ Front-end message types
   back-end for its device status as defined in the Virtio specification.
   Deprecated together with VHOST_USER_SET_STATUS.
 
+``VHOST_USER_SET_DEVICE_STATE_FD``
+  :id: 41
+  :equivalent ioctl: N/A
+  :request payload: device state transfer parameters
+  :reply payload: ``u64``
+
+  Front-end and back-end negotiate a channel over which to transfer the
+  back-end’s internal state during migration.  Either side (front-end or
+  back-end) may create the channel.  The nature of this channel is not
+  restricted or defined in this document, but whichever side creates it
+  must create a file descriptor that is provided to the respectively
+  other side, allowing access to the channel.  This FD must behave as
+  follows:
+
+  * For the writing end, it must allow writing the whole back-end state
+    sequentially.  Closing the file descriptor signals the end of
+    transfer.
+
+  * For the reading end, it must allow reading the whole back-end state
+    sequentially.  The end of file signals the end of the transfer.
+
+  For example, the channel may be a pipe, in which case the two ends of
+  the pipe fulfill these requirements respectively.
+
+  Initially, the front-end creates a channel along with such an FD.  It
+  passes the FD to the back-end as ancillary data of a
+  ``VHOST_USER_SET_DEVICE_STATE_FD`` message.  The back-end may create a
+  different transfer channel, passing the respective FD back to the
+  front-end as ancillary data of the reply.  If so, the front-end must
+  then discard its channel and use the one provided by the back-end.
+
+  Whether the back-end should decide to use its own channel is decided
+  based on efficiency: If the channel is a pipe, both ends will most
+  likely need to copy data into and out of it.  Any channel that allows
+  for more efficient processing on at least one end, e.g. through
+  zero-copy, is considered more efficient and thus preferred.  If the
+  back-end can provide such a channel, it should decide to use it.
+
+  The request payload contains parameters for the subsequent data
+  transfer, as described in the :ref:`Migrating back-end state
+  <migrating_backend_state>` section.
+
+  The value returned is both an indication for success, and whether a
+  file descriptor for a back-end-provided channel is returned: Bits 0–7
+  are 0 on success, and non-zero on error.  Bit 8 is the invalid FD
+  flag; this flag is set when there is no file descriptor returned.
+  When this flag is not set, the front-end must use the returned file
+  descriptor as its end of the transfer channel.  The back-end must not
+  both indicate an error and return a file descriptor.
+
+  Using this function requires prior negotiation of the
+  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
+
+``VHOST_USER_CHECK_DEVICE_STATE``
+  :id: 42
+  :equivalent ioctl: N/A
+  :request payload: N/A
+  :reply payload: ``u64``
+
+  After transferring the back-end’s internal state during migration (see
+  the :ref:`Migrating back-end state <migrating_backend_state>`
+  section), check whether the back-end was able to successfully fully
+  process the state.
+
+  The value returned indicates success or error; 0 is success, any
+  non-zero value is an error.
+
+  Using this function requires prior negotiation of the
+  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
+
 
 Back-end message types
 ----------------------
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (4 preceding siblings ...)
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek
@ 2023-10-04 12:59 ` Hanna Czenczek
  2023-10-05 17:46   ` Stefan Hajnoczi
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

Add the interface for transferring the back-end's state during migration
as defined previously in vhost-user.rst.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  24 +++++
 include/hw/virtio/vhost.h         |  78 ++++++++++++++++
 hw/virtio/vhost-user.c            | 148 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  37 ++++++++
 4 files changed, 287 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 31a251a9f5..b6eee7e9fd 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+typedef enum VhostDeviceStateDirection {
+    /* Transfer state from back-end (device) to front-end */
+    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
+    /* Transfer state from front-end to back-end (device) */
+    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
+} VhostDeviceStateDirection;
+
+typedef enum VhostDeviceStatePhase {
+    /* The device (and all its vrings) is stopped */
+    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
+} VhostDeviceStatePhase;
+
 struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
@@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
 
 typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
 
+typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev);
+typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
+                                            VhostDeviceStateDirection direction,
+                                            VhostDeviceStatePhase phase,
+                                            int fd,
+                                            int *reply_fd,
+                                            Error **errp);
+typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -181,6 +202,9 @@ typedef struct VhostOps {
     vhost_force_iommu_op vhost_force_iommu;
     vhost_set_config_call_op vhost_set_config_call;
     vhost_reset_status_op vhost_reset_status;
+    vhost_supports_device_state_op vhost_supports_device_state;
+    vhost_set_device_state_fd_op vhost_set_device_state_fd;
+    vhost_check_device_state_op vhost_check_device_state;
 } VhostOps;
 
 int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 14621f9e79..a0d03c9fdf 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev *hdev)
 }
 #endif /* CONFIG_VHOST */
 
+/**
+ * vhost_supports_device_state(): Checks whether the back-end supports
+ * transferring internal device state for the purpose of migration.
+ * Support for this feature is required for vhost_set_device_state_fd()
+ * and vhost_check_device_state().
+ *
+ * @dev: The vhost device
+ *
+ * Returns true if the device supports these commands, and false if it
+ * does not.
+ */
+bool vhost_supports_device_state(struct vhost_dev *dev);
+
+/**
+ * vhost_set_device_state_fd(): Begin transfer of internal state from/to
+ * the back-end for the purpose of migration.  Data is to be transferred
+ * over a pipe according to @direction and @phase.  The sending end must
+ * only write to the pipe, and the receiving end must only read from it.
+ * Once the sending end is done, it closes its FD.  The receiving end
+ * must take this as the end-of-transfer signal and close its FD, too.
+ *
+ * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
+ * read FD for LOAD.  This function transfers ownership of @fd to the
+ * back-end, i.e. closes it in the front-end.
+ *
+ * The back-end may optionally reply with an FD of its own, if this
+ * improves efficiency on its end.  In this case, the returned FD is
+ * stored in *reply_fd.  The back-end will discard the FD sent to it,
+ * and the front-end must use *reply_fd for transferring state to/from
+ * the back-end.
+ *
+ * @dev: The vhost device
+ * @direction: The direction in which the state is to be transferred.
+ *             For outgoing migrations, this is SAVE, and data is read
+ *             from the back-end and stored by the front-end in the
+ *             migration stream.
+ *             For incoming migrations, this is LOAD, and data is read
+ *             by the front-end from the migration stream and sent to
+ *             the back-end to restore the saved state.
+ * @phase: Which migration phase we are in.  Currently, there is only
+ *         STOPPED (device and all vrings are stopped), in the future,
+ *         more phases such as PRE_COPY or POST_COPY may be added.
+ * @fd: Back-end's end of the pipe through which to transfer state; note
+ *      that ownership is transferred to the back-end, so this function
+ *      closes @fd in the front-end.
+ * @reply_fd: If the back-end wishes to use a different pipe for state
+ *            transfer, this will contain an FD for the front-end to
+ *            use.  Otherwise, -1 is stored here.
+ * @errp: Potential error description
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp);
+
+/**
+ * vhost_set_device_state_fd(): After transferring state from/to the
+ * back-end via vhost_set_device_state_fd(), i.e. once the sending end
+ * has closed the pipe, inquire the back-end to report any potential
+ * errors that have occurred on its side.  This allows to sense errors
+ * like:
+ * - During outgoing migration, when the source side had already started
+ *   to produce its state, something went wrong and it failed to finish
+ * - During incoming migration, when the received state is somehow
+ *   invalid and cannot be processed by the back-end
+ *
+ * @dev: The vhost device
+ * @errp: Potential error description
+ *
+ * Returns 0 when the back-end reports successful state transfer and
+ * processing, and -errno when an error occurred somewhere.
+ */
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 7bed9ad7d5..7096b148a9 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -74,6 +74,8 @@ enum VhostUserProtocolFeature {
     /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
     VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
     VHOST_USER_PROTOCOL_F_STATUS = 16,
+    /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+    VHOST_USER_PROTOCOL_F_DEVICE_STATE = 18,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -121,6 +123,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_REM_MEM_REG = 38,
     VHOST_USER_SET_STATUS = 39,
     VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_SET_DEVICE_STATE_FD = 41,
+    VHOST_USER_CHECK_DEVICE_STATE = 42,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -212,6 +216,12 @@ typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
+typedef struct VhostUserTransferDeviceState {
+    uint32_t direction;
+    uint32_t phase;
+} VhostUserTransferDeviceState;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
@@ -226,6 +236,7 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserTransferDeviceState transfer_state;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2746,6 +2757,140 @@ static void vhost_user_reset_status(struct vhost_dev *dev)
     }
 }
 
+static bool vhost_user_supports_device_state(struct vhost_dev *dev)
+{
+    return virtio_has_feature(dev->protocol_features,
+                              VHOST_USER_PROTOCOL_F_DEVICE_STATE);
+}
+
+static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
+                                          VhostDeviceStateDirection direction,
+                                          VhostDeviceStatePhase phase,
+                                          int fd,
+                                          int *reply_fd,
+                                          Error **errp)
+{
+    int ret;
+    struct vhost_user *vu = dev->opaque;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_SET_DEVICE_STATE_FD,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.transfer_state),
+        },
+        .payload.transfer_state = {
+            .direction = direction,
+            .phase = phase,
+        },
+    };
+
+    *reply_fd = -1;
+
+    if (!vhost_user_supports_device_state(dev)) {
+        close(fd);
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, &fd, 1);
+    close(fd);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send SET_DEVICE_STATE_FD message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive SET_DEVICE_STATE_FD reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if ((msg.payload.u64 & 0xff) != 0) {
+        error_setg(errp, "Back-end did not accept migration state transfer");
+        return -EIO;
+    }
+
+    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
+        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
+        if (*reply_fd < 0) {
+            error_setg(errp,
+                       "Failed to get back-end-provided transfer pipe FD");
+            *reply_fd = -1;
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_CHECK_DEVICE_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = 0,
+        },
+    };
+
+    if (!vhost_user_supports_device_state(dev)) {
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send CHECK_DEVICE_STATE message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive CHECK_DEVICE_STATE reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if (msg.payload.u64 != 0) {
+        error_setg(errp, "Back-end failed to process its internal state");
+        return -EIO;
+    }
+
+    return 0;
+}
+
 const VhostOps user_ops = {
         .backend_type = VHOST_BACKEND_TYPE_USER,
         .vhost_backend_init = vhost_user_backend_init,
@@ -2782,4 +2927,7 @@ const VhostOps user_ops = {
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
         .vhost_dev_start = vhost_user_dev_start,
         .vhost_reset_status = vhost_user_reset_status,
+        .vhost_supports_device_state = vhost_user_supports_device_state,
+        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
+        .vhost_check_device_state = vhost_user_check_device_state,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 6003e50e83..85e199f0aa 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2096,3 +2096,40 @@ int vhost_reset_device(struct vhost_dev *hdev)
 
     return -ENOSYS;
 }
+
+bool vhost_supports_device_state(struct vhost_dev *dev)
+{
+    if (dev->vhost_ops->vhost_supports_device_state) {
+        return dev->vhost_ops->vhost_supports_device_state(dev);
+    }
+
+    return false;
+}
+
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp)
+{
+    if (dev->vhost_ops->vhost_set_device_state_fd) {
+        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
+                                                         fd, reply_fd, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}
+
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    if (dev->vhost_ops->vhost_check_device_state) {
+        return dev->vhost_ops->vhost_check_device_state(dev, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (5 preceding siblings ...)
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek
@ 2023-10-04 12:59 ` Hanna Czenczek
  2023-10-05 17:46   ` Stefan Hajnoczi
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek
  2023-10-05 17:48 ` [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

vhost_save_backend_state() and vhost_load_backend_state() can be used by
vhost front-ends to easily save and load the back-end's state to/from
the migration stream.

Because we do not know the full state size ahead of time,
vhost_save_backend_state() simply reads the data in 1 MB chunks, and
writes each chunk consecutively into the migration stream, prefixed by
its length.  EOF is indicated by a 0-length chunk.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost.h |  35 +++++++
 hw/virtio/vhost.c         | 204 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 239 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a0d03c9fdf..100fcc874d 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
  */
 int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
 
+/**
+ * vhost_save_backend_state(): High-level function to receive a vhost
+ * back-end's state, and save it in @f.  Uses
+ * `vhost_set_device_state_fd()` to get the data from the back-end, and
+ * stores it in consecutive chunks that are each prefixed by their
+ * respective length (be32).  The end is marked by a 0-length chunk.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device from which to save the state
+ * @f: Migration stream in which to save the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
+/**
+ * vhost_load_backend_state(): High-level function to load a vhost
+ * back-end's state from @f, and send it over to the back-end.  Reads
+ * the data from @f in the format used by `vhost_save_state()`, and uses
+ * `vhost_set_device_state_fd()` to transfer it to the back-end.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device to which to send the sate
+ * @f: Migration stream from which to load the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 85e199f0aa..1465adf13a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
                "vhost transport does not support migration state transfer");
     return -ENOSYS;
 }
+
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
+{
+    /* Maximum chunk size in which to transfer the state */
+    const size_t chunk_size = 1 * 1024 * 1024;
+    g_autofree void *transfer_buf = NULL;
+    g_autoptr(GError) g_err = NULL;
+    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
+    int ret;
+
+    /* [0] for reading (our end), [1] for writing (back-end's end) */
+    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
+        error_setg(errp, "Failed to set up state transfer pipe: %s",
+                   g_err->message);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    read_fd = pipe_fds[0];
+    write_fd = pipe_fds[1];
+
+    /*
+     * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
+     * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for
+     * vhost-user, so just check that it is stopped at all.
+     */
+    assert(!dev->started);
+
+    /* Transfer ownership of write_fd to the back-end */
+    ret = vhost_set_device_state_fd(dev,
+                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
+                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
+                                    write_fd,
+                                    &reply_fd,
+                                    errp);
+    if (ret < 0) {
+        error_prepend(errp, "Failed to initiate state transfer: ");
+        goto fail;
+    }
+
+    /* If the back-end wishes to use a different pipe, switch over */
+    if (reply_fd >= 0) {
+        close(read_fd);
+        read_fd = reply_fd;
+    }
+
+    transfer_buf = g_malloc(chunk_size);
+
+    while (true) {
+        ssize_t read_ret;
+
+        read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size));
+        if (read_ret < 0) {
+            ret = -errno;
+            error_setg_errno(errp, -ret, "Failed to receive state");
+            goto fail;
+        }
+
+        assert(read_ret <= chunk_size);
+        qemu_put_be32(f, read_ret);
+
+        if (read_ret == 0) {
+            /* EOF */
+            break;
+        }
+
+        qemu_put_buffer(f, transfer_buf, read_ret);
+    }
+
+    /*
+     * Back-end will not really care, but be clean and close our end of the pipe
+     * before inquiring the back-end about whether transfer was successful
+     */
+    close(read_fd);
+    read_fd = -1;
+
+    /* Also, verify that the device is still stopped */
+    assert(!dev->started);
+
+    ret = vhost_check_device_state(dev, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    ret = 0;
+fail:
+    if (read_fd >= 0) {
+        close(read_fd);
+    }
+
+    return ret;
+}
+
+int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
+{
+    size_t transfer_buf_size = 0;
+    g_autofree void *transfer_buf = NULL;
+    g_autoptr(GError) g_err = NULL;
+    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
+    int ret;
+
+    /* [0] for reading (back-end's end), [1] for writing (our end) */
+    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
+        error_setg(errp, "Failed to set up state transfer pipe: %s",
+                   g_err->message);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    read_fd = pipe_fds[0];
+    write_fd = pipe_fds[1];
+
+    /*
+     * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
+     * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for
+     * vhost-user, so just check that it is stopped at all.
+     */
+    assert(!dev->started);
+
+    /* Transfer ownership of read_fd to the back-end */
+    ret = vhost_set_device_state_fd(dev,
+                                    VHOST_TRANSFER_STATE_DIRECTION_LOAD,
+                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
+                                    read_fd,
+                                    &reply_fd,
+                                    errp);
+    if (ret < 0) {
+        error_prepend(errp, "Failed to initiate state transfer: ");
+        goto fail;
+    }
+
+    /* If the back-end wishes to use a different pipe, switch over */
+    if (reply_fd >= 0) {
+        close(write_fd);
+        write_fd = reply_fd;
+    }
+
+    while (true) {
+        size_t this_chunk_size = qemu_get_be32(f);
+        ssize_t write_ret;
+        const uint8_t *transfer_pointer;
+
+        if (this_chunk_size == 0) {
+            /* End of state */
+            break;
+        }
+
+        if (transfer_buf_size < this_chunk_size) {
+            transfer_buf = g_realloc(transfer_buf, this_chunk_size);
+            transfer_buf_size = this_chunk_size;
+        }
+
+        if (qemu_get_buffer(f, transfer_buf, this_chunk_size) <
+                this_chunk_size)
+        {
+            error_setg(errp, "Failed to read state");
+            ret = -EINVAL;
+            goto fail;
+        }
+
+        transfer_pointer = transfer_buf;
+        while (this_chunk_size > 0) {
+            write_ret = RETRY_ON_EINTR(
+                write(write_fd, transfer_pointer, this_chunk_size)
+            );
+            if (write_ret < 0) {
+                ret = -errno;
+                error_setg_errno(errp, -ret, "Failed to send state");
+                goto fail;
+            } else if (write_ret == 0) {
+                error_setg(errp, "Failed to send state: Connection is closed");
+                ret = -ECONNRESET;
+                goto fail;
+            }
+
+            assert(write_ret <= this_chunk_size);
+            this_chunk_size -= write_ret;
+            transfer_pointer += write_ret;
+        }
+    }
+
+    /*
+     * Close our end, thus ending transfer, before inquiring the back-end about
+     * whether transfer was successful
+     */
+    close(write_fd);
+    write_fd = -1;
+
+    /* Also, verify that the device is still stopped */
+    assert(!dev->started);
+
+    ret = vhost_check_device_state(dev, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    ret = 0;
+fail:
+    if (write_fd >= 0) {
+        close(write_fd);
+    }
+
+    return ret;
+}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (6 preceding siblings ...)
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek
@ 2023-10-04 12:59 ` Hanna Czenczek
  2023-10-05 17:46   ` Stefan Hajnoczi
  2023-10-05 17:48 ` [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi
  8 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	German Maglione, Eugenio Pérez, Anton Kuchin

A virtio-fs device's VM state consists of:
- the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
- the back-end's (virtiofsd's) internal state

We get/set the latter via the new vhost operations to transfer migratory
state.  It is its own dedicated subsection, so that for external
migration, it can be disabled.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 49d699ffc2..eb91723855 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
     return &fs->vhost_dev;
 }
 
+/**
+ * Fetch the internal state from virtiofsd and save it to `f`.
+ */
+static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field, JSONWriter *vmdesc)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    Error *local_error = NULL;
+    int ret;
+
+    ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error);
+    if (ret < 0) {
+        error_reportf_err(local_error,
+                          "Error saving back-end state of %s device %s "
+                          "(tag: \"%s\"): ",
+                          vdev->name, vdev->parent_obj.canonical_path,
+                          fs->conf.tag ?: "<none>");
+        return ret;
+    }
+
+    return 0;
+}
+
+/**
+ * Load virtiofsd's internal state from `f` and send it over to virtiofsd.
+ */
+static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    Error *local_error = NULL;
+    int ret;
+
+    ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error);
+    if (ret < 0) {
+        error_reportf_err(local_error,
+                          "Error loading back-end state of %s device %s "
+                          "(tag: \"%s\"): ",
+                          vdev->name, vdev->parent_obj.canonical_path,
+                          fs->conf.tag ?: "<none>");
+        return ret;
+    }
+
+    return 0;
+}
+
+static bool vuf_is_internal_migration(void *opaque)
+{
+    /* TODO: Return false when an external migration is requested */
+    return true;
+}
+
+static int vuf_check_migration_support(void *opaque)
+{
+    VirtIODevice *vdev = opaque;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+    if (!vhost_supports_device_state(&fs->vhost_dev)) {
+        error_report("Back-end of %s device %s (tag: \"%s\") does not support "
+                     "migration through qemu",
+                     vdev->name, vdev->parent_obj.canonical_path,
+                     fs->conf.tag ?: "<none>");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vuf_backend_vmstate;
+
 static const VMStateDescription vuf_vmstate = {
     .name = "vhost-user-fs",
-    .unmigratable = 1,
+    .version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription * []) {
+        &vuf_backend_vmstate,
+        NULL,
+    }
+};
+
+static const VMStateDescription vuf_backend_vmstate = {
+    .name = "vhost-user-fs-backend",
+    .version_id = 0,
+    .needed = vuf_is_internal_migration,
+    .pre_load = vuf_check_migration_support,
+    .pre_save = vuf_check_migration_support,
+    .fields = (VMStateField[]) {
+        {
+            .name = "back-end",
+            .info = &(const VMStateInfo) {
+                .name = "virtio-fs back-end state",
+                .get = vuf_load_state,
+                .put = vuf_save_state,
+            },
+        },
+        VMSTATE_END_OF_LIST()
+    },
 };
 
 static Property vuf_properties[] = {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek
@ 2023-10-05 17:08   ` Stefan Hajnoczi
  2023-10-05 17:15     ` [Virtio-fs] (no subject) Michael S. Tsirkin
  0 siblings, 1 reply; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:08 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 1769 bytes --]

On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> There is no clearly defined purpose for the virtio status byte in
> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> protocol extension, it is possible for SET_FEATURES to return errors
> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> 
> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> implement it, but only uses it to signal feature negotiation failure.
> While it does log reset requests (SET_STATUS 0) as such, it effectively
> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> means the same thing as RESET_DEVICE).
> 
> While qemu superficially has support for [GS]ET_STATUS, it does not
> forward the guest-set status byte, but instead just makes it up
> internally, and actually completely ignores what the back-end returns,
> only using it as the template for a subsequent SET_STATUS to add single
> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> to see whether the flag is still set, which is the only way in which
> dpdk uses the status byte.
> 
> As-is, no front-end or back-end can rely on the other side handling this
> field in a useful manner, and it also provides no practical use over
> other mechanisms the vhost-user protocol has, which are more clearly
> defined.  Deprecate it.
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Virtio-fs] (no subject)
  2023-10-05 17:08   ` Stefan Hajnoczi
@ 2023-10-05 17:15     ` Michael S. Tsirkin
  2023-10-06  7:48       ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-05 17:15 UTC (permalink / raw)
  Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione,
	Eugenio Pérez, Anton Kuchin

On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > There is no clearly defined purpose for the virtio status byte in
> > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > protocol extension, it is possible for SET_FEATURES to return errors
> > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > 
> > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > implement it, but only uses it to signal feature negotiation failure.
> > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > means the same thing as RESET_DEVICE).
> > 
> > While qemu superficially has support for [GS]ET_STATUS, it does not
> > forward the guest-set status byte, but instead just makes it up
> > internally, and actually completely ignores what the back-end returns,
> > only using it as the template for a subsequent SET_STATUS to add single
> > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > to see whether the flag is still set, which is the only way in which
> > dpdk uses the status byte.
> > 
> > As-is, no front-end or back-end can rely on the other side handling this
> > field in a useful manner, and it also provides no practical use over
> > other mechanisms the vhost-user protocol has, which are more clearly
> > defined.  Deprecate it.
> > 
> > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > ---
> >  docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> >  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>


SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
The fact current backends never check errors does not mean they never
will. So no, not applying this.

-- 
MST


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek
@ 2023-10-05 17:38   ` Stefan Hajnoczi
  2023-10-06  7:53     ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:38 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 4758 bytes --]

On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
> GET_VRING_BASE does not mention that it stops the respective ring.  Fix
> that.
> 
> Furthermore, it is not fully clear what the "base offset" these
> commands' documentation refers to is; an offset could be many things.
> Be more precise and verbose about it, especially given that these
> commands use different payload structures depending on whether the vring
> is split or packed.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 2f68e67a1a..50f5acebe5 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -108,6 +108,37 @@ A vring state description
>  
>  :num: a 32-bit number
>  
> +A vring descriptor index for split virtqueues
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++-------------+---------------------+
> +| vring index | index in avail ring |
> ++-------------+---------------------+
> +
> +:vring index: 32-bit index of the respective virtqueue
> +
> +:index in avail ring: 32-bit value, of which currently only the lower 16
> +  bits are used:
> +
> +  - Bits 0–15: Next descriptor index in the *Available Ring*

I think we need to say more to make this implementable just by reading
the spec:

  Index of the next *Available Ring* descriptor that the back-end will
  process. This is a free-running index that is not wrapped by the ring
  size.

Feel free to rephrase.

> +  - Bits 16–31: Reserved (set to zero)
> +
> +Vring descriptor indices for packed virtqueues
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++-------------+--------------------+
> +| vring index | descriptor indices |
> ++-------------+--------------------+
> +
> +:vring index: 32-bit index of the respective virtqueue
> +
> +:descriptor indices: 32-bit value:
> +
> +  - Bits 0–14: Index in the *Available Ring*

Same here.

> +  - Bit 15: Driver (Available) Ring Wrap Counter
> +  - Bits 16–30: Index in the *Used Ring*

Same here.

> +  - Bit 31: Device (Used) Ring Wrap Counter
> +
>  A vring address description
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
> @@ -1031,18 +1062,45 @@ Front-end message types
>  ``VHOST_USER_SET_VRING_BASE``
>    :id: 10
>    :equivalent ioctl: ``VHOST_SET_VRING_BASE``
> -  :request payload: vring state description
> +  :request payload: vring descriptor index/indices
>    :reply payload: N/A
>  
> -  Sets the base offset in the available vring.
> +  Sets the next index to use for descriptors in this vring:
> +
> +  * For a split virtqueue, sets only the next descriptor index in the
> +    *Available Ring*.  The device is supposed to read the next index in
> +    the *Used Ring* from the respective vring structure in guest memory.
> +
> +  * For a packed virtqueue, both indices are supplied, as they are not
> +    explicitly available in memory.
> +
> +  Consequently, the payload type is specific to the type of virt queue
> +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
> +  indices for packed virtqueues*).
>  
>  ``VHOST_USER_GET_VRING_BASE``
>    :id: 11
>    :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
>    :request payload: vring state description
> -  :reply payload: vring state description
> +  :reply payload: vring descriptor index/indices
> +
> +  Stops the vring and returns the current descriptor index or indices:
> +
> +    * For a split virtqueue, returns only the 16-bit next descriptor
> +      index in the *Available Ring*.  The index in the *Used Ring* is
> +      controlled by the guest driver and can be read from the vring

I find "is controlled by the guest driver" confusing. The device writes
the Used Ring index. The driver only reads it. The device is the active
party here.

The sentence can be shortened to omit the "controlled by the guest
driver" part.

> +      structure in memory, so is not covered.
> +
> +    * For a packed virtqueue, neither index is explicitly available to
> +      read from memory, so both indices (as maintained by the device) are
> +      returned.
> +
> +  Consequently, the payload type is specific to the type of virt queue
> +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
> +  indices for packed virtqueues*).
>  
> -  Get the available vring base offset.
> +  The request payload’s *num* field is currently reserved and must be
> +  set to 0.
>  
>  ``VHOST_USER_SET_VRING_KICK``
>    :id: 12
> -- 
> 2.41.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek
@ 2023-10-05 17:43   ` Stefan Hajnoczi
  2023-10-18 12:14   ` Michael S. Tsirkin
  1 sibling, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:43 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 1824 bytes --]

On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote:
> Currently, the vhost-user documentation says that rings are to be
> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
> negotiated.  However, by the time of feature negotiation, all rings have
> already been initialized, so it is not entirely clear what this means.
> 
> At least the vhost-user-backend Rust crate's implementation interpreted
> it to mean that whenever this feature is negotiated, all rings are to
> put into a disabled state, which means that every SET_FEATURES call
> would disable all rings, effectively halting the device.  This is
> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
> this way, which happens during migration.  Doing so should not halt the
> device.
> 
> Other implementations have interpreted this to mean that the device is
> to be initialized with all rings disabled, and a subsequent SET_FEATURES
> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
> them.  Here, SET_FEATURES will never disable any ring.
> 
> This interpretation does not suffer the problem of unintentionally
> halting the device whenever features are set or cleared, so it seems
> better and more reasonable.
> 
> We can clarify this in the documentation by making it explicit that the
> enabled/disabled state is tracked even while the vring is stopped.
> Every vring is initialized in a disabled state, and SET_FEATURES without
> VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
> vrings.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek
@ 2023-10-05 17:44   ` Stefan Hajnoczi
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:44 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 817 bytes --]

On Wed, Oct 04, 2023 at 02:59:00PM +0200, Hanna Czenczek wrote:
> In vDPA, GET_VRING_BASE does not stop the queried vring, which is why
> SUSPEND was introduced so that the returned index would be stable.  In
> vhost-user, it does stop the vring, so under the same reasoning, it can
> get away without SUSPEND.
> 
> Still, we do want to clarify that if the device is completely stopped,
> i.e. all vrings are stopped, the back-end should cease to modify any
> state relating to the guest.  Do this by calling it "suspended".
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek
@ 2023-10-05 17:46   ` Stefan Hajnoczi
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

On Wed, Oct 04, 2023 at 02:59:01PM +0200, Hanna Czenczek wrote:
> For vhost-user devices, qemu can migrate the virtio state, but not the
> back-end's internal state.  To do so, we need to be able to transfer
> this internal state between front-end (qemu) and back-end.
> 
> At this point, this new feature is added for the purpose of virtio-fs
> migration.  Because virtiofsd's internal state will not be too large, we
> believe it is best to transfer it as a single binary blob after the
> streaming phase.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
>   descriptor over which to transfer the state.
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   file descriptor, the front-end invokes this function to verify
>   success.  There is no in-band way (through the file descriptor) to
>   indicate failure, so we need to check explicitly.
> 
> Once the transfer FD has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into it, and the reading side
> reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 172 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek
@ 2023-10-05 17:46   ` Stefan Hajnoczi
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On Wed, Oct 04, 2023 at 02:59:02PM +0200, Hanna Czenczek wrote:
> Add the interface for transferring the back-end's state during migration
> as defined previously in vhost-user.rst.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  78 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 148 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek
@ 2023-10-05 17:46   ` Stefan Hajnoczi
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]

On Wed, Oct 04, 2023 at 02:59:03PM +0200, Hanna Czenczek wrote:
> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> vhost front-ends to easily save and load the back-end's state to/from
> the migration stream.
> 
> Because we do not know the full state size ahead of time,
> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> writes each chunk consecutively into the migration stream, prefixed by
> its length.  EOF is indicated by a 0-length chunk.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h |  35 +++++++
>  hw/virtio/vhost.c         | 204 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 239 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek
@ 2023-10-05 17:46   ` Stefan Hajnoczi
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On Wed, Oct 04, 2023 at 02:59:04PM +0200, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
> 
> We get/set the latter via the new vhost operations to transfer migratory
> state.  It is its own dedicated subsection, so that for external
> migration, it can be disabled.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 100 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration
  2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
                   ` (7 preceding siblings ...)
  2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek
@ 2023-10-05 17:48 ` Stefan Hajnoczi
  8 siblings, 0 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2023-10-05 17:48 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione,
	Eugenio Pérez, Anton Kuchin

[-- Attachment #1: Type: text/plain, Size: 4046 bytes --]

On Wed, Oct 04, 2023 at 02:58:56PM +0200, Hanna Czenczek wrote:
> RFC:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
> 
> v1:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html
> 
> v2:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html
> 
> v3:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg03750.html
> 
> 
> Based-on: <20231004014532.1228637-1-stefanha@redhat.com>
>           ([PATCH v2 0/3] vhost: clean up device reset)
> 
> 
> Hi,
> 
> This v4 includes largely unchanged patches from v3.  The main
> addition/change is what came out of the discussion between Stefan and me
> around how to proceed without SUSPEND/RESUME, which is that this series
> is now based on his reset fix, and it includes more documentation
> changes.

This looks good. I posted some minor comments on the new patches.

Stefan

> 
> Changes in detail:
> 
> - Patch 1: Fall-out from the reset fix: Currently, the status byte is
>   effectively unused (qemu only uses it for resetting, which all
>   back-ends ignore; DPDK uses it to announce potential feature
>   negotiation failure, which qemu ignores).  It is also not defined what
>   exactly front-end or back-end should do with this byte, except
>   pointing at the virtio spec, which however naturally does not say how
>   this integrates with vhost-user’s RESET_DEVICE or [GS]ET_FEATURES.
>   Furthermore, there does not seem to be a use for this; we have
>   RESET_DEVICE for resetting, and we have [GS]ET_FEATURES (and
>   REPLY_ACK, which can be used on SET_FEATURES) for feature
>   negotation.
>   Therefore, deprecate the status byte, pointing to those other commands
>   instead.
> 
> - Patch 2: Patch 4 defines a suspended state for the whole back-end if
>   all vrings are stopped.  I think this should be mentioned in
>   GET_VRING_BASE, but upon trying to add it, I found that it does not
>   even mention that it stops the vring (mentioned only in the Ring
>   States section), and remembered that the whole description of both
>   GET_VRING_BASE and SET_VRING_BASE really was not helpful when trying
>   to implement a vhost-user back-end.  Took the opportunity to overhaul
>   both.
> 
> - Patch 3: This one’s from v3, but quite heavily modified.  Stefan
>   suggested consistently defining the started/stopped and
>   enabled/disabled states to be independent, and indeed doing so
>   simplifies a whole lot of stuff.  Specifically, it makes the magic
>   “enabled/disabled when started” go away.  Basically, I found this
>   change alone is enough to remove the confusion I had with the existing
>   documentation.
> 
> - Patch 4: As suggested by Stefan, just define a suspended state without
>   introducing SUSPEND.  vDPA needs SUSPEND because its GET_VRING_BASE
>   does not stop the vring, but vhost-user’s does, so we can define the
>   suspended state to be when all vrings are stopped.
> 
> - Patch 5: Reference the suspended state.
> 
> - Patches 6 through 8: Unmodified, except for them being rebase on
>   Stefan’s series.
> 
> 
> Hanna Czenczek (8):
>   vhost-user.rst: Deprecate [GS]ET_STATUS
>   vhost-user.rst: Improve [GS]ET_VRING_BASE doc
>   vhost-user.rst: Clarify enabling/disabling vrings
>   vhost-user.rst: Introduce suspended state
>   vhost-user.rst: Migrating back-end-internal state
>   vhost-user: Interface for migration state transfer
>   vhost: Add high-level state save/load functions
>   vhost-user-fs: Implement internal migration
> 
>  docs/interop/vhost-user.rst       | 318 +++++++++++++++++++++++++++---
>  include/hw/virtio/vhost-backend.h |  24 +++
>  include/hw/virtio/vhost.h         | 113 +++++++++++
>  hw/virtio/vhost-user-fs.c         | 101 +++++++++-
>  hw/virtio/vhost-user.c            | 148 ++++++++++++++
>  hw/virtio/vhost.c                 | 241 ++++++++++++++++++++++
>  6 files changed, 917 insertions(+), 28 deletions(-)
> 
> -- 
> 2.41.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-05 17:15     ` [Virtio-fs] (no subject) Michael S. Tsirkin
@ 2023-10-06  7:48       ` Hanna Czenczek
  2023-10-06  8:45         ` Michael S. Tsirkin
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06  7:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin

On 05.10.23 19:15, Michael S. Tsirkin wrote:
> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>> There is no clearly defined purpose for the virtio status byte in
>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>> protocol extension, it is possible for SET_FEATURES to return errors
>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>
>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>> implement it, but only uses it to signal feature negotiation failure.
>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>> means the same thing as RESET_DEVICE).
>>>
>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>> forward the guest-set status byte, but instead just makes it up
>>> internally, and actually completely ignores what the back-end returns,
>>> only using it as the template for a subsequent SET_STATUS to add single
>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>> to see whether the flag is still set, which is the only way in which
>>> dpdk uses the status byte.
>>>
>>> As-is, no front-end or back-end can rely on the other side handling this
>>> field in a useful manner, and it also provides no practical use over
>>> other mechanisms the vhost-user protocol has, which are more clearly
>>> defined.  Deprecate it.
>>>
>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>   1 file changed, 21 insertions(+), 7 deletions(-)
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>
> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> The fact current backends never check errors does not mean they never
> will. So no, not applying this.

Can this not be done with REPLY_ACK?  I.e., with the following message 
order:

1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is 
present
2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
4. SET_FEATURES with need_reply

If not, the problem is that qemu has sent SET_STATUS 0 for a while when 
the vCPUs are stopped, which generally seems to request a device reset.  
If we don’t state at least that SET_STATUS 0 is to be ignored, back-ends 
that will implement SET_STATUS later may break with at least these qemu 
versions.  But documenting that a particular use of the status byte is 
to be ignored would be really strange.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-05 17:38   ` Stefan Hajnoczi
@ 2023-10-06  7:53     ` Hanna Czenczek
  2023-10-06  8:49       ` Michael S. Tsirkin
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06  7:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S . Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez,
	Anton Kuchin

On 05.10.23 19:38, Stefan Hajnoczi wrote:
> On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
>> GET_VRING_BASE does not mention that it stops the respective ring.  Fix
>> that.
>>
>> Furthermore, it is not fully clear what the "base offset" these
>> commands' documentation refers to is; an offset could be many things.
>> Be more precise and verbose about it, especially given that these
>> commands use different payload structures depending on whether the vring
>> is split or packed.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
>>   1 file changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
>> index 2f68e67a1a..50f5acebe5 100644
>> --- a/docs/interop/vhost-user.rst
>> +++ b/docs/interop/vhost-user.rst
>> @@ -108,6 +108,37 @@ A vring state description
>>   
>>   :num: a 32-bit number
>>   
>> +A vring descriptor index for split virtqueues
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++-------------+---------------------+
>> +| vring index | index in avail ring |
>> ++-------------+---------------------+
>> +
>> +:vring index: 32-bit index of the respective virtqueue
>> +
>> +:index in avail ring: 32-bit value, of which currently only the lower 16
>> +  bits are used:
>> +
>> +  - Bits 0–15: Next descriptor index in the *Available Ring*
> I think we need to say more to make this implementable just by reading
> the spec:
>
>    Index of the next *Available Ring* descriptor that the back-end will
>    process. This is a free-running index that is not wrapped by the ring
>    size.

Sure, thanks.

> Feel free to rephrase.
>
>> +  - Bits 16–31: Reserved (set to zero)
>> +
>> +Vring descriptor indices for packed virtqueues
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> ++-------------+--------------------+
>> +| vring index | descriptor indices |
>> ++-------------+--------------------+
>> +
>> +:vring index: 32-bit index of the respective virtqueue
>> +
>> +:descriptor indices: 32-bit value:
>> +
>> +  - Bits 0–14: Index in the *Available Ring*
> Same here.
>
>> +  - Bit 15: Driver (Available) Ring Wrap Counter
>> +  - Bits 16–30: Index in the *Used Ring*
> Same here.
>
>> +  - Bit 31: Device (Used) Ring Wrap Counter
>> +
>>   A vring address description
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>   
>> @@ -1031,18 +1062,45 @@ Front-end message types
>>   ``VHOST_USER_SET_VRING_BASE``
>>     :id: 10
>>     :equivalent ioctl: ``VHOST_SET_VRING_BASE``
>> -  :request payload: vring state description
>> +  :request payload: vring descriptor index/indices
>>     :reply payload: N/A
>>   
>> -  Sets the base offset in the available vring.
>> +  Sets the next index to use for descriptors in this vring:
>> +
>> +  * For a split virtqueue, sets only the next descriptor index in the
>> +    *Available Ring*.  The device is supposed to read the next index in
>> +    the *Used Ring* from the respective vring structure in guest memory.
>> +
>> +  * For a packed virtqueue, both indices are supplied, as they are not
>> +    explicitly available in memory.
>> +
>> +  Consequently, the payload type is specific to the type of virt queue
>> +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
>> +  indices for packed virtqueues*).
>>   
>>   ``VHOST_USER_GET_VRING_BASE``
>>     :id: 11
>>     :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
>>     :request payload: vring state description
>> -  :reply payload: vring state description
>> +  :reply payload: vring descriptor index/indices
>> +
>> +  Stops the vring and returns the current descriptor index or indices:
>> +
>> +    * For a split virtqueue, returns only the 16-bit next descriptor
>> +      index in the *Available Ring*.  The index in the *Used Ring* is
>> +      controlled by the guest driver and can be read from the vring
> I find "is controlled by the guest driver" confusing. The device writes
> the Used Ring index. The driver only reads it. The device is the active
> party here.

Er, good point.  That breaks the whole reasoning.  Then I don’t 
understand why we do get/set the available ring index and not the used 
ring index.  Do you know why?

> The sentence can be shortened to omit the "controlled by the guest
> driver" part.

I don’t want to shorten it, because I would like to know why we don’t 
get/set both indices for split virtqueues, too.

Hanna

>> +      structure in memory, so is not covered.
>> +
>> +    * For a packed virtqueue, neither index is explicitly available to
>> +      read from memory, so both indices (as maintained by the device) are
>> +      returned.
>> +
>> +  Consequently, the payload type is specific to the type of virt queue
>> +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
>> +  indices for packed virtqueues*).
>>   
>> -  Get the available vring base offset.
>> +  The request payload’s *num* field is currently reserved and must be
>> +  set to 0.
>>   
>>   ``VHOST_USER_SET_VRING_KICK``
>>     :id: 12
>> -- 
>> 2.41.0
>>
>>
>> _______________________________________________
>> Virtio-fs mailing list
>> Virtio-fs@redhat.com
>> https://listman.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06  7:48       ` Hanna Czenczek
@ 2023-10-06  8:45         ` Michael S. Tsirkin
  2023-10-06  9:15           ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-06  8:45 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > There is no clearly defined purpose for the virtio status byte in
> > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > 
> > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > implement it, but only uses it to signal feature negotiation failure.
> > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > means the same thing as RESET_DEVICE).
> > > > 
> > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > forward the guest-set status byte, but instead just makes it up
> > > > internally, and actually completely ignores what the back-end returns,
> > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > to see whether the flag is still set, which is the only way in which
> > > > dpdk uses the status byte.
> > > > 
> > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > field in a useful manner, and it also provides no practical use over
> > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > defined.  Deprecate it.
> > > > 
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >   docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > >   1 file changed, 21 insertions(+), 7 deletions(-)
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > The fact current backends never check errors does not mean they never
> > will. So no, not applying this.
> 
> Can this not be done with REPLY_ACK?  I.e., with the following message
> order:
> 
> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> present
> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> 4. SET_FEATURES with need_reply
> 
> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> vCPUs are stopped, which generally seems to request a device reset.  If we
> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> implement SET_STATUS later may break with at least these qemu versions.  But
> documenting that a particular use of the status byte is to be ignored would
> be really strange.
> 
> Hanna

Hmm I guess. Though just following virtio spec seems cleaner to me...
vhost-user reconfigures the state fully on start. I guess symmetry was the
point. So I don't see why SET_STATUS 0 has to be ignored.


SET_STATUS was introduced by:

commit 923b8921d210763359e96246a58658ac0db6c645
Author: Yajun Wu <yajunw@nvidia.com>
Date:   Mon Oct 17 14:44:52 2022 +0800

    vhost-user: Support vhost_dev_start

CC the author.

-- 
MST


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-06  7:53     ` Hanna Czenczek
@ 2023-10-06  8:49       ` Michael S. Tsirkin
  2023-10-06 13:55         ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-06  8:49 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin

On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote:
> On 05.10.23 19:38, Stefan Hajnoczi wrote:
> > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
> > > GET_VRING_BASE does not mention that it stops the respective ring.  Fix
> > > that.
> > > 
> > > Furthermore, it is not fully clear what the "base offset" these
> > > commands' documentation refers to is; an offset could be many things.
> > > Be more precise and verbose about it, especially given that these
> > > commands use different payload structures depending on whether the vring
> > > is split or packed.
> > > 
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >   docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
> > >   1 file changed, 62 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > > index 2f68e67a1a..50f5acebe5 100644
> > > --- a/docs/interop/vhost-user.rst
> > > +++ b/docs/interop/vhost-user.rst
> > > @@ -108,6 +108,37 @@ A vring state description
> > >   :num: a 32-bit number
> > > +A vring descriptor index for split virtqueues
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+---------------------+
> > > +| vring index | index in avail ring |
> > > ++-------------+---------------------+
> > > +
> > > +:vring index: 32-bit index of the respective virtqueue
> > > +
> > > +:index in avail ring: 32-bit value, of which currently only the lower 16
> > > +  bits are used:
> > > +
> > > +  - Bits 0–15: Next descriptor index in the *Available Ring*
> > I think we need to say more to make this implementable just by reading
> > the spec:
> > 
> >    Index of the next *Available Ring* descriptor that the back-end will
> >    process. This is a free-running index that is not wrapped by the ring
> >    size.
> 
> Sure, thanks.
> 
> > Feel free to rephrase.
> > 
> > > +  - Bits 16–31: Reserved (set to zero)
> > > +
> > > +Vring descriptor indices for packed virtqueues
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------------------+
> > > +| vring index | descriptor indices |
> > > ++-------------+--------------------+
> > > +
> > > +:vring index: 32-bit index of the respective virtqueue
> > > +
> > > +:descriptor indices: 32-bit value:
> > > +
> > > +  - Bits 0–14: Index in the *Available Ring*
> > Same here.
> > 
> > > +  - Bit 15: Driver (Available) Ring Wrap Counter
> > > +  - Bits 16–30: Index in the *Used Ring*
> > Same here.
> > 
> > > +  - Bit 31: Device (Used) Ring Wrap Counter
> > > +
> > >   A vring address description
> > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > @@ -1031,18 +1062,45 @@ Front-end message types
> > >   ``VHOST_USER_SET_VRING_BASE``
> > >     :id: 10
> > >     :equivalent ioctl: ``VHOST_SET_VRING_BASE``
> > > -  :request payload: vring state description
> > > +  :request payload: vring descriptor index/indices
> > >     :reply payload: N/A
> > > -  Sets the base offset in the available vring.
> > > +  Sets the next index to use for descriptors in this vring:
> > > +
> > > +  * For a split virtqueue, sets only the next descriptor index in the
> > > +    *Available Ring*.  The device is supposed to read the next index in
> > > +    the *Used Ring* from the respective vring structure in guest memory.
> > > +
> > > +  * For a packed virtqueue, both indices are supplied, as they are not
> > > +    explicitly available in memory.
> > > +
> > > +  Consequently, the payload type is specific to the type of virt queue
> > > +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
> > > +  indices for packed virtqueues*).
> > >   ``VHOST_USER_GET_VRING_BASE``
> > >     :id: 11
> > >     :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
> > >     :request payload: vring state description
> > > -  :reply payload: vring state description
> > > +  :reply payload: vring descriptor index/indices
> > > +
> > > +  Stops the vring and returns the current descriptor index or indices:
> > > +
> > > +    * For a split virtqueue, returns only the 16-bit next descriptor
> > > +      index in the *Available Ring*.  The index in the *Used Ring* is
> > > +      controlled by the guest driver and can be read from the vring
> > I find "is controlled by the guest driver" confusing. The device writes
> > the Used Ring index. The driver only reads it. The device is the active
> > party here.
> 
> Er, good point.  That breaks the whole reasoning.  Then I don’t understand
> why we do get/set the available ring index and not the used ring index.  Do
> you know why?

It's simple. used ring index in memory is controlled by the device and
reflects device state. device can just read it back to restore.
available ring index in memory is controlled by driver and does
not reflect device state.

> > The sentence can be shortened to omit the "controlled by the guest
> > driver" part.
> 
> I don’t want to shorten it, because I would like to know why we don’t
> get/set both indices for split virtqueues, too.
> 
> Hanna
> 
> > > +      structure in memory, so is not covered.
> > > +
> > > +    * For a packed virtqueue, neither index is explicitly available to
> > > +      read from memory, so both indices (as maintained by the device) are
> > > +      returned.
> > > +
> > > +  Consequently, the payload type is specific to the type of virt queue
> > > +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
> > > +  indices for packed virtqueues*).
> > > -  Get the available vring base offset.
> > > +  The request payload’s *num* field is currently reserved and must be
> > > +  set to 0.
> > >   ``VHOST_USER_SET_VRING_KICK``
> > >     :id: 12
> > > -- 
> > > 2.41.0
> > > 
> > > 
> > > _______________________________________________
> > > Virtio-fs mailing list
> > > Virtio-fs@redhat.com
> > > https://listman.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06  8:45         ` Michael S. Tsirkin
@ 2023-10-06  9:15           ` Hanna Czenczek
  2023-10-06  9:26             ` Michael S. Tsirkin
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06  9:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On 06.10.23 10:45, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>
>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>> means the same thing as RESET_DEVICE).
>>>>>
>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>> forward the guest-set status byte, but instead just makes it up
>>>>> internally, and actually completely ignores what the back-end returns,
>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>> to see whether the flag is still set, which is the only way in which
>>>>> dpdk uses the status byte.
>>>>>
>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>> field in a useful manner, and it also provides no practical use over
>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>> defined.  Deprecate it.
>>>>>
>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>> ---
>>>>>    docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>    1 file changed, 21 insertions(+), 7 deletions(-)
>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>> The fact current backends never check errors does not mean they never
>>> will. So no, not applying this.
>> Can this not be done with REPLY_ACK?  I.e., with the following message
>> order:
>>
>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>> present
>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>> 4. SET_FEATURES with need_reply
>>
>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>> vCPUs are stopped, which generally seems to request a device reset.  If we
>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>> implement SET_STATUS later may break with at least these qemu versions.  But
>> documenting that a particular use of the status byte is to be ignored would
>> be really strange.
>>
>> Hanna
> Hmm I guess. Though just following virtio spec seems cleaner to me...
> vhost-user reconfigures the state fully on start.

Not the internal device state, though.  virtiofsd has internal state, 
and other devices like vhost-gpu back-ends would probably, too.

Stefan has recently sent a series 
(https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) 
to put the reset (RESET_DEVICE) into virtio_reset() (when we really need 
a reset).

I really don’t like our current approach with the status byte. Following 
the virtio specification to me would mean that the guest directly 
controls this byte, which it does not.  qemu makes up values as it deems 
appropriate, and this includes sending a SET_STATUS 0 when the guest is 
just paused, i.e. when the guest really doesn’t want a device reset.

That means that qemu does not treat this as a virtio device field 
(because that would mean exposing it to the guest driver), but instead 
treats it as part of the vhost(-user) protocol.  It doesn’t feel right 
to me that we use a virtio-defined feature for communication on the 
vhost level, i.e. between front-end and back-end, and not between guest 
driver and device.  I think all vhost-level protocol features should be 
fully defined in the vhost-user specification, which REPLY_ACK is.

Now, we could hand full control of the status byte to the guest, and 
that would make me content.  But I feel like that doesn’t really work, 
because qemu needs to intercept the status byte anyway (it needs to know 
when there is a reset, probably wants to know when the device is 
configured, etc.), so I don’t think having the status byte in vhost-user 
really gains us much when qemu could translate status byte changes 
to/from other vhost-user commands.

Hanna

> I guess symmetry was the
> point. So I don't see why SET_STATUS 0 has to be ignored.
>
>
> SET_STATUS was introduced by:
>
> commit 923b8921d210763359e96246a58658ac0db6c645
> Author: Yajun Wu <yajunw@nvidia.com>
> Date:   Mon Oct 17 14:44:52 2022 +0800
>
>      vhost-user: Support vhost_dev_start
>
> CC the author.
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06  9:15           ` Hanna Czenczek
@ 2023-10-06  9:26             ` Michael S. Tsirkin
  2023-10-06  9:47               ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-06  9:26 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> > > On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > > > There is no clearly defined purpose for the virtio status byte in
> > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > > > 
> > > > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > > > implement it, but only uses it to signal feature negotiation failure.
> > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > > > means the same thing as RESET_DEVICE).
> > > > > > 
> > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > > > forward the guest-set status byte, but instead just makes it up
> > > > > > internally, and actually completely ignores what the back-end returns,
> > > > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > > > to see whether the flag is still set, which is the only way in which
> > > > > > dpdk uses the status byte.
> > > > > > 
> > > > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > > > field in a useful manner, and it also provides no practical use over
> > > > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > > > defined.  Deprecate it.
> > > > > > 
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >    docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > > > >    1 file changed, 21 insertions(+), 7 deletions(-)
> > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > > > The fact current backends never check errors does not mean they never
> > > > will. So no, not applying this.
> > > Can this not be done with REPLY_ACK?  I.e., with the following message
> > > order:
> > > 
> > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> > > present
> > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > 4. SET_FEATURES with need_reply
> > > 
> > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> > > vCPUs are stopped, which generally seems to request a device reset.  If we
> > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> > > implement SET_STATUS later may break with at least these qemu versions.  But
> > > documenting that a particular use of the status byte is to be ignored would
> > > be really strange.
> > > 
> > > Hanna
> > Hmm I guess. Though just following virtio spec seems cleaner to me...
> > vhost-user reconfigures the state fully on start.
> 
> Not the internal device state, though.  virtiofsd has internal state, and
> other devices like vhost-gpu back-ends would probably, too.
> 
> Stefan has recently sent a series
> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> reset).
> 
> I really don’t like our current approach with the status byte. Following the
> virtio specification to me would mean that the guest directly controls this
> byte, which it does not.  qemu makes up values as it deems appropriate, and
> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> when the guest really doesn’t want a device reset.
> 
> That means that qemu does not treat this as a virtio device field (because
> that would mean exposing it to the guest driver), but instead treats it as
> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> a virtio-defined feature for communication on the vhost level, i.e. between
> front-end and back-end, and not between guest driver and device.  I think
> all vhost-level protocol features should be fully defined in the vhost-user
> specification, which REPLY_ACK is.

Hmm that makes sense. Maybe we should have done what stefan's patch
is doing.

Do look at the original commit that introduced it to understand why
it was added.

> Now, we could hand full control of the status byte to the guest, and that
> would make me content.  But I feel like that doesn’t really work, because
> qemu needs to intercept the status byte anyway (it needs to know when there
> is a reset, probably wants to know when the device is configured, etc.), so
> I don’t think having the status byte in vhost-user really gains us much when
> qemu could translate status byte changes to/from other vhost-user commands.
> 
> Hanna

well it intercepts it but I think it could pass it on unchanged.


> > I guess symmetry was the
> > point. So I don't see why SET_STATUS 0 has to be ignored.
> > 
> > 
> > SET_STATUS was introduced by:
> > 
> > commit 923b8921d210763359e96246a58658ac0db6c645
> > Author: Yajun Wu <yajunw@nvidia.com>
> > Date:   Mon Oct 17 14:44:52 2022 +0800
> > 
> >      vhost-user: Support vhost_dev_start
> > 
> > CC the author.
> > 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06  9:26             ` Michael S. Tsirkin
@ 2023-10-06  9:47               ` Hanna Czenczek
  2023-10-06 10:34                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On 06.10.23 11:26, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>
>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>
>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>> dpdk uses the status byte.
>>>>>>>
>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>> defined.  Deprecate it.
>>>>>>>
>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>> ---
>>>>>>>     docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>     1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>> The fact current backends never check errors does not mean they never
>>>>> will. So no, not applying this.
>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>> order:
>>>>
>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>> present
>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>> 4. SET_FEATURES with need_reply
>>>>
>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>> documenting that a particular use of the status byte is to be ignored would
>>>> be really strange.
>>>>
>>>> Hanna
>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>> vhost-user reconfigures the state fully on start.
>> Not the internal device state, though.  virtiofsd has internal state, and
>> other devices like vhost-gpu back-ends would probably, too.
>>
>> Stefan has recently sent a series
>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>> reset).
>>
>> I really don’t like our current approach with the status byte. Following the
>> virtio specification to me would mean that the guest directly controls this
>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>> when the guest really doesn’t want a device reset.
>>
>> That means that qemu does not treat this as a virtio device field (because
>> that would mean exposing it to the guest driver), but instead treats it as
>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>> a virtio-defined feature for communication on the vhost level, i.e. between
>> front-end and back-end, and not between guest driver and device.  I think
>> all vhost-level protocol features should be fully defined in the vhost-user
>> specification, which REPLY_ACK is.
> Hmm that makes sense. Maybe we should have done what stefan's patch
> is doing.
>
> Do look at the original commit that introduced it to understand why
> it was added.

I don’t understand why this was added to the stop/cont code, though.  If 
it is time consuming to make these changes, why are they done every time 
the VM is paused
and resumed?  It makes sense that this would be done for the initial 
configuration (where a reset also wouldn’t hurt), but here it seems wrong.

(To be clear, a reset in the stop/cont code is wrong, because it breaks 
stateful devices.)

Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as 
originally introduced was wrong even for non-stateful devices, because 
it occurred before we fetched the state (vring indices) so we could 
restore it later.  I don’t know how 923b8921d21 was tested, but if the 
back-end used for testing implemented SET_STATUS 0 as a reset, it could 
not have survived either migration or a stop/cont in general, because 
the vring indices would have been reset to 0.

What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke 
all devices that would implement them as per virtio spec, and even today 
it’s broken for stateful devices.  The mentioned performance issue is 
likely real, but we can’t address it by making up SET_STATUS calls that 
are wrong.

I concede that I didn’t think about DRIVER_OK.  Personally, I would do 
all final configuration that would happen upon a DRIVER_OK once the 
first vring is started (i.e. receives a kick).  That has the added 
benefit of being asynchronous because it doesn’t block any vhost-user 
messages (which are synchronous, and thus block downtime).

Hanna

>> Now, we could hand full control of the status byte to the guest, and that
>> would make me content.  But I feel like that doesn’t really work, because
>> qemu needs to intercept the status byte anyway (it needs to know when there
>> is a reset, probably wants to know when the device is configured, etc.), so
>> I don’t think having the status byte in vhost-user really gains us much when
>> qemu could translate status byte changes to/from other vhost-user commands.
>>
>> Hanna
> well it intercepts it but I think it could pass it on unchanged.
>
>
>>> I guess symmetry was the
>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>
>>>
>>> SET_STATUS was introduced by:
>>>
>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>> Author: Yajun Wu <yajunw@nvidia.com>
>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>
>>>       vhost-user: Support vhost_dev_start
>>>
>>> CC the author.
>>>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06  9:47               ` Hanna Czenczek
@ 2023-10-06 10:34                 ` Michael S. Tsirkin
  2023-10-06 11:42                   ` Hanna Czenczek
  2023-10-07  2:22                   ` Yajun Wu
  0 siblings, 2 replies; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-06 10:34 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> > > On 06.10.23 10:45, Michael S. Tsirkin wrote:
> > > > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> > > > > On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > > > > > There is no clearly defined purpose for the virtio status byte in
> > > > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > > > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > > > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > > > > > 
> > > > > > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > > > > > implement it, but only uses it to signal feature negotiation failure.
> > > > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > > > > > means the same thing as RESET_DEVICE).
> > > > > > > > 
> > > > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > > > > > forward the guest-set status byte, but instead just makes it up
> > > > > > > > internally, and actually completely ignores what the back-end returns,
> > > > > > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > > > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > > > > > to see whether the flag is still set, which is the only way in which
> > > > > > > > dpdk uses the status byte.
> > > > > > > > 
> > > > > > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > > > > > field in a useful manner, and it also provides no practical use over
> > > > > > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > > > > > defined.  Deprecate it.
> > > > > > > > 
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >     docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > > > > > >     1 file changed, 21 insertions(+), 7 deletions(-)
> > > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > > > > > The fact current backends never check errors does not mean they never
> > > > > > will. So no, not applying this.
> > > > > Can this not be done with REPLY_ACK?  I.e., with the following message
> > > > > order:
> > > > > 
> > > > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> > > > > present
> > > > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > > 4. SET_FEATURES with need_reply
> > > > > 
> > > > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> > > > > vCPUs are stopped, which generally seems to request a device reset.  If we
> > > > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> > > > > implement SET_STATUS later may break with at least these qemu versions.  But
> > > > > documenting that a particular use of the status byte is to be ignored would
> > > > > be really strange.
> > > > > 
> > > > > Hanna
> > > > Hmm I guess. Though just following virtio spec seems cleaner to me...
> > > > vhost-user reconfigures the state fully on start.
> > > Not the internal device state, though.  virtiofsd has internal state, and
> > > other devices like vhost-gpu back-ends would probably, too.
> > > 
> > > Stefan has recently sent a series
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> > > put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> > > reset).
> > > 
> > > I really don’t like our current approach with the status byte. Following the
> > > virtio specification to me would mean that the guest directly controls this
> > > byte, which it does not.  qemu makes up values as it deems appropriate, and
> > > this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> > > when the guest really doesn’t want a device reset.
> > > 
> > > That means that qemu does not treat this as a virtio device field (because
> > > that would mean exposing it to the guest driver), but instead treats it as
> > > part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> > > a virtio-defined feature for communication on the vhost level, i.e. between
> > > front-end and back-end, and not between guest driver and device.  I think
> > > all vhost-level protocol features should be fully defined in the vhost-user
> > > specification, which REPLY_ACK is.
> > Hmm that makes sense. Maybe we should have done what stefan's patch
> > is doing.
> > 
> > Do look at the original commit that introduced it to understand why
> > it was added.
> 
> I don’t understand why this was added to the stop/cont code, though.  If it
> is time consuming to make these changes, why are they done every time the VM
> is paused
> and resumed?  It makes sense that this would be done for the initial
> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> 
> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> stateful devices.)
> 
> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> originally introduced was wrong even for non-stateful devices, because it
> occurred before we fetched the state (vring indices) so we could restore it
> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> either migration or a stop/cont in general, because the vring indices would
> have been reset to 0.
> 
> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> devices that would implement them as per virtio spec, and even today it’s
> broken for stateful devices.  The mentioned performance issue is likely
> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> 
> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> final configuration that would happen upon a DRIVER_OK once the first vring
> is started (i.e. receives a kick).  That has the added benefit of being
> asynchronous because it doesn’t block any vhost-user messages (which are
> synchronous, and thus block downtime).
> 
> Hanna


For better or worse kick is per ring. It's out of spec to start rings
that were not kicked but I guess you could do configuration ...
Seems somewhat asymmetrical though.

Let's wait until next week, hopefully Yajun Wu will answer.

> > > Now, we could hand full control of the status byte to the guest, and that
> > > would make me content.  But I feel like that doesn’t really work, because
> > > qemu needs to intercept the status byte anyway (it needs to know when there
> > > is a reset, probably wants to know when the device is configured, etc.), so
> > > I don’t think having the status byte in vhost-user really gains us much when
> > > qemu could translate status byte changes to/from other vhost-user commands.
> > > 
> > > Hanna
> > well it intercepts it but I think it could pass it on unchanged.
> > 
> > 
> > > > I guess symmetry was the
> > > > point. So I don't see why SET_STATUS 0 has to be ignored.
> > > > 
> > > > 
> > > > SET_STATUS was introduced by:
> > > > 
> > > > commit 923b8921d210763359e96246a58658ac0db6c645
> > > > Author: Yajun Wu <yajunw@nvidia.com>
> > > > Date:   Mon Oct 17 14:44:52 2022 +0800
> > > > 
> > > >       vhost-user: Support vhost_dev_start
> > > > 
> > > > CC the author.
> > > > 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 10:34                 ` Michael S. Tsirkin
@ 2023-10-06 11:42                   ` Hanna Czenczek
  2023-10-06 15:17                     ` Alex Bennée
  2023-10-07  2:22                   ` Yajun Wu
  1 sibling, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06 11:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu

On 06.10.23 12:34, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>
>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>
>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>
>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>> will. So no, not applying this.
>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>> order:
>>>>>>
>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>> present
>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 4. SET_FEATURES with need_reply
>>>>>>
>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>> be really strange.
>>>>>>
>>>>>> Hanna
>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>> vhost-user reconfigures the state fully on start.
>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>
>>>> Stefan has recently sent a series
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>> reset).
>>>>
>>>> I really don’t like our current approach with the status byte. Following the
>>>> virtio specification to me would mean that the guest directly controls this
>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>> when the guest really doesn’t want a device reset.
>>>>
>>>> That means that qemu does not treat this as a virtio device field (because
>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>> front-end and back-end, and not between guest driver and device.  I think
>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>> specification, which REPLY_ACK is.
>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>> is doing.
>>>
>>> Do look at the original commit that introduced it to understand why
>>> it was added.
>> I don’t understand why this was added to the stop/cont code, though.  If it
>> is time consuming to make these changes, why are they done every time the VM
>> is paused
>> and resumed?  It makes sense that this would be done for the initial
>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>
>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>> stateful devices.)
>>
>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>> originally introduced was wrong even for non-stateful devices, because it
>> occurred before we fetched the state (vring indices) so we could restore it
>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>> either migration or a stop/cont in general, because the vring indices would
>> have been reset to 0.
>>
>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>> devices that would implement them as per virtio spec, and even today it’s
>> broken for stateful devices.  The mentioned performance issue is likely
>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>
>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>> final configuration that would happen upon a DRIVER_OK once the first vring
>> is started (i.e. receives a kick).  That has the added benefit of being
>> asynchronous because it doesn’t block any vhost-user messages (which are
>> synchronous, and thus block downtime).
>>
>> Hanna
>
> For better or worse kick is per ring. It's out of spec to start rings
> that were not kicked but I guess you could do configuration ...
> Seems somewhat asymmetrical though.

I meant to take the first ring being started as the signal to do the 
global configuration, i.e. not do this once per vring, but once globally.

> Let's wait until next week, hopefully Yajun Wu will answer.

I mean, personally I don’t really care about the whole SET_STATUS 
thing.  It’s clear that it’s broken for stateful devices.  The fact that 
it took until 6f8be29ec17d to fix it for just any device that would 
implement it according to spec to me is a strong indication that nobody 
does implement it according to spec, and is currently only used to 
signal to some specific back-end that all rings have been set up and 
should be configured in a single block.

(By the way, our SET_STATUS call that adds ACKNOWLEDGE | DRIVER | 
DRIVER_OK is also completely against the spec, and any well-behaving 
device should reject it.  These flags must be set one after another, and 
specifically, features must be read and set after setting DRIVER, but 
before setting FEATURES_OK, and FEATURES_OK must be set before setting 
DRIVER_OK.  Any well-behaving device should error out when DRIVER_OK is 
set without FEATURES_OK set, or when FEATURES_OK is set without 
ACKNOWLEDGE | DRIVER set.)

I can just drop this patch from the migration series, because in my 
opinion it doesn’t affect it whatsoever (although I understood Stefan 
disagrees).  But honestly, I think any vhost-user back-end developer is 
well-advised to completely ignore the status byte.  Not ignoring it 
means relying on qemu’s implementation-defined behavior.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-06  8:49       ` Michael S. Tsirkin
@ 2023-10-06 13:55         ` Hanna Czenczek
  2023-10-06 13:58           ` Hanna Czenczek
  2023-10-07 21:27           ` Michael S. Tsirkin
  0 siblings, 2 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06 13:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi

On 06.10.23 10:49, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote:
>> On 05.10.23 19:38, Stefan Hajnoczi wrote:
>>> On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
>>>> GET_VRING_BASE does not mention that it stops the respective ring.  Fix
>>>> that.
>>>>
>>>> Furthermore, it is not fully clear what the "base offset" these
>>>> commands' documentation refers to is; an offset could be many things.
>>>> Be more precise and verbose about it, especially given that these
>>>> commands use different payload structures depending on whether the vring
>>>> is split or packed.
>>>>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
>>>>    1 file changed, 62 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
>>>> index 2f68e67a1a..50f5acebe5 100644
>>>> --- a/docs/interop/vhost-user.rst
>>>> +++ b/docs/interop/vhost-user.rst
>>>> @@ -108,6 +108,37 @@ A vring state description
>>>>    :num: a 32-bit number
>>>> +A vring descriptor index for split virtqueues
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++-------------+---------------------+
>>>> +| vring index | index in avail ring |
>>>> ++-------------+---------------------+
>>>> +
>>>> +:vring index: 32-bit index of the respective virtqueue
>>>> +
>>>> +:index in avail ring: 32-bit value, of which currently only the lower 16
>>>> +  bits are used:
>>>> +
>>>> +  - Bits 0–15: Next descriptor index in the *Available Ring*
>>> I think we need to say more to make this implementable just by reading
>>> the spec:
>>>
>>>     Index of the next *Available Ring* descriptor that the back-end will
>>>     process. This is a free-running index that is not wrapped by the ring
>>>     size.
>> Sure, thanks.
>>
>>> Feel free to rephrase.
>>>
>>>> +  - Bits 16–31: Reserved (set to zero)
>>>> +
>>>> +Vring descriptor indices for packed virtqueues
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++-------------+--------------------+
>>>> +| vring index | descriptor indices |
>>>> ++-------------+--------------------+
>>>> +
>>>> +:vring index: 32-bit index of the respective virtqueue
>>>> +
>>>> +:descriptor indices: 32-bit value:
>>>> +
>>>> +  - Bits 0–14: Index in the *Available Ring*
>>> Same here.
>>>
>>>> +  - Bit 15: Driver (Available) Ring Wrap Counter
>>>> +  - Bits 16–30: Index in the *Used Ring*
>>> Same here.
>>>
>>>> +  - Bit 31: Device (Used) Ring Wrap Counter
>>>> +
>>>>    A vring address description
>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> @@ -1031,18 +1062,45 @@ Front-end message types
>>>>    ``VHOST_USER_SET_VRING_BASE``
>>>>      :id: 10
>>>>      :equivalent ioctl: ``VHOST_SET_VRING_BASE``
>>>> -  :request payload: vring state description
>>>> +  :request payload: vring descriptor index/indices
>>>>      :reply payload: N/A
>>>> -  Sets the base offset in the available vring.
>>>> +  Sets the next index to use for descriptors in this vring:
>>>> +
>>>> +  * For a split virtqueue, sets only the next descriptor index in the
>>>> +    *Available Ring*.  The device is supposed to read the next index in
>>>> +    the *Used Ring* from the respective vring structure in guest memory.
>>>> +
>>>> +  * For a packed virtqueue, both indices are supplied, as they are not
>>>> +    explicitly available in memory.
>>>> +
>>>> +  Consequently, the payload type is specific to the type of virt queue
>>>> +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
>>>> +  indices for packed virtqueues*).
>>>>    ``VHOST_USER_GET_VRING_BASE``
>>>>      :id: 11
>>>>      :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
>>>>      :request payload: vring state description
>>>> -  :reply payload: vring state description
>>>> +  :reply payload: vring descriptor index/indices
>>>> +
>>>> +  Stops the vring and returns the current descriptor index or indices:
>>>> +
>>>> +    * For a split virtqueue, returns only the 16-bit next descriptor
>>>> +      index in the *Available Ring*.  The index in the *Used Ring* is
>>>> +      controlled by the guest driver and can be read from the vring
>>> I find "is controlled by the guest driver" confusing. The device writes
>>> the Used Ring index. The driver only reads it. The device is the active
>>> party here.
>> Er, good point.  That breaks the whole reasoning.  Then I don’t understand
>> why we do get/set the available ring index and not the used ring index.  Do
>> you know why?
> It's simple. used ring index in memory is controlled by the device and
> reflects device state.

Exactly, it’s device state, that’s why I thought the front-end needs to 
ensure its read and restored around the reset we currently have in 
vhost_dev_stop()/start().

> device can just read it back to restore.

I find it strange that the device is supposed to read its own state from 
memory.

> available ring index in memory is controlled by driver and does
> not reflect device state.

Why can’t the device read the available index from memory?  That value 
is put into memory by the driver precisely so the device can read it 
from there.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-06 13:55         ` Hanna Czenczek
@ 2023-10-06 13:58           ` Hanna Czenczek
  2023-10-07 21:29             ` Michael S. Tsirkin
  2023-10-07 21:27           ` Michael S. Tsirkin
  1 sibling, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06 13:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi

On 06.10.23 15:55, Hanna Czenczek wrote:
> On 06.10.23 10:49, Michael S. Tsirkin wrote:
>> On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote:
>>> On 05.10.23 19:38, Stefan Hajnoczi wrote:
>>>> On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:

[...]

>>>>    ``VHOST_USER_GET_VRING_BASE``
>>>>      :id: 11
>>>>      :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
>>>>      :request payload: vring state description
>>>> -  :reply payload: vring state description
>>>> +  :reply payload: vring descriptor index/indices
>>>> +
>>>> +  Stops the vring and returns the current descriptor index or 
>>>> indices:
>>>> +
>>>> +    * For a split virtqueue, returns only the 16-bit next descriptor
>>>> +      index in the *Available Ring*.  The index in the *Used Ring* is
>>>> +      controlled by the guest driver and can be read from the vring
>>>> I find "is controlled by the guest driver" confusing. The device 
>>>> writes
>>>> the Used Ring index. The driver only reads it. The device is the 
>>>> active
>>>> party here.
>>> Er, good point.  That breaks the whole reasoning.  Then I don’t 
>>> understand
>>> why we do get/set the available ring index and not the used ring 
>>> index.  Do
>>> you know why?
>> It's simple. used ring index in memory is controlled by the device and
>> reflects device state.
>
> Exactly, it’s device state, that’s why I thought the front-end needs 
> to ensure its read and restored around the reset we currently have in 
> vhost_dev_stop()/start().
>
>> device can just read it back to restore.
>
> I find it strange that the device is supposed to read its own state 
> from memory.
>
>> available ring index in memory is controlled by driver and does
>> not reflect device state.
>
> Why can’t the device read the available index from memory?  That value 
> is put into memory by the driver precisely so the device can read it 
> from there.

Ah, wait, is the idea that the device may have an internal available 
index counter that reflects what descriptor it has already fetched? I.e. 
this index will lag behind the one in memory, and the difference are new 
descriptors that the device still needs to read? If that internal 
counter is the index that’s get/set here, then yes, that makes a lot of 
sense.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 11:42                   ` Hanna Czenczek
@ 2023-10-06 15:17                     ` Alex Bennée
  2023-10-06 15:47                       ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Bennée @ 2023-10-06 15:17 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin,
	Yajun Wu, qemu-devel


Hanna Czenczek <hreitz@redhat.com> writes:

> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
<snip>
>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>> devices that would implement them as per virtio spec, and even today it’s
>>> broken for stateful devices.  The mentioned performance issue is likely
>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>
>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>> is started (i.e. receives a kick).  That has the added benefit of being
>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>> synchronous, and thus block downtime).
>>>
>>> Hanna
>>
>> For better or worse kick is per ring. It's out of spec to start rings
>> that were not kicked but I guess you could do configuration ...
>> Seems somewhat asymmetrical though.
>
> I meant to take the first ring being started as the signal to do the
> global configuration, i.e. not do this once per vring, but once
> globally.
>
>> Let's wait until next week, hopefully Yajun Wu will answer.
>
> I mean, personally I don’t really care about the whole SET_STATUS
> thing.  It’s clear that it’s broken for stateful devices.  The fact
> that it took until 6f8be29ec17d to fix it for just any device that
> would implement it according to spec to me is a strong indication that
> nobody does implement it according to spec, and is currently only used
> to signal to some specific back-end that all rings have been set up
> and should be configured in a single block.

I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
extensions where everything is off-loaded to the vhost-user backend.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 15:17                     ` Alex Bennée
@ 2023-10-06 15:47                       ` Hanna Czenczek
  2023-10-06 20:49                         ` Alex Bennée
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-06 15:47 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin,
	Yajun Wu, qemu-devel

On 06.10.23 17:17, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> <snip>
>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>> devices that would implement them as per virtio spec, and even today it’s
>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>
>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>> synchronous, and thus block downtime).
>>>>
>>>> Hanna
>>> For better or worse kick is per ring. It's out of spec to start rings
>>> that were not kicked but I guess you could do configuration ...
>>> Seems somewhat asymmetrical though.
>> I meant to take the first ring being started as the signal to do the
>> global configuration, i.e. not do this once per vring, but once
>> globally.
>>
>>> Let's wait until next week, hopefully Yajun Wu will answer.
>> I mean, personally I don’t really care about the whole SET_STATUS
>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>> that it took until 6f8be29ec17d to fix it for just any device that
>> would implement it according to spec to me is a strong indication that
>> nobody does implement it according to spec, and is currently only used
>> to signal to some specific back-end that all rings have been set up
>> and should be configured in a single block.
> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
> extensions where everything is off-loaded to the vhost-user backend.

How do these back-ends work with the fact that qemu uses SET_STATUS 
incorrectly when not offloading?  Do you plan on fixing that?

(I.e. that we send SET_STATUS 0 when the VM is paused, potentially 
resetting state that is not recoverable, and that we set DRIVER and 
DRIVER_OK simultaneously.)

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 15:47                       ` Hanna Czenczek
@ 2023-10-06 20:49                         ` Alex Bennée
  2023-10-09  8:07                           ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Bennée @ 2023-10-06 20:49 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin,
	Yajun Wu, qemu-devel


Hanna Czenczek <hreitz@redhat.com> writes:

> On 06.10.23 17:17, Alex Bennée wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>> <snip>
>>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>>> devices that would implement them as per virtio spec, and even today it’s
>>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>>
>>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>>> synchronous, and thus block downtime).
>>>>>
>>>>> Hanna
>>>> For better or worse kick is per ring. It's out of spec to start rings
>>>> that were not kicked but I guess you could do configuration ...
>>>> Seems somewhat asymmetrical though.
>>> I meant to take the first ring being started as the signal to do the
>>> global configuration, i.e. not do this once per vring, but once
>>> globally.
>>>
>>>> Let's wait until next week, hopefully Yajun Wu will answer.
>>> I mean, personally I don’t really care about the whole SET_STATUS
>>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>>> that it took until 6f8be29ec17d to fix it for just any device that
>>> would implement it according to spec to me is a strong indication that
>>> nobody does implement it according to spec, and is currently only used
>>> to signal to some specific back-end that all rings have been set up
>>> and should be configured in a single block.
>> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
>> extensions where everything is off-loaded to the vhost-user backend.
>
> How do these back-ends work with the fact that qemu uses SET_STATUS
> incorrectly when not offloading?  Do you plan on fixing that?

Mainly having a common base implementation which does it right and
having very lightweight derivations for legacy stubs using it. The
aim is to eliminate the need for QEMU stubs entirely by fully specifying
the device from the vhost-user API. 

> (I.e. that we send SET_STATUS 0 when the VM is paused, potentially
> resetting state that is not recoverable, and that we set DRIVER and
> DRIVER_OK simultaneously.)

This is QEMU simulating a SET_STATUS rather than the guest triggering
it?

>
> Hanna


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 10:34                 ` Michael S. Tsirkin
  2023-10-06 11:42                   ` Hanna Czenczek
@ 2023-10-07  2:22                   ` Yajun Wu
  2023-10-09  8:21                     ` Hanna Czenczek
  2023-10-09 10:28                     ` German Maglione
  1 sibling, 2 replies; 53+ messages in thread
From: Yajun Wu @ 2023-10-07  2:22 UTC (permalink / raw)
  To: Michael S. Tsirkin, Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav,
	maxime.coquelin


On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>
>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>
>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>
>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>> will. So no, not applying this.
>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>> order:
>>>>>>
>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>> present
>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 4. SET_FEATURES with need_reply
>>>>>>
>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>> be really strange.
>>>>>>
>>>>>> Hanna
>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>> vhost-user reconfigures the state fully on start.
>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>
>>>> Stefan has recently sent a series
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>> reset).
>>>>
>>>> I really don’t like our current approach with the status byte. Following the
>>>> virtio specification to me would mean that the guest directly controls this
>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>> when the guest really doesn’t want a device reset.
>>>>
>>>> That means that qemu does not treat this as a virtio device field (because
>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>> front-end and back-end, and not between guest driver and device.  I think
>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>> specification, which REPLY_ACK is.
>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>> is doing.
>>>
>>> Do look at the original commit that introduced it to understand why
>>> it was added.
>> I don’t understand why this was added to the stop/cont code, though.  If it
>> is time consuming to make these changes, why are they done every time the VM
>> is paused
>> and resumed?  It makes sense that this would be done for the initial
>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>
>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>> stateful devices.)
>>
>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>> originally introduced was wrong even for non-stateful devices, because it
>> occurred before we fetched the state (vring indices) so we could restore it
>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>> either migration or a stop/cont in general, because the vring indices would
>> have been reset to 0.
>>
>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>> devices that would implement them as per virtio spec, and even today it’s
>> broken for stateful devices.  The mentioned performance issue is likely
>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>
>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>> final configuration that would happen upon a DRIVER_OK once the first vring
>> is started (i.e. receives a kick).  That has the added benefit of being
>> asynchronous because it doesn’t block any vhost-user messages (which are
>> synchronous, and thus block downtime).
>>
>> Hanna
>
> For better or worse kick is per ring. It's out of spec to start rings
> that were not kicked but I guess you could do configuration ...
> Seems somewhat asymmetrical though.
>
> Let's wait until next week, hopefully Yajun Wu will answer.
The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
DPDK know
when DRIVER_OK bit is valid. It's an indication of all VQ configuration 
has sent,
otherwise DPDK has to rely on first queue pair is ready, then 
receiving/applying
VQ configuration one by one.

During live migration, configuring VQ one by one is very time consuming. 
For VIRTIO
net vDPA, HW needs to know how many VQs are enabled to set 
RSS(Receive-Side Scaling).

If you don’t want SET_STATUS message, backend can remove protocol 
feature bit
VHOST_USER_PROTOCOL_F_STATUS.
DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
close/reset.

I'm not involved in discussion about adding SET_STATUS in Vhost 
protocol. This feature
is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).

Thanks,
Yajun
>
>>>> Now, we could hand full control of the status byte to the guest, and that
>>>> would make me content.  But I feel like that doesn’t really work, because
>>>> qemu needs to intercept the status byte anyway (it needs to know when there
>>>> is a reset, probably wants to know when the device is configured, etc.), so
>>>> I don’t think having the status byte in vhost-user really gains us much when
>>>> qemu could translate status byte changes to/from other vhost-user commands.
>>>>
>>>> Hanna
>>> well it intercepts it but I think it could pass it on unchanged.
>>>
>>>
>>>>> I guess symmetry was the
>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>
>>>>>
>>>>> SET_STATUS was introduced by:
>>>>>
>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>
>>>>>        vhost-user: Support vhost_dev_start
>>>>>
>>>>> CC the author.
>>>>>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-06 13:55         ` Hanna Czenczek
  2023-10-06 13:58           ` Hanna Czenczek
@ 2023-10-07 21:27           ` Michael S. Tsirkin
  1 sibling, 0 replies; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-07 21:27 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi

On Fri, Oct 06, 2023 at 03:55:56PM +0200, Hanna Czenczek wrote:
> On 06.10.23 10:49, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote:
> > > On 05.10.23 19:38, Stefan Hajnoczi wrote:
> > > > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
> > > > > GET_VRING_BASE does not mention that it stops the respective ring.  Fix
> > > > > that.
> > > > > 
> > > > > Furthermore, it is not fully clear what the "base offset" these
> > > > > commands' documentation refers to is; an offset could be many things.
> > > > > Be more precise and verbose about it, especially given that these
> > > > > commands use different payload structures depending on whether the vring
> > > > > is split or packed.
> > > > > 
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > ---
> > > > >    docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++---
> > > > >    1 file changed, 62 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > > > > index 2f68e67a1a..50f5acebe5 100644
> > > > > --- a/docs/interop/vhost-user.rst
> > > > > +++ b/docs/interop/vhost-user.rst
> > > > > @@ -108,6 +108,37 @@ A vring state description
> > > > >    :num: a 32-bit number
> > > > > +A vring descriptor index for split virtqueues
> > > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > +
> > > > > ++-------------+---------------------+
> > > > > +| vring index | index in avail ring |
> > > > > ++-------------+---------------------+
> > > > > +
> > > > > +:vring index: 32-bit index of the respective virtqueue
> > > > > +
> > > > > +:index in avail ring: 32-bit value, of which currently only the lower 16
> > > > > +  bits are used:
> > > > > +
> > > > > +  - Bits 0–15: Next descriptor index in the *Available Ring*
> > > > I think we need to say more to make this implementable just by reading
> > > > the spec:
> > > > 
> > > >     Index of the next *Available Ring* descriptor that the back-end will
> > > >     process. This is a free-running index that is not wrapped by the ring
> > > >     size.
> > > Sure, thanks.
> > > 
> > > > Feel free to rephrase.
> > > > 
> > > > > +  - Bits 16–31: Reserved (set to zero)
> > > > > +
> > > > > +Vring descriptor indices for packed virtqueues
> > > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > +
> > > > > ++-------------+--------------------+
> > > > > +| vring index | descriptor indices |
> > > > > ++-------------+--------------------+
> > > > > +
> > > > > +:vring index: 32-bit index of the respective virtqueue
> > > > > +
> > > > > +:descriptor indices: 32-bit value:
> > > > > +
> > > > > +  - Bits 0–14: Index in the *Available Ring*
> > > > Same here.
> > > > 
> > > > > +  - Bit 15: Driver (Available) Ring Wrap Counter
> > > > > +  - Bits 16–30: Index in the *Used Ring*
> > > > Same here.
> > > > 
> > > > > +  - Bit 31: Device (Used) Ring Wrap Counter
> > > > > +
> > > > >    A vring address description
> > > > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > @@ -1031,18 +1062,45 @@ Front-end message types
> > > > >    ``VHOST_USER_SET_VRING_BASE``
> > > > >      :id: 10
> > > > >      :equivalent ioctl: ``VHOST_SET_VRING_BASE``
> > > > > -  :request payload: vring state description
> > > > > +  :request payload: vring descriptor index/indices
> > > > >      :reply payload: N/A
> > > > > -  Sets the base offset in the available vring.
> > > > > +  Sets the next index to use for descriptors in this vring:
> > > > > +
> > > > > +  * For a split virtqueue, sets only the next descriptor index in the
> > > > > +    *Available Ring*.  The device is supposed to read the next index in
> > > > > +    the *Used Ring* from the respective vring structure in guest memory.
> > > > > +
> > > > > +  * For a packed virtqueue, both indices are supplied, as they are not
> > > > > +    explicitly available in memory.
> > > > > +
> > > > > +  Consequently, the payload type is specific to the type of virt queue
> > > > > +  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
> > > > > +  indices for packed virtqueues*).
> > > > >    ``VHOST_USER_GET_VRING_BASE``
> > > > >      :id: 11
> > > > >      :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
> > > > >      :request payload: vring state description
> > > > > -  :reply payload: vring state description
> > > > > +  :reply payload: vring descriptor index/indices
> > > > > +
> > > > > +  Stops the vring and returns the current descriptor index or indices:
> > > > > +
> > > > > +    * For a split virtqueue, returns only the 16-bit next descriptor
> > > > > +      index in the *Available Ring*.  The index in the *Used Ring* is
> > > > > +      controlled by the guest driver and can be read from the vring
> > > > I find "is controlled by the guest driver" confusing. The device writes
> > > > the Used Ring index. The driver only reads it. The device is the active
> > > > party here.
> > > Er, good point.  That breaks the whole reasoning.  Then I don’t understand
> > > why we do get/set the available ring index and not the used ring index.  Do
> > > you know why?
> > It's simple. used ring index in memory is controlled by the device and
> > reflects device state.
> 
> Exactly, it’s device state, that’s why I thought the front-end needs to
> ensure its read and restored around the reset we currently have in
> vhost_dev_stop()/start().
> 
> > device can just read it back to restore.
> 
> I find it strange that the device is supposed to read its own state from
> memory.

/me shrugs. It puts it there, why not read it back. Duplicating state
is not usually a good idea - leads to bugs.

> > available ring index in memory is controlled by driver and does
> > not reflect device state.
> 
> Why can’t the device read the available index from memory?  That value is
> put into memory by the driver precisely so the device can read it from
> there.
> 
> Hanna

Consider an example of RX ring for net device. buffers might be
available but device does not use them until packets arrive.  what I
think you could say is that actually just the used index should be
sufficient. So I think main thing GET_BASE does is stop the ring. As for
the value returned, we can if we want to validate that it matches used
ring index.

-- 
MST


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  2023-10-06 13:58           ` Hanna Czenczek
@ 2023-10-07 21:29             ` Michael S. Tsirkin
  0 siblings, 0 replies; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-07 21:29 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi

On Fri, Oct 06, 2023 at 03:58:44PM +0200, Hanna Czenczek wrote:
> On 06.10.23 15:55, Hanna Czenczek wrote:
> > On 06.10.23 10:49, Michael S. Tsirkin wrote:
> > > On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote:
> > > > On 05.10.23 19:38, Stefan Hajnoczi wrote:
> > > > > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote:
> 
> [...]
> 
> > > > >    ``VHOST_USER_GET_VRING_BASE``
> > > > >      :id: 11
> > > > >      :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
> > > > >      :request payload: vring state description
> > > > > -  :reply payload: vring state description
> > > > > +  :reply payload: vring descriptor index/indices
> > > > > +
> > > > > +  Stops the vring and returns the current descriptor index
> > > > > or indices:
> > > > > +
> > > > > +    * For a split virtqueue, returns only the 16-bit next descriptor
> > > > > +      index in the *Available Ring*.  The index in the *Used Ring* is
> > > > > +      controlled by the guest driver and can be read from the vring
> > > > > I find "is controlled by the guest driver" confusing. The
> > > > > device writes
> > > > > the Used Ring index. The driver only reads it. The device is
> > > > > the active
> > > > > party here.
> > > > Er, good point.  That breaks the whole reasoning.  Then I don’t
> > > > understand
> > > > why we do get/set the available ring index and not the used ring
> > > > index.  Do
> > > > you know why?
> > > It's simple. used ring index in memory is controlled by the device and
> > > reflects device state.
> > 
> > Exactly, it’s device state, that’s why I thought the front-end needs to
> > ensure its read and restored around the reset we currently have in
> > vhost_dev_stop()/start().
> > 
> > > device can just read it back to restore.
> > 
> > I find it strange that the device is supposed to read its own state from
> > memory.
> > 
> > > available ring index in memory is controlled by driver and does
> > > not reflect device state.
> > 
> > Why can’t the device read the available index from memory?  That value
> > is put into memory by the driver precisely so the device can read it
> > from there.
> 
> Ah, wait, is the idea that the device may have an internal available index
> counter that reflects what descriptor it has already fetched? I.e. this
> index will lag behind the one in memory, and the difference are new
> descriptors that the device still needs to read? If that internal counter is
> the index that’s get/set here, then yes, that makes a lot of sense.
> 
> Hanna

Exactly. And this gets eventually written out as used index.

-- 
MST


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-06 20:49                         ` Alex Bennée
@ 2023-10-09  8:07                           ` Hanna Czenczek
  0 siblings, 0 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-09  8:07 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Michael S. Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez,
	Anton Kuchin, Yajun Wu

On 06.10.23 22:49, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 06.10.23 17:17, Alex Bennée wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>> <snip>
>>>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>>>> devices that would implement them as per virtio spec, and even today it’s
>>>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>>>
>>>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>>>> synchronous, and thus block downtime).
>>>>>>
>>>>>> Hanna
>>>>> For better or worse kick is per ring. It's out of spec to start rings
>>>>> that were not kicked but I guess you could do configuration ...
>>>>> Seems somewhat asymmetrical though.
>>>> I meant to take the first ring being started as the signal to do the
>>>> global configuration, i.e. not do this once per vring, but once
>>>> globally.
>>>>
>>>>> Let's wait until next week, hopefully Yajun Wu will answer.
>>>> I mean, personally I don’t really care about the whole SET_STATUS
>>>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>>>> that it took until 6f8be29ec17d to fix it for just any device that
>>>> would implement it according to spec to me is a strong indication that
>>>> nobody does implement it according to spec, and is currently only used
>>>> to signal to some specific back-end that all rings have been set up
>>>> and should be configured in a single block.
>>> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
>>> extensions where everything is off-loaded to the vhost-user backend.
>> How do these back-ends work with the fact that qemu uses SET_STATUS
>> incorrectly when not offloading?  Do you plan on fixing that?
> Mainly having a common base implementation which does it right and
> having very lightweight derivations for legacy stubs using it. The
> aim is to eliminate the need for QEMU stubs entirely by fully specifying
> the device from the vhost-user API.

If the current SET_STATUS use is overhauled, too, that would be good.  I 
wonder why you need the status byte, though.

>> (I.e. that we send SET_STATUS 0 when the VM is paused, potentially
>> resetting state that is not recoverable, and that we set DRIVER and
>> DRIVER_OK simultaneously.)
> This is QEMU simulating a SET_STATUS rather than the guest triggering
> it?

Yes, and the fact that we simulate it when the guest will not have 
triggered it, i.e. we reset the device (SET_STATUS 0) when the VM is 
paused.  Effectively, qemu injects virtio commands that the guest has 
never requested, which generally feels like a bad idea, because qemu 
will need to get the device back to its previous state before the guest 
is resumed, which may or may not work.  Specifically, it won’t work for 
devices that have internal state.

Furthermore, we use SET_STATUS to set ACKNOWLEDGE | DRIVER | DRIVER_OK 
simultaneously, which is wrong.  ACKNOWLEDGE | DRIVER may perhaps be set 
simultaneously, but then comes feature negotiation (setting and checking 
FEATURES_OK), and then DRIVER_OK.

Finally, how the status byte is to be used is not noted in the 
vhost-user specification, which instead points to the virtio 
specification.  I think if we keep SET_STATUS, it must be documented how 
it interacts with other vhost-user commands.  For example, how the 
FEATURES_OK protocol described in the virtio specification interacts 
with GET_FEATURES/SET_FEATURES, or whether SET_STATUS 0 and RESET_DEVICE 
are equivalent.  Currently, the only implementation of SET_STATUS I know 
(DPDK) ignores SET_STATUS 0, i.e. doesn’t do a reset.  To me that 
indicates that the spec must be clear on what these status values mean 
with regards to the vhost-user protocol as a whole.

So every software implementation with STATUS support that I know 
implements SET_STATUS wrongly right now, and that’s a problem, because 
it prevents implementations like virtiofsd from doing so correctly.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-07  2:22                   ` Yajun Wu
@ 2023-10-09  8:21                     ` Hanna Czenczek
  2023-10-09  9:07                       ` Hanna Czenczek
  2023-10-09 10:28                     ` German Maglione
  1 sibling, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-09  8:21 UTC (permalink / raw)
  To: Yajun Wu, Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav,
	maxime.coquelin

On 07.10.23 04:22, Yajun Wu wrote:
>
> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>>> There is no clearly defined purpose for the virtio status 
>>>>>>>>>> byte in
>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and 
>>>>>>>>>> for virtio
>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK
>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return 
>>>>>>>>>> errors
>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>>
>>>>>>>>>> As for implementations, SET_STATUS is not widely 
>>>>>>>>>> implemented.  dpdk does
>>>>>>>>>> implement it, but only uses it to signal feature negotiation 
>>>>>>>>>> failure.
>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it 
>>>>>>>>>> effectively
>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is 
>>>>>>>>>> deprecated, and today
>>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>>
>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it 
>>>>>>>>>> does not
>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>>> internally, and actually completely ignores what the back-end 
>>>>>>>>>> returns,
>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to 
>>>>>>>>>> add single
>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never 
>>>>>>>>>> reads it back
>>>>>>>>>> to see whether the flag is still set, which is the only way 
>>>>>>>>>> in which
>>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>>
>>>>>>>>>> As-is, no front-end or back-end can rely on the other side 
>>>>>>>>>> handling this
>>>>>>>>>> field in a useful manner, and it also provides no practical 
>>>>>>>>>> use over
>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more 
>>>>>>>>>> clearly
>>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>      docs/interop/vhost-user.rst | 28 
>>>>>>>>>> +++++++++++++++++++++-------
>>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge 
>>>>>>>> FEATURES_OK.
>>>>>>>> The fact current backends never check errors does not mean they 
>>>>>>>> never
>>>>>>>> will. So no, not applying this.
>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following 
>>>>>>> message
>>>>>>> order:
>>>>>>>
>>>>>>> 1. GET_FEATURES to find out whether 
>>>>>>> VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>>> present
>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get 
>>>>>>> VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>> 4. SET_FEATURES with need_reply
>>>>>>>
>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a 
>>>>>>> while when the
>>>>>>> vCPUs are stopped, which generally seems to request a device 
>>>>>>> reset.  If we
>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, 
>>>>>>> back-ends that will
>>>>>>> implement SET_STATUS later may break with at least these qemu 
>>>>>>> versions.  But
>>>>>>> documenting that a particular use of the status byte is to be 
>>>>>>> ignored would
>>>>>>> be really strange.
>>>>>>>
>>>>>>> Hanna
>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to 
>>>>>> me...
>>>>>> vhost-user reconfigures the state fully on start.
>>>>> Not the internal device state, though.  virtiofsd has internal 
>>>>> state, and
>>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>>
>>>>> Stefan has recently sent a series
>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) 
>>>>> to
>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really 
>>>>> need a
>>>>> reset).
>>>>>
>>>>> I really don’t like our current approach with the status byte. 
>>>>> Following the
>>>>> virtio specification to me would mean that the guest directly 
>>>>> controls this
>>>>> byte, which it does not.  qemu makes up values as it deems 
>>>>> appropriate, and
>>>>> this includes sending a SET_STATUS 0 when the guest is just 
>>>>> paused, i.e.
>>>>> when the guest really doesn’t want a device reset.
>>>>>
>>>>> That means that qemu does not treat this as a virtio device field 
>>>>> (because
>>>>> that would mean exposing it to the guest driver), but instead 
>>>>> treats it as
>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me 
>>>>> that we use
>>>>> a virtio-defined feature for communication on the vhost level, 
>>>>> i.e. between
>>>>> front-end and back-end, and not between guest driver and device.  
>>>>> I think
>>>>> all vhost-level protocol features should be fully defined in the 
>>>>> vhost-user
>>>>> specification, which REPLY_ACK is.
>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>>> is doing.
>>>>
>>>> Do look at the original commit that introduced it to understand why
>>>> it was added.
>>> I don’t understand why this was added to the stop/cont code, 
>>> though.  If it
>>> is time consuming to make these changes, why are they done every 
>>> time the VM
>>> is paused
>>> and resumed?  It makes sense that this would be done for the initial
>>> configuration (where a reset also wouldn’t hurt), but here it seems 
>>> wrong.
>>>
>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>>> stateful devices.)
>>>
>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>>> originally introduced was wrong even for non-stateful devices, 
>>> because it
>>> occurred before we fetched the state (vring indices) so we could 
>>> restore it
>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end 
>>> used
>>> for testing implemented SET_STATUS 0 as a reset, it could not have 
>>> survived
>>> either migration or a stop/cont in general, because the vring 
>>> indices would
>>> have been reset to 0.
>>>
>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that 
>>> broke all
>>> devices that would implement them as per virtio spec, and even today 
>>> it’s
>>> broken for stateful devices.  The mentioned performance issue is likely
>>> real, but we can’t address it by making up SET_STATUS calls that are 
>>> wrong.
>>>
>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would 
>>> do all
>>> final configuration that would happen upon a DRIVER_OK once the 
>>> first vring
>>> is started (i.e. receives a kick).  That has the added benefit of being
>>> asynchronous because it doesn’t block any vhost-user messages (which 
>>> are
>>> synchronous, and thus block downtime).
>>>
>>> Hanna
>>
>> For better or worse kick is per ring. It's out of spec to start rings
>> that were not kicked but I guess you could do configuration ...
>> Seems somewhat asymmetrical though.
>>
>> Let's wait until next week, hopefully Yajun Wu will answer.
> The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
> DPDK know
> when DRIVER_OK bit is valid. It's an indication of all VQ 
> configuration has sent,
> otherwise DPDK has to rely on first queue pair is ready, then 
> receiving/applying
> VQ configuration one by one.
>
> During live migration, configuring VQ one by one is very time consuming.

One question I have here is why it wasn’t then introduced in the live 
migration code, but in the general VM stop/cont code instead. It does 
seem time-consuming to do this every time the VM is paused and resumed.

> For VIRTIO
> net vDPA, HW needs to know how many VQs are enabled to set 
> RSS(Receive-Side Scaling).
>
> If you don’t want SET_STATUS message, backend can remove protocol 
> feature bit
> VHOST_USER_PROTOCOL_F_STATUS.

The problem isn’t back-ends that don’t want the message, the problem is 
that qemu uses the message wrongly, which prevents well-behaving 
back-ends from implementing the message.

> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
> close/reset.

So the right thing to do for back-ends is to announce STATUS support and 
then not implement it correctly?

GET_VRING_BASE should not reset the close or reset the device, by the 
way.  It should stop that one vring, not more.  We have a RESET_DEVICE 
command for resetting.

> I'm not involved in discussion about adding SET_STATUS in Vhost 
> protocol. This feature
> is essential for vDPA(same as vhost-vdpa implements 
> VHOST_VDPA_SET_STATUS).

So from what I gather from your response is that there is only a single 
use for SET_STATUS, which is the DRIVER_OK bit.  If so, documenting that 
all other bits are to be ignored by both back-end and front-end would be 
fine by me.

I’m not fully serious about that suggestion, but I hear the strong 
implication that nothing but DRIVER_OK was of any concern, and this is 
really important to note when we talk about the status of the STATUS 
feature in vhost today.  It seems to me now that it was not intended to 
be the virtio-level status byte, but just a DRIVER_OK signalling path 
from front-end to back-end.  That makes it a vhost-level protocol 
feature to me.

Hanna

>
> Thanks,
> Yajun
>>
>>>>> Now, we could hand full control of the status byte to the guest, 
>>>>> and that
>>>>> would make me content.  But I feel like that doesn’t really work, 
>>>>> because
>>>>> qemu needs to intercept the status byte anyway (it needs to know 
>>>>> when there
>>>>> is a reset, probably wants to know when the device is configured, 
>>>>> etc.), so
>>>>> I don’t think having the status byte in vhost-user really gains us 
>>>>> much when
>>>>> qemu could translate status byte changes to/from other vhost-user 
>>>>> commands.
>>>>>
>>>>> Hanna
>>>> well it intercepts it but I think it could pass it on unchanged.
>>>>
>>>>
>>>>>> I guess symmetry was the
>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>>
>>>>>>
>>>>>> SET_STATUS was introduced by:
>>>>>>
>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>>
>>>>>>        vhost-user: Support vhost_dev_start
>>>>>>
>>>>>> CC the author.
>>>>>>
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-09  8:21                     ` Hanna Czenczek
@ 2023-10-09  9:07                       ` Hanna Czenczek
  2023-10-09  9:13                         ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-09  9:07 UTC (permalink / raw)
  To: Yajun Wu, Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav,
	maxime.coquelin, Alex Bennée

On 09.10.23 10:21, Hanna Czenczek wrote:
> On 07.10.23 04:22, Yajun Wu wrote:

[...]

>> The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
>> DPDK know
>> when DRIVER_OK bit is valid. It's an indication of all VQ 
>> configuration has sent,
>> otherwise DPDK has to rely on first queue pair is ready, then 
>> receiving/applying
>> VQ configuration one by one.
>>
>> During live migration, configuring VQ one by one is very time consuming.
>
> One question I have here is why it wasn’t then introduced in the live 
> migration code, but in the general VM stop/cont code instead. It does 
> seem time-consuming to do this every time the VM is paused and resumed.
>
>> For VIRTIO
>> net vDPA, HW needs to know how many VQs are enabled to set 
>> RSS(Receive-Side Scaling).
>>
>> If you don’t want SET_STATUS message, backend can remove protocol 
>> feature bit
>> VHOST_USER_PROTOCOL_F_STATUS.
>
> The problem isn’t back-ends that don’t want the message, the problem 
> is that qemu uses the message wrongly, which prevents well-behaving 
> back-ends from implementing the message.
>
>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
>> close/reset.
>
> So the right thing to do for back-ends is to announce STATUS support 
> and then not implement it correctly?
>
> GET_VRING_BASE should not reset the close or reset the device, by the 
> way.  It should stop that one vring, not more.  We have a RESET_DEVICE 
> command for resetting.
>
>> I'm not involved in discussion about adding SET_STATUS in Vhost 
>> protocol. This feature
>> is essential for vDPA(same as vhost-vdpa implements 
>> VHOST_VDPA_SET_STATUS).
>
> So from what I gather from your response is that there is only a 
> single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
> documenting that all other bits are to be ignored by both back-end and 
> front-end would be fine by me.
>
> I’m not fully serious about that suggestion, but I hear the strong 
> implication that nothing but DRIVER_OK was of any concern, and this is 
> really important to note when we talk about the status of the STATUS 
> feature in vhost today.  It seems to me now that it was not intended 
> to be the virtio-level status byte, but just a DRIVER_OK signalling 
> path from front-end to back-end.  That makes it a vhost-level protocol 
> feature to me.

On second thought, it just is a pure vhost-level protocol feature, and 
has nothing to do with the virtio status byte as-is.  The only stated 
purpose is for the front-end to send DRIVER_OK after migration, but 
migration is transparent to the guest, so the guest would never change 
the status byte during migration.  Therefore, if this feature is 
essential, we will never be able to have a status byte that is 
transparently shared between guest and back-end device, i.e. the virtio 
status byte.

Cc-ing Alex on this mail, because to me, this seems like an important 
detail when he plans on using the byte in the future.  If we need a 
virtio status byte, I can’t see how we could use the existing F_STATUS 
for it.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-09  9:07                       ` Hanna Czenczek
@ 2023-10-09  9:13                         ` Hanna Czenczek
  2023-10-10  4:00                           ` Yajun Wu
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-09  9:13 UTC (permalink / raw)
  To: Yajun Wu, Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav,
	maxime.coquelin, Alex Bennée

On 09.10.23 11:07, Hanna Czenczek wrote:
> On 09.10.23 10:21, Hanna Czenczek wrote:
>> On 07.10.23 04:22, Yajun Wu wrote:
>
> [...]
>
>>> The main motivation of adding VHOST_USER_SET_STATUS is to let 
>>> backend DPDK know
>>> when DRIVER_OK bit is valid. It's an indication of all VQ 
>>> configuration has sent,
>>> otherwise DPDK has to rely on first queue pair is ready, then 
>>> receiving/applying
>>> VQ configuration one by one.
>>>
>>> During live migration, configuring VQ one by one is very time 
>>> consuming.
>>
>> One question I have here is why it wasn’t then introduced in the live 
>> migration code, but in the general VM stop/cont code instead. It does 
>> seem time-consuming to do this every time the VM is paused and resumed.
>>
>>> For VIRTIO
>>> net vDPA, HW needs to know how many VQs are enabled to set 
>>> RSS(Receive-Side Scaling).
>>>
>>> If you don’t want SET_STATUS message, backend can remove protocol 
>>> feature bit
>>> VHOST_USER_PROTOCOL_F_STATUS.
>>
>> The problem isn’t back-ends that don’t want the message, the problem 
>> is that qemu uses the message wrongly, which prevents well-behaving 
>> back-ends from implementing the message.
>>
>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
>>> close/reset.
>>
>> So the right thing to do for back-ends is to announce STATUS support 
>> and then not implement it correctly?
>>
>> GET_VRING_BASE should not reset the close or reset the device, by the 
>> way.  It should stop that one vring, not more.  We have a 
>> RESET_DEVICE command for resetting.
>>
>>> I'm not involved in discussion about adding SET_STATUS in Vhost 
>>> protocol. This feature
>>> is essential for vDPA(same as vhost-vdpa implements 
>>> VHOST_VDPA_SET_STATUS).
>>
>> So from what I gather from your response is that there is only a 
>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
>> documenting that all other bits are to be ignored by both back-end 
>> and front-end would be fine by me.
>>
>> I’m not fully serious about that suggestion, but I hear the strong 
>> implication that nothing but DRIVER_OK was of any concern, and this 
>> is really important to note when we talk about the status of the 
>> STATUS feature in vhost today.  It seems to me now that it was not 
>> intended to be the virtio-level status byte, but just a DRIVER_OK 
>> signalling path from front-end to back-end.  That makes it a 
>> vhost-level protocol feature to me.
>
> On second thought, it just is a pure vhost-level protocol feature, and 
> has nothing to do with the virtio status byte as-is.  The only stated 
> purpose is for the front-end to send DRIVER_OK after migration, but 
> migration is transparent to the guest, so the guest would never change 
> the status byte during migration.  Therefore, if this feature is 
> essential, we will never be able to have a status byte that is 
> transparently shared between guest and back-end device, i.e. the 
> virtio status byte.

On third thought, scratch that.  The guest wouldn’t set it, but 
naturally, after migration, the front-end will need to restore the 
status byte from the source, so the front-end will always need to set 
it, even if it were otherwise used controlled only by the guest and the 
back-end device.  So technically, this doesn’t prevent such a use case.  
(In practice, it isn’t controlled by the guest right now, but that could 
be fixed.)

> Cc-ing Alex on this mail, because to me, this seems like an important 
> detail when he plans on using the byte in the future. If we need a 
> virtio status byte, I can’t see how we could use the existing F_STATUS 
> for it.
>
> Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-07  2:22                   ` Yajun Wu
  2023-10-09  8:21                     ` Hanna Czenczek
@ 2023-10-09 10:28                     ` German Maglione
  2023-10-10  2:56                       ` Yajun Wu
  1 sibling, 1 reply; 53+ messages in thread
From: German Maglione @ 2023-10-09 10:28 UTC (permalink / raw)
  To: Yajun Wu
  Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, parav, Anton Kuchin

On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
>
>
> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> >> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> >>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> >>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> >>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> >>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> >>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> >>>>>>>>> There is no clearly defined purpose for the virtio status byte in
> >>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> >>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> >>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
> >>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> >>>>>>>>>
> >>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> >>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
> >>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
> >>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> >>>>>>>>> means the same thing as RESET_DEVICE).
> >>>>>>>>>
> >>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
> >>>>>>>>> forward the guest-set status byte, but instead just makes it up
> >>>>>>>>> internally, and actually completely ignores what the back-end returns,
> >>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
> >>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> >>>>>>>>> to see whether the flag is still set, which is the only way in which
> >>>>>>>>> dpdk uses the status byte.
> >>>>>>>>>
> >>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
> >>>>>>>>> field in a useful manner, and it also provides no practical use over
> >>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
> >>>>>>>>> defined.  Deprecate it.
> >>>>>>>>>
> >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> >>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
> >>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> >>>>>>> The fact current backends never check errors does not mean they never
> >>>>>>> will. So no, not applying this.
> >>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
> >>>>>> order:
> >>>>>>
> >>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> >>>>>> present
> >>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>> 4. SET_FEATURES with need_reply
> >>>>>>
> >>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> >>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
> >>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> >>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
> >>>>>> documenting that a particular use of the status byte is to be ignored would
> >>>>>> be really strange.
> >>>>>>
> >>>>>> Hanna
> >>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
> >>>>> vhost-user reconfigures the state fully on start.
> >>>> Not the internal device state, though.  virtiofsd has internal state, and
> >>>> other devices like vhost-gpu back-ends would probably, too.
> >>>>
> >>>> Stefan has recently sent a series
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> >>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> >>>> reset).
> >>>>
> >>>> I really don’t like our current approach with the status byte. Following the
> >>>> virtio specification to me would mean that the guest directly controls this
> >>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
> >>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> >>>> when the guest really doesn’t want a device reset.
> >>>>
> >>>> That means that qemu does not treat this as a virtio device field (because
> >>>> that would mean exposing it to the guest driver), but instead treats it as
> >>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> >>>> a virtio-defined feature for communication on the vhost level, i.e. between
> >>>> front-end and back-end, and not between guest driver and device.  I think
> >>>> all vhost-level protocol features should be fully defined in the vhost-user
> >>>> specification, which REPLY_ACK is.
> >>> Hmm that makes sense. Maybe we should have done what stefan's patch
> >>> is doing.
> >>>
> >>> Do look at the original commit that introduced it to understand why
> >>> it was added.
> >> I don’t understand why this was added to the stop/cont code, though.  If it
> >> is time consuming to make these changes, why are they done every time the VM
> >> is paused
> >> and resumed?  It makes sense that this would be done for the initial
> >> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> >>
> >> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> >> stateful devices.)
> >>
> >> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> >> originally introduced was wrong even for non-stateful devices, because it
> >> occurred before we fetched the state (vring indices) so we could restore it
> >> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> >> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> >> either migration or a stop/cont in general, because the vring indices would
> >> have been reset to 0.
> >>
> >> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> >> devices that would implement them as per virtio spec, and even today it’s
> >> broken for stateful devices.  The mentioned performance issue is likely
> >> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> >>
> >> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> >> final configuration that would happen upon a DRIVER_OK once the first vring
> >> is started (i.e. receives a kick).  That has the added benefit of being
> >> asynchronous because it doesn’t block any vhost-user messages (which are
> >> synchronous, and thus block downtime).
> >>
> >> Hanna
> >
> > For better or worse kick is per ring. It's out of spec to start rings
> > that were not kicked but I guess you could do configuration ...
> > Seems somewhat asymmetrical though.
> >
> > Let's wait until next week, hopefully Yajun Wu will answer.
> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
> DPDK know
> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
> has sent,
> otherwise DPDK has to rely on first queue pair is ready, then
> receiving/applying
> VQ configuration one by one.
>
> During live migration, configuring VQ one by one is very time consuming.
> For VIRTIO
> net vDPA, HW needs to know how many VQs are enabled to set
> RSS(Receive-Side Scaling).
>
> If you don’t want SET_STATUS message, backend can remove protocol
> feature bit
> VHOST_USER_PROTOCOL_F_STATUS.
> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
> close/reset.

This is incorrect, resetting the device on GET_VRING_BASE breaks
the stop/cont. Since you don't want to reset the VQs on stop/cont.

>
> I'm not involved in discussion about adding SET_STATUS in Vhost
> protocol. This feature
> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
>
> Thanks,
> Yajun
> >
> >>>> Now, we could hand full control of the status byte to the guest, and that
> >>>> would make me content.  But I feel like that doesn’t really work, because
> >>>> qemu needs to intercept the status byte anyway (it needs to know when there
> >>>> is a reset, probably wants to know when the device is configured, etc.), so
> >>>> I don’t think having the status byte in vhost-user really gains us much when
> >>>> qemu could translate status byte changes to/from other vhost-user commands.
> >>>>
> >>>> Hanna
> >>> well it intercepts it but I think it could pass it on unchanged.
> >>>
> >>>
> >>>>> I guess symmetry was the
> >>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
> >>>>>
> >>>>>
> >>>>> SET_STATUS was introduced by:
> >>>>>
> >>>>> commit 923b8921d210763359e96246a58658ac0db6c645
> >>>>> Author: Yajun Wu <yajunw@nvidia.com>
> >>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
> >>>>>
> >>>>>        vhost-user: Support vhost_dev_start
> >>>>>
> >>>>> CC the author.
> >>>>>
>
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs



-- 
German


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-09 10:28                     ` German Maglione
@ 2023-10-10  2:56                       ` Yajun Wu
  2023-10-10 10:04                         ` German Maglione
  0 siblings, 1 reply; 53+ messages in thread
From: Yajun Wu @ 2023-10-10  2:56 UTC (permalink / raw)
  To: German Maglione
  Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, Parav Pandit, Anton Kuchin


On 10/9/2023 6:28 PM, German Maglione wrote:
> External email: Use caution opening links or attachments
>
>
> On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
>>
>> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>>>
>>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>>>
>>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>>>
>>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>>>
>>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>>>       1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>>>> will. So no, not applying this.
>>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>>>> order:
>>>>>>>>
>>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>>>> present
>>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>>> 4. SET_FEATURES with need_reply
>>>>>>>>
>>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>>>> be really strange.
>>>>>>>>
>>>>>>>> Hanna
>>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>>>> vhost-user reconfigures the state fully on start.
>>>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>>>
>>>>>> Stefan has recently sent a series
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>>>> reset).
>>>>>>
>>>>>> I really don’t like our current approach with the status byte. Following the
>>>>>> virtio specification to me would mean that the guest directly controls this
>>>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>>>> when the guest really doesn’t want a device reset.
>>>>>>
>>>>>> That means that qemu does not treat this as a virtio device field (because
>>>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>>>> front-end and back-end, and not between guest driver and device.  I think
>>>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>>>> specification, which REPLY_ACK is.
>>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>>>> is doing.
>>>>>
>>>>> Do look at the original commit that introduced it to understand why
>>>>> it was added.
>>>> I don’t understand why this was added to the stop/cont code, though.  If it
>>>> is time consuming to make these changes, why are they done every time the VM
>>>> is paused
>>>> and resumed?  It makes sense that this would be done for the initial
>>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>>>
>>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>>>> stateful devices.)
>>>>
>>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>>>> originally introduced was wrong even for non-stateful devices, because it
>>>> occurred before we fetched the state (vring indices) so we could restore it
>>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>>>> either migration or a stop/cont in general, because the vring indices would
>>>> have been reset to 0.
>>>>
>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>> devices that would implement them as per virtio spec, and even today it’s
>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>
>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>> synchronous, and thus block downtime).
>>>>
>>>> Hanna
>>> For better or worse kick is per ring. It's out of spec to start rings
>>> that were not kicked but I guess you could do configuration ...
>>> Seems somewhat asymmetrical though.
>>>
>>> Let's wait until next week, hopefully Yajun Wu will answer.
>> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
>> DPDK know
>> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
>> has sent,
>> otherwise DPDK has to rely on first queue pair is ready, then
>> receiving/applying
>> VQ configuration one by one.
>>
>> During live migration, configuring VQ one by one is very time consuming.
>> For VIRTIO
>> net vDPA, HW needs to know how many VQs are enabled to set
>> RSS(Receive-Side Scaling).
>>
>> If you don’t want SET_STATUS message, backend can remove protocol
>> feature bit
>> VHOST_USER_PROTOCOL_F_STATUS.
>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>> close/reset.
> This is incorrect, resetting the device on GET_VRING_BASE breaks
> the stop/cont. Since you don't want to reset the VQs on stop/cont.
Sorry for the misunderstanding, dpdk vhost backend framework doesn't 
have RESET concept(only device level .dev_conf and .dev_close). On 
receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does 
dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf.
>
>> I'm not involved in discussion about adding SET_STATUS in Vhost
>> protocol. This feature
>> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
>>
>> Thanks,
>> Yajun
>>>>>> Now, we could hand full control of the status byte to the guest, and that
>>>>>> would make me content.  But I feel like that doesn’t really work, because
>>>>>> qemu needs to intercept the status byte anyway (it needs to know when there
>>>>>> is a reset, probably wants to know when the device is configured, etc.), so
>>>>>> I don’t think having the status byte in vhost-user really gains us much when
>>>>>> qemu could translate status byte changes to/from other vhost-user commands.
>>>>>>
>>>>>> Hanna
>>>>> well it intercepts it but I think it could pass it on unchanged.
>>>>>
>>>>>
>>>>>>> I guess symmetry was the
>>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>>>
>>>>>>>
>>>>>>> SET_STATUS was introduced by:
>>>>>>>
>>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>>>
>>>>>>>         vhost-user: Support vhost_dev_start
>>>>>>>
>>>>>>> CC the author.
>>>>>>>
>> _______________________________________________
>> Virtio-fs mailing list
>> Virtio-fs@redhat.com
>> https://listman.redhat.com/mailman/listinfo/virtio-fs
>
>
> --
> German
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-09  9:13                         ` Hanna Czenczek
@ 2023-10-10  4:00                           ` Yajun Wu
  2023-10-10  8:18                             ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Yajun Wu @ 2023-10-10  4:00 UTC (permalink / raw)
  To: Hanna Czenczek, Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin,
	Parav Pandit, maxime.coquelin, Alex Bennée


On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
> External email: Use caution opening links or attachments
>
>
> On 09.10.23 11:07, Hanna Czenczek wrote:
>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>> On 07.10.23 04:22, Yajun Wu wrote:
>> [...]
>>
>>>> The main motivation of adding VHOST_USER_SET_STATUS is to let
>>>> backend DPDK know
>>>> when DRIVER_OK bit is valid. It's an indication of all VQ
>>>> configuration has sent,
>>>> otherwise DPDK has to rely on first queue pair is ready, then
>>>> receiving/applying
>>>> VQ configuration one by one.
>>>>
>>>> During live migration, configuring VQ one by one is very time
>>>> consuming.
>>> One question I have here is why it wasn’t then introduced in the live
>>> migration code, but in the general VM stop/cont code instead. It does
>>> seem time-consuming to do this every time the VM is paused and resumed.

Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe 
because there's no device level stop/cont vhost message?

>>>
>>>> For VIRTIO
>>>> net vDPA, HW needs to know how many VQs are enabled to set
>>>> RSS(Receive-Side Scaling).
>>>>
>>>> If you don’t want SET_STATUS message, backend can remove protocol
>>>> feature bit
>>>> VHOST_USER_PROTOCOL_F_STATUS.
>>> The problem isn’t back-ends that don’t want the message, the problem
>>> is that qemu uses the message wrongly, which prevents well-behaving
>>> back-ends from implementing the message.
>>>
>>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>>>> close/reset.
>>> So the right thing to do for back-ends is to announce STATUS support
>>> and then not implement it correctly?
>>>
>>> GET_VRING_BASE should not reset the close or reset the device, by the
>>> way.  It should stop that one vring, not more.  We have a
>>> RESET_DEVICE command for resetting.
I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? 
It's a compatible issue. For new backend implements, we can have better 
solution, right?
>>>> I'm not involved in discussion about adding SET_STATUS in Vhost
>>>> protocol. This feature
>>>> is essential for vDPA(same as vhost-vdpa implements
>>>> VHOST_VDPA_SET_STATUS).
>>> So from what I gather from your response is that there is only a
>>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so,
>>> documenting that all other bits are to be ignored by both back-end
>>> and front-end would be fine by me.
>>>
>>> I’m not fully serious about that suggestion, but I hear the strong
>>> implication that nothing but DRIVER_OK was of any concern, and this
>>> is really important to note when we talk about the status of the
>>> STATUS feature in vhost today.  It seems to me now that it was not
>>> intended to be the virtio-level status byte, but just a DRIVER_OK
>>> signalling path from front-end to back-end.  That makes it a
>>> vhost-level protocol feature to me.
>> On second thought, it just is a pure vhost-level protocol feature, and
>> has nothing to do with the virtio status byte as-is.  The only stated
>> purpose is for the front-end to send DRIVER_OK after migration, but
>> migration is transparent to the guest, so the guest would never change
>> the status byte during migration.  Therefore, if this feature is
>> essential, we will never be able to have a status byte that is
>> transparently shared between guest and back-end device, i.e. the
>> virtio status byte.
> On third thought, scratch that.  The guest wouldn’t set it, but
> naturally, after migration, the front-end will need to restore the
> status byte from the source, so the front-end will always need to set
> it, even if it were otherwise used controlled only by the guest and the
> back-end device.  So technically, this doesn’t prevent such a use case.
> (In practice, it isn’t controlled by the guest right now, but that could
> be fixed.)
I only tested the feature with DPDK(the only backend use it today?). Max 
defined the protocol and added the corresponding code in DPDK before I 
added QEMU support. If other backend or different device type want to 
use this, we can have further discussion?
>> Cc-ing Alex on this mail, because to me, this seems like an important
>> detail when he plans on using the byte in the future. If we need a
>> virtio status byte, I can’t see how we could use the existing F_STATUS
>> for it.
>>
>> Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10  4:00                           ` Yajun Wu
@ 2023-10-10  8:18                             ` Hanna Czenczek
  2023-10-10 10:36                               ` Alex Bennée
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-10  8:18 UTC (permalink / raw)
  To: Yajun Wu, Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin,
	Parav Pandit, maxime.coquelin, Alex Bennée

On 10.10.23 06:00, Yajun Wu wrote:
>
> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>> [...]
>>>
>>>>> The main motivation of adding VHOST_USER_SET_STATUS is to let
>>>>> backend DPDK know
>>>>> when DRIVER_OK bit is valid. It's an indication of all VQ
>>>>> configuration has sent,
>>>>> otherwise DPDK has to rely on first queue pair is ready, then
>>>>> receiving/applying
>>>>> VQ configuration one by one.
>>>>>
>>>>> During live migration, configuring VQ one by one is very time
>>>>> consuming.
>>>> One question I have here is why it wasn’t then introduced in the live
>>>> migration code, but in the general VM stop/cont code instead. It does
>>>> seem time-consuming to do this every time the VM is paused and 
>>>> resumed.
>
> Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe 
> because there's no device level stop/cont vhost message?

No, it is because qemu will reset the status in stop/cont*, which it 
should not do.  Aside from guest-initiated resets, the only thing where 
a reset comes into play is when the back-end is changed, e.g. during 
migration.  In that case, the source back-end will see a disconnect on 
the vhost-user socket and can then do whatever uninitialization it needs 
to do, and the destination front-end will need to be reconfigured by 
qemu anyway, because it’s just a case of the destination qemu initiating 
a fresh connection to a new back-end (except that it will need to 
restore the state from the source).

*Yes, technically, dpdk will ignore that reset, but it still stops the 
device on a different message (when it should just pause processing 
vrings), so the outcome is the same.

>>>>
>>>>> For VIRTIO
>>>>> net vDPA, HW needs to know how many VQs are enabled to set
>>>>> RSS(Receive-Side Scaling).
>>>>>
>>>>> If you don’t want SET_STATUS message, backend can remove protocol
>>>>> feature bit
>>>>> VHOST_USER_PROTOCOL_F_STATUS.
>>>> The problem isn’t back-ends that don’t want the message, the problem
>>>> is that qemu uses the message wrongly, which prevents well-behaving
>>>> back-ends from implementing the message.
>>>>
>>>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>>>>> close/reset.
>>>> So the right thing to do for back-ends is to announce STATUS support
>>>> and then not implement it correctly?
>>>>
>>>> GET_VRING_BASE should not reset the close or reset the device, by the
>>>> way.  It should stop that one vring, not more.  We have a
>>>> RESET_DEVICE command for resetting.
> I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? 

I don’t think it matters who came first.  What matters is the 
specification, and that dpdk decided to rely on implementation-specific 
behavior without having all involved parties agree by matters of putting 
that in the specification.  And now dpdk clearly deviates from the 
specification as a result of that action, which can result in problems 
if the front-end doesn’t do what qemu always used to do.  (E.g. the 
front-end might just send GET_VRING_BASE for all vrings when suspending 
the guest, and then only send kicks on resume to re-start the vrings.  
dpdk would most likely be left in a state where the whole device is 
stopped, expecting DRIVER_OK.  Same thing in general for front-ends that 
don’t support F_STATUS.)

> It's a compatible issue. For new backend implements, we can have 
> better solution, right?

The fact that dpdk and qemu deviate from the specification is a problem 
as-is.

>>>>> I'm not involved in discussion about adding SET_STATUS in Vhost
>>>>> protocol. This feature
>>>>> is essential for vDPA(same as vhost-vdpa implements
>>>>> VHOST_VDPA_SET_STATUS).
>>>> So from what I gather from your response is that there is only a
>>>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so,
>>>> documenting that all other bits are to be ignored by both back-end
>>>> and front-end would be fine by me.
>>>>
>>>> I’m not fully serious about that suggestion, but I hear the strong
>>>> implication that nothing but DRIVER_OK was of any concern, and this
>>>> is really important to note when we talk about the status of the
>>>> STATUS feature in vhost today.  It seems to me now that it was not
>>>> intended to be the virtio-level status byte, but just a DRIVER_OK
>>>> signalling path from front-end to back-end.  That makes it a
>>>> vhost-level protocol feature to me.
>>> On second thought, it just is a pure vhost-level protocol feature, and
>>> has nothing to do with the virtio status byte as-is.  The only stated
>>> purpose is for the front-end to send DRIVER_OK after migration, but
>>> migration is transparent to the guest, so the guest would never change
>>> the status byte during migration.  Therefore, if this feature is
>>> essential, we will never be able to have a status byte that is
>>> transparently shared between guest and back-end device, i.e. the
>>> virtio status byte.
>> On third thought, scratch that.  The guest wouldn’t set it, but
>> naturally, after migration, the front-end will need to restore the
>> status byte from the source, so the front-end will always need to set
>> it, even if it were otherwise used controlled only by the guest and the
>> back-end device.  So technically, this doesn’t prevent such a use case.
>> (In practice, it isn’t controlled by the guest right now, but that could
>> be fixed.)
> I only tested the feature with DPDK(the only backend use it today?). 
> Max defined the protocol and added the corresponding code in DPDK 
> before I added QEMU support. If other backend or different device type 
> want to use this, we can have further discussion?

So as far as I understand, the feature is supposed to rely on 
implementation-specific behavior between specifically qemu as a 
front-end and dpdk as a back-end, nothing else.  Honestly, that to me is 
a very good reason to deprecate it.  That would make it clear that any 
implementation that implements it does so because it relies on 
implementation-specific behavior from other implementations.

Option 2 is to fix it.  It is not right to use this broadly defined 
feature with its clear protocol as given in the virtio specification 
just to set and clear a single bit (DRIVER_OK).  The vhost-user 
specification points to that virtio protocol.  We must adhere to the 
protocol.  And note that we must not reset devices just because the VM 
is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so 
that Stefan’s series would introduce RESET_DEVICE where we need it, and 
we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)

Option 3 would be to just be honest in the specification, and limit the 
scope of F_STATUS to say the only bit that matters is DRIVER_OK.  I 
would say this is not really different from deprecating, though it 
wouldn’t affect your case.  However, I understand Alex relies on a full 
status byte.  I’m still interested to know why that is.

Option 4 is of course not to do anything, and leave everything as-is, 
waiting for the next person to stir the hornet’s nest.

>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>> detail when he plans on using the byte in the future. If we need a
>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>> for it.
>>>
>>> Hanna
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10  2:56                       ` Yajun Wu
@ 2023-10-10 10:04                         ` German Maglione
  0 siblings, 0 replies; 53+ messages in thread
From: German Maglione @ 2023-10-10 10:04 UTC (permalink / raw)
  To: Yajun Wu
  Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, Parav Pandit, Anton Kuchin

On Tue, Oct 10, 2023 at 4:57 AM Yajun Wu <yajunw@nvidia.com> wrote:
>
>
> On 10/9/2023 6:28 PM, German Maglione wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
> >>
> >> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> >>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> >>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> >>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> >>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> >>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> >>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> >>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> >>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
> >>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> >>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> >>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
> >>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> >>>>>>>>>>>
> >>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> >>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
> >>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
> >>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> >>>>>>>>>>> means the same thing as RESET_DEVICE).
> >>>>>>>>>>>
> >>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
> >>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
> >>>>>>>>>>> internally, and actually completely ignores what the back-end returns,
> >>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
> >>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> >>>>>>>>>>> to see whether the flag is still set, which is the only way in which
> >>>>>>>>>>> dpdk uses the status byte.
> >>>>>>>>>>>
> >>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
> >>>>>>>>>>> field in a useful manner, and it also provides no practical use over
> >>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
> >>>>>>>>>>> defined.  Deprecate it.
> >>>>>>>>>>>
> >>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>       docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> >>>>>>>>>>>       1 file changed, 21 insertions(+), 7 deletions(-)
> >>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> >>>>>>>>> The fact current backends never check errors does not mean they never
> >>>>>>>>> will. So no, not applying this.
> >>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
> >>>>>>>> order:
> >>>>>>>>
> >>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> >>>>>>>> present
> >>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>>>> 4. SET_FEATURES with need_reply
> >>>>>>>>
> >>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> >>>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
> >>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> >>>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
> >>>>>>>> documenting that a particular use of the status byte is to be ignored would
> >>>>>>>> be really strange.
> >>>>>>>>
> >>>>>>>> Hanna
> >>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
> >>>>>>> vhost-user reconfigures the state fully on start.
> >>>>>> Not the internal device state, though.  virtiofsd has internal state, and
> >>>>>> other devices like vhost-gpu back-ends would probably, too.
> >>>>>>
> >>>>>> Stefan has recently sent a series
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> >>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> >>>>>> reset).
> >>>>>>
> >>>>>> I really don’t like our current approach with the status byte. Following the
> >>>>>> virtio specification to me would mean that the guest directly controls this
> >>>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
> >>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> >>>>>> when the guest really doesn’t want a device reset.
> >>>>>>
> >>>>>> That means that qemu does not treat this as a virtio device field (because
> >>>>>> that would mean exposing it to the guest driver), but instead treats it as
> >>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> >>>>>> a virtio-defined feature for communication on the vhost level, i.e. between
> >>>>>> front-end and back-end, and not between guest driver and device.  I think
> >>>>>> all vhost-level protocol features should be fully defined in the vhost-user
> >>>>>> specification, which REPLY_ACK is.
> >>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
> >>>>> is doing.
> >>>>>
> >>>>> Do look at the original commit that introduced it to understand why
> >>>>> it was added.
> >>>> I don’t understand why this was added to the stop/cont code, though.  If it
> >>>> is time consuming to make these changes, why are they done every time the VM
> >>>> is paused
> >>>> and resumed?  It makes sense that this would be done for the initial
> >>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> >>>>
> >>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> >>>> stateful devices.)
> >>>>
> >>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> >>>> originally introduced was wrong even for non-stateful devices, because it
> >>>> occurred before we fetched the state (vring indices) so we could restore it
> >>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> >>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> >>>> either migration or a stop/cont in general, because the vring indices would
> >>>> have been reset to 0.
> >>>>
> >>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> >>>> devices that would implement them as per virtio spec, and even today it’s
> >>>> broken for stateful devices.  The mentioned performance issue is likely
> >>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> >>>>
> >>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> >>>> final configuration that would happen upon a DRIVER_OK once the first vring
> >>>> is started (i.e. receives a kick).  That has the added benefit of being
> >>>> asynchronous because it doesn’t block any vhost-user messages (which are
> >>>> synchronous, and thus block downtime).
> >>>>
> >>>> Hanna
> >>> For better or worse kick is per ring. It's out of spec to start rings
> >>> that were not kicked but I guess you could do configuration ...
> >>> Seems somewhat asymmetrical though.
> >>>
> >>> Let's wait until next week, hopefully Yajun Wu will answer.
> >> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
> >> DPDK know
> >> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
> >> has sent,
> >> otherwise DPDK has to rely on first queue pair is ready, then
> >> receiving/applying
> >> VQ configuration one by one.
> >>
> >> During live migration, configuring VQ one by one is very time consuming.
> >> For VIRTIO
> >> net vDPA, HW needs to know how many VQs are enabled to set
> >> RSS(Receive-Side Scaling).
> >>
> >> If you don’t want SET_STATUS message, backend can remove protocol
> >> feature bit
> >> VHOST_USER_PROTOCOL_F_STATUS.
> >> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
> >> close/reset.
> > This is incorrect, resetting the device on GET_VRING_BASE breaks
> > the stop/cont. Since you don't want to reset the VQs on stop/cont.
> Sorry for the misunderstanding, dpdk vhost backend framework doesn't
> have RESET concept(only device level .dev_conf and .dev_close). On
> receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does
> dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf.

(sorry I did not explain myself well)
I meant that resetting the VQs upon receiveng GET_VRING_BASE makes the
backend to fail if qemu continues after a "stop". I notice that in dpdk,
when it receives a GET_VRING_BASE[0], it calls 'vring_invalidate(dev, vq);'[1],
resetting the VQ[2], doing that is incorrect.

[0] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2135
[1] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2201
[2] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost.c#L580

> >
> >> I'm not involved in discussion about adding SET_STATUS in Vhost
> >> protocol. This feature
> >> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
> >>
> >> Thanks,
> >> Yajun
> >>>>>> Now, we could hand full control of the status byte to the guest, and that
> >>>>>> would make me content.  But I feel like that doesn’t really work, because
> >>>>>> qemu needs to intercept the status byte anyway (it needs to know when there
> >>>>>> is a reset, probably wants to know when the device is configured, etc.), so
> >>>>>> I don’t think having the status byte in vhost-user really gains us much when
> >>>>>> qemu could translate status byte changes to/from other vhost-user commands.
> >>>>>>
> >>>>>> Hanna
> >>>>> well it intercepts it but I think it could pass it on unchanged.
> >>>>>
> >>>>>
> >>>>>>> I guess symmetry was the
> >>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
> >>>>>>>
> >>>>>>>
> >>>>>>> SET_STATUS was introduced by:
> >>>>>>>
> >>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
> >>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
> >>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
> >>>>>>>
> >>>>>>>         vhost-user: Support vhost_dev_start
> >>>>>>>
> >>>>>>> CC the author.
> >>>>>>>
> >> _______________________________________________
> >> Virtio-fs mailing list
> >> Virtio-fs@redhat.com
> >> https://listman.redhat.com/mailman/listinfo/virtio-fs
> >
> >
> > --
> > German
> >
>


-- 
German


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10  8:18                             ` Hanna Czenczek
@ 2023-10-10 10:36                               ` Alex Bennée
  2023-10-10 13:18                                 ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Bennée @ 2023-10-10 10:36 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Yajun Wu, Michael S. Tsirkin, qemu-devel, virtio-fs,
	Eugenio Pérez, Anton Kuchin, Parav Pandit, maxime.coquelin


Hanna Czenczek <hreitz@redhat.com> writes:

> On 10.10.23 06:00, Yajun Wu wrote:
>>
>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>> [...]
>>>>
<snip>
> So as far as I understand, the feature is supposed to rely on
> implementation-specific behavior between specifically qemu as a
> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
> is a very good reason to deprecate it.  That would make it clear that
> any implementation that implements it does so because it relies on
> implementation-specific behavior from other implementations.
>
> Option 2 is to fix it.  It is not right to use this broadly defined
> feature with its clear protocol as given in the virtio specification
> just to set and clear a single bit (DRIVER_OK).  The vhost-user
> specification points to that virtio protocol.  We must adhere to the
> protocol.  And note that we must not reset devices just because the VM
> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
> that Stefan’s series would introduce RESET_DEVICE where we need it,
> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>
> Option 3 would be to just be honest in the specification, and limit
> the scope of F_STATUS to say the only bit that matters is DRIVER_OK. 
> I would say this is not really different from deprecating, though it
> wouldn’t affect your case.  However, I understand Alex relies on a
> full status byte.  I’m still interested to know why that is.

For an F_TRANSPORT backend (or whatever the final name ends up being) we
need the backend to have full control of the status byte because all the
handling of VirtIO is deferred to it. Therefor it has to handle all the
feature negotiation and indicate when the device needs resetting.

(side note: feature negotiation is another slippery area when QEMU gets
involved in gating which feature bits may or may not be exposed to the
backend. The only one it should ever mask is F_UNUSED which is used
(sic) to trigger the vhost protocol negotiation)

> Option 4 is of course not to do anything, and leave everything as-is,
> waiting for the next person to stir the hornet’s nest.
>
>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>> detail when he plans on using the byte in the future. If we need a
>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>> for it.

What would we use instead of F_STATUS to query the Device Status field?

>>>>
>>>> Hanna
>>


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10 10:36                               ` Alex Bennée
@ 2023-10-10 13:18                                 ` Hanna Czenczek
  2023-10-10 14:35                                   ` Alex Bennée
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-10 13:18 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Michael S. Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez,
	maxime.coquelin, Parav Pandit, Anton Kuchin, Yajun Wu

On 10.10.23 12:36, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 10.10.23 06:00, Yajun Wu wrote:
>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>> [...]
>>>>>
> <snip>
>> So as far as I understand, the feature is supposed to rely on
>> implementation-specific behavior between specifically qemu as a
>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>> is a very good reason to deprecate it.  That would make it clear that
>> any implementation that implements it does so because it relies on
>> implementation-specific behavior from other implementations.
>>
>> Option 2 is to fix it.  It is not right to use this broadly defined
>> feature with its clear protocol as given in the virtio specification
>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>> specification points to that virtio protocol.  We must adhere to the
>> protocol.  And note that we must not reset devices just because the VM
>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>
>> Option 3 would be to just be honest in the specification, and limit
>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>> I would say this is not really different from deprecating, though it
>> wouldn’t affect your case.  However, I understand Alex relies on a
>> full status byte.  I’m still interested to know why that is.
> For an F_TRANSPORT backend (or whatever the final name ends up being) we
> need the backend to have full control of the status byte because all the
> handling of VirtIO is deferred to it. Therefor it has to handle all the
> feature negotiation and indicate when the device needs resetting.
>
> (side note: feature negotiation is another slippery area when QEMU gets
> involved in gating which feature bits may or may not be exposed to the
> backend. The only one it should ever mask is F_UNUSED which is used
> (sic) to trigger the vhost protocol negotiation)

That’s the thing, feature negotiation is done with GET_FEATURES and 
SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return errors.

Indicating that the device needs reset is a good point, there is no 
other feature to do that.  (And something qemu currently ignores, just 
like any value the device returns through GET_STATUS, but that’s besides 
the point.)

>> Option 4 is of course not to do anything, and leave everything as-is,
>> waiting for the next person to stir the hornet’s nest.
>>
>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>> detail when he plans on using the byte in the future. If we need a
>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>> for it.
> What would we use instead of F_STATUS to query the Device Status field?

We would emulate it in the front-end, just like we need to do for 
back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET 
bit, though, that’s correct.

Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % 
convinced that your use case has a hard dependency on F_STATUS. However, 
this still does make a fair point in general that it would be useful to 
keep it.

That still leaves us with the situation that currently, the only 
implementations with F_STATUS support are qemu and dpdk, which both 
handle it incorrectly.  Furthermore, the specification leaves much to be 
desired, specifically in how F_STATUS interacts with other vhost-user 
commands (which is something I cited as a reason for my original patch), 
i.e. whether RESET_DEVICE and SET_STATUS 0 are equivalent, and whether 
failures in feature negotiation must result in both SET_FEATURES 
returning an error (with F_REPLY_ACK), and FEATURES_OK being reset in 
the status byte, or whether either is sufficient.  What happens when 
DEVICE_NEEDS_RESET is set, i.e. do we just need RESET_DEVICE / 
SET_STATUS 0, or do we also need to reset some protocol state?  (This is 
also connected to the fact that what happens on RESET_DEVICE is largely 
undefined, which I said on Stefan’s series.)

In general, because we have our own transport, we should make a note how 
it interacts with the status negotiation phases, i.e. that GET_FEATURES 
must not be called before S_ACKNOWLEDGE | S_DRIVER are set, that 
FEATURES_OK must be set after the SET_FEATURES call, and that DRIVER_OK 
must not be set without FEATURES_OK set / SET_FEATURES having returned 
success.  Here we would also answer the question about the interaction 
of F_REPLY_ACK+SET_FEATURES with F_STATUS, specifically whether an 
implementation with F_REPLY_ACK even needs to read back the status byte 
after setting FEATURES_OK because it could have got the feature 
negotiation result already as a result of the SET_FEATURES call.

After migration, can you just set all flags immediately or do we need to 
follow this step-by-step protocol?  I think we do need to do it 
step-by-step, mostly for simplicity in the back-end, i.e. that it just 
sees a normal device start-up.

We should also clarify whether SET_STATUS can fail, i.e. whether setting 
an invalid status (is setting FEATURES_OK when the device doesn’t think 
so invalid?) has SET_STATUS fail (with F_REPLY_ACK) and/or immediately 
gets the device into DEVICE_NEEDS_RESET.

We should clarify whether SET_STATUS can block.  The current use of 
DRIVER_OK seems to indicate to me that dpdk does do time-consuming 
operations when it sees DRIVER_OK (code looks like it, too) and only 
returns when that’s done, but naïvely, I would expect SET_STATUS to be 
just setting some value and doing whatever needs to be done in the 
background, not actually launching and blocking on an operation.

I think it is dangerous to just push ahead with using F_STATUS without 
acknowledging that its implementation is broken right now, and that it 
is so *on purpose* because the DRIVER_OK bit is the only thing that it 
was supposed to be used for.  Using it for its purported original use 
(actually the virtio status byte) is contradictory to that.  It’s 
probably fixable, but I think it requires taking a step back and seeing 
what needs to be done to remedy the conflict.  If you rip out all the 
existing STATUS code and replace it such that qemu will let the guest 
have full control over the status byte (except for migration, where we 
restore it on the destination, which will result in DRIVER_OK being set 
at the end, fulfilling that requirement), that will fix the 
implementation in qemu.  I think.  But the specification should be 
amended to think about all these corner cases, not least because I think 
they will also affect your implementation.

(The answers to many of the questions I raise for documentation may be 
obvious to you, based on “in virtio, it’s just an MMIO byte that’s 
written and read, so the rest follows from there”.  But evidently the 
implementation we have kind of ignores that e.g. SET_STATUS 0 is a reset 
(6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not much 
more) or that there is actually a protocol to setting the status flags 
and you can’t just set them all at once, so I don’t think the answers 
are immediately obvious, and should be documented.)

As for me and the original patch: I claimed nobody really needs 
F_STATUS, you say you do, so plainly, I assumed wrong and will naturally 
take my hands off of F_STATUS and just ensure not to implement it in any 
back-end until you’ve fixed it, as Yajun has advised.  I’d still prefer 
mentioning this advice in the documentation until it’s fixed, but, you 
know, I wouldn’t be the first one to say “I now know about the quirk, so 
I can work around it, no need to tell anyone else as long as my stuff 
works”.

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10 13:18                                 ` Hanna Czenczek
@ 2023-10-10 14:35                                   ` Alex Bennée
  2023-10-13 18:02                                     ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Bennée @ 2023-10-10 14:35 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Michael S. Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez,
	maxime.coquelin, Parav Pandit, Anton Kuchin, Yajun Wu,
	Viresh Kumar


Hanna Czenczek <hreitz@redhat.com> writes:

(adding Viresh to CC for Xen Vhost questions)

> On 10.10.23 12:36, Alex Bennée wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> On 10.10.23 06:00, Yajun Wu wrote:
>>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>>> [...]
>>>>>>
>> <snip>
>>> So as far as I understand, the feature is supposed to rely on
>>> implementation-specific behavior between specifically qemu as a
>>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>>> is a very good reason to deprecate it.  That would make it clear that
>>> any implementation that implements it does so because it relies on
>>> implementation-specific behavior from other implementations.
>>>
>>> Option 2 is to fix it.  It is not right to use this broadly defined
>>> feature with its clear protocol as given in the virtio specification
>>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>>> specification points to that virtio protocol.  We must adhere to the
>>> protocol.  And note that we must not reset devices just because the VM
>>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>>
>>> Option 3 would be to just be honest in the specification, and limit
>>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>>> I would say this is not really different from deprecating, though it
>>> wouldn’t affect your case.  However, I understand Alex relies on a
>>> full status byte.  I’m still interested to know why that is.
>> For an F_TRANSPORT backend (or whatever the final name ends up being) we
>> need the backend to have full control of the status byte because all the
>> handling of VirtIO is deferred to it. Therefor it has to handle all the
>> feature negotiation and indicate when the device needs resetting.
>>
>> (side note: feature negotiation is another slippery area when QEMU gets
>> involved in gating which feature bits may or may not be exposed to the
>> backend. The only one it should ever mask is F_UNUSED which is used
>> (sic) to trigger the vhost protocol negotiation)
>
> That’s the thing, feature negotiation is done with GET_FEATURES and
> SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return
> errors.

OK but then what - QEMU fakes up FEATURES_OK in the Device Status field
on the behalf of the backend?

I should point out QEMU doesn't exist in some of these use case. When
using the rust-vmm backends with Xen for example there is no VMM to talk
to so we have a Xen Vhost Frontend which is entirely concerned with
setup and then once connected up leaves the backend to do its thing. I'd
rather leave the frontend as dumb as possible rather than splitting
logic between the two.

> Indicating that the device needs reset is a good point, there is no
> other feature to do that.  (And something qemu currently ignores, just
> like any value the device returns through GET_STATUS, but that’s
> besides the point.)
>
>>> Option 4 is of course not to do anything, and leave everything as-is,
>>> waiting for the next person to stir the hornet’s nest.
>>>
>>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>>> detail when he plans on using the byte in the future. If we need a
>>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>>> for it.
>> What would we use instead of F_STATUS to query the Device Status field?
>
> We would emulate it in the front-end, just like we need to do for
> back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET
> bit, though, that’s correct.
>
> Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 %
> convinced that your use case has a hard dependency on F_STATUS.
> However, this still does make a fair point in general that it would be
> useful to keep it.

OK/

> That still leaves us with the situation that currently, the only
> implementations with F_STATUS support are qemu and dpdk, which both
> handle it incorrectly. 

I was going to say there is also the rust-vmm vhost-user-master crates
which we've imported:

  https://github.com/vireshk/vhost

for the Xen Vhost Frontend:

  https://github.com/vireshk/xen-vhost-frontend

but I can't actually see any handling for GET/SET_STATUS at all which
makes me wonder how we actually work. Viresh?

> Furthermore, the specification leaves much to
> be desired, specifically in how F_STATUS interacts with other
> vhost-user commands (which is something I cited as a reason for my
> original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are
> equivalent, and whether failures in feature negotiation must result in
> both SET_FEATURES returning an error (with F_REPLY_ACK), and
> FEATURES_OK being reset in the status byte, or whether either is
> sufficient.  What happens when DEVICE_NEEDS_RESET is set, i.e. do we
> just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset
> some protocol state?  (This is also connected to the fact that what
> happens on RESET_DEVICE is largely undefined, which I said on Stefan’s
> series.)

I'm all for strengthening the vhost-user protocol definitions. I'm just
wary of encoding QEMU<->backend implementation details.

> In general, because we have our own transport, we should make a note
> how it interacts with the status negotiation phases, i.e. that
> GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are
> set, that FEATURES_OK must be set after the SET_FEATURES call, and
> that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES
> having returned success.  Here we would also answer the question about
> the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS,
> specifically whether an implementation with F_REPLY_ACK even needs to
> read back the status byte after setting FEATURES_OK because it could
> have got the feature negotiation result already as a result of the
> SET_FEATURES call.

Some sequence diagrams would remove a lot of the ambiguity from parsing
the words. I wonder if there is a pretty way to do that to render nicely
in our published docs?

> After migration, can you just set all flags immediately or do we need
> to follow this step-by-step protocol?  I think we do need to do it
> step-by-step, mostly for simplicity in the back-end, i.e. that it just
> sees a normal device start-up.

Makes sense.

> We should also clarify whether SET_STATUS can fail, i.e. whether
> setting an invalid status (is setting FEATURES_OK when the device
> doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK)
> and/or immediately gets the device into DEVICE_NEEDS_RESET.
>
> We should clarify whether SET_STATUS can block.  The current use of
> DRIVER_OK seems to indicate to me that dpdk does do time-consuming
> operations when it sees DRIVER_OK (code looks like it, too) and only
> returns when that’s done, but naïvely, I would expect SET_STATUS to be
> just setting some value and doing whatever needs to be done in the
> background, not actually launching and blocking on an operation.

Shouldn't the guest driver be reading the status bit until it flips? So
potentially there could be multiple GET_STATUS calls.

> I think it is dangerous to just push ahead with using F_STATUS without
> acknowledging that its implementation is broken right now, and that it
> is so *on purpose* because the DRIVER_OK bit is the only thing that it
> was supposed to be used for.  Using it for its purported original use
> (actually the virtio status byte) is contradictory to that.  It’s
> probably fixable, but I think it requires taking a step back and
> seeing what needs to be done to remedy the conflict.  If you rip out
> all the existing STATUS code and replace it such that qemu will let
> the guest have full control over the status byte (except for
> migration, where we restore it on the destination, which will result
> in DRIVER_OK being set at the end, fulfilling that requirement), that
> will fix the implementation in qemu.  I think.  But the specification
> should be amended to think about all these corner cases, not least
> because I think they will also affect your implementation.
>
> (The answers to many of the questions I raise for documentation may be
> obvious to you, based on “in virtio, it’s just an MMIO byte that’s
> written and read, so the rest follows from there”.  But evidently the
> implementation we have kind of ignores that e.g. SET_STATUS 0 is a
> reset (6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not
> much more) or that there is actually a protocol to setting the status
> flags and you can’t just set them all at once, so I don’t think the
> answers are immediately obvious, and should be documented.)
>
> As for me and the original patch: I claimed nobody really needs
> F_STATUS, you say you do, so plainly, I assumed wrong and will
> naturally take my hands off of F_STATUS and just ensure not to
> implement it in any back-end until you’ve fixed it, as Yajun has
> advised.  I’d still prefer mentioning this advice in the documentation
> until it’s fixed, but, you know, I wouldn’t be the first one to say “I
> now know about the quirk, so I can work around it, no need to tell
> anyone else as long as my stuff works”.
>
> Hanna


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-10 14:35                                   ` Alex Bennée
@ 2023-10-13 18:02                                     ` Hanna Czenczek
  2023-10-17  7:49                                       ` Viresh Kumar
  0 siblings, 1 reply; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-13 18:02 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Michael S. Tsirkin, Viresh Kumar, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, Parav Pandit, Anton Kuchin,
	Yajun Wu

On 10.10.23 16:35, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
> (adding Viresh to CC for Xen Vhost questions)
>
>> On 10.10.23 12:36, Alex Bennée wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 10.10.23 06:00, Yajun Wu wrote:
>>>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>>>> [...]
>>>>>>>
>>> <snip>
>>>> So as far as I understand, the feature is supposed to rely on
>>>> implementation-specific behavior between specifically qemu as a
>>>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>>>> is a very good reason to deprecate it.  That would make it clear that
>>>> any implementation that implements it does so because it relies on
>>>> implementation-specific behavior from other implementations.
>>>>
>>>> Option 2 is to fix it.  It is not right to use this broadly defined
>>>> feature with its clear protocol as given in the virtio specification
>>>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>>>> specification points to that virtio protocol.  We must adhere to the
>>>> protocol.  And note that we must not reset devices just because the VM
>>>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>>>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>>>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>>>
>>>> Option 3 would be to just be honest in the specification, and limit
>>>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>>>> I would say this is not really different from deprecating, though it
>>>> wouldn’t affect your case.  However, I understand Alex relies on a
>>>> full status byte.  I’m still interested to know why that is.
>>> For an F_TRANSPORT backend (or whatever the final name ends up being) we
>>> need the backend to have full control of the status byte because all the
>>> handling of VirtIO is deferred to it. Therefor it has to handle all the
>>> feature negotiation and indicate when the device needs resetting.
>>>
>>> (side note: feature negotiation is another slippery area when QEMU gets
>>> involved in gating which feature bits may or may not be exposed to the
>>> backend. The only one it should ever mask is F_UNUSED which is used
>>> (sic) to trigger the vhost protocol negotiation)
>> That’s the thing, feature negotiation is done with GET_FEATURES and
>> SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return
>> errors.
> OK but then what - QEMU fakes up FEATURES_OK in the Device Status field
> on the behalf of the backend?

It does that right now.  When using qemu, vhost-user status byte is not 
exposed to the guest at all.  qemu makes it up completely, and 
effectively ignores the response from GET_STATUS completely.

(The only use of GET_STATUS is (right now): There is a function to set a 
flag in the status byte, and it calls GET_STATUS, ORs the flag in, and 
calls SET_STATUS with the result.)

> I should point out QEMU doesn't exist in some of these use case. When
> using the rust-vmm backends with Xen for example there is no VMM to talk
> to so we have a Xen Vhost Frontend which is entirely concerned with
> setup and then once connected up leaves the backend to do its thing. I'd
> rather leave the frontend as dumb as possible rather than splitting
> logic between the two.
>
>> Indicating that the device needs reset is a good point, there is no
>> other feature to do that.  (And something qemu currently ignores, just
>> like any value the device returns through GET_STATUS, but that’s
>> besides the point.)
>>
>>>> Option 4 is of course not to do anything, and leave everything as-is,
>>>> waiting for the next person to stir the hornet’s nest.
>>>>
>>>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>>>> detail when he plans on using the byte in the future. If we need a
>>>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>>>> for it.
>>> What would we use instead of F_STATUS to query the Device Status field?
>> We would emulate it in the front-end, just like we need to do for
>> back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET
>> bit, though, that’s correct.
>>
>> Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 %
>> convinced that your use case has a hard dependency on F_STATUS.
>> However, this still does make a fair point in general that it would be
>> useful to keep it.
> OK/
>
>> That still leaves us with the situation that currently, the only
>> implementations with F_STATUS support are qemu and dpdk, which both
>> handle it incorrectly.
> I was going to say there is also the rust-vmm vhost-user-master crates
> which we've imported:
>
>    https://github.com/vireshk/vhost
>
> for the Xen Vhost Frontend:
>
>    https://github.com/vireshk/xen-vhost-frontend
>
> but I can't actually see any handling for GET/SET_STATUS at all which
> makes me wonder how we actually work. Viresh?

As far as I know the only back-end implementation of F_STATUS is in 
DPDK.  As I said, if anyone else implemented it right now, that would be 
dangerous, because qemu doesn’t adhere to the virtio protocol when it 
comes to the status byte.

>> Furthermore, the specification leaves much to
>> be desired, specifically in how F_STATUS interacts with other
>> vhost-user commands (which is something I cited as a reason for my
>> original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are
>> equivalent, and whether failures in feature negotiation must result in
>> both SET_FEATURES returning an error (with F_REPLY_ACK), and
>> FEATURES_OK being reset in the status byte, or whether either is
>> sufficient.  What happens when DEVICE_NEEDS_RESET is set, i.e. do we
>> just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset
>> some protocol state?  (This is also connected to the fact that what
>> happens on RESET_DEVICE is largely undefined, which I said on Stefan’s
>> series.)
> I'm all for strengthening the vhost-user protocol definitions. I'm just
> wary of encoding QEMU<->backend implementation details.
>
>> In general, because we have our own transport, we should make a note
>> how it interacts with the status negotiation phases, i.e. that
>> GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are
>> set, that FEATURES_OK must be set after the SET_FEATURES call, and
>> that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES
>> having returned success.  Here we would also answer the question about
>> the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS,
>> specifically whether an implementation with F_REPLY_ACK even needs to
>> read back the status byte after setting FEATURES_OK because it could
>> have got the feature negotiation result already as a result of the
>> SET_FEATURES call.
> Some sequence diagrams would remove a lot of the ambiguity from parsing
> the words. I wonder if there is a pretty way to do that to render nicely
> in our published docs?

I’m sure some form of SVG will work.  Somehow.  If not, it should. :)

>> After migration, can you just set all flags immediately or do we need
>> to follow this step-by-step protocol?  I think we do need to do it
>> step-by-step, mostly for simplicity in the back-end, i.e. that it just
>> sees a normal device start-up.
> Makes sense.
>
>> We should also clarify whether SET_STATUS can fail, i.e. whether
>> setting an invalid status (is setting FEATURES_OK when the device
>> doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK)
>> and/or immediately gets the device into DEVICE_NEEDS_RESET.
>>
>> We should clarify whether SET_STATUS can block.  The current use of
>> DRIVER_OK seems to indicate to me that dpdk does do time-consuming
>> operations when it sees DRIVER_OK (code looks like it, too) and only
>> returns when that’s done, but naïvely, I would expect SET_STATUS to be
>> just setting some value and doing whatever needs to be done in the
>> background, not actually launching and blocking on an operation.
> Shouldn't the guest driver be reading the status bit until it flips? So
> potentially there could be multiple GET_STATUS calls.

Ah, the device will only show DRIVER_OK set once the device is ready to 
serve the driver?

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-13 18:02                                     ` Hanna Czenczek
@ 2023-10-17  7:49                                       ` Viresh Kumar
  2023-10-17  8:13                                         ` Hanna Czenczek
  0 siblings, 1 reply; 53+ messages in thread
From: Viresh Kumar @ 2023-10-17  7:49 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Alex Bennée, Michael S. Tsirkin, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, Parav Pandit, Anton Kuchin,
	Yajun Wu

On 13-10-23, 20:02, Hanna Czenczek wrote:
> On 10.10.23 16:35, Alex Bennée wrote:
> > I was going to say there is also the rust-vmm vhost-user-master crates
> > which we've imported:
> > 
> >    https://github.com/vireshk/vhost
> > 
> > for the Xen Vhost Frontend:
> > 
> >    https://github.com/vireshk/xen-vhost-frontend
> > 
> > but I can't actually see any handling for GET/SET_STATUS at all which
> > makes me wonder how we actually work. Viresh?
> 
> As far as I know the only back-end implementation of F_STATUS is in DPDK. 
> As I said, if anyone else implemented it right now, that would be dangerous,
> because qemu doesn’t adhere to the virtio protocol when it comes to the
> status byte.

Yeah, none of the Rust based Virtio backends enable `STATUS` in
`VhostUserProtocolFeatures` and so these messages are never exchanged.

The generic Rust code for the backends, doesn't even implement them.
Not sure if they should or not.

-- 
viresh


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] (no subject)
  2023-10-17  7:49                                       ` Viresh Kumar
@ 2023-10-17  8:13                                         ` Hanna Czenczek
  0 siblings, 0 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-17  8:13 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Alex Bennée, Michael S. Tsirkin, qemu-devel, virtio-fs,
	Eugenio Pérez, maxime.coquelin, Parav Pandit, Anton Kuchin,
	Yajun Wu

On 17.10.23 09:49, Viresh Kumar wrote:
> On 13-10-23, 20:02, Hanna Czenczek wrote:
>> On 10.10.23 16:35, Alex Bennée wrote:
>>> I was going to say there is also the rust-vmm vhost-user-master crates
>>> which we've imported:
>>>
>>>     https://github.com/vireshk/vhost
>>>
>>> for the Xen Vhost Frontend:
>>>
>>>     https://github.com/vireshk/xen-vhost-frontend
>>>
>>> but I can't actually see any handling for GET/SET_STATUS at all which
>>> makes me wonder how we actually work. Viresh?
>> As far as I know the only back-end implementation of F_STATUS is in DPDK.
>> As I said, if anyone else implemented it right now, that would be dangerous,
>> because qemu doesn’t adhere to the virtio protocol when it comes to the
>> status byte.
> Yeah, none of the Rust based Virtio backends enable `STATUS` in
> `VhostUserProtocolFeatures` and so these messages are never exchanged.
>
> The generic Rust code for the backends, doesn't even implement them.
> Not sure if they should or not.

It absolutely should not, for evidence see this whole thread.  qemu 
sends a SET_STATUS 0, which amounts to a reset, when the VM is merely 
paused[1], and when it sets status bytes, it does not set them according 
to virtio specification.  Implementing it right now means relying on and 
working around qemu’s implementation-defined spec-breaking behavior.  
Also, note that qemu ignores feature negotiation response through 
FEATURES_OK, and DEVICE_NEEDS_RESET, so unless it’s worth working around 
the problems just to get some form of DRIVER_OK information (note this 
information does not come from the driver, but qemu makes it up), I 
absolutely would not implement it.

[1] Notably, it does restore the virtio state to the best of its 
abilities when the VM is resumed, but this is all still wrong (there is 
no point in doing so much on a pause/resume, it needlessly costs time) 
and any implementation that does a reset then will rely on the 
implementation-defined behavior that qemu is actually able to restore 
all the state that the back-end would lose during a reset. Notably, 
reset is not even well-defined in the vhost-user specification.  It was 
argued, in this thread, that DPDK works just fine with this, precisely 
because it ignores SET_STATUS 0.  Finally, if virtiofsd in particular, 
as a user of the Rust crates, is reset, it would lose its internal 
state, which qemu cannot restore short of using the upcoming migration 
facilities.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings
  2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek
  2023-10-05 17:43   ` Stefan Hajnoczi
@ 2023-10-18 12:14   ` Michael S. Tsirkin
  2023-10-18 16:17     ` Hanna Czenczek
  1 sibling, 1 reply; 53+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18 12:14 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Eugenio Pérez, Anton Kuchin

On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote:
> Currently, the vhost-user documentation says that rings are to be
> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
> negotiated.  However, by the time of feature negotiation, all rings have
> already been initialized, so it is not entirely clear what this means.
> 
> At least the vhost-user-backend Rust crate's implementation interpreted
> it to mean that whenever this feature is negotiated, all rings are to
> put into a disabled state, which means that every SET_FEATURES call
> would disable all rings, effectively halting the device.  This is
> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
> this way, which happens during migration.  Doing so should not halt the
> device.
> 
> Other implementations have interpreted this to mean that the device is
> to be initialized with all rings disabled, and a subsequent SET_FEATURES
> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
> them.  Here, SET_FEATURES will never disable any ring.
> 
> This interpretation does not suffer the problem of unintentionally
> halting the device whenever features are set or cleared, so it seems
> better and more reasonable.
> 
> We can clarify this in the documentation by making it explicit that the
> enabled/disabled state is tracked even while the vring is stopped.
> Every vring is initialized in a disabled state, and SET_FEATURES without
> VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
> vrings.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>


OK so I am expecting v5. My advice is to move patch 1 to end of patchset
so we can defer it if we want to.

> ---
>  docs/interop/vhost-user.rst | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 50f5acebe5..9f4940a036 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -395,31 +395,33 @@ negotiation.
>  Ring states
>  -----------
>  
> -Rings can be in one of three states:
> +Rings have two independent states: started/stopped, and enabled/disabled.
>  
> -* stopped: the back-end must not process the ring at all.
> +* While a ring is stopped, the back-end must not process the ring at
> +  all, regardless of whether it is enabled or disabled.  The
> +  enabled/disabled state should still be tracked, though, so it can come
> +  into effect once the ring is started.
>  
> -* started but disabled: the back-end must process the ring without
> +* started and disabled: The back-end must process the ring without
>    causing any side effects.  For example, for a networking device,
>    in the disabled state the back-end must not supply any new RX packets,
>    but must process and discard any TX packets.
>  
> -* started and enabled.
> +* started and enabled: The back-end must process the ring normally, i.e.
> +  process all requests and execute them.
>  
> -Each ring is initialized in a stopped state.  The back-end must start
> -ring upon receiving a kick (that is, detecting that file descriptor is
> -readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK``
> -or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
> -and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
> +Each ring is initialized in a stopped and disabled state.  The back-end
> +must start a ring upon receiving a kick (that is, detecting that file
> +descriptor is readable) on the descriptor specified by
> +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
> +``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving
> +``VHOST_USER_GET_VRING_BASE``.
>  
>  Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>  
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> -ring starts directly in the enabled state.
> -
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> -initialized in a disabled state and is enabled by
> -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
> +In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from
> +the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the
> +back-end must enable all rings immediately.
>  
>  While processing the rings (whether they are enabled or not), the back-end
>  must support changing some configuration aspects on the fly.
> -- 
> 2.41.0

On Wed, Oct 04, 2023 at 02:59:00PM +0200, Hanna Czenczek wrote:
> In vDPA, GET_VRING_BASE does not stop the queried vring, which is why
> SUSPEND was introduced so that the returned index would be stable.  In
> vhost-user, it does stop the vring, so under the same reasoning, it can
> get away without SUSPEND.
> 
> Still, we do want to clarify that if the device is completely stopped,
> i.e. all vrings are stopped, the back-end should cease to modify any
> state relating to the guest.  Do this by calling it "suspended".
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 9f4940a036..d282155562 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -426,6 +426,19 @@ back-end must enable all rings immediately.
>  While processing the rings (whether they are enabled or not), the back-end
>  must support changing some configuration aspects on the fly.
>  
> +.. _suspended_device_state:
> +
> +Suspended device state
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +While all vrings are stopped, the device is *suspended*.  In addition to
> +not processing any vring (because they are stopped), the device must:
> +
> +* not write to any guest memory regions,
> +* not send any notifications to the guest,
> +* not send any messages to the front-end,
> +* still process and reply to messages from the front-end.
> +
>  Multiple queue support
>  ----------------------
>  
> @@ -513,7 +526,8 @@ ancillary data, it may be used to inform the front-end that the log has
>  been modified.
>  
>  Once the source has finished migration, rings will be stopped by the
> -source. No further update must be done before rings are restarted.
> +source (:ref:`Suspended device state <suspended_device_state>`). No
> +further update must be done before rings are restarted.
>  
>  In postcopy migration the back-end is started before all the memory has
>  been received from the source host, and care must be taken to avoid
> @@ -1101,6 +1115,10 @@ Front-end message types
>    (*a vring descriptor index for split virtqueues* vs. *vring descriptor
>    indices for packed virtqueues*).
>  
> +  When and as long as all of a device’s vrings are stopped, it is
> +  *suspended*, see :ref:`Suspended device state
> +  <suspended_device_state>`.
> +
>    The request payload’s *num* field is currently reserved and must be
>    set to 0.
>  
> -- 
> 2.41.0

On Wed, Oct 04, 2023 at 02:59:01PM +0200, Hanna Czenczek wrote:
> For vhost-user devices, qemu can migrate the virtio state, but not the
> back-end's internal state.  To do so, we need to be able to transfer
> this internal state between front-end (qemu) and back-end.
> 
> At this point, this new feature is added for the purpose of virtio-fs
> migration.  Because virtiofsd's internal state will not be too large, we
> believe it is best to transfer it as a single binary blob after the
> streaming phase.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
>   descriptor over which to transfer the state.
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   file descriptor, the front-end invokes this function to verify
>   success.  There is no in-band way (through the file descriptor) to
>   indicate failure, so we need to check explicitly.
> 
> Once the transfer FD has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into it, and the reading side
> reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 172 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d282155562..aa91e2b34e 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -306,6 +306,32 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Device state transfer parameters
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> ++--------------------+-----------------+
> +| transfer direction | migration phase |
> ++--------------------+-----------------+
> +
> +:transfer direction: a 32-bit enum, describing the direction in which
> +  the state is transferred:
> +
> +  - 0: Save: Transfer the state from the back-end to the front-end,
> +    which happens on the source side of migration
> +  - 1: Load: Transfer the state from the front-end to the back-end,
> +    which happens on the destination side of migration
> +
> +:migration phase: a 32-bit enum, describing the state in which the VM
> +  guest and devices are:
> +
> +  - 0: Stopped (in the period after the transfer of memory-mapped
> +    regions before switch-over to the destination): The VM guest is
> +    stopped, and the vhost-user device is suspended (see
> +    :ref:`Suspended device state <suspended_device_state>`).
> +
> +  In the future, additional phases might be added e.g. to allow
> +  iterative migration while the device is running.
> +
>  C structure
>  -----------
>  
> @@ -365,6 +391,7 @@ in the ancillary data:
>  * ``VHOST_USER_SET_VRING_ERR``
>  * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
>  * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
> +* ``VHOST_USER_SET_DEVICE_STATE_FD``
>  
>  If *front-end* is unable to send the full message or receives a wrong
>  reply it will close the connection. An optional reconnection mechanism
> @@ -539,6 +566,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled
>  back-end.  The front-end indicates support for this via the
>  ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
>  
> +.. _migrating_backend_state:
> +
> +Migrating back-end state
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Migrating device state involves transferring the state from one
> +back-end, called the source, to another back-end, called the
> +destination.  After migration, the destination transparently resumes
> +operation without requiring the driver to re-initialize the device at
> +the VIRTIO level.  If the migration fails, then the source can
> +transparently resume operation until another migration attempt is made.
> +
> +Generally, the front-end is connected to a virtual machine guest (which
> +contains the driver), which has its own state to transfer between source
> +and destination, and therefore will have an implementation-specific
> +mechanism to do so.  The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature
> +provides functionality to have the front-end include the back-end's
> +state in this transfer operation so the back-end does not need to
> +implement its own mechanism, and so the virtual machine may have its
> +complete state, including vhost-user devices' states, contained within a
> +single stream of data.
> +
> +To do this, the back-end state is transferred from back-end to front-end
> +on the source side, and vice versa on the destination side.  This
> +transfer happens over a channel that is negotiated using the
> +``VHOST_USER_SET_DEVICE_STATE_FD`` message.  This message has two
> +parameters:
> +
> +* Direction of transfer: On the source, the data is saved, transferring
> +  it from the back-end to the front-end.  On the destination, the data
> +  is loaded, transferring it from the front-end to the back-end.
> +
> +* Migration phase: Currently, the only supported phase is the period
> +  after the transfer of memory-mapped regions before switch-over to the
> +  destination, when both the source and destination devices are
> +  suspended (:ref:`Suspended device state <suspended_device_state>`).
> +  In the future, additional phases might be supported to allow iterative
> +  migration while the device is running.
> +
> +The nature of the channel is implementation-defined, but it must
> +generally behave like a pipe: The writing end will write all the data it
> +has into it, signalling the end of data by closing its end.  The reading
> +end must read all of this data (until encountering the end of file) and
> +process it.
> +
> +* When saving, the writing end is the source back-end, and the reading
> +  end is the source front-end.  After reading the state data from the
> +  channel, the source front-end must transfer it to the destination
> +  front-end through an implementation-defined mechanism.
> +
> +* When loading, the writing end is the destination front-end, and the
> +  reading end is the destination back-end.  After reading the state data
> +  from the channel, the destination back-end must deserialize its
> +  internal state from that data and set itself up to allow the driver to
> +  seamlessly resume operation on the VIRTIO level.
> +
> +Seamlessly resuming operation means that the migration must be
> +transparent to the guest driver, which operates on the VIRTIO level.
> +This driver will not perform any re-initialization steps, but continue
> +to use the device as if no migration had occurred.  The vhost-user
> +front-end, however, will re-initialize the vhost state on the
> +destination, following the usual protocol for establishing a connection
> +to a vhost-user back-end: This includes, for example, setting up memory
> +mappings and kick and call FDs as necessary, negotiating protocol
> +features, or setting the initial vring base indices (to the same value
> +as on the source side, so that operation can resume).
> +
> +Both on the source and on the destination side, after the respective
> +front-end has seen all data transferred (when the transfer FD has been
> +closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to
> +verify that data transfer was successful in the back-end, too.  The
> +back-end responds once it knows whether the transfer and processing was
> +successful or not.
> +
>  Memory access
>  -------------
>  
> @@ -932,6 +1033,7 @@ Protocol features
>    #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>    #define VHOST_USER_PROTOCOL_F_STATUS               16
>    #define VHOST_USER_PROTOCOL_F_XEN_MMAP             17
> +  #define VHOST_USER_PROTOCOL_F_DEVICE_STATE         18
>  
>  Front-end message types
>  -----------------------
> @@ -1532,6 +1634,76 @@ Front-end message types
>    back-end for its device status as defined in the Virtio specification.
>    Deprecated together with VHOST_USER_SET_STATUS.
>  
> +``VHOST_USER_SET_DEVICE_STATE_FD``
> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :request payload: device state transfer parameters
> +  :reply payload: ``u64``
> +
> +  Front-end and back-end negotiate a channel over which to transfer the
> +  back-end’s internal state during migration.  Either side (front-end or
> +  back-end) may create the channel.  The nature of this channel is not
> +  restricted or defined in this document, but whichever side creates it
> +  must create a file descriptor that is provided to the respectively
> +  other side, allowing access to the channel.  This FD must behave as
> +  follows:
> +
> +  * For the writing end, it must allow writing the whole back-end state
> +    sequentially.  Closing the file descriptor signals the end of
> +    transfer.
> +
> +  * For the reading end, it must allow reading the whole back-end state
> +    sequentially.  The end of file signals the end of the transfer.
> +
> +  For example, the channel may be a pipe, in which case the two ends of
> +  the pipe fulfill these requirements respectively.
> +
> +  Initially, the front-end creates a channel along with such an FD.  It
> +  passes the FD to the back-end as ancillary data of a
> +  ``VHOST_USER_SET_DEVICE_STATE_FD`` message.  The back-end may create a
> +  different transfer channel, passing the respective FD back to the
> +  front-end as ancillary data of the reply.  If so, the front-end must
> +  then discard its channel and use the one provided by the back-end.
> +
> +  Whether the back-end should decide to use its own channel is decided
> +  based on efficiency: If the channel is a pipe, both ends will most
> +  likely need to copy data into and out of it.  Any channel that allows
> +  for more efficient processing on at least one end, e.g. through
> +  zero-copy, is considered more efficient and thus preferred.  If the
> +  back-end can provide such a channel, it should decide to use it.
> +
> +  The request payload contains parameters for the subsequent data
> +  transfer, as described in the :ref:`Migrating back-end state
> +  <migrating_backend_state>` section.
> +
> +  The value returned is both an indication for success, and whether a
> +  file descriptor for a back-end-provided channel is returned: Bits 0–7
> +  are 0 on success, and non-zero on error.  Bit 8 is the invalid FD
> +  flag; this flag is set when there is no file descriptor returned.
> +  When this flag is not set, the front-end must use the returned file
> +  descriptor as its end of the transfer channel.  The back-end must not
> +  both indicate an error and return a file descriptor.
> +
> +  Using this function requires prior negotiation of the
> +  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
> +
> +``VHOST_USER_CHECK_DEVICE_STATE``
> +  :id: 42
> +  :equivalent ioctl: N/A
> +  :request payload: N/A
> +  :reply payload: ``u64``
> +
> +  After transferring the back-end’s internal state during migration (see
> +  the :ref:`Migrating back-end state <migrating_backend_state>`
> +  section), check whether the back-end was able to successfully fully
> +  process the state.
> +
> +  The value returned indicates success or error; 0 is success, any
> +  non-zero value is an error.
> +
> +  Using this function requires prior negotiation of the
> +  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
> +
>  
>  Back-end message types
>  ----------------------
> -- 
> 2.41.0

On Wed, Oct 04, 2023 at 02:59:02PM +0200, Hanna Czenczek wrote:
> Add the interface for transferring the back-end's state during migration
> as defined previously in vhost-user.rst.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  78 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 148 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index 31a251a9f5..b6eee7e9fd 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>  
> +typedef enum VhostDeviceStateDirection {
> +    /* Transfer state from back-end (device) to front-end */
> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> +    /* Transfer state from front-end to back-end (device) */
> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> +} VhostDeviceStateDirection;
> +
> +typedef enum VhostDeviceStatePhase {
> +    /* The device (and all its vrings) is stopped */
> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> +} VhostDeviceStatePhase;
> +
>  struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
> @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>  
>  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>  
> +typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev);
> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> +                                            VhostDeviceStateDirection direction,
> +                                            VhostDeviceStatePhase phase,
> +                                            int fd,
> +                                            int *reply_fd,
> +                                            Error **errp);
> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
>      vhost_reset_status_op vhost_reset_status;
> +    vhost_supports_device_state_op vhost_supports_device_state;
> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> +    vhost_check_device_state_op vhost_check_device_state;
>  } VhostOps;
>  
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 14621f9e79..a0d03c9fdf 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev *hdev)
>  }
>  #endif /* CONFIG_VHOST */
>  
> +/**
> + * vhost_supports_device_state(): Checks whether the back-end supports
> + * transferring internal device state for the purpose of migration.
> + * Support for this feature is required for vhost_set_device_state_fd()
> + * and vhost_check_device_state().
> + *
> + * @dev: The vhost device
> + *
> + * Returns true if the device supports these commands, and false if it
> + * does not.
> + */
> +bool vhost_supports_device_state(struct vhost_dev *dev);
> +
> +/**
> + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> + * the back-end for the purpose of migration.  Data is to be transferred
> + * over a pipe according to @direction and @phase.  The sending end must
> + * only write to the pipe, and the receiving end must only read from it.
> + * Once the sending end is done, it closes its FD.  The receiving end
> + * must take this as the end-of-transfer signal and close its FD, too.
> + *
> + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> + * read FD for LOAD.  This function transfers ownership of @fd to the
> + * back-end, i.e. closes it in the front-end.
> + *
> + * The back-end may optionally reply with an FD of its own, if this
> + * improves efficiency on its end.  In this case, the returned FD is
> + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> + * and the front-end must use *reply_fd for transferring state to/from
> + * the back-end.
> + *
> + * @dev: The vhost device
> + * @direction: The direction in which the state is to be transferred.
> + *             For outgoing migrations, this is SAVE, and data is read
> + *             from the back-end and stored by the front-end in the
> + *             migration stream.
> + *             For incoming migrations, this is LOAD, and data is read
> + *             by the front-end from the migration stream and sent to
> + *             the back-end to restore the saved state.
> + * @phase: Which migration phase we are in.  Currently, there is only
> + *         STOPPED (device and all vrings are stopped), in the future,
> + *         more phases such as PRE_COPY or POST_COPY may be added.
> + * @fd: Back-end's end of the pipe through which to transfer state; note
> + *      that ownership is transferred to the back-end, so this function
> + *      closes @fd in the front-end.
> + * @reply_fd: If the back-end wishes to use a different pipe for state
> + *            transfer, this will contain an FD for the front-end to
> + *            use.  Otherwise, -1 is stored here.
> + * @errp: Potential error description
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp);
> +
> +/**
> + * vhost_set_device_state_fd(): After transferring state from/to the
> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> + * has closed the pipe, inquire the back-end to report any potential
> + * errors that have occurred on its side.  This allows to sense errors
> + * like:
> + * - During outgoing migration, when the source side had already started
> + *   to produce its state, something went wrong and it failed to finish
> + * - During incoming migration, when the received state is somehow
> + *   invalid and cannot be processed by the back-end
> + *
> + * @dev: The vhost device
> + * @errp: Potential error description
> + *
> + * Returns 0 when the back-end reports successful state transfer and
> + * processing, and -errno when an error occurred somewhere.
> + */
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 7bed9ad7d5..7096b148a9 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -74,6 +74,8 @@ enum VhostUserProtocolFeature {
>      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
>      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
>      VHOST_USER_PROTOCOL_F_STATUS = 16,
> +    /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
> +    VHOST_USER_PROTOCOL_F_DEVICE_STATE = 18,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>  
> @@ -121,6 +123,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> +    VHOST_USER_CHECK_DEVICE_STATE = 42,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -212,6 +216,12 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> +typedef struct VhostUserTransferDeviceState {
> +    uint32_t direction;
> +    uint32_t phase;
> +} VhostUserTransferDeviceState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -226,6 +236,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserTransferDeviceState transfer_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2746,6 +2757,140 @@ static void vhost_user_reset_status(struct vhost_dev *dev)
>      }
>  }
>  
> +static bool vhost_user_supports_device_state(struct vhost_dev *dev)
> +{
> +    return virtio_has_feature(dev->protocol_features,
> +                              VHOST_USER_PROTOCOL_F_DEVICE_STATE);
> +}
> +
> +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> +                                          VhostDeviceStateDirection direction,
> +                                          VhostDeviceStatePhase phase,
> +                                          int fd,
> +                                          int *reply_fd,
> +                                          Error **errp)
> +{
> +    int ret;
> +    struct vhost_user *vu = dev->opaque;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.transfer_state),
> +        },
> +        .payload.transfer_state = {
> +            .direction = direction,
> +            .phase = phase,
> +        },
> +    };
> +
> +    *reply_fd = -1;
> +
> +    if (!vhost_user_supports_device_state(dev)) {
> +        close(fd);
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, &fd, 1);
> +    close(fd);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send SET_DEVICE_STATE_FD message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if ((msg.payload.u64 & 0xff) != 0) {
> +        error_setg(errp, "Back-end did not accept migration state transfer");
> +        return -EIO;
> +    }
> +
> +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> +        if (*reply_fd < 0) {
> +            error_setg(errp,
> +                       "Failed to get back-end-provided transfer pipe FD");
> +            *reply_fd = -1;
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = 0,
> +        },
> +    };
> +
> +    if (!vhost_user_supports_device_state(dev)) {
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send CHECK_DEVICE_STATE message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive CHECK_DEVICE_STATE reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.u64 != 0) {
> +        error_setg(errp, "Back-end failed to process its internal state");
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
>  const VhostOps user_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_USER,
>          .vhost_backend_init = vhost_user_backend_init,
> @@ -2782,4 +2927,7 @@ const VhostOps user_ops = {
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
>          .vhost_reset_status = vhost_user_reset_status,
> +        .vhost_supports_device_state = vhost_user_supports_device_state,
> +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> +        .vhost_check_device_state = vhost_user_check_device_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 6003e50e83..85e199f0aa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2096,3 +2096,40 @@ int vhost_reset_device(struct vhost_dev *hdev)
>  
>      return -ENOSYS;
>  }
> +
> +bool vhost_supports_device_state(struct vhost_dev *dev)
> +{
> +    if (dev->vhost_ops->vhost_supports_device_state) {
> +        return dev->vhost_ops->vhost_supports_device_state(dev);
> +    }
> +
> +    return false;
> +}
> +
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> +                                                         fd, reply_fd, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> +
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_check_device_state) {
> +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> -- 
> 2.41.0

On Wed, Oct 04, 2023 at 02:59:04PM +0200, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
> 
> We get/set the latter via the new vhost operations to transfer migratory
> state.  It is its own dedicated subsection, so that for external
> migration, it can be disabled.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index 49d699ffc2..eb91723855 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
>      return &fs->vhost_dev;
>  }
>  
> +/**
> + * Fetch the internal state from virtiofsd and save it to `f`.
> + */
> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field, JSONWriter *vmdesc)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    Error *local_error = NULL;
> +    int ret;
> +
> +    ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error);
> +    if (ret < 0) {
> +        error_reportf_err(local_error,
> +                          "Error saving back-end state of %s device %s "
> +                          "(tag: \"%s\"): ",
> +                          vdev->name, vdev->parent_obj.canonical_path,
> +                          fs->conf.tag ?: "<none>");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * Load virtiofsd's internal state from `f` and send it over to virtiofsd.
> + */
> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    Error *local_error = NULL;
> +    int ret;
> +
> +    ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error);
> +    if (ret < 0) {
> +        error_reportf_err(local_error,
> +                          "Error loading back-end state of %s device %s "
> +                          "(tag: \"%s\"): ",
> +                          vdev->name, vdev->parent_obj.canonical_path,
> +                          fs->conf.tag ?: "<none>");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static bool vuf_is_internal_migration(void *opaque)
> +{
> +    /* TODO: Return false when an external migration is requested */
> +    return true;
> +}
> +
> +static int vuf_check_migration_support(void *opaque)
> +{
> +    VirtIODevice *vdev = opaque;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +
> +    if (!vhost_supports_device_state(&fs->vhost_dev)) {
> +        error_report("Back-end of %s device %s (tag: \"%s\") does not support "
> +                     "migration through qemu",
> +                     vdev->name, vdev->parent_obj.canonical_path,
> +                     fs->conf.tag ?: "<none>");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vuf_backend_vmstate;
> +
>  static const VMStateDescription vuf_vmstate = {
>      .name = "vhost-user-fs",
> -    .unmigratable = 1,
> +    .version_id = 0,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_VIRTIO_DEVICE,
> +        VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription * []) {
> +        &vuf_backend_vmstate,
> +        NULL,
> +    }
> +};
> +
> +static const VMStateDescription vuf_backend_vmstate = {
> +    .name = "vhost-user-fs-backend",
> +    .version_id = 0,
> +    .needed = vuf_is_internal_migration,
> +    .pre_load = vuf_check_migration_support,
> +    .pre_save = vuf_check_migration_support,
> +    .fields = (VMStateField[]) {
> +        {
> +            .name = "back-end",
> +            .info = &(const VMStateInfo) {
> +                .name = "virtio-fs back-end state",
> +                .get = vuf_load_state,
> +                .put = vuf_save_state,
> +            },
> +        },
> +        VMSTATE_END_OF_LIST()
> +    },
>  };
>  
>  static Property vuf_properties[] = {
> -- 
> 2.41.0

On Wed, Oct 04, 2023 at 02:59:03PM +0200, Hanna Czenczek wrote:
> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> vhost front-ends to easily save and load the back-end's state to/from
> the migration stream.
> 
> Because we do not know the full state size ahead of time,
> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> writes each chunk consecutively into the migration stream, prefixed by
> its length.  EOF is indicated by a 0-length chunk.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h |  35 +++++++
>  hw/virtio/vhost.c         | 204 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 239 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a0d03c9fdf..100fcc874d 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
>   */
>  int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>  
> +/**
> + * vhost_save_backend_state(): High-level function to receive a vhost
> + * back-end's state, and save it in @f.  Uses
> + * `vhost_set_device_state_fd()` to get the data from the back-end, and
> + * stores it in consecutive chunks that are each prefixed by their
> + * respective length (be32).  The end is marked by a 0-length chunk.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device from which to save the state
> + * @f: Migration stream in which to save the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> +
> +/**
> + * vhost_load_backend_state(): High-level function to load a vhost
> + * back-end's state from @f, and send it over to the back-end.  Reads
> + * the data from @f in the format used by `vhost_save_state()`, and uses
> + * `vhost_set_device_state_fd()` to transfer it to the back-end.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device to which to send the sate
> + * @f: Migration stream from which to load the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 85e199f0aa..1465adf13a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
>                 "vhost transport does not support migration state transfer");
>      return -ENOSYS;
>  }
> +
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
> +{
> +    /* Maximum chunk size in which to transfer the state */
> +    const size_t chunk_size = 1 * 1024 * 1024;
> +    g_autofree void *transfer_buf = NULL;
> +    g_autoptr(GError) g_err = NULL;
> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> +    int ret;
> +
> +    /* [0] for reading (our end), [1] for writing (back-end's end) */
> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
> +                   g_err->message);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    read_fd = pipe_fds[0];
> +    write_fd = pipe_fds[1];
> +
> +    /*
> +     * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
> +     * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for
> +     * vhost-user, so just check that it is stopped at all.
> +     */
> +    assert(!dev->started);
> +
> +    /* Transfer ownership of write_fd to the back-end */
> +    ret = vhost_set_device_state_fd(dev,
> +                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
> +                                    write_fd,
> +                                    &reply_fd,
> +                                    errp);
> +    if (ret < 0) {
> +        error_prepend(errp, "Failed to initiate state transfer: ");
> +        goto fail;
> +    }
> +
> +    /* If the back-end wishes to use a different pipe, switch over */
> +    if (reply_fd >= 0) {
> +        close(read_fd);
> +        read_fd = reply_fd;
> +    }
> +
> +    transfer_buf = g_malloc(chunk_size);
> +
> +    while (true) {
> +        ssize_t read_ret;
> +
> +        read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size));
> +        if (read_ret < 0) {
> +            ret = -errno;
> +            error_setg_errno(errp, -ret, "Failed to receive state");
> +            goto fail;
> +        }
> +
> +        assert(read_ret <= chunk_size);
> +        qemu_put_be32(f, read_ret);
> +
> +        if (read_ret == 0) {
> +            /* EOF */
> +            break;
> +        }
> +
> +        qemu_put_buffer(f, transfer_buf, read_ret);
> +    }
> +
> +    /*
> +     * Back-end will not really care, but be clean and close our end of the pipe
> +     * before inquiring the back-end about whether transfer was successful
> +     */
> +    close(read_fd);
> +    read_fd = -1;
> +
> +    /* Also, verify that the device is still stopped */
> +    assert(!dev->started);
> +
> +    ret = vhost_check_device_state(dev, errp);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    ret = 0;
> +fail:
> +    if (read_fd >= 0) {
> +        close(read_fd);
> +    }
> +
> +    return ret;
> +}
> +
> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
> +{
> +    size_t transfer_buf_size = 0;
> +    g_autofree void *transfer_buf = NULL;
> +    g_autoptr(GError) g_err = NULL;
> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> +    int ret;
> +
> +    /* [0] for reading (back-end's end), [1] for writing (our end) */
> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
> +                   g_err->message);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    read_fd = pipe_fds[0];
> +    write_fd = pipe_fds[1];
> +
> +    /*
> +     * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
> +     * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for
> +     * vhost-user, so just check that it is stopped at all.
> +     */
> +    assert(!dev->started);
> +
> +    /* Transfer ownership of read_fd to the back-end */
> +    ret = vhost_set_device_state_fd(dev,
> +                                    VHOST_TRANSFER_STATE_DIRECTION_LOAD,
> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
> +                                    read_fd,
> +                                    &reply_fd,
> +                                    errp);
> +    if (ret < 0) {
> +        error_prepend(errp, "Failed to initiate state transfer: ");
> +        goto fail;
> +    }
> +
> +    /* If the back-end wishes to use a different pipe, switch over */
> +    if (reply_fd >= 0) {
> +        close(write_fd);
> +        write_fd = reply_fd;
> +    }
> +
> +    while (true) {
> +        size_t this_chunk_size = qemu_get_be32(f);
> +        ssize_t write_ret;
> +        const uint8_t *transfer_pointer;
> +
> +        if (this_chunk_size == 0) {
> +            /* End of state */
> +            break;
> +        }
> +
> +        if (transfer_buf_size < this_chunk_size) {
> +            transfer_buf = g_realloc(transfer_buf, this_chunk_size);
> +            transfer_buf_size = this_chunk_size;
> +        }
> +
> +        if (qemu_get_buffer(f, transfer_buf, this_chunk_size) <
> +                this_chunk_size)
> +        {
> +            error_setg(errp, "Failed to read state");
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +
> +        transfer_pointer = transfer_buf;
> +        while (this_chunk_size > 0) {
> +            write_ret = RETRY_ON_EINTR(
> +                write(write_fd, transfer_pointer, this_chunk_size)
> +            );
> +            if (write_ret < 0) {
> +                ret = -errno;
> +                error_setg_errno(errp, -ret, "Failed to send state");
> +                goto fail;
> +            } else if (write_ret == 0) {
> +                error_setg(errp, "Failed to send state: Connection is closed");
> +                ret = -ECONNRESET;
> +                goto fail;
> +            }
> +
> +            assert(write_ret <= this_chunk_size);
> +            this_chunk_size -= write_ret;
> +            transfer_pointer += write_ret;
> +        }
> +    }
> +
> +    /*
> +     * Close our end, thus ending transfer, before inquiring the back-end about
> +     * whether transfer was successful
> +     */
> +    close(write_fd);
> +    write_fd = -1;
> +
> +    /* Also, verify that the device is still stopped */
> +    assert(!dev->started);
> +
> +    ret = vhost_check_device_state(dev, errp);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    ret = 0;
> +fail:
> +    if (write_fd >= 0) {
> +        close(write_fd);
> +    }
> +
> +    return ret;
> +}
> -- 
> 2.41.0


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings
  2023-10-18 12:14   ` Michael S. Tsirkin
@ 2023-10-18 16:17     ` Hanna Czenczek
  0 siblings, 0 replies; 53+ messages in thread
From: Hanna Czenczek @ 2023-10-18 16:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Eugenio Pérez, Anton Kuchin

On 18.10.23 14:14, Michael S. Tsirkin wrote:
> On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote:
>> Currently, the vhost-user documentation says that rings are to be
>> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
>> negotiated.  However, by the time of feature negotiation, all rings have
>> already been initialized, so it is not entirely clear what this means.
>>
>> At least the vhost-user-backend Rust crate's implementation interpreted
>> it to mean that whenever this feature is negotiated, all rings are to
>> put into a disabled state, which means that every SET_FEATURES call
>> would disable all rings, effectively halting the device.  This is
>> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
>> this way, which happens during migration.  Doing so should not halt the
>> device.
>>
>> Other implementations have interpreted this to mean that the device is
>> to be initialized with all rings disabled, and a subsequent SET_FEATURES
>> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
>> them.  Here, SET_FEATURES will never disable any ring.
>>
>> This interpretation does not suffer the problem of unintentionally
>> halting the device whenever features are set or cleared, so it seems
>> better and more reasonable.
>>
>> We can clarify this in the documentation by making it explicit that the
>> enabled/disabled state is tracked even while the vring is stopped.
>> Every vring is initialized in a disabled state, and SET_FEATURES without
>> VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
>> vrings.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>
> OK so I am expecting v5. My advice is to move patch 1 to end of patchset
> so we can defer it if we want to.

Already sent – I’ve just dropped patch 1, since it doesn’t add anything 
to the objective of the patch series itself:

https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg04727.html

Hanna


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2023-10-18 16:17 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-04 12:58 [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek
2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek
2023-10-05 17:08   ` Stefan Hajnoczi
2023-10-05 17:15     ` [Virtio-fs] (no subject) Michael S. Tsirkin
2023-10-06  7:48       ` Hanna Czenczek
2023-10-06  8:45         ` Michael S. Tsirkin
2023-10-06  9:15           ` Hanna Czenczek
2023-10-06  9:26             ` Michael S. Tsirkin
2023-10-06  9:47               ` Hanna Czenczek
2023-10-06 10:34                 ` Michael S. Tsirkin
2023-10-06 11:42                   ` Hanna Czenczek
2023-10-06 15:17                     ` Alex Bennée
2023-10-06 15:47                       ` Hanna Czenczek
2023-10-06 20:49                         ` Alex Bennée
2023-10-09  8:07                           ` Hanna Czenczek
2023-10-07  2:22                   ` Yajun Wu
2023-10-09  8:21                     ` Hanna Czenczek
2023-10-09  9:07                       ` Hanna Czenczek
2023-10-09  9:13                         ` Hanna Czenczek
2023-10-10  4:00                           ` Yajun Wu
2023-10-10  8:18                             ` Hanna Czenczek
2023-10-10 10:36                               ` Alex Bennée
2023-10-10 13:18                                 ` Hanna Czenczek
2023-10-10 14:35                                   ` Alex Bennée
2023-10-13 18:02                                     ` Hanna Czenczek
2023-10-17  7:49                                       ` Viresh Kumar
2023-10-17  8:13                                         ` Hanna Czenczek
2023-10-09 10:28                     ` German Maglione
2023-10-10  2:56                       ` Yajun Wu
2023-10-10 10:04                         ` German Maglione
2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek
2023-10-05 17:38   ` Stefan Hajnoczi
2023-10-06  7:53     ` Hanna Czenczek
2023-10-06  8:49       ` Michael S. Tsirkin
2023-10-06 13:55         ` Hanna Czenczek
2023-10-06 13:58           ` Hanna Czenczek
2023-10-07 21:29             ` Michael S. Tsirkin
2023-10-07 21:27           ` Michael S. Tsirkin
2023-10-04 12:58 ` [Virtio-fs] [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek
2023-10-05 17:43   ` Stefan Hajnoczi
2023-10-18 12:14   ` Michael S. Tsirkin
2023-10-18 16:17     ` Hanna Czenczek
2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek
2023-10-05 17:44   ` Stefan Hajnoczi
2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek
2023-10-05 17:46   ` Stefan Hajnoczi
2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek
2023-10-05 17:46   ` Stefan Hajnoczi
2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek
2023-10-05 17:46   ` Stefan Hajnoczi
2023-10-04 12:59 ` [Virtio-fs] [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek
2023-10-05 17:46   ` Stefan Hajnoczi
2023-10-05 17:48 ` [Virtio-fs] [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).