All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] vhost-user-fs: Stateful migration
@ 2023-03-13 17:48 ` Hanna Czenczek
  0 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

Hi,

Patch 1 of this RFC series adds virtio-fs-specific operations to vhost
for transferring a binary blob of back-end-internal state, and
implements those for vhost-user.

Patch 2 uses those operations to implement stateful migration for
vhost-user-fs devices, assuming the back-end (virtiofsd) supports it.

This is an RFC for multiple reasons, most notably:
- Patch 1 proposes yet undiscussed changes to the vhost protocol, which
  makes it RFC by default.
- Without much experience in the fields of migration or vhost (on the
  qemu side), I hope marking this as an RFC leads to extra scrutiny on
  the reviewer’s side. O:)


Hanna Czenczek (2):
  vhost-user: Add interface for virtio-fs migration
  vhost-user-fs: Implement stateful migration

 include/hw/virtio/vhost-backend.h |   9 ++
 include/hw/virtio/vhost.h         |  68 ++++++++++++
 hw/virtio/vhost-user-fs.c         | 171 +++++++++++++++++++++++++++++-
 hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  29 +++++
 5 files changed, 414 insertions(+), 1 deletion(-)

-- 
2.39.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Virtio-fs] [RFC 0/2] vhost-user-fs: Stateful migration
@ 2023-03-13 17:48 ` Hanna Czenczek
  0 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

Hi,

Patch 1 of this RFC series adds virtio-fs-specific operations to vhost
for transferring a binary blob of back-end-internal state, and
implements those for vhost-user.

Patch 2 uses those operations to implement stateful migration for
vhost-user-fs devices, assuming the back-end (virtiofsd) supports it.

This is an RFC for multiple reasons, most notably:
- Patch 1 proposes yet undiscussed changes to the vhost protocol, which
  makes it RFC by default.
- Without much experience in the fields of migration or vhost (on the
  qemu side), I hope marking this as an RFC leads to extra scrutiny on
  the reviewer���s side. O:)


Hanna Czenczek (2):
  vhost-user: Add interface for virtio-fs migration
  vhost-user-fs: Implement stateful migration

 include/hw/virtio/vhost-backend.h |   9 ++
 include/hw/virtio/vhost.h         |  68 ++++++++++++
 hw/virtio/vhost-user-fs.c         | 171 +++++++++++++++++++++++++++++-
 hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  29 +++++
 5 files changed, 414 insertions(+), 1 deletion(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-13 17:48 ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-13 17:48   ` Hanna Czenczek
  -1 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

Add a virtio-fs-specific vhost-user interface to facilitate migrating
back-end-internal state.  We plan to migrate the internal state simply
as a binary blob after the streaming phase, so all we need is a way to
transfer such a blob from and to the back-end.  We do so by using a
dedicated area of shared memory through which the blob is transferred in
chunks.

This patch adds the following vhost operations (and implements them for
vhost-user):

- FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
  to the back-end.  This area will be used to transfer state via the
  other two operations.
  (After the transfer FS_SET_STATE_FD detaches the shared memory area
  again.)

- FS_GET_STATE: The front-end asks the back-end to place a chunk of
  internal state into the shared memory area.

- FS_SET_STATE: The front-end puts a chunk of internal state into the
  shared memory area, and asks the back-end to fetch it.

On the source side, the back-end is expected to serialize its internal
state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
parts of that serialized state into the shared memory area.

On the destination side, the back-end is expected to collect the state
blob over all FS_SET_STATE calls, and then deserialize and apply it once
FS_SET_STATE_FD detaches the shared memory area.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |   9 ++
 include/hw/virtio/vhost.h         |  68 +++++++++++++++
 hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  29 +++++++
 4 files changed, 244 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index ec3fbae58d..fa3bd19386 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
 typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
 typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
 
+typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
+                                         uint64_t state_offset, size_t size);
+typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
+                                     uint64_t state_offset, size_t size);
+typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
+                                        size_t size);
 typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
                                 struct vhost_vring_file *file);
 typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
@@ -138,6 +144,9 @@ typedef struct VhostOps {
     vhost_backend_init vhost_backend_init;
     vhost_backend_cleanup vhost_backend_cleanup;
     vhost_backend_memslots_limit vhost_backend_memslots_limit;
+    vhost_fs_get_state_op vhost_fs_get_state;
+    vhost_fs_set_state_op vhost_fs_set_state;
+    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
     vhost_net_set_backend_op vhost_net_set_backend;
     vhost_net_set_mtu_op vhost_net_set_mtu;
     vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a52f273347..b1ad9785dd 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
                            struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
                            struct vhost_inflight *inflight);
+
+/**
+ * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
+ * back-end for transferring internal state for the purpose of
+ * migration.  Calling this function again will have the back-end
+ * unregister (free) the previously shared memory area.
+ *
+ * @dev: The vhost device
+ * @memfd: File descriptor associated with the shared memory to share.
+ *         If negative, no memory area is shared, only releasing the
+ *         previously shared area, and announcing the end of transfer
+ *         (which, on the destination side, should lead to the
+ *         back-end deserializing and applying the received state).
+ * @size: Size of the shared memory area
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
+
+/**
+ * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
+ * a chunk of migration state into the shared memory area negotiated
+ * through vhost_fs_set_state_fd().  May only be used for migration,
+ * and only by the source side.
+ *
+ * The back-end-internal migration state is treated as a binary blob,
+ * which is transferred in chunks to fit into the shared memory area.
+ *
+ * @dev: The vhost device
+ * @state_offset: Offset into the state blob of the first byte to be
+ *                transferred
+ * @size: Number of bytes to transfer at most; must not exceed the
+ *        size of the shared memory area
+ *
+ * On success, returns the number of bytes that remain in the full
+ * state blob from the beginning of this chunk (i.e. the full size of
+ * the blob, minus @state_offset).  When transferring the final chunk,
+ * this may be less than @size.  The shared memory will contain the
+ * requested data, starting at offset 0 into the SHM, and counting
+ * `MIN(@size, returned value)` bytes.
+ *
+ * On failure, returns -errno.
+ */
+ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
+                           uint64_t size);
+
+/**
+ * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
+ * a chunk of migration state from the shared memory area negotiated
+ * through vhost_fs_set_state_fd().  May only be used for migration,
+ * and only by the destination side.
+ *
+ * The back-end-internal migration state is treated as a binary blob,
+ * which is transferred in chunks to fit into the shared memory area.
+ *
+ * The front-end (i.e. the caller) must transfer the whole state to
+ * the back-end, without holes.
+ *
+ * @vdev: the VirtIODevice structure
+ * @state_offset: Offset into the state blob of the first byte to be
+ *                transferred
+ * @size: Length of the chunk to transfer; must not exceed the size of
+ *        the shared memory area
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
+                       uint64_t size);
 #endif
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e5285df4ba..7fd1fb1ed3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_REM_MEM_REG = 38,
     VHOST_USER_SET_STATUS = 39,
     VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_FS_SET_STATE_FD = 41,
+    VHOST_USER_FS_GET_STATE = 42,
+    VHOST_USER_FS_SET_STATE = 43,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -210,6 +213,15 @@ typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/*
+ * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
+ * payload of VHOST_USER_FS_SET_STATE.
+ */
+typedef struct VhostUserFsState {
+    uint64_t state_offset;
+    uint64_t size;
+} VhostUserFsState;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
@@ -224,6 +236,7 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserFsState fs_state;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
     return 0;
 }
 
+static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
+                                      size_t size)
+{
+    int ret;
+    bool reply_supported = virtio_has_feature(dev->protocol_features,
+                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_SET_STATE_FD,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.u64),
+        },
+        .payload.u64 = size,
+    };
+
+    if (reply_supported) {
+        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
+    if (memfd < 0) {
+        assert(size == 0);
+        ret = vhost_user_write(dev, &msg, NULL, 0);
+    } else {
+        ret = vhost_user_write(dev, &msg, &memfd, 1);
+    }
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (reply_supported) {
+        return process_message_reply(dev, &msg);
+    }
+
+    return 0;
+}
+
+static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
+                                       uint64_t state_offset,
+                                       size_t size)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_GET_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.fs_state),
+        },
+        .payload.fs_state = {
+            .state_offset = state_offset,
+            .size = size,
+        },
+    };
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
+        error_report("Received unexpected message type: "
+                     "Expected %d, received %d",
+                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(VhostUserFsState)) {
+        error_report("Received unexpected message length: "
+                     "Expected %" PRIu32 ", received %zu",
+                     msg.hdr.size, sizeof(VhostUserFsState));
+        return -EPROTO;
+    }
+
+    if (msg.payload.fs_state.size > SSIZE_MAX) {
+        error_report("Remaining state size returned by back end is too high: "
+                     "Expected up to %zd, reported %" PRIu64,
+                     SSIZE_MAX, msg.payload.fs_state.size);
+        return -EPROTO;
+    }
+
+    return msg.payload.fs_state.size;
+}
+
+static int vhost_user_fs_set_state(struct vhost_dev *dev,
+                                   uint64_t state_offset,
+                                   size_t size)
+{
+    int ret;
+    bool reply_supported = virtio_has_feature(dev->protocol_features,
+                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_SET_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.fs_state),
+        },
+        .payload.fs_state = {
+            .state_offset = state_offset,
+            .size = size,
+        },
+    };
+
+    if (reply_supported) {
+        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (reply_supported) {
+        return process_message_reply(dev, &msg);
+    }
+
+    return 0;
+}
+
 static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
                                             struct vhost_iotlb_msg *imsg)
 {
@@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
         .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
         .vhost_dev_start = vhost_user_dev_start,
+        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
+        .vhost_fs_get_state = vhost_user_fs_get_state,
+        .vhost_fs_set_state = vhost_user_fs_set_state,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a266396576..ef8252c90e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -ENOSYS;
 }
+
+int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
+{
+    if (dev->vhost_ops->vhost_fs_set_state_fd) {
+        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
+    }
+
+    return -ENOSYS;
+}
+
+ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
+                           uint64_t size)
+{
+    if (dev->vhost_ops->vhost_fs_get_state) {
+        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
+    }
+
+    return -ENOSYS;
+}
+
+int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
+                       uint64_t size)
+{
+    if (dev->vhost_ops->vhost_fs_set_state) {
+        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
+    }
+
+    return -ENOSYS;
+}
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
@ 2023-03-13 17:48   ` Hanna Czenczek
  0 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

Add a virtio-fs-specific vhost-user interface to facilitate migrating
back-end-internal state.  We plan to migrate the internal state simply
as a binary blob after the streaming phase, so all we need is a way to
transfer such a blob from and to the back-end.  We do so by using a
dedicated area of shared memory through which the blob is transferred in
chunks.

This patch adds the following vhost operations (and implements them for
vhost-user):

- FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
  to the back-end.  This area will be used to transfer state via the
  other two operations.
  (After the transfer FS_SET_STATE_FD detaches the shared memory area
  again.)

- FS_GET_STATE: The front-end asks the back-end to place a chunk of
  internal state into the shared memory area.

- FS_SET_STATE: The front-end puts a chunk of internal state into the
  shared memory area, and asks the back-end to fetch it.

On the source side, the back-end is expected to serialize its internal
state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
parts of that serialized state into the shared memory area.

On the destination side, the back-end is expected to collect the state
blob over all FS_SET_STATE calls, and then deserialize and apply it once
FS_SET_STATE_FD detaches the shared memory area.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |   9 ++
 include/hw/virtio/vhost.h         |  68 +++++++++++++++
 hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  29 +++++++
 4 files changed, 244 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index ec3fbae58d..fa3bd19386 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
 typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
 typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
 
+typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
+                                         uint64_t state_offset, size_t size);
+typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
+                                     uint64_t state_offset, size_t size);
+typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
+                                        size_t size);
 typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
                                 struct vhost_vring_file *file);
 typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
@@ -138,6 +144,9 @@ typedef struct VhostOps {
     vhost_backend_init vhost_backend_init;
     vhost_backend_cleanup vhost_backend_cleanup;
     vhost_backend_memslots_limit vhost_backend_memslots_limit;
+    vhost_fs_get_state_op vhost_fs_get_state;
+    vhost_fs_set_state_op vhost_fs_set_state;
+    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
     vhost_net_set_backend_op vhost_net_set_backend;
     vhost_net_set_mtu_op vhost_net_set_mtu;
     vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a52f273347..b1ad9785dd 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
                            struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
                            struct vhost_inflight *inflight);
+
+/**
+ * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
+ * back-end for transferring internal state for the purpose of
+ * migration.  Calling this function again will have the back-end
+ * unregister (free) the previously shared memory area.
+ *
+ * @dev: The vhost device
+ * @memfd: File descriptor associated with the shared memory to share.
+ *         If negative, no memory area is shared, only releasing the
+ *         previously shared area, and announcing the end of transfer
+ *         (which, on the destination side, should lead to the
+ *         back-end deserializing and applying the received state).
+ * @size: Size of the shared memory area
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
+
+/**
+ * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
+ * a chunk of migration state into the shared memory area negotiated
+ * through vhost_fs_set_state_fd().  May only be used for migration,
+ * and only by the source side.
+ *
+ * The back-end-internal migration state is treated as a binary blob,
+ * which is transferred in chunks to fit into the shared memory area.
+ *
+ * @dev: The vhost device
+ * @state_offset: Offset into the state blob of the first byte to be
+ *                transferred
+ * @size: Number of bytes to transfer at most; must not exceed the
+ *        size of the shared memory area
+ *
+ * On success, returns the number of bytes that remain in the full
+ * state blob from the beginning of this chunk (i.e. the full size of
+ * the blob, minus @state_offset).  When transferring the final chunk,
+ * this may be less than @size.  The shared memory will contain the
+ * requested data, starting at offset 0 into the SHM, and counting
+ * `MIN(@size, returned value)` bytes.
+ *
+ * On failure, returns -errno.
+ */
+ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
+                           uint64_t size);
+
+/**
+ * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
+ * a chunk of migration state from the shared memory area negotiated
+ * through vhost_fs_set_state_fd().  May only be used for migration,
+ * and only by the destination side.
+ *
+ * The back-end-internal migration state is treated as a binary blob,
+ * which is transferred in chunks to fit into the shared memory area.
+ *
+ * The front-end (i.e. the caller) must transfer the whole state to
+ * the back-end, without holes.
+ *
+ * @vdev: the VirtIODevice structure
+ * @state_offset: Offset into the state blob of the first byte to be
+ *                transferred
+ * @size: Length of the chunk to transfer; must not exceed the size of
+ *        the shared memory area
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
+                       uint64_t size);
 #endif
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e5285df4ba..7fd1fb1ed3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_REM_MEM_REG = 38,
     VHOST_USER_SET_STATUS = 39,
     VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_FS_SET_STATE_FD = 41,
+    VHOST_USER_FS_GET_STATE = 42,
+    VHOST_USER_FS_SET_STATE = 43,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -210,6 +213,15 @@ typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/*
+ * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
+ * payload of VHOST_USER_FS_SET_STATE.
+ */
+typedef struct VhostUserFsState {
+    uint64_t state_offset;
+    uint64_t size;
+} VhostUserFsState;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
@@ -224,6 +236,7 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserFsState fs_state;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
     return 0;
 }
 
+static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
+                                      size_t size)
+{
+    int ret;
+    bool reply_supported = virtio_has_feature(dev->protocol_features,
+                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_SET_STATE_FD,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.u64),
+        },
+        .payload.u64 = size,
+    };
+
+    if (reply_supported) {
+        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
+    if (memfd < 0) {
+        assert(size == 0);
+        ret = vhost_user_write(dev, &msg, NULL, 0);
+    } else {
+        ret = vhost_user_write(dev, &msg, &memfd, 1);
+    }
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (reply_supported) {
+        return process_message_reply(dev, &msg);
+    }
+
+    return 0;
+}
+
+static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
+                                       uint64_t state_offset,
+                                       size_t size)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_GET_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.fs_state),
+        },
+        .payload.fs_state = {
+            .state_offset = state_offset,
+            .size = size,
+        },
+    };
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
+        error_report("Received unexpected message type: "
+                     "Expected %d, received %d",
+                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(VhostUserFsState)) {
+        error_report("Received unexpected message length: "
+                     "Expected %" PRIu32 ", received %zu",
+                     msg.hdr.size, sizeof(VhostUserFsState));
+        return -EPROTO;
+    }
+
+    if (msg.payload.fs_state.size > SSIZE_MAX) {
+        error_report("Remaining state size returned by back end is too high: "
+                     "Expected up to %zd, reported %" PRIu64,
+                     SSIZE_MAX, msg.payload.fs_state.size);
+        return -EPROTO;
+    }
+
+    return msg.payload.fs_state.size;
+}
+
+static int vhost_user_fs_set_state(struct vhost_dev *dev,
+                                   uint64_t state_offset,
+                                   size_t size)
+{
+    int ret;
+    bool reply_supported = virtio_has_feature(dev->protocol_features,
+                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_FS_SET_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.fs_state),
+        },
+        .payload.fs_state = {
+            .state_offset = state_offset,
+            .size = size,
+        },
+    };
+
+    if (reply_supported) {
+        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (reply_supported) {
+        return process_message_reply(dev, &msg);
+    }
+
+    return 0;
+}
+
 static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
                                             struct vhost_iotlb_msg *imsg)
 {
@@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
         .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
         .vhost_dev_start = vhost_user_dev_start,
+        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
+        .vhost_fs_get_state = vhost_user_fs_get_state,
+        .vhost_fs_set_state = vhost_user_fs_set_state,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a266396576..ef8252c90e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -ENOSYS;
 }
+
+int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
+{
+    if (dev->vhost_ops->vhost_fs_set_state_fd) {
+        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
+    }
+
+    return -ENOSYS;
+}
+
+ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
+                           uint64_t size)
+{
+    if (dev->vhost_ops->vhost_fs_get_state) {
+        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
+    }
+
+    return -ENOSYS;
+}
+
+int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
+                       uint64_t size)
+{
+    if (dev->vhost_ops->vhost_fs_set_state) {
+        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
+    }
+
+    return -ENOSYS;
+}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-13 17:48 ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-13 17:48   ` Hanna Czenczek
  -1 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

A virtio-fs device's VM state consists of:
- the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
- the back-end's (virtiofsd's) internal state

We get/set the latter via the new vhost-user operations FS_SET_STATE_FD,
FS_GET_STATE, and FS_SET_STATE.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 hw/virtio/vhost-user-fs.c | 171 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 170 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 83fc20e49e..df1fb02acc 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -20,8 +20,10 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 #include "qemu/error-report.h"
+#include "qemu/memfd.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user-fs.h"
+#include "migration/qemu-file-types.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
 
@@ -298,9 +300,176 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
     return &fs->vhost_dev;
 }
 
+/**
+ * Fetch the internal state from the back-end (virtiofsd) and save it
+ * to `f`.
+ */
+static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field, JSONWriter *vmdesc)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int memfd = -1;
+    /* Size of the shared memory through which to transfer the state */
+    const size_t chunk_size = 4 * 1024 * 1024;
+    size_t state_offset;
+    ssize_t remaining;
+    void *shm_buf;
+    Error *local_err = NULL;
+    int ret, ret2;
+
+    /* Set up shared memory through which to receive the state from virtiofsd */
+    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
+                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
+                               &memfd, &local_err);
+    if (!shm_buf) {
+        error_report_err(local_err);
+        ret = -ENOMEM;
+        goto early_fail;
+    }
+
+    /* Share the SHM area with virtiofsd */
+    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
+    if (ret < 0) {
+        goto early_fail;
+    }
+
+    /* Receive the virtiofsd state in chunks, and write them to `f` */
+    state_offset = 0;
+    do {
+        size_t this_chunk_size;
+
+        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
+                                       chunk_size);
+        if (remaining < 0) {
+            ret = remaining;
+            goto fail;
+        }
+
+        /* Prefix the whole state by its total length */
+        if (state_offset == 0) {
+            qemu_put_be64(f, remaining);
+        }
+
+        this_chunk_size = MIN(remaining, chunk_size);
+        qemu_put_buffer(f, shm_buf, this_chunk_size);
+        state_offset += this_chunk_size;
+    } while (remaining >= chunk_size);
+
+    ret = 0;
+fail:
+    /* Have virtiofsd close the shared memory */
+    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
+    if (ret2 < 0) {
+        error_report("Failed to remove state FD from the vhost-user-fs back "
+                     "end: %s", strerror(-ret));
+        if (ret == 0) {
+            ret = ret2;
+        }
+    }
+
+early_fail:
+    if (shm_buf) {
+        qemu_memfd_free(shm_buf, chunk_size, memfd);
+    }
+
+    return ret;
+}
+
+/**
+ * Load the back-end's (virtiofsd's) internal state from `f` and send
+ * it over to that back-end.
+ */
+static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int memfd = -1;
+    /* Size of the shared memory through which to transfer the state */
+    const size_t chunk_size = 4 * 1024 * 1024;
+    size_t state_offset;
+    uint64_t remaining;
+    void *shm_buf;
+    Error *local_err = NULL;
+    int ret, ret2;
+
+    /* The state is prefixed by its total length, read that first */
+    remaining = qemu_get_be64(f);
+
+    /* Set up shared memory through which to send the state to virtiofsd */
+    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
+                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
+                               &memfd, &local_err);
+    if (!shm_buf) {
+        error_report_err(local_err);
+        ret = -ENOMEM;
+        goto early_fail;
+    }
+
+    /* Share the SHM area with virtiofsd */
+    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
+    if (ret < 0) {
+        goto early_fail;
+    }
+
+    /*
+     * Read the virtiofsd state in chunks from `f`, and send them over
+     * to virtiofsd
+     */
+    state_offset = 0;
+    do {
+        size_t this_chunk_size = MIN(remaining, chunk_size);
+
+        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < this_chunk_size) {
+            ret = -EINVAL;
+            goto fail;
+        }
+
+        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, this_chunk_size);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        state_offset += this_chunk_size;
+        remaining -= this_chunk_size;
+    } while (remaining > 0);
+
+    ret = 0;
+fail:
+    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
+    if (ret2 < 0) {
+        error_report("Failed to remove state FD from the vhost-user-fs back "
+                     "end -- perhaps it failed to deserialize/apply the state: "
+                     "%s", strerror(-ret2));
+        if (ret == 0) {
+            ret = ret2;
+        }
+    }
+
+early_fail:
+    if (shm_buf) {
+        qemu_memfd_free(shm_buf, chunk_size, memfd);
+    }
+
+    return ret;
+}
+
 static const VMStateDescription vuf_vmstate = {
     .name = "vhost-user-fs",
-    .unmigratable = 1,
+    .version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        {
+            .name = "back-end",
+            .info = &(const VMStateInfo) {
+                .name = "virtio-fs back-end state",
+                .get = vuf_load_state,
+                .put = vuf_save_state,
+            },
+        },
+        VMSTATE_END_OF_LIST()
+    },
 };
 
 static Property vuf_properties[] = {
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
@ 2023-03-13 17:48   ` Hanna Czenczek
  0 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-13 17:48 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

A virtio-fs device's VM state consists of:
- the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
- the back-end's (virtiofsd's) internal state

We get/set the latter via the new vhost-user operations FS_SET_STATE_FD,
FS_GET_STATE, and FS_SET_STATE.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 hw/virtio/vhost-user-fs.c | 171 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 170 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 83fc20e49e..df1fb02acc 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -20,8 +20,10 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 #include "qemu/error-report.h"
+#include "qemu/memfd.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user-fs.h"
+#include "migration/qemu-file-types.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
 
@@ -298,9 +300,176 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
     return &fs->vhost_dev;
 }
 
+/**
+ * Fetch the internal state from the back-end (virtiofsd) and save it
+ * to `f`.
+ */
+static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field, JSONWriter *vmdesc)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int memfd = -1;
+    /* Size of the shared memory through which to transfer the state */
+    const size_t chunk_size = 4 * 1024 * 1024;
+    size_t state_offset;
+    ssize_t remaining;
+    void *shm_buf;
+    Error *local_err = NULL;
+    int ret, ret2;
+
+    /* Set up shared memory through which to receive the state from virtiofsd */
+    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
+                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
+                               &memfd, &local_err);
+    if (!shm_buf) {
+        error_report_err(local_err);
+        ret = -ENOMEM;
+        goto early_fail;
+    }
+
+    /* Share the SHM area with virtiofsd */
+    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
+    if (ret < 0) {
+        goto early_fail;
+    }
+
+    /* Receive the virtiofsd state in chunks, and write them to `f` */
+    state_offset = 0;
+    do {
+        size_t this_chunk_size;
+
+        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
+                                       chunk_size);
+        if (remaining < 0) {
+            ret = remaining;
+            goto fail;
+        }
+
+        /* Prefix the whole state by its total length */
+        if (state_offset == 0) {
+            qemu_put_be64(f, remaining);
+        }
+
+        this_chunk_size = MIN(remaining, chunk_size);
+        qemu_put_buffer(f, shm_buf, this_chunk_size);
+        state_offset += this_chunk_size;
+    } while (remaining >= chunk_size);
+
+    ret = 0;
+fail:
+    /* Have virtiofsd close the shared memory */
+    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
+    if (ret2 < 0) {
+        error_report("Failed to remove state FD from the vhost-user-fs back "
+                     "end: %s", strerror(-ret));
+        if (ret == 0) {
+            ret = ret2;
+        }
+    }
+
+early_fail:
+    if (shm_buf) {
+        qemu_memfd_free(shm_buf, chunk_size, memfd);
+    }
+
+    return ret;
+}
+
+/**
+ * Load the back-end's (virtiofsd's) internal state from `f` and send
+ * it over to that back-end.
+ */
+static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int memfd = -1;
+    /* Size of the shared memory through which to transfer the state */
+    const size_t chunk_size = 4 * 1024 * 1024;
+    size_t state_offset;
+    uint64_t remaining;
+    void *shm_buf;
+    Error *local_err = NULL;
+    int ret, ret2;
+
+    /* The state is prefixed by its total length, read that first */
+    remaining = qemu_get_be64(f);
+
+    /* Set up shared memory through which to send the state to virtiofsd */
+    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
+                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
+                               &memfd, &local_err);
+    if (!shm_buf) {
+        error_report_err(local_err);
+        ret = -ENOMEM;
+        goto early_fail;
+    }
+
+    /* Share the SHM area with virtiofsd */
+    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
+    if (ret < 0) {
+        goto early_fail;
+    }
+
+    /*
+     * Read the virtiofsd state in chunks from `f`, and send them over
+     * to virtiofsd
+     */
+    state_offset = 0;
+    do {
+        size_t this_chunk_size = MIN(remaining, chunk_size);
+
+        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < this_chunk_size) {
+            ret = -EINVAL;
+            goto fail;
+        }
+
+        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, this_chunk_size);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        state_offset += this_chunk_size;
+        remaining -= this_chunk_size;
+    } while (remaining > 0);
+
+    ret = 0;
+fail:
+    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
+    if (ret2 < 0) {
+        error_report("Failed to remove state FD from the vhost-user-fs back "
+                     "end -- perhaps it failed to deserialize/apply the state: "
+                     "%s", strerror(-ret2));
+        if (ret == 0) {
+            ret = ret2;
+        }
+    }
+
+early_fail:
+    if (shm_buf) {
+        qemu_memfd_free(shm_buf, chunk_size, memfd);
+    }
+
+    return ret;
+}
+
 static const VMStateDescription vuf_vmstate = {
     .name = "vhost-user-fs",
-    .unmigratable = 1,
+    .version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        {
+            .name = "back-end",
+            .info = &(const VMStateInfo) {
+                .name = "virtio-fs back-end state",
+                .get = vuf_load_state,
+                .put = vuf_save_state,
+            },
+        },
+        VMSTATE_END_OF_LIST()
+    },
 };
 
 static Property vuf_properties[] = {
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-13 17:48   ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-15 13:58     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 13:58 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 16488 bytes --]

On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply

Luckily the interface does not need to be virtiofs-specific since it
only transfers opaque data. Any stateful device can use this for
migration. Please make it generic both at the vhost-user protocol
message level and at the QEMU vhost API level.

> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.

Keeping the migration data transfer separate from the vhost-user UNIX
domain socket is a good idea since the amount of data could be large and
may congest the UNIX domain socket. The shared memory interface solves
this.

Where I get lost is why it needs to be shared memory instead of simply
an fd? On the source, the front-end could read the fd until EOF and
transfer the opaque data. On the destination, the front-end could write
to the fd and then close it. I think that would be simpler than the
shared memory interface and could potentially support zero-copy via
splice(2) (QEMU doesn't need to look at the data being transferred!).

Here is an outline of an fd-based interface:

- SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
  transferring device state.

  The @direction argument:
  - SAVE: the back-end transfers an outgoing device state over the fd.
  - LOAD: the back-end transfers an incoming device state over the fd.

  The @phase argument:
  - STOPPED: the device is stopped.
  - PRE_COPY: reserved for future use.
  - POST_COPY: reserved for future use.

  The back-end transfers data over the fd according to @direction and
  @phase upon receiving the SET_DEVICE_STATE_FD message.

There are loose ends like how the message interacts with the virtqueue
enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
sent, etc. I have ignored them for now.

What I wanted to mention about the fd-based interface is:

- It's just one message. The I/O activity happens via the fd and does
  not involve GET_STATE/SET_STATE messages over the vhost-user domain
  socket.

- Buffer management is up to the front-end and back-end implementations
  and a bit simpler than the shared memory interface.

Did you choose the shared memory approach because it has certain
advantages?

> 
> This patch adds the following vhost operations (and implements them for
> vhost-user):
> 
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>   to the back-end.  This area will be used to transfer state via the
>   other two operations.
>   (After the transfer FS_SET_STATE_FD detaches the shared memory area
>   again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>   internal state into the shared memory area.
> 
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>   shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
> 
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.

What is the rationale for waiting to receive the entire incoming state
before parsing it rather than parsing it in a streaming fashion? Can
this be left as an implementation detail of the vhost-user back-end so
that there's freedom in choosing either approach?

> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |   9 ++
>  include/hw/virtio/vhost.h         |  68 +++++++++++++++
>  hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  29 +++++++
>  4 files changed, 244 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>  typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>  typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>  
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>  typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                  struct vhost_vring_file *file);
>  typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>      vhost_backend_init vhost_backend_init;
>      vhost_backend_cleanup vhost_backend_cleanup;
>      vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>      vhost_net_set_backend_op vhost_net_set_backend;
>      vhost_net_set_mtu_op vhost_net_set_mtu;
>      vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -210,6 +213,15 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>      return 0;
>  }
>  
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                              struct vhost_iotlb_msg *imsg)
>  {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -ENOSYS;
>  }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> -- 
> 2.39.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
@ 2023-03-15 13:58     ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 13:58 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 16488 bytes --]

On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply

Luckily the interface does not need to be virtiofs-specific since it
only transfers opaque data. Any stateful device can use this for
migration. Please make it generic both at the vhost-user protocol
message level and at the QEMU vhost API level.

> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.

Keeping the migration data transfer separate from the vhost-user UNIX
domain socket is a good idea since the amount of data could be large and
may congest the UNIX domain socket. The shared memory interface solves
this.

Where I get lost is why it needs to be shared memory instead of simply
an fd? On the source, the front-end could read the fd until EOF and
transfer the opaque data. On the destination, the front-end could write
to the fd and then close it. I think that would be simpler than the
shared memory interface and could potentially support zero-copy via
splice(2) (QEMU doesn't need to look at the data being transferred!).

Here is an outline of an fd-based interface:

- SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
  transferring device state.

  The @direction argument:
  - SAVE: the back-end transfers an outgoing device state over the fd.
  - LOAD: the back-end transfers an incoming device state over the fd.

  The @phase argument:
  - STOPPED: the device is stopped.
  - PRE_COPY: reserved for future use.
  - POST_COPY: reserved for future use.

  The back-end transfers data over the fd according to @direction and
  @phase upon receiving the SET_DEVICE_STATE_FD message.

There are loose ends like how the message interacts with the virtqueue
enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
sent, etc. I have ignored them for now.

What I wanted to mention about the fd-based interface is:

- It's just one message. The I/O activity happens via the fd and does
  not involve GET_STATE/SET_STATE messages over the vhost-user domain
  socket.

- Buffer management is up to the front-end and back-end implementations
  and a bit simpler than the shared memory interface.

Did you choose the shared memory approach because it has certain
advantages?

> 
> This patch adds the following vhost operations (and implements them for
> vhost-user):
> 
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>   to the back-end.  This area will be used to transfer state via the
>   other two operations.
>   (After the transfer FS_SET_STATE_FD detaches the shared memory area
>   again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>   internal state into the shared memory area.
> 
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>   shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
> 
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.

What is the rationale for waiting to receive the entire incoming state
before parsing it rather than parsing it in a streaming fashion? Can
this be left as an implementation detail of the vhost-user back-end so
that there's freedom in choosing either approach?

> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |   9 ++
>  include/hw/virtio/vhost.h         |  68 +++++++++++++++
>  hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  29 +++++++
>  4 files changed, 244 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>  typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>  typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>  
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>  typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                  struct vhost_vring_file *file);
>  typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>      vhost_backend_init vhost_backend_init;
>      vhost_backend_cleanup vhost_backend_cleanup;
>      vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>      vhost_net_set_backend_op vhost_net_set_backend;
>      vhost_net_set_mtu_op vhost_net_set_mtu;
>      vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -210,6 +213,15 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>      return 0;
>  }
>  
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                              struct vhost_iotlb_msg *imsg)
>  {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -ENOSYS;
>  }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> -- 
> 2.39.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-15 13:58     ` [Virtio-fs] " Stefan Hajnoczi
@ 2023-03-15 15:55       ` Hanna Czenczek
  -1 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-15 15:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

On 15.03.23 14:58, Stefan Hajnoczi wrote:
> On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
>> Add a virtio-fs-specific vhost-user interface to facilitate migrating
>> back-end-internal state.  We plan to migrate the internal state simply
> Luckily the interface does not need to be virtiofs-specific since it
> only transfers opaque data. Any stateful device can use this for
> migration. Please make it generic both at the vhost-user protocol
> message level and at the QEMU vhost API level.

OK, sure.

>> as a binary blob after the streaming phase, so all we need is a way to
>> transfer such a blob from and to the back-end.  We do so by using a
>> dedicated area of shared memory through which the blob is transferred in
>> chunks.
> Keeping the migration data transfer separate from the vhost-user UNIX
> domain socket is a good idea since the amount of data could be large and
> may congest the UNIX domain socket. The shared memory interface solves
> this.
>
> Where I get lost is why it needs to be shared memory instead of simply
> an fd? On the source, the front-end could read the fd until EOF and
> transfer the opaque data. On the destination, the front-end could write
> to the fd and then close it. I think that would be simpler than the
> shared memory interface and could potentially support zero-copy via
> splice(2) (QEMU doesn't need to look at the data being transferred!).
>
> Here is an outline of an fd-based interface:
>
> - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
>    transferring device state.
>
>    The @direction argument:
>    - SAVE: the back-end transfers an outgoing device state over the fd.
>    - LOAD: the back-end transfers an incoming device state over the fd.
>
>    The @phase argument:
>    - STOPPED: the device is stopped.
>    - PRE_COPY: reserved for future use.
>    - POST_COPY: reserved for future use.
>
>    The back-end transfers data over the fd according to @direction and
>    @phase upon receiving the SET_DEVICE_STATE_FD message.
>
> There are loose ends like how the message interacts with the virtqueue
> enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> sent, etc. I have ignored them for now.
>
> What I wanted to mention about the fd-based interface is:
>
> - It's just one message. The I/O activity happens via the fd and does
>    not involve GET_STATE/SET_STATE messages over the vhost-user domain
>    socket.
>
> - Buffer management is up to the front-end and back-end implementations
>    and a bit simpler than the shared memory interface.
>
> Did you choose the shared memory approach because it has certain
> advantages?

I simply chose it because I didn’t think of anything else. :)

Using just an FD for a pipe-like interface sounds perfect to me.  I 
expect that to make the code simpler and, as you point out, it’s just 
better in general.  Thanks!

>> This patch adds the following vhost operations (and implements them for
>> vhost-user):
>>
>> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>>    to the back-end.  This area will be used to transfer state via the
>>    other two operations.
>>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>>    again.)
>>
>> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>>    internal state into the shared memory area.
>>
>> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>>    shared memory area, and asks the back-end to fetch it.
>>
>> On the source side, the back-end is expected to serialize its internal
>> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
>> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
>> parts of that serialized state into the shared memory area.
>>
>> On the destination side, the back-end is expected to collect the state
>> blob over all FS_SET_STATE calls, and then deserialize and apply it once
>> FS_SET_STATE_FD detaches the shared memory area.
> What is the rationale for waiting to receive the entire incoming state
> before parsing it rather than parsing it in a streaming fashion? Can
> this be left as an implementation detail of the vhost-user back-end so
> that there's freedom in choosing either approach?

The rationale was that when using the shared memory approach, you need 
to specify the offset into the state of the chunk that you’re currently 
transferring.  So to allow streaming, you’d need to make the front-end 
transfer the chunks in a streaming fashion, so that these offsets are 
continuously increasing.  Definitely possible, and reasonable, I just 
thought it’d be easier not to define it at this point and just state 
that decoding at the end is always safe.

When using a pipe/splicing, however, that won’t be a concern anymore, so 
yes, then we can definitely allow the back-end to decode its state while 
it’s still being received.

Hanna



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
@ 2023-03-15 15:55       ` Hanna Czenczek
  0 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-15 15:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

On 15.03.23 14:58, Stefan Hajnoczi wrote:
> On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
>> Add a virtio-fs-specific vhost-user interface to facilitate migrating
>> back-end-internal state.  We plan to migrate the internal state simply
> Luckily the interface does not need to be virtiofs-specific since it
> only transfers opaque data. Any stateful device can use this for
> migration. Please make it generic both at the vhost-user protocol
> message level and at the QEMU vhost API level.

OK, sure.

>> as a binary blob after the streaming phase, so all we need is a way to
>> transfer such a blob from and to the back-end.  We do so by using a
>> dedicated area of shared memory through which the blob is transferred in
>> chunks.
> Keeping the migration data transfer separate from the vhost-user UNIX
> domain socket is a good idea since the amount of data could be large and
> may congest the UNIX domain socket. The shared memory interface solves
> this.
>
> Where I get lost is why it needs to be shared memory instead of simply
> an fd? On the source, the front-end could read the fd until EOF and
> transfer the opaque data. On the destination, the front-end could write
> to the fd and then close it. I think that would be simpler than the
> shared memory interface and could potentially support zero-copy via
> splice(2) (QEMU doesn't need to look at the data being transferred!).
>
> Here is an outline of an fd-based interface:
>
> - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
>    transferring device state.
>
>    The @direction argument:
>    - SAVE: the back-end transfers an outgoing device state over the fd.
>    - LOAD: the back-end transfers an incoming device state over the fd.
>
>    The @phase argument:
>    - STOPPED: the device is stopped.
>    - PRE_COPY: reserved for future use.
>    - POST_COPY: reserved for future use.
>
>    The back-end transfers data over the fd according to @direction and
>    @phase upon receiving the SET_DEVICE_STATE_FD message.
>
> There are loose ends like how the message interacts with the virtqueue
> enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> sent, etc. I have ignored them for now.
>
> What I wanted to mention about the fd-based interface is:
>
> - It's just one message. The I/O activity happens via the fd and does
>    not involve GET_STATE/SET_STATE messages over the vhost-user domain
>    socket.
>
> - Buffer management is up to the front-end and back-end implementations
>    and a bit simpler than the shared memory interface.
>
> Did you choose the shared memory approach because it has certain
> advantages?

I simply chose it because I didn’t think of anything else. :)

Using just an FD for a pipe-like interface sounds perfect to me.  I 
expect that to make the code simpler and, as you point out, it’s just 
better in general.  Thanks!

>> This patch adds the following vhost operations (and implements them for
>> vhost-user):
>>
>> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>>    to the back-end.  This area will be used to transfer state via the
>>    other two operations.
>>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>>    again.)
>>
>> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>>    internal state into the shared memory area.
>>
>> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>>    shared memory area, and asks the back-end to fetch it.
>>
>> On the source side, the back-end is expected to serialize its internal
>> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
>> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
>> parts of that serialized state into the shared memory area.
>>
>> On the destination side, the back-end is expected to collect the state
>> blob over all FS_SET_STATE calls, and then deserialize and apply it once
>> FS_SET_STATE_FD detaches the shared memory area.
> What is the rationale for waiting to receive the entire incoming state
> before parsing it rather than parsing it in a streaming fashion? Can
> this be left as an implementation detail of the vhost-user back-end so
> that there's freedom in choosing either approach?

The rationale was that when using the shared memory approach, you need 
to specify the offset into the state of the chunk that you’re currently 
transferring.  So to allow streaming, you’d need to make the front-end 
transfer the chunks in a streaming fashion, so that these offsets are 
continuously increasing.  Definitely possible, and reasonable, I just 
thought it’d be easier not to define it at this point and just state 
that decoding at the end is always safe.

When using a pipe/splicing, however, that won’t be a concern anymore, so 
yes, then we can definitely allow the back-end to decode its state while 
it’s still being received.

Hanna


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-15 15:55       ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-15 16:33         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 16:33 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

On Wed, 15 Mar 2023 at 11:56, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 15.03.23 14:58, Stefan Hajnoczi wrote:
> > On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> >> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> >> back-end-internal state.  We plan to migrate the internal state simply
> > Luckily the interface does not need to be virtiofs-specific since it
> > only transfers opaque data. Any stateful device can use this for
> > migration. Please make it generic both at the vhost-user protocol
> > message level and at the QEMU vhost API level.
>
> OK, sure.
>
> >> as a binary blob after the streaming phase, so all we need is a way to
> >> transfer such a blob from and to the back-end.  We do so by using a
> >> dedicated area of shared memory through which the blob is transferred in
> >> chunks.
> > Keeping the migration data transfer separate from the vhost-user UNIX
> > domain socket is a good idea since the amount of data could be large and
> > may congest the UNIX domain socket. The shared memory interface solves
> > this.
> >
> > Where I get lost is why it needs to be shared memory instead of simply
> > an fd? On the source, the front-end could read the fd until EOF and
> > transfer the opaque data. On the destination, the front-end could write
> > to the fd and then close it. I think that would be simpler than the
> > shared memory interface and could potentially support zero-copy via
> > splice(2) (QEMU doesn't need to look at the data being transferred!).
> >
> > Here is an outline of an fd-based interface:
> >
> > - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
> >    transferring device state.
> >
> >    The @direction argument:
> >    - SAVE: the back-end transfers an outgoing device state over the fd.
> >    - LOAD: the back-end transfers an incoming device state over the fd.
> >
> >    The @phase argument:
> >    - STOPPED: the device is stopped.
> >    - PRE_COPY: reserved for future use.
> >    - POST_COPY: reserved for future use.
> >
> >    The back-end transfers data over the fd according to @direction and
> >    @phase upon receiving the SET_DEVICE_STATE_FD message.
> >
> > There are loose ends like how the message interacts with the virtqueue
> > enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> > sent, etc. I have ignored them for now.
> >
> > What I wanted to mention about the fd-based interface is:
> >
> > - It's just one message. The I/O activity happens via the fd and does
> >    not involve GET_STATE/SET_STATE messages over the vhost-user domain
> >    socket.
> >
> > - Buffer management is up to the front-end and back-end implementations
> >    and a bit simpler than the shared memory interface.
> >
> > Did you choose the shared memory approach because it has certain
> > advantages?
>
> I simply chose it because I didn’t think of anything else. :)
>
> Using just an FD for a pipe-like interface sounds perfect to me.  I
> expect that to make the code simpler and, as you point out, it’s just
> better in general.  Thanks!

The Linux VFIO Migration v2 API could be interesting to look at too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/vfio.h#n814

It has a state machine that puts the device into
pre-copy/saving/loading/etc states.

> > What is the rationale for waiting to receive the entire incoming state
> > before parsing it rather than parsing it in a streaming fashion? Can
> > this be left as an implementation detail of the vhost-user back-end so
> > that there's freedom in choosing either approach?
>
> The rationale was that when using the shared memory approach, you need
> to specify the offset into the state of the chunk that you’re currently
> transferring.  So to allow streaming, you’d need to make the front-end
> transfer the chunks in a streaming fashion, so that these offsets are
> continuously increasing.  Definitely possible, and reasonable, I just
> thought it’d be easier not to define it at this point and just state
> that decoding at the end is always safe.
>
> When using a pipe/splicing, however, that won’t be a concern anymore, so
> yes, then we can definitely allow the back-end to decode its state while
> it’s still being received.

I see.

Stefan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
@ 2023-03-15 16:33         ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 16:33 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Juan Quintela

On Wed, 15 Mar 2023 at 11:56, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 15.03.23 14:58, Stefan Hajnoczi wrote:
> > On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> >> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> >> back-end-internal state.  We plan to migrate the internal state simply
> > Luckily the interface does not need to be virtiofs-specific since it
> > only transfers opaque data. Any stateful device can use this for
> > migration. Please make it generic both at the vhost-user protocol
> > message level and at the QEMU vhost API level.
>
> OK, sure.
>
> >> as a binary blob after the streaming phase, so all we need is a way to
> >> transfer such a blob from and to the back-end.  We do so by using a
> >> dedicated area of shared memory through which the blob is transferred in
> >> chunks.
> > Keeping the migration data transfer separate from the vhost-user UNIX
> > domain socket is a good idea since the amount of data could be large and
> > may congest the UNIX domain socket. The shared memory interface solves
> > this.
> >
> > Where I get lost is why it needs to be shared memory instead of simply
> > an fd? On the source, the front-end could read the fd until EOF and
> > transfer the opaque data. On the destination, the front-end could write
> > to the fd and then close it. I think that would be simpler than the
> > shared memory interface and could potentially support zero-copy via
> > splice(2) (QEMU doesn't need to look at the data being transferred!).
> >
> > Here is an outline of an fd-based interface:
> >
> > - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
> >    transferring device state.
> >
> >    The @direction argument:
> >    - SAVE: the back-end transfers an outgoing device state over the fd.
> >    - LOAD: the back-end transfers an incoming device state over the fd.
> >
> >    The @phase argument:
> >    - STOPPED: the device is stopped.
> >    - PRE_COPY: reserved for future use.
> >    - POST_COPY: reserved for future use.
> >
> >    The back-end transfers data over the fd according to @direction and
> >    @phase upon receiving the SET_DEVICE_STATE_FD message.
> >
> > There are loose ends like how the message interacts with the virtqueue
> > enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> > sent, etc. I have ignored them for now.
> >
> > What I wanted to mention about the fd-based interface is:
> >
> > - It's just one message. The I/O activity happens via the fd and does
> >    not involve GET_STATE/SET_STATE messages over the vhost-user domain
> >    socket.
> >
> > - Buffer management is up to the front-end and back-end implementations
> >    and a bit simpler than the shared memory interface.
> >
> > Did you choose the shared memory approach because it has certain
> > advantages?
>
> I simply chose it because I didn’t think of anything else. :)
>
> Using just an FD for a pipe-like interface sounds perfect to me.  I
> expect that to make the code simpler and, as you point out, it’s just
> better in general.  Thanks!

The Linux VFIO Migration v2 API could be interesting to look at too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/vfio.h#n814

It has a state machine that puts the device into
pre-copy/saving/loading/etc states.

> > What is the rationale for waiting to receive the entire incoming state
> > before parsing it rather than parsing it in a streaming fashion? Can
> > this be left as an implementation detail of the vhost-user back-end so
> > that there's freedom in choosing either approach?
>
> The rationale was that when using the shared memory approach, you need
> to specify the offset into the state of the chunk that you’re currently
> transferring.  So to allow streaming, you’d need to make the front-end
> transfer the chunks in a streaming fashion, so that these offsets are
> continuously increasing.  Definitely possible, and reasonable, I just
> thought it’d be easier not to define it at this point and just state
> that decoding at the end is always safe.
>
> When using a pipe/splicing, however, that won’t be a concern anymore, so
> yes, then we can definitely allow the back-end to decode its state while
> it’s still being received.

I see.

Stefan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-13 17:48   ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-17 17:19     ` Anton Kuchin
  -1 siblings, 0 replies; 23+ messages in thread
From: Anton Kuchin @ 2023-03-17 17:19 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

On 13/03/2023 19:48, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
>
> We get/set the latter via the new vhost-user operations FS_SET_STATE_FD,
> FS_GET_STATE, and FS_SET_STATE.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   hw/virtio/vhost-user-fs.c | 171 +++++++++++++++++++++++++++++++++++++-
>   1 file changed, 170 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index 83fc20e49e..df1fb02acc 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -20,8 +20,10 @@
>   #include "hw/virtio/virtio-bus.h"
>   #include "hw/virtio/virtio-access.h"
>   #include "qemu/error-report.h"
> +#include "qemu/memfd.h"
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/vhost-user-fs.h"
> +#include "migration/qemu-file-types.h"
>   #include "monitor/monitor.h"
>   #include "sysemu/sysemu.h"
>   
> @@ -298,9 +300,176 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
>       return &fs->vhost_dev;
>   }
>   
> +/**
> + * Fetch the internal state from the back-end (virtiofsd) and save it
> + * to `f`.
> + */
> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field, JSONWriter *vmdesc)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    int memfd = -1;
> +    /* Size of the shared memory through which to transfer the state */
> +    const size_t chunk_size = 4 * 1024 * 1024;
> +    size_t state_offset;
> +    ssize_t remaining;
> +    void *shm_buf;
> +    Error *local_err = NULL;
> +    int ret, ret2;
> +
> +    /* Set up shared memory through which to receive the state from virtiofsd */
> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
> +                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
> +                               &memfd, &local_err);
> +    if (!shm_buf) {
> +        error_report_err(local_err);
> +        ret = -ENOMEM;
> +        goto early_fail;
> +    }
> +
> +    /* Share the SHM area with virtiofsd */
> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
> +    if (ret < 0) {
> +        goto early_fail;

Don't we need some log message here too?

> +    }
> +
> +    /* Receive the virtiofsd state in chunks, and write them to `f` */
> +    state_offset = 0;
> +    do {
> +        size_t this_chunk_size;
> +
> +        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
> +                                       chunk_size);
> +        if (remaining < 0) {
> +            ret = remaining;
> +            goto fail;
> +        }
> +
> +        /* Prefix the whole state by its total length */
> +        if (state_offset == 0) {
> +            qemu_put_be64(f, remaining);
> +        }
> +
> +        this_chunk_size = MIN(remaining, chunk_size);
> +        qemu_put_buffer(f, shm_buf, this_chunk_size);
> +        state_offset += this_chunk_size;
> +    } while (remaining >= chunk_size);
> +
> +    ret = 0;
> +fail:
> +    /* Have virtiofsd close the shared memory */
> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
> +    if (ret2 < 0) {
> +        error_report("Failed to remove state FD from the vhost-user-fs back "
> +                     "end: %s", strerror(-ret));
> +        if (ret == 0) {
> +            ret = ret2;
> +        }
> +    }
> +
> +early_fail:
> +    if (shm_buf) {
> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
> +    }
> +
> +    return ret;
> +}
> +
> +/**
> + * Load the back-end's (virtiofsd's) internal state from `f` and send
> + * it over to that back-end.
> + */
> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    int memfd = -1;
> +    /* Size of the shared memory through which to transfer the state */
> +    const size_t chunk_size = 4 * 1024 * 1024;
> +    size_t state_offset;
> +    uint64_t remaining;
> +    void *shm_buf;
> +    Error *local_err = NULL;
> +    int ret, ret2;
> +
> +    /* The state is prefixed by its total length, read that first */
> +    remaining = qemu_get_be64(f);
> +
> +    /* Set up shared memory through which to send the state to virtiofsd */
> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
> +                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
> +                               &memfd, &local_err);
> +    if (!shm_buf) {
> +        error_report_err(local_err);
> +        ret = -ENOMEM;
> +        goto early_fail;
> +    }
> +
> +    /* Share the SHM area with virtiofsd */
> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
> +    if (ret < 0) {
> +        goto early_fail;
> +    }
> +
> +    /*
> +     * Read the virtiofsd state in chunks from `f`, and send them over
> +     * to virtiofsd
> +     */
> +    state_offset = 0;
> +    do {
> +        size_t this_chunk_size = MIN(remaining, chunk_size);
> +
> +        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < this_chunk_size) {
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +
> +        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, this_chunk_size);
> +        if (ret < 0) {
> +            goto fail;
> +        }
> +
> +        state_offset += this_chunk_size;
> +        remaining -= this_chunk_size;
> +    } while (remaining > 0);
> +
> +    ret = 0;
> +fail:
> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
> +    if (ret2 < 0) {
> +        error_report("Failed to remove state FD from the vhost-user-fs back "
> +                     "end -- perhaps it failed to deserialize/apply the state: "
> +                     "%s", strerror(-ret2));
> +        if (ret == 0) {
> +            ret = ret2;
> +        }
> +    }
> +
> +early_fail:
> +    if (shm_buf) {
> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
> +    }
> +
> +    return ret;
> +}
> +
>   static const VMStateDescription vuf_vmstate = {
>       .name = "vhost-user-fs",
> -    .unmigratable = 1,
> +    .version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_VIRTIO_DEVICE,
> +        {
> +            .name = "back-end",
> +            .info = &(const VMStateInfo) {
> +                .name = "virtio-fs back-end state",
> +                .get = vuf_load_state,
> +                .put = vuf_save_state,
> +            },
> +        },

I've been working on stateless migration patch [1] and there was 
discussed that we
need to keep some kind of blocker by default if orchestrators rely on 
unmigratable
field in virtio-fs vmstate to block the migration.
For this purpose I've implemented flag that selects "none" or "external" 
and is checked
in pre_save, so it could be extended with "internal" option.
We didn't come to conclusion if we also need to check incoming 
migration, the discussion
has stopped for a while but I'm going back to it now.

I would appreciate if you have time to take a look at the discussion and 
consider the idea
proposed there to store internal state as a subsection of vmstate to 
make it as an option
but not mandatory.

[1] 
https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/

> +        VMSTATE_END_OF_LIST()
> +    },
>   };
>   
>   static Property vuf_properties[] = {


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
@ 2023-03-17 17:19     ` Anton Kuchin
  0 siblings, 0 replies; 23+ messages in thread
From: Anton Kuchin @ 2023-03-17 17:19 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

On 13/03/2023 19:48, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
>
> We get/set the latter via the new vhost-user operations FS_SET_STATE_FD,
> FS_GET_STATE, and FS_SET_STATE.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   hw/virtio/vhost-user-fs.c | 171 +++++++++++++++++++++++++++++++++++++-
>   1 file changed, 170 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index 83fc20e49e..df1fb02acc 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -20,8 +20,10 @@
>   #include "hw/virtio/virtio-bus.h"
>   #include "hw/virtio/virtio-access.h"
>   #include "qemu/error-report.h"
> +#include "qemu/memfd.h"
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/vhost-user-fs.h"
> +#include "migration/qemu-file-types.h"
>   #include "monitor/monitor.h"
>   #include "sysemu/sysemu.h"
>   
> @@ -298,9 +300,176 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
>       return &fs->vhost_dev;
>   }
>   
> +/**
> + * Fetch the internal state from the back-end (virtiofsd) and save it
> + * to `f`.
> + */
> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field, JSONWriter *vmdesc)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    int memfd = -1;
> +    /* Size of the shared memory through which to transfer the state */
> +    const size_t chunk_size = 4 * 1024 * 1024;
> +    size_t state_offset;
> +    ssize_t remaining;
> +    void *shm_buf;
> +    Error *local_err = NULL;
> +    int ret, ret2;
> +
> +    /* Set up shared memory through which to receive the state from virtiofsd */
> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
> +                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
> +                               &memfd, &local_err);
> +    if (!shm_buf) {
> +        error_report_err(local_err);
> +        ret = -ENOMEM;
> +        goto early_fail;
> +    }
> +
> +    /* Share the SHM area with virtiofsd */
> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
> +    if (ret < 0) {
> +        goto early_fail;

Don't we need some log message here too?

> +    }
> +
> +    /* Receive the virtiofsd state in chunks, and write them to `f` */
> +    state_offset = 0;
> +    do {
> +        size_t this_chunk_size;
> +
> +        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
> +                                       chunk_size);
> +        if (remaining < 0) {
> +            ret = remaining;
> +            goto fail;
> +        }
> +
> +        /* Prefix the whole state by its total length */
> +        if (state_offset == 0) {
> +            qemu_put_be64(f, remaining);
> +        }
> +
> +        this_chunk_size = MIN(remaining, chunk_size);
> +        qemu_put_buffer(f, shm_buf, this_chunk_size);
> +        state_offset += this_chunk_size;
> +    } while (remaining >= chunk_size);
> +
> +    ret = 0;
> +fail:
> +    /* Have virtiofsd close the shared memory */
> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
> +    if (ret2 < 0) {
> +        error_report("Failed to remove state FD from the vhost-user-fs back "
> +                     "end: %s", strerror(-ret));
> +        if (ret == 0) {
> +            ret = ret2;
> +        }
> +    }
> +
> +early_fail:
> +    if (shm_buf) {
> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
> +    }
> +
> +    return ret;
> +}
> +
> +/**
> + * Load the back-end's (virtiofsd's) internal state from `f` and send
> + * it over to that back-end.
> + */
> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
> +                          const VMStateField *field)
> +{
> +    VirtIODevice *vdev = pv;
> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
> +    int memfd = -1;
> +    /* Size of the shared memory through which to transfer the state */
> +    const size_t chunk_size = 4 * 1024 * 1024;
> +    size_t state_offset;
> +    uint64_t remaining;
> +    void *shm_buf;
> +    Error *local_err = NULL;
> +    int ret, ret2;
> +
> +    /* The state is prefixed by its total length, read that first */
> +    remaining = qemu_get_be64(f);
> +
> +    /* Set up shared memory through which to send the state to virtiofsd */
> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
> +                               F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW,
> +                               &memfd, &local_err);
> +    if (!shm_buf) {
> +        error_report_err(local_err);
> +        ret = -ENOMEM;
> +        goto early_fail;
> +    }
> +
> +    /* Share the SHM area with virtiofsd */
> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
> +    if (ret < 0) {
> +        goto early_fail;
> +    }
> +
> +    /*
> +     * Read the virtiofsd state in chunks from `f`, and send them over
> +     * to virtiofsd
> +     */
> +    state_offset = 0;
> +    do {
> +        size_t this_chunk_size = MIN(remaining, chunk_size);
> +
> +        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < this_chunk_size) {
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +
> +        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, this_chunk_size);
> +        if (ret < 0) {
> +            goto fail;
> +        }
> +
> +        state_offset += this_chunk_size;
> +        remaining -= this_chunk_size;
> +    } while (remaining > 0);
> +
> +    ret = 0;
> +fail:
> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
> +    if (ret2 < 0) {
> +        error_report("Failed to remove state FD from the vhost-user-fs back "
> +                     "end -- perhaps it failed to deserialize/apply the state: "
> +                     "%s", strerror(-ret2));
> +        if (ret == 0) {
> +            ret = ret2;
> +        }
> +    }
> +
> +early_fail:
> +    if (shm_buf) {
> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
> +    }
> +
> +    return ret;
> +}
> +
>   static const VMStateDescription vuf_vmstate = {
>       .name = "vhost-user-fs",
> -    .unmigratable = 1,
> +    .version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_VIRTIO_DEVICE,
> +        {
> +            .name = "back-end",
> +            .info = &(const VMStateInfo) {
> +                .name = "virtio-fs back-end state",
> +                .get = vuf_load_state,
> +                .put = vuf_save_state,
> +            },
> +        },

I've been working on stateless migration patch [1] and there was 
discussed that we
need to keep some kind of blocker by default if orchestrators rely on 
unmigratable
field in virtio-fs vmstate to block the migration.
For this purpose I've implemented flag that selects "none" or "external" 
and is checked
in pre_save, so it could be extended with "internal" option.
We didn't come to conclusion if we also need to check incoming 
migration, the discussion
has stopped for a while but I'm going back to it now.

I would appreciate if you have time to take a look at the discussion and 
consider the idea
proposed there to store internal state as a subsection of vmstate to 
make it as an option
but not mandatory.

[1] 
https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/

> +        VMSTATE_END_OF_LIST()
> +    },
>   };
>   
>   static Property vuf_properties[] = {


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-13 17:48   ` [Virtio-fs] " Hanna Czenczek
@ 2023-03-17 17:39     ` Anton Kuchin
  -1 siblings, 0 replies; 23+ messages in thread
From: Anton Kuchin @ 2023-03-17 17:39 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

On 13/03/2023 19:48, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply
> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.
>
> This patch adds the following vhost operations (and implements them for
> vhost-user):
>
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>    to the back-end.  This area will be used to transfer state via the
>    other two operations.
>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>    again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>    internal state into the shared memory area.
>
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>    shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
>
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   include/hw/virtio/vhost-backend.h |   9 ++
>   include/hw/virtio/vhost.h         |  68 +++++++++++++++
>   hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost.c                 |  29 +++++++
>   4 files changed, 244 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>   typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>   typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>   
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>   typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                   struct vhost_vring_file *file);
>   typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>       vhost_backend_init vhost_backend_init;
>       vhost_backend_cleanup vhost_backend_cleanup;
>       vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>       vhost_net_set_backend_op vhost_net_set_backend;
>       vhost_net_set_mtu_op vhost_net_set_mtu;
>       vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                              struct vhost_inflight *inflight);
>   int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                              struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>   #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>       VHOST_USER_REM_MEM_REG = 38,
>       VHOST_USER_SET_STATUS = 39,
>       VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>       VHOST_USER_MAX
>   } VhostUserRequest;
>   
> @@ -210,6 +213,15 @@ typedef struct {
>       uint32_t size; /* the following payload size */
>   } QEMU_PACKED VhostUserHeader;
>   
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>   typedef union {
>   #define VHOST_USER_VRING_IDX_MASK   (0xff)
>   #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>           VhostUserCryptoSession session;
>           VhostUserVringArea area;
>           VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>   } VhostUserPayload;
>   
>   typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>       return 0;
>   }
>   
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{

Am I right to assume that vrings are supposed to be stopped to get/set 
state?
If I am shouldn't we check it before sending message or do you believe it is
better to return error from backend in these cases?

> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>   static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                               struct vhost_iotlb_msg *imsg)
>   {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>           .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>           .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>           .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>   };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>   
>       return -ENOSYS;
>   }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
@ 2023-03-17 17:39     ` Anton Kuchin
  0 siblings, 0 replies; 23+ messages in thread
From: Anton Kuchin @ 2023-03-17 17:39 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, Stefan Hajnoczi,
	Dr . David Alan Gilbert, Juan Quintela

On 13/03/2023 19:48, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply
> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.
>
> This patch adds the following vhost operations (and implements them for
> vhost-user):
>
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>    to the back-end.  This area will be used to transfer state via the
>    other two operations.
>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>    again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>    internal state into the shared memory area.
>
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>    shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
>
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   include/hw/virtio/vhost-backend.h |   9 ++
>   include/hw/virtio/vhost.h         |  68 +++++++++++++++
>   hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost.c                 |  29 +++++++
>   4 files changed, 244 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>   typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>   typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>   
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>   typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                   struct vhost_vring_file *file);
>   typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>       vhost_backend_init vhost_backend_init;
>       vhost_backend_cleanup vhost_backend_cleanup;
>       vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>       vhost_net_set_backend_op vhost_net_set_backend;
>       vhost_net_set_mtu_op vhost_net_set_mtu;
>       vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                              struct vhost_inflight *inflight);
>   int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                              struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>   #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>       VHOST_USER_REM_MEM_REG = 38,
>       VHOST_USER_SET_STATUS = 39,
>       VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>       VHOST_USER_MAX
>   } VhostUserRequest;
>   
> @@ -210,6 +213,15 @@ typedef struct {
>       uint32_t size; /* the following payload size */
>   } QEMU_PACKED VhostUserHeader;
>   
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>   typedef union {
>   #define VHOST_USER_VRING_IDX_MASK   (0xff)
>   #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>           VhostUserCryptoSession session;
>           VhostUserVringArea area;
>           VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>   } VhostUserPayload;
>   
>   typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>       return 0;
>   }
>   
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{

Am I right to assume that vrings are supposed to be stopped to get/set 
state?
If I am shouldn't we check it before sending message or do you believe it is
better to return error from backend in these cases?

> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>   static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                               struct vhost_iotlb_msg *imsg)
>   {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>           .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>           .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>           .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>   };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>   
>       return -ENOSYS;
>   }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-17 17:19     ` [Virtio-fs] " Anton Kuchin
  (?)
@ 2023-03-17 17:52     ` Hanna Czenczek
  2023-03-17 18:37       ` Anton Kuchin
  -1 siblings, 1 reply; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:52 UTC (permalink / raw)
  To: Anton Kuchin
  Cc: Juan Quintela, Michael S . Tsirkin, qemu-devel, virtio-fs,
	Stefan Hajnoczi

On 17.03.23 18:19, Anton Kuchin wrote:
> On 13/03/2023 19:48, Hanna Czenczek wrote:
>> A virtio-fs device's VM state consists of:
>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>> - the back-end's (virtiofsd's) internal state
>>
>> We get/set the latter via the new vhost-user operations FS_SET_STATE_FD,
>> FS_GET_STATE, and FS_SET_STATE.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   hw/virtio/vhost-user-fs.c | 171 +++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 170 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
>> index 83fc20e49e..df1fb02acc 100644
>> --- a/hw/virtio/vhost-user-fs.c
>> +++ b/hw/virtio/vhost-user-fs.c
>> @@ -20,8 +20,10 @@
>>   #include "hw/virtio/virtio-bus.h"
>>   #include "hw/virtio/virtio-access.h"
>>   #include "qemu/error-report.h"
>> +#include "qemu/memfd.h"
>>   #include "hw/virtio/vhost.h"
>>   #include "hw/virtio/vhost-user-fs.h"
>> +#include "migration/qemu-file-types.h"
>>   #include "monitor/monitor.h"
>>   #include "sysemu/sysemu.h"
>>   @@ -298,9 +300,176 @@ static struct vhost_dev 
>> *vuf_get_vhost(VirtIODevice *vdev)
>>       return &fs->vhost_dev;
>>   }
>>   +/**
>> + * Fetch the internal state from the back-end (virtiofsd) and save it
>> + * to `f`.
>> + */
>> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
>> +                          const VMStateField *field, JSONWriter 
>> *vmdesc)
>> +{
>> +    VirtIODevice *vdev = pv;
>> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
>> +    int memfd = -1;
>> +    /* Size of the shared memory through which to transfer the state */
>> +    const size_t chunk_size = 4 * 1024 * 1024;
>> +    size_t state_offset;
>> +    ssize_t remaining;
>> +    void *shm_buf;
>> +    Error *local_err = NULL;
>> +    int ret, ret2;
>> +
>> +    /* Set up shared memory through which to receive the state from 
>> virtiofsd */
>> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
>> +                               F_SEAL_SEAL | F_SEAL_SHRINK | 
>> F_SEAL_GROW,
>> +                               &memfd, &local_err);
>> +    if (!shm_buf) {
>> +        error_report_err(local_err);
>> +        ret = -ENOMEM;
>> +        goto early_fail;
>> +    }
>> +
>> +    /* Share the SHM area with virtiofsd */
>> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
>> +    if (ret < 0) {
>> +        goto early_fail;
>
> Don't we need some log message here too?

Sure, why not.  There are other places in this patch that just return 
-errno but print no error, I think they could all use a verbose error 
message.

>> +    }
>> +
>> +    /* Receive the virtiofsd state in chunks, and write them to `f` */
>> +    state_offset = 0;
>> +    do {
>> +        size_t this_chunk_size;
>> +
>> +        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
>> +                                       chunk_size);
>> +        if (remaining < 0) {
>> +            ret = remaining;
>> +            goto fail;
>> +        }
>> +
>> +        /* Prefix the whole state by its total length */
>> +        if (state_offset == 0) {
>> +            qemu_put_be64(f, remaining);
>> +        }
>> +
>> +        this_chunk_size = MIN(remaining, chunk_size);
>> +        qemu_put_buffer(f, shm_buf, this_chunk_size);
>> +        state_offset += this_chunk_size;
>> +    } while (remaining >= chunk_size);
>> +
>> +    ret = 0;
>> +fail:
>> +    /* Have virtiofsd close the shared memory */
>> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
>> +    if (ret2 < 0) {
>> +        error_report("Failed to remove state FD from the 
>> vhost-user-fs back "
>> +                     "end: %s", strerror(-ret));
>> +        if (ret == 0) {
>> +            ret = ret2;
>> +        }
>> +    }
>> +
>> +early_fail:
>> +    if (shm_buf) {
>> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +/**
>> + * Load the back-end's (virtiofsd's) internal state from `f` and send
>> + * it over to that back-end.
>> + */
>> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
>> +                          const VMStateField *field)
>> +{
>> +    VirtIODevice *vdev = pv;
>> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
>> +    int memfd = -1;
>> +    /* Size of the shared memory through which to transfer the state */
>> +    const size_t chunk_size = 4 * 1024 * 1024;
>> +    size_t state_offset;
>> +    uint64_t remaining;
>> +    void *shm_buf;
>> +    Error *local_err = NULL;
>> +    int ret, ret2;
>> +
>> +    /* The state is prefixed by its total length, read that first */
>> +    remaining = qemu_get_be64(f);
>> +
>> +    /* Set up shared memory through which to send the state to 
>> virtiofsd */
>> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
>> +                               F_SEAL_SEAL | F_SEAL_SHRINK | 
>> F_SEAL_GROW,
>> +                               &memfd, &local_err);
>> +    if (!shm_buf) {
>> +        error_report_err(local_err);
>> +        ret = -ENOMEM;
>> +        goto early_fail;
>> +    }
>> +
>> +    /* Share the SHM area with virtiofsd */
>> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
>> +    if (ret < 0) {
>> +        goto early_fail;
>> +    }
>> +
>> +    /*
>> +     * Read the virtiofsd state in chunks from `f`, and send them over
>> +     * to virtiofsd
>> +     */
>> +    state_offset = 0;
>> +    do {
>> +        size_t this_chunk_size = MIN(remaining, chunk_size);
>> +
>> +        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < 
>> this_chunk_size) {
>> +            ret = -EINVAL;
>> +            goto fail;
>> +        }
>> +
>> +        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, 
>> this_chunk_size);
>> +        if (ret < 0) {
>> +            goto fail;
>> +        }
>> +
>> +        state_offset += this_chunk_size;
>> +        remaining -= this_chunk_size;
>> +    } while (remaining > 0);
>> +
>> +    ret = 0;
>> +fail:
>> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
>> +    if (ret2 < 0) {
>> +        error_report("Failed to remove state FD from the 
>> vhost-user-fs back "
>> +                     "end -- perhaps it failed to deserialize/apply 
>> the state: "
>> +                     "%s", strerror(-ret2));
>> +        if (ret == 0) {
>> +            ret = ret2;
>> +        }
>> +    }
>> +
>> +early_fail:
>> +    if (shm_buf) {
>> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static const VMStateDescription vuf_vmstate = {
>>       .name = "vhost-user-fs",
>> -    .unmigratable = 1,
>> +    .version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_VIRTIO_DEVICE,
>> +        {
>> +            .name = "back-end",
>> +            .info = &(const VMStateInfo) {
>> +                .name = "virtio-fs back-end state",
>> +                .get = vuf_load_state,
>> +                .put = vuf_save_state,
>> +            },
>> +        },
>
> I've been working on stateless migration patch [1] and there was 
> discussed that we
> need to keep some kind of blocker by default if orchestrators rely on 
> unmigratable
> field in virtio-fs vmstate to block the migration.
> For this purpose I've implemented flag that selects "none" or 
> "external" and is checked
> in pre_save, so it could be extended with "internal" option.
> We didn't come to conclusion if we also need to check incoming 
> migration, the discussion
> has stopped for a while but I'm going back to it now.
>
> I would appreciate if you have time to take a look at the discussion 
> and consider the idea
> proposed there to store internal state as a subsection of vmstate to 
> make it as an option
> but not mandatory.
>
> [1] 
> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/

So far I’ve mostly considered these issues orthogonal.  If your 
stateless migration goes in first, then state is optional and I’ll 
adjust this series.  If stateful migration goes in first, then your 
series can simply make state optional by introducing the external 
option, no?

But maybe we could also consider making stateless migration a special 
case of stateful migration; if we had stateful migration, can’t we just 
implement stateless migration by telling virtiofsd that it should submit 
a special “I have no state” pseudo-state, i.e. by having a switch on 
virtiofsd instead?

Off the top of my head, some downsides of that approach would be (1) 
it’d need a switch on the virtiofsd side, not on the qemu side (not sure 
if that’s a downside, but a difference for sure), and (2) we’d need at 
least some support for this on the virtiofsd side, i.e. practically 
can’t come quicker than stateful migration support.

Hanna


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 1/2] vhost-user: Add interface for virtio-fs migration
  2023-03-17 17:39     ` [Virtio-fs] " Anton Kuchin
  (?)
@ 2023-03-17 17:53     ` Hanna Czenczek
  -1 siblings, 0 replies; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:53 UTC (permalink / raw)
  To: Anton Kuchin
  Cc: Juan Quintela, Michael S . Tsirkin, qemu-devel, virtio-fs,
	Stefan Hajnoczi

On 17.03.23 18:39, Anton Kuchin wrote:
> On 13/03/2023 19:48, Hanna Czenczek wrote:
>> Add a virtio-fs-specific vhost-user interface to facilitate migrating
>> back-end-internal state.  We plan to migrate the internal state simply
>> as a binary blob after the streaming phase, so all we need is a way to
>> transfer such a blob from and to the back-end.  We do so by using a
>> dedicated area of shared memory through which the blob is transferred in
>> chunks.
>>
>> This patch adds the following vhost operations (and implements them for
>> vhost-user):
>>
>> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>>    to the back-end.  This area will be used to transfer state via the
>>    other two operations.
>>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>>    again.)
>>
>> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>>    internal state into the shared memory area.
>>
>> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>>    shared memory area, and asks the back-end to fetch it.
>>
>> On the source side, the back-end is expected to serialize its internal
>> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
>> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
>> parts of that serialized state into the shared memory area.
>>
>> On the destination side, the back-end is expected to collect the state
>> blob over all FS_SET_STATE calls, and then deserialize and apply it once
>> FS_SET_STATE_FD detaches the shared memory area.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |   9 ++
>>   include/hw/virtio/vhost.h         |  68 +++++++++++++++
>>   hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  29 +++++++
>>   4 files changed, 244 insertions(+)
>>

[...]

>> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
>> +                                       uint64_t state_offset,
>> +                                       size_t size)
>> +{
>
> Am I right to assume that vrings are supposed to be stopped to get/set 
> state?
> If I am shouldn't we check it before sending message or do you believe 
> it is
> better to return error from backend in these cases?

Good point!  Yes, the vrings should be stopped.  I don’t verify this 
currently, but it makes sense to do so both on the qemu and the 
virtiofsd side.

Hanna


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-17 17:52     ` Hanna Czenczek
@ 2023-03-17 18:37       ` Anton Kuchin
  2023-03-20  9:33         ` Hanna Czenczek
  0 siblings, 1 reply; 23+ messages in thread
From: Anton Kuchin @ 2023-03-17 18:37 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Juan Quintela, Michael S . Tsirkin, qemu-devel, virtio-fs,
	Stefan Hajnoczi

On 17/03/2023 19:52, Hanna Czenczek wrote:
> On 17.03.23 18:19, Anton Kuchin wrote:
>> On 13/03/2023 19:48, Hanna Czenczek wrote:
>>> A virtio-fs device's VM state consists of:
>>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>>> - the back-end's (virtiofsd's) internal state
>>>
>>> We get/set the latter via the new vhost-user operations 
>>> FS_SET_STATE_FD,
>>> FS_GET_STATE, and FS_SET_STATE.
>>>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   hw/virtio/vhost-user-fs.c | 171 
>>> +++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 170 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
>>> index 83fc20e49e..df1fb02acc 100644
>>> --- a/hw/virtio/vhost-user-fs.c
>>> +++ b/hw/virtio/vhost-user-fs.c
>>> @@ -20,8 +20,10 @@
>>>   #include "hw/virtio/virtio-bus.h"
>>>   #include "hw/virtio/virtio-access.h"
>>>   #include "qemu/error-report.h"
>>> +#include "qemu/memfd.h"
>>>   #include "hw/virtio/vhost.h"
>>>   #include "hw/virtio/vhost-user-fs.h"
>>> +#include "migration/qemu-file-types.h"
>>>   #include "monitor/monitor.h"
>>>   #include "sysemu/sysemu.h"
>>>   @@ -298,9 +300,176 @@ static struct vhost_dev 
>>> *vuf_get_vhost(VirtIODevice *vdev)
>>>       return &fs->vhost_dev;
>>>   }
>>>   +/**
>>> + * Fetch the internal state from the back-end (virtiofsd) and save it
>>> + * to `f`.
>>> + */
>>> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
>>> +                          const VMStateField *field, JSONWriter 
>>> *vmdesc)
>>> +{
>>> +    VirtIODevice *vdev = pv;
>>> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
>>> +    int memfd = -1;
>>> +    /* Size of the shared memory through which to transfer the 
>>> state */
>>> +    const size_t chunk_size = 4 * 1024 * 1024;
>>> +    size_t state_offset;
>>> +    ssize_t remaining;
>>> +    void *shm_buf;
>>> +    Error *local_err = NULL;
>>> +    int ret, ret2;
>>> +
>>> +    /* Set up shared memory through which to receive the state from 
>>> virtiofsd */
>>> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
>>> +                               F_SEAL_SEAL | F_SEAL_SHRINK | 
>>> F_SEAL_GROW,
>>> +                               &memfd, &local_err);
>>> +    if (!shm_buf) {
>>> +        error_report_err(local_err);
>>> +        ret = -ENOMEM;
>>> +        goto early_fail;
>>> +    }
>>> +
>>> +    /* Share the SHM area with virtiofsd */
>>> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
>>> +    if (ret < 0) {
>>> +        goto early_fail;
>>
>> Don't we need some log message here too?
>
> Sure, why not.  There are other places in this patch that just return 
> -errno but print no error, I think they could all use a verbose error 
> message.
>
>>> +    }
>>> +
>>> +    /* Receive the virtiofsd state in chunks, and write them to `f` */
>>> +    state_offset = 0;
>>> +    do {
>>> +        size_t this_chunk_size;
>>> +
>>> +        remaining = vhost_fs_get_state(&fs->vhost_dev, state_offset,
>>> +                                       chunk_size);
>>> +        if (remaining < 0) {
>>> +            ret = remaining;
>>> +            goto fail;
>>> +        }
>>> +
>>> +        /* Prefix the whole state by its total length */
>>> +        if (state_offset == 0) {
>>> +            qemu_put_be64(f, remaining);
>>> +        }
>>> +
>>> +        this_chunk_size = MIN(remaining, chunk_size);
>>> +        qemu_put_buffer(f, shm_buf, this_chunk_size);
>>> +        state_offset += this_chunk_size;
>>> +    } while (remaining >= chunk_size);
>>> +
>>> +    ret = 0;
>>> +fail:
>>> +    /* Have virtiofsd close the shared memory */
>>> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
>>> +    if (ret2 < 0) {
>>> +        error_report("Failed to remove state FD from the 
>>> vhost-user-fs back "
>>> +                     "end: %s", strerror(-ret));
>>> +        if (ret == 0) {
>>> +            ret = ret2;
>>> +        }
>>> +    }
>>> +
>>> +early_fail:
>>> +    if (shm_buf) {
>>> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +/**
>>> + * Load the back-end's (virtiofsd's) internal state from `f` and send
>>> + * it over to that back-end.
>>> + */
>>> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
>>> +                          const VMStateField *field)
>>> +{
>>> +    VirtIODevice *vdev = pv;
>>> +    VHostUserFS *fs = VHOST_USER_FS(vdev);
>>> +    int memfd = -1;
>>> +    /* Size of the shared memory through which to transfer the 
>>> state */
>>> +    const size_t chunk_size = 4 * 1024 * 1024;
>>> +    size_t state_offset;
>>> +    uint64_t remaining;
>>> +    void *shm_buf;
>>> +    Error *local_err = NULL;
>>> +    int ret, ret2;
>>> +
>>> +    /* The state is prefixed by its total length, read that first */
>>> +    remaining = qemu_get_be64(f);
>>> +
>>> +    /* Set up shared memory through which to send the state to 
>>> virtiofsd */
>>> +    shm_buf = qemu_memfd_alloc("vhost-fs-state", chunk_size,
>>> +                               F_SEAL_SEAL | F_SEAL_SHRINK | 
>>> F_SEAL_GROW,
>>> +                               &memfd, &local_err);
>>> +    if (!shm_buf) {
>>> +        error_report_err(local_err);
>>> +        ret = -ENOMEM;
>>> +        goto early_fail;
>>> +    }
>>> +
>>> +    /* Share the SHM area with virtiofsd */
>>> +    ret = vhost_fs_set_state_fd(&fs->vhost_dev, memfd, chunk_size);
>>> +    if (ret < 0) {
>>> +        goto early_fail;
>>> +    }
>>> +
>>> +    /*
>>> +     * Read the virtiofsd state in chunks from `f`, and send them over
>>> +     * to virtiofsd
>>> +     */
>>> +    state_offset = 0;
>>> +    do {
>>> +        size_t this_chunk_size = MIN(remaining, chunk_size);
>>> +
>>> +        if (qemu_get_buffer(f, shm_buf, this_chunk_size) < 
>>> this_chunk_size) {
>>> +            ret = -EINVAL;
>>> +            goto fail;
>>> +        }
>>> +
>>> +        ret = vhost_fs_set_state(&fs->vhost_dev, state_offset, 
>>> this_chunk_size);
>>> +        if (ret < 0) {
>>> +            goto fail;
>>> +        }
>>> +
>>> +        state_offset += this_chunk_size;
>>> +        remaining -= this_chunk_size;
>>> +    } while (remaining > 0);
>>> +
>>> +    ret = 0;
>>> +fail:
>>> +    ret2 = vhost_fs_set_state_fd(&fs->vhost_dev, -1, 0);
>>> +    if (ret2 < 0) {
>>> +        error_report("Failed to remove state FD from the 
>>> vhost-user-fs back "
>>> +                     "end -- perhaps it failed to deserialize/apply 
>>> the state: "
>>> +                     "%s", strerror(-ret2));
>>> +        if (ret == 0) {
>>> +            ret = ret2;
>>> +        }
>>> +    }
>>> +
>>> +early_fail:
>>> +    if (shm_buf) {
>>> +        qemu_memfd_free(shm_buf, chunk_size, memfd);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static const VMStateDescription vuf_vmstate = {
>>>       .name = "vhost-user-fs",
>>> -    .unmigratable = 1,
>>> +    .version_id = 1,
>>> +    .fields = (VMStateField[]) {
>>> +        VMSTATE_VIRTIO_DEVICE,
>>> +        {
>>> +            .name = "back-end",
>>> +            .info = &(const VMStateInfo) {
>>> +                .name = "virtio-fs back-end state",
>>> +                .get = vuf_load_state,
>>> +                .put = vuf_save_state,
>>> +            },
>>> +        },
>>
>> I've been working on stateless migration patch [1] and there was 
>> discussed that we
>> need to keep some kind of blocker by default if orchestrators rely on 
>> unmigratable
>> field in virtio-fs vmstate to block the migration.
>> For this purpose I've implemented flag that selects "none" or 
>> "external" and is checked
>> in pre_save, so it could be extended with "internal" option.
>> We didn't come to conclusion if we also need to check incoming 
>> migration, the discussion
>> has stopped for a while but I'm going back to it now.
>>
>> I would appreciate if you have time to take a look at the discussion 
>> and consider the idea
>> proposed there to store internal state as a subsection of vmstate to 
>> make it as an option
>> but not mandatory.
>>
>> [1] 
>> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/
>
> So far I’ve mostly considered these issues orthogonal.  If your 
> stateless migration goes in first, then state is optional and I’ll 
> adjust this series.
> If stateful migration goes in first, then your series can simply make 
> state optional by introducing the external option, no?

Not really. State can be easily extended by subsections but not trimmed. 
Maybe this can be worked around by defining two types of vmstate and 
selecting the correct one at migration, but I'm not sure.

>
> But maybe we could also consider making stateless migration a special 
> case of stateful migration; if we had stateful migration, can’t we 
> just implement stateless migration by telling virtiofsd that it should 
> submit a special “I have no state” pseudo-state, i.e. by having a 
> switch on virtiofsd instead?

Sure. Backend can send empty state (as your patch treats 0 length as a 
valid response and not error) or dummy state that can be recognized as 
stateless. The only potential problem is that then we need support in 
backend for new commands even to return dummy state, and if backend can 
support both types then we'll need some switch in backend to reply with 
real or empty state.

>
> Off the top of my head, some downsides of that approach would be
> (1) it’d need a switch on the virtiofsd side, not on the qemu side 
> (not sure if that’s a downside, but a difference for sure),

Why would you? It seems to me that this affects only how qemu treats the 
vmstate of device. If the state was requested backend sends it to qemu. 
If state subsection is present in stream qemu sends it to the backend 
for loading. Stateless one just doesn't request state from the backend. 
Or am I missing something?

> and (2) we’d need at least some support for this on the virtiofsd 
> side, i.e. practically can’t come quicker than stateful migration 
> support.

Not much, essentially this is just a reconnect. I've sent a draft of a 
reconnect patch for old C-virtiofsd, for rust version it takes much 
longer because I'm learning rust and I'm not really good at it yet.

>
> Hanna
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-17 18:37       ` Anton Kuchin
@ 2023-03-20  9:33         ` Hanna Czenczek
  2023-03-20 12:39           ` Anton Kuchin
  0 siblings, 1 reply; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-20  9:33 UTC (permalink / raw)
  To: Anton Kuchin
  Cc: virtio-fs, Michael S . Tsirkin, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

On 17.03.23 19:37, Anton Kuchin wrote:
> On 17/03/2023 19:52, Hanna Czenczek wrote:
>> On 17.03.23 18:19, Anton Kuchin wrote:
>>> On 13/03/2023 19:48, Hanna Czenczek wrote:
>>>> A virtio-fs device's VM state consists of:
>>>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>>>> - the back-end's (virtiofsd's) internal state
>>>>
>>>> We get/set the latter via the new vhost-user operations 
>>>> FS_SET_STATE_FD,
>>>> FS_GET_STATE, and FS_SET_STATE.
>>>>

[...]

>>>>   static const VMStateDescription vuf_vmstate = {
>>>>       .name = "vhost-user-fs",
>>>> -    .unmigratable = 1,
>>>> +    .version_id = 1,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_VIRTIO_DEVICE,
>>>> +        {
>>>> +            .name = "back-end",
>>>> +            .info = &(const VMStateInfo) {
>>>> +                .name = "virtio-fs back-end state",
>>>> +                .get = vuf_load_state,
>>>> +                .put = vuf_save_state,
>>>> +            },
>>>> +        },
>>>
>>> I've been working on stateless migration patch [1] and there was 
>>> discussed that we
>>> need to keep some kind of blocker by default if orchestrators rely 
>>> on unmigratable
>>> field in virtio-fs vmstate to block the migration.
>>> For this purpose I've implemented flag that selects "none" or 
>>> "external" and is checked
>>> in pre_save, so it could be extended with "internal" option.
>>> We didn't come to conclusion if we also need to check incoming 
>>> migration, the discussion
>>> has stopped for a while but I'm going back to it now.
>>>
>>> I would appreciate if you have time to take a look at the discussion 
>>> and consider the idea
>>> proposed there to store internal state as a subsection of vmstate to 
>>> make it as an option
>>> but not mandatory.
>>>
>>> [1] 
>>> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/
>>
>> So far I’ve mostly considered these issues orthogonal.  If your 
>> stateless migration goes in first, then state is optional and I’ll 
>> adjust this series.
>> If stateful migration goes in first, then your series can simply make 
>> state optional by introducing the external option, no?
>
> Not really. State can be easily extended by subsections but not 
> trimmed. Maybe this can be worked around by defining two types of 
> vmstate and selecting the correct one at migration, but I'm not sure.

I thought your patch included a switch on the vhost-user-fs device (on 
the qemu side) to tell qemu what migration data to expect.  Can we not 
transfer a 0-length state for 'external', and assert this on the 
destination side?

>>
>> But maybe we could also consider making stateless migration a special 
>> case of stateful migration; if we had stateful migration, can’t we 
>> just implement stateless migration by telling virtiofsd that it 
>> should submit a special “I have no state” pseudo-state, i.e. by 
>> having a switch on virtiofsd instead?
>
> Sure. Backend can send empty state (as your patch treats 0 length as a 
> valid response and not error) or dummy state that can be recognized as 
> stateless. The only potential problem is that then we need support in 
> backend for new commands even to return dummy state, and if backend 
> can support both types then we'll need some switch in backend to reply 
> with real or empty state.

Yes, exactly.

>>
>> Off the top of my head, some downsides of that approach would be
>> (1) it’d need a switch on the virtiofsd side, not on the qemu side 
>> (not sure if that’s a downside, but a difference for sure),
>
> Why would you? It seems to me that this affects only how qemu treats 
> the vmstate of device. If the state was requested backend sends it to 
> qemu. If state subsection is present in stream qemu sends it to the 
> backend for loading. Stateless one just doesn't request state from the 
> backend. Or am I missing something?
>
>> and (2) we’d need at least some support for this on the virtiofsd 
>> side, i.e. practically can’t come quicker than stateful migration 
>> support.
>
> Not much, essentially this is just a reconnect. I've sent a draft of a 
> reconnect patch for old C-virtiofsd, for rust version it takes much 
> longer because I'm learning rust and I'm not really good at it yet.

I meant these two downsides not for your proposal, but instead if we 
implemented stateless migration only in the back-end; i.e. the front-end 
would only implement stateful migration, and the back-end would send and 
accept an empty state.

Then, qemu would always request a “state” (even if it turns out empty 
for stateless migration), and qemu would always treat it the same, so 
there wouldn’t be a switch on the qemu side, but only on the virtiofsd 
side.  Doesn’t really matter, but what does matter is that we’d need to 
implement the migration interface in virtiofsd, even if in the end no 
state is transferred.

So exactly what you’ve said above (“The only potential problem is […]”). :)

Hanna


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-20  9:33         ` Hanna Czenczek
@ 2023-03-20 12:39           ` Anton Kuchin
  2023-03-21 17:49             ` Hanna Czenczek
  0 siblings, 1 reply; 23+ messages in thread
From: Anton Kuchin @ 2023-03-20 12:39 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: virtio-fs, Michael S . Tsirkin, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

On 20/03/2023 11:33, Hanna Czenczek wrote:
> On 17.03.23 19:37, Anton Kuchin wrote:
>> On 17/03/2023 19:52, Hanna Czenczek wrote:
>>> On 17.03.23 18:19, Anton Kuchin wrote:
>>>> On 13/03/2023 19:48, Hanna Czenczek wrote:
>>>>> A virtio-fs device's VM state consists of:
>>>>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>>>>> - the back-end's (virtiofsd's) internal state
>>>>>
>>>>> We get/set the latter via the new vhost-user operations 
>>>>> FS_SET_STATE_FD,
>>>>> FS_GET_STATE, and FS_SET_STATE.
>>>>>
>
> [...]
>
>>>>>   static const VMStateDescription vuf_vmstate = {
>>>>>       .name = "vhost-user-fs",
>>>>> -    .unmigratable = 1,
>>>>> +    .version_id = 1,
>>>>> +    .fields = (VMStateField[]) {
>>>>> +        VMSTATE_VIRTIO_DEVICE,
>>>>> +        {
>>>>> +            .name = "back-end",
>>>>> +            .info = &(const VMStateInfo) {
>>>>> +                .name = "virtio-fs back-end state",
>>>>> +                .get = vuf_load_state,
>>>>> +                .put = vuf_save_state,
>>>>> +            },
>>>>> +        },
>>>>
>>>> I've been working on stateless migration patch [1] and there was 
>>>> discussed that we
>>>> need to keep some kind of blocker by default if orchestrators rely 
>>>> on unmigratable
>>>> field in virtio-fs vmstate to block the migration.
>>>> For this purpose I've implemented flag that selects "none" or 
>>>> "external" and is checked
>>>> in pre_save, so it could be extended with "internal" option.
>>>> We didn't come to conclusion if we also need to check incoming 
>>>> migration, the discussion
>>>> has stopped for a while but I'm going back to it now.
>>>>
>>>> I would appreciate if you have time to take a look at the 
>>>> discussion and consider the idea
>>>> proposed there to store internal state as a subsection of vmstate 
>>>> to make it as an option
>>>> but not mandatory.
>>>>
>>>> [1] 
>>>> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/
>>>
>>> So far I’ve mostly considered these issues orthogonal.  If your 
>>> stateless migration goes in first, then state is optional and I’ll 
>>> adjust this series.
>>> If stateful migration goes in first, then your series can simply 
>>> make state optional by introducing the external option, no?
>>
>> Not really. State can be easily extended by subsections but not 
>> trimmed. Maybe this can be worked around by defining two types of 
>> vmstate and selecting the correct one at migration, but I'm not sure.
>
> I thought your patch included a switch on the vhost-user-fs device (on 
> the qemu side) to tell qemu what migration data to expect. Can we not 
> transfer a 0-length state for 'external', and assert this on the 
> destination side?

Looks like I really need to make the description of my patch and the 
documentation more clear :) My patch proposes switch on _source_ side to 
select which data to save in the stream mostly to protect old 
orchestrators that don't expect virtio-fs to be migratable (and for 
internal case it can be extended to select if qemu needs to request 
state from backend), Michael insists that we also need to check on 
destination but I disagree because I believe that we can figure this out 
from stream data without additional device flags.

>
>>>
>>> But maybe we could also consider making stateless migration a 
>>> special case of stateful migration; if we had stateful migration, 
>>> can’t we just implement stateless migration by telling virtiofsd 
>>> that it should submit a special “I have no state” pseudo-state, i.e. 
>>> by having a switch on virtiofsd instead?
>>
>> Sure. Backend can send empty state (as your patch treats 0 length as 
>> a valid response and not error) or dummy state that can be recognized 
>> as stateless. The only potential problem is that then we need support 
>> in backend for new commands even to return dummy state, and if 
>> backend can support both types then we'll need some switch in backend 
>> to reply with real or empty state.
>
> Yes, exactly.
>
>>>
>>> Off the top of my head, some downsides of that approach would be
>>> (1) it’d need a switch on the virtiofsd side, not on the qemu side 
>>> (not sure if that’s a downside, but a difference for sure),
>>
>> Why would you? It seems to me that this affects only how qemu treats 
>> the vmstate of device. If the state was requested backend sends it to 
>> qemu. If state subsection is present in stream qemu sends it to the 
>> backend for loading. Stateless one just doesn't request state from 
>> the backend. Or am I missing something?
>>
>>> and (2) we’d need at least some support for this on the virtiofsd 
>>> side, i.e. practically can’t come quicker than stateful migration 
>>> support.
>>
>> Not much, essentially this is just a reconnect. I've sent a draft of 
>> a reconnect patch for old C-virtiofsd, for rust version it takes much 
>> longer because I'm learning rust and I'm not really good at it yet.
>
> I meant these two downsides not for your proposal, but instead if we 
> implemented stateless migration only in the back-end; i.e. the 
> front-end would only implement stateful migration, and the back-end 
> would send and accept an empty state.
>
> Then, qemu would always request a “state” (even if it turns out empty 
> for stateless migration), and qemu would always treat it the same, so 
> there wouldn’t be a switch on the qemu side, but only on the virtiofsd 
> side.  Doesn’t really matter, but what does matter is that we’d need 
> to implement the migration interface in virtiofsd, even if in the end 
> no state is transferred.
>
> So exactly what you’ve said above (“The only potential problem is 
> […]”). :)
>
> Hanna
>

Oh, yes, we were talking about the same thing. So do you agree that 
storing internal state data in subsection will allow backend code to be 
more straightforward without additional switches?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-20 12:39           ` Anton Kuchin
@ 2023-03-21 17:49             ` Hanna Czenczek
  2023-03-22 11:18               ` Anton Kuchin
  0 siblings, 1 reply; 23+ messages in thread
From: Hanna Czenczek @ 2023-03-21 17:49 UTC (permalink / raw)
  To: Anton Kuchin
  Cc: virtio-fs, Michael S . Tsirkin, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

On 20.03.23 13:39, Anton Kuchin wrote:
> On 20/03/2023 11:33, Hanna Czenczek wrote:
>> On 17.03.23 19:37, Anton Kuchin wrote:
>>> On 17/03/2023 19:52, Hanna Czenczek wrote:
>>>> On 17.03.23 18:19, Anton Kuchin wrote:
>>>>> On 13/03/2023 19:48, Hanna Czenczek wrote:
>>>>>> A virtio-fs device's VM state consists of:
>>>>>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>>>>>> - the back-end's (virtiofsd's) internal state
>>>>>>
>>>>>> We get/set the latter via the new vhost-user operations 
>>>>>> FS_SET_STATE_FD,
>>>>>> FS_GET_STATE, and FS_SET_STATE.
>>>>>>
>>
>> [...]
>>
>>>>>>   static const VMStateDescription vuf_vmstate = {
>>>>>>       .name = "vhost-user-fs",
>>>>>> -    .unmigratable = 1,
>>>>>> +    .version_id = 1,
>>>>>> +    .fields = (VMStateField[]) {
>>>>>> +        VMSTATE_VIRTIO_DEVICE,
>>>>>> +        {
>>>>>> +            .name = "back-end",
>>>>>> +            .info = &(const VMStateInfo) {
>>>>>> +                .name = "virtio-fs back-end state",
>>>>>> +                .get = vuf_load_state,
>>>>>> +                .put = vuf_save_state,
>>>>>> +            },
>>>>>> +        },
>>>>>
>>>>> I've been working on stateless migration patch [1] and there was 
>>>>> discussed that we
>>>>> need to keep some kind of blocker by default if orchestrators rely 
>>>>> on unmigratable
>>>>> field in virtio-fs vmstate to block the migration.
>>>>> For this purpose I've implemented flag that selects "none" or 
>>>>> "external" and is checked
>>>>> in pre_save, so it could be extended with "internal" option.
>>>>> We didn't come to conclusion if we also need to check incoming 
>>>>> migration, the discussion
>>>>> has stopped for a while but I'm going back to it now.
>>>>>
>>>>> I would appreciate if you have time to take a look at the 
>>>>> discussion and consider the idea
>>>>> proposed there to store internal state as a subsection of vmstate 
>>>>> to make it as an option
>>>>> but not mandatory.
>>>>>
>>>>> [1] 
>>>>> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/
>>>>
>>>> So far I’ve mostly considered these issues orthogonal.  If your 
>>>> stateless migration goes in first, then state is optional and I’ll 
>>>> adjust this series.
>>>> If stateful migration goes in first, then your series can simply 
>>>> make state optional by introducing the external option, no?
>>>
>>> Not really. State can be easily extended by subsections but not 
>>> trimmed. Maybe this can be worked around by defining two types of 
>>> vmstate and selecting the correct one at migration, but I'm not sure.
>>
>> I thought your patch included a switch on the vhost-user-fs device 
>> (on the qemu side) to tell qemu what migration data to expect. Can we 
>> not transfer a 0-length state for 'external', and assert this on the 
>> destination side?
>
> Looks like I really need to make the description of my patch and the 
> documentation more clear :) My patch proposes switch on _source_ side 
> to select which data to save in the stream mostly to protect old 
> orchestrators that don't expect virtio-fs to be migratable (and for 
> internal case it can be extended to select if qemu needs to request 
> state from backend), Michael insists that we also need to check on 
> destination but I disagree because I believe that we can figure this 
> out from stream data without additional device flags.
>
>>
>>>>
>>>> But maybe we could also consider making stateless migration a 
>>>> special case of stateful migration; if we had stateful migration, 
>>>> can’t we just implement stateless migration by telling virtiofsd 
>>>> that it should submit a special “I have no state” pseudo-state, 
>>>> i.e. by having a switch on virtiofsd instead?
>>>
>>> Sure. Backend can send empty state (as your patch treats 0 length as 
>>> a valid response and not error) or dummy state that can be 
>>> recognized as stateless. The only potential problem is that then we 
>>> need support in backend for new commands even to return dummy state, 
>>> and if backend can support both types then we'll need some switch in 
>>> backend to reply with real or empty state.
>>
>> Yes, exactly.
>>
>>>>
>>>> Off the top of my head, some downsides of that approach would be
>>>> (1) it’d need a switch on the virtiofsd side, not on the qemu side 
>>>> (not sure if that’s a downside, but a difference for sure),
>>>
>>> Why would you? It seems to me that this affects only how qemu treats 
>>> the vmstate of device. If the state was requested backend sends it 
>>> to qemu. If state subsection is present in stream qemu sends it to 
>>> the backend for loading. Stateless one just doesn't request state 
>>> from the backend. Or am I missing something?
>>>
>>>> and (2) we’d need at least some support for this on the virtiofsd 
>>>> side, i.e. practically can’t come quicker than stateful migration 
>>>> support.
>>>
>>> Not much, essentially this is just a reconnect. I've sent a draft of 
>>> a reconnect patch for old C-virtiofsd, for rust version it takes 
>>> much longer because I'm learning rust and I'm not really good at it 
>>> yet.
>>
>> I meant these two downsides not for your proposal, but instead if we 
>> implemented stateless migration only in the back-end; i.e. the 
>> front-end would only implement stateful migration, and the back-end 
>> would send and accept an empty state.
>>
>> Then, qemu would always request a “state” (even if it turns out empty 
>> for stateless migration), and qemu would always treat it the same, so 
>> there wouldn’t be a switch on the qemu side, but only on the 
>> virtiofsd side.  Doesn’t really matter, but what does matter is that 
>> we’d need to implement the migration interface in virtiofsd, even if 
>> in the end no state is transferred.
>>
>> So exactly what you’ve said above (“The only potential problem is 
>> […]”). :)
>>
>> Hanna
>>
>
> Oh, yes, we were talking about the same thing. So do you agree that 
> storing internal state data in subsection will allow backend code to 
> be more straightforward without additional switches?

Sounds good.  I think we should rename VHOST_USER_MIGRATION_TYPE_NONE to 
..._INTERNAL, though, and then use that (default) mode for stateful 
migration (via VMState), because I think that should be the default 
migration type (and while there’s no support for it, it will just 
continue to block migration).

So I suppose you mean something like this, where 
vuf_stateful_migration() basically returns `fs->migration_type == 
VHOST_USER_MIGRATION_TYPE_INTERNAL`, and on the destination, you’d check 
whether the subsection is present to decide which kind of migration it 
is, right?

static const VMStateDescription vuf_backend_vmstate;

static const VMStateDescription vuf_vmstate = {
     .name = "vhost-user-fs",
     .version_id = 0,
     .fields = (VMStateField[]) {
         VMSTATE_VIRTIO_DEVICE,
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription*[]) {
         &vuf_backend_vmstate,
         NULL,
     }
};

static const VMStateDescription vuf_backend_vmstate = {
     .name = "vhost-user-fs-backend",
     .version_id = 0,
     .needed = vuf_stateful_migration,
     .fields = (VMStateField[]) {
         {
             .name = "back-end",
             .info = &(const VMStateInfo) {
                 .name = "virtio-fs back-end state",
                 .get = vuf_load_state,
                 .put = vuf_save_state,
             },
         },
         VMSTATE_END_OF_LIST()
     },
};


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Virtio-fs] [RFC 2/2] vhost-user-fs: Implement stateful migration
  2023-03-21 17:49             ` Hanna Czenczek
@ 2023-03-22 11:18               ` Anton Kuchin
  0 siblings, 0 replies; 23+ messages in thread
From: Anton Kuchin @ 2023-03-22 11:18 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: virtio-fs, Michael S . Tsirkin, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

On 21/03/2023 19:49, Hanna Czenczek wrote:
> On 20.03.23 13:39, Anton Kuchin wrote:
>> On 20/03/2023 11:33, Hanna Czenczek wrote:
>>> On 17.03.23 19:37, Anton Kuchin wrote:
>>>> On 17/03/2023 19:52, Hanna Czenczek wrote:
>>>>> On 17.03.23 18:19, Anton Kuchin wrote:
>>>>>> On 13/03/2023 19:48, Hanna Czenczek wrote:
>>>>>>> A virtio-fs device's VM state consists of:
>>>>>>> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
>>>>>>> - the back-end's (virtiofsd's) internal state
>>>>>>>
>>>>>>> We get/set the latter via the new vhost-user operations 
>>>>>>> FS_SET_STATE_FD,
>>>>>>> FS_GET_STATE, and FS_SET_STATE.
>>>>>>>
>>>
>>> [...]
>>>
>>>>>>>   static const VMStateDescription vuf_vmstate = {
>>>>>>>       .name = "vhost-user-fs",
>>>>>>> -    .unmigratable = 1,
>>>>>>> +    .version_id = 1,
>>>>>>> +    .fields = (VMStateField[]) {
>>>>>>> +        VMSTATE_VIRTIO_DEVICE,
>>>>>>> +        {
>>>>>>> +            .name = "back-end",
>>>>>>> +            .info = &(const VMStateInfo) {
>>>>>>> +                .name = "virtio-fs back-end state",
>>>>>>> +                .get = vuf_load_state,
>>>>>>> +                .put = vuf_save_state,
>>>>>>> +            },
>>>>>>> +        },
>>>>>>
>>>>>> I've been working on stateless migration patch [1] and there was 
>>>>>> discussed that we
>>>>>> need to keep some kind of blocker by default if orchestrators 
>>>>>> rely on unmigratable
>>>>>> field in virtio-fs vmstate to block the migration.
>>>>>> For this purpose I've implemented flag that selects "none" or 
>>>>>> "external" and is checked
>>>>>> in pre_save, so it could be extended with "internal" option.
>>>>>> We didn't come to conclusion if we also need to check incoming 
>>>>>> migration, the discussion
>>>>>> has stopped for a while but I'm going back to it now.
>>>>>>
>>>>>> I would appreciate if you have time to take a look at the 
>>>>>> discussion and consider the idea
>>>>>> proposed there to store internal state as a subsection of vmstate 
>>>>>> to make it as an option
>>>>>> but not mandatory.
>>>>>>
>>>>>> [1] 
>>>>>> https://patchew.org/QEMU/20230217170038.1273710-1-antonkuchin@yandex-team.ru/
>>>>>
>>>>> So far I’ve mostly considered these issues orthogonal.  If your 
>>>>> stateless migration goes in first, then state is optional and I’ll 
>>>>> adjust this series.
>>>>> If stateful migration goes in first, then your series can simply 
>>>>> make state optional by introducing the external option, no?
>>>>
>>>> Not really. State can be easily extended by subsections but not 
>>>> trimmed. Maybe this can be worked around by defining two types of 
>>>> vmstate and selecting the correct one at migration, but I'm not sure.
>>>
>>> I thought your patch included a switch on the vhost-user-fs device 
>>> (on the qemu side) to tell qemu what migration data to expect. Can 
>>> we not transfer a 0-length state for 'external', and assert this on 
>>> the destination side?
>>
>> Looks like I really need to make the description of my patch and the 
>> documentation more clear :) My patch proposes switch on _source_ side 
>> to select which data to save in the stream mostly to protect old 
>> orchestrators that don't expect virtio-fs to be migratable (and for 
>> internal case it can be extended to select if qemu needs to request 
>> state from backend), Michael insists that we also need to check on 
>> destination but I disagree because I believe that we can figure this 
>> out from stream data without additional device flags.
>>
>>>
>>>>>
>>>>> But maybe we could also consider making stateless migration a 
>>>>> special case of stateful migration; if we had stateful migration, 
>>>>> can’t we just implement stateless migration by telling virtiofsd 
>>>>> that it should submit a special “I have no state” pseudo-state, 
>>>>> i.e. by having a switch on virtiofsd instead?
>>>>
>>>> Sure. Backend can send empty state (as your patch treats 0 length 
>>>> as a valid response and not error) or dummy state that can be 
>>>> recognized as stateless. The only potential problem is that then we 
>>>> need support in backend for new commands even to return dummy 
>>>> state, and if backend can support both types then we'll need some 
>>>> switch in backend to reply with real or empty state.
>>>
>>> Yes, exactly.
>>>
>>>>>
>>>>> Off the top of my head, some downsides of that approach would be
>>>>> (1) it’d need a switch on the virtiofsd side, not on the qemu side 
>>>>> (not sure if that’s a downside, but a difference for sure),
>>>>
>>>> Why would you? It seems to me that this affects only how qemu 
>>>> treats the vmstate of device. If the state was requested backend 
>>>> sends it to qemu. If state subsection is present in stream qemu 
>>>> sends it to the backend for loading. Stateless one just doesn't 
>>>> request state from the backend. Or am I missing something?
>>>>
>>>>> and (2) we’d need at least some support for this on the virtiofsd 
>>>>> side, i.e. practically can’t come quicker than stateful migration 
>>>>> support.
>>>>
>>>> Not much, essentially this is just a reconnect. I've sent a draft 
>>>> of a reconnect patch for old C-virtiofsd, for rust version it takes 
>>>> much longer because I'm learning rust and I'm not really good at it 
>>>> yet.
>>>
>>> I meant these two downsides not for your proposal, but instead if we 
>>> implemented stateless migration only in the back-end; i.e. the 
>>> front-end would only implement stateful migration, and the back-end 
>>> would send and accept an empty state.
>>>
>>> Then, qemu would always request a “state” (even if it turns out 
>>> empty for stateless migration), and qemu would always treat it the 
>>> same, so there wouldn’t be a switch on the qemu side, but only on 
>>> the virtiofsd side.  Doesn’t really matter, but what does matter is 
>>> that we’d need to implement the migration interface in virtiofsd, 
>>> even if in the end no state is transferred.
>>>
>>> So exactly what you’ve said above (“The only potential problem is 
>>> […]”). :)
>>>
>>> Hanna
>>>
>>
>> Oh, yes, we were talking about the same thing. So do you agree that 
>> storing internal state data in subsection will allow backend code to 
>> be more straightforward without additional switches?
>
> Sounds good.  I think we should rename VHOST_USER_MIGRATION_TYPE_NONE 
> to ..._INTERNAL, though, and then use that (default) mode for stateful 
> migration (via VMState), because I think that should be the default 
> migration type (and while there’s no support for it, it will just 
> continue to block migration).

My plan was to add new ..._INTERNAL option and keep ..._NONE as default 
one that blocks migration unless orchestrator explicitly selects 
migration type to avoid accidental migrations of old orchestrators that 
expect blocker for virtio-fs. But if INTERNAL blocks migration when 
backend doesn't support new commands your idea to make it default may be 
even better.

>
>
> So I suppose you mean something like this, where 
> vuf_stateful_migration() basically returns `fs->migration_type == 
> VHOST_USER_MIGRATION_TYPE_INTERNAL`, and on the destination, you’d 
> check whether the subsection is present to decide which kind of 
> migration it is, right?

Yes, exactly.

>
>
> static const VMStateDescription vuf_backend_vmstate;
>
> static const VMStateDescription vuf_vmstate = {
>     .name = "vhost-user-fs",
>     .version_id = 0,
>     .fields = (VMStateField[]) {
>         VMSTATE_VIRTIO_DEVICE,
>         VMSTATE_END_OF_LIST()
>     },
>     .subsections = (const VMStateDescription*[]) {
>         &vuf_backend_vmstate,
>         NULL,
>     }
> };
>
> static const VMStateDescription vuf_backend_vmstate = {
>     .name = "vhost-user-fs-backend",
>     .version_id = 0,
>     .needed = vuf_stateful_migration,
>     .fields = (VMStateField[]) {
>         {
>             .name = "back-end",
>             .info = &(const VMStateInfo) {
>                 .name = "virtio-fs back-end state",
>                 .get = vuf_load_state,
>                 .put = vuf_save_state,
>             },
>         },
>         VMSTATE_END_OF_LIST()
>     },
> };
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-03-22 11:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-13 17:48 [RFC 0/2] vhost-user-fs: Stateful migration Hanna Czenczek
2023-03-13 17:48 ` [Virtio-fs] " Hanna Czenczek
2023-03-13 17:48 ` [RFC 1/2] vhost-user: Add interface for virtio-fs migration Hanna Czenczek
2023-03-13 17:48   ` [Virtio-fs] " Hanna Czenczek
2023-03-15 13:58   ` Stefan Hajnoczi
2023-03-15 13:58     ` [Virtio-fs] " Stefan Hajnoczi
2023-03-15 15:55     ` Hanna Czenczek
2023-03-15 15:55       ` [Virtio-fs] " Hanna Czenczek
2023-03-15 16:33       ` Stefan Hajnoczi
2023-03-15 16:33         ` [Virtio-fs] " Stefan Hajnoczi
2023-03-17 17:39   ` Anton Kuchin
2023-03-17 17:39     ` [Virtio-fs] " Anton Kuchin
2023-03-17 17:53     ` Hanna Czenczek
2023-03-13 17:48 ` [RFC 2/2] vhost-user-fs: Implement stateful migration Hanna Czenczek
2023-03-13 17:48   ` [Virtio-fs] " Hanna Czenczek
2023-03-17 17:19   ` Anton Kuchin
2023-03-17 17:19     ` [Virtio-fs] " Anton Kuchin
2023-03-17 17:52     ` Hanna Czenczek
2023-03-17 18:37       ` Anton Kuchin
2023-03-20  9:33         ` Hanna Czenczek
2023-03-20 12:39           ` Anton Kuchin
2023-03-21 17:49             ` Hanna Czenczek
2023-03-22 11:18               ` Anton Kuchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.