qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/13] vDPA software assisted live migration
@ 2021-03-15 19:48 Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 01/13] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
                   ` (13 more replies)
  0 siblings, 14 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

This series enable shadow virtqueue for vhost-net devices. This is a
new method of vhost devices migration: Instead of relay on vhost
device's dirty logging capability, SW assisted LM intercepts dataplane,
forwarding the descriptors between VM and device. Is intended for vDPA
devices with no logging, but this address the basic platform to build
that support on.

In this migration mode, qemu offers a new vring to the device to
read and write into, and disable vhost notifiers, processing guest and
vhost notifications in qemu. On used buffer relay, qemu will mark the
dirty memory as with plain virtio-net devices. This way, devices does
not need to have dirty page logging capability.

This series is a POC doing SW LM for vhost-net devices, which already
have dirty page logging capabilities. For qemu to use shadow virtqueues
these vhost-net devices need to be instantiated:
* With IOMMU (iommu_platform=on,ats=on)
* Without event_idx (event_idx=off)

And shadow virtqueue needs to be enabled for them with QMP command
like:

{
  "execute": "x-vhost-enable-shadow-vq",
  "arguments": {
    "name": "virtio-net",
    "enable": false
  }
}

Just the notification forwarding (with no descriptor relay) can be
achieved with patches 5 and 6, and starting SVQ. Previous commits
are cleanup ones and declaration of QMP command.

Commit 11 introduces the buffer forwarding. Previous one are for
preparations again, and laters are for enabling some obvious
optimizations.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

Comments are welcome! Especially on:
* Different/improved way of synchronization, particularly on the race
  of masking.

TODO:
* Event, indirect, packed, and others features of virtio - Waiting for
  confirmation of the big picture.
* vDPA devices: Developing solutions for tracking the available IOVA
  space for all devices. Small POC available, skipping the get/set
  status (since vDPA does not support it) and just allocating more and
  more IOVA addresses in a hardcoded range available for the device.
* To sepparate buffers forwarding in its own AIO context, so we can
  throw more threads to that task and we don't need to stop the main
  event loop.
* IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
  sent to device.
* Automatic kick-in on live-migration.
* Proper documentation.

Thanks!

Changes from v1 RFC:
  * Use QMP instead of migration to start SVQ mode.
  * Only accepting IOMMU devices, closer behavior with target devices
    (vDPA)
  * Fix invalid masking/unmasking of vhost call fd.
  * Use of proper methods for synchronization.
  * No need to modify VirtIO device code, all of the changes are
    contained in vhost code.
  * Delete superfluous code.
  * An intermediate RFC was sent with only the notifications forwarding
    changes. It can be seen in
    https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
  * v1 at
    https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html

Eugenio Pérez (13):
  virtio: Add virtio_queue_is_host_notifier_enabled
  vhost: Save masked_notifier state
  vhost: Add VhostShadowVirtqueue
  vhost: Add x-vhost-enable-shadow-vq qmp
  vhost: Route guest->host notification through shadow virtqueue
  vhost: Route host->guest notification through shadow virtqueue
  vhost: Avoid re-set masked notifier in shadow vq
  virtio: Add vhost_shadow_vq_get_vring_addr
  virtio: Add virtio_queue_full
  vhost: add vhost_kernel_set_vring_enable
  vhost: Shadow virtqueue buffers forwarding
  vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
    kick
  vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
    virtqueue

 qapi/net.json                      |  22 ++
 hw/virtio/vhost-shadow-virtqueue.h |  36 ++
 include/hw/virtio/vhost.h          |   6 +
 include/hw/virtio/virtio.h         |   3 +
 hw/virtio/vhost-backend.c          |  29 ++
 hw/virtio/vhost-shadow-virtqueue.c | 551 +++++++++++++++++++++++++++++
 hw/virtio/vhost.c                  | 283 +++++++++++++++
 hw/virtio/virtio.c                 |  23 +-
 hw/virtio/meson.build              |   2 +-
 9 files changed, 952 insertions(+), 3 deletions(-)
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

-- 
2.27.0




^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v2 01/13] virtio: Add virtio_queue_is_host_notifier_enabled
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 02/13] vhost: Save masked_notifier state Eugenio Pérez
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

This allows shadow virtqueue code to assert the queue status before
making changes.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h | 1 +
 hw/virtio/virtio.c         | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b7ece7a6a8..c2c7cee993 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,6 +316,7 @@ void virtio_device_release_ioeventfd(VirtIODevice *vdev);
 bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
                                                 VirtIOHandleAIOOutput handle_output);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 07f4e60b30..a86b3f9c26 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3594,6 +3594,11 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq)
     return &vq->host_notifier;
 }
 
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq)
+{
+    return vq->host_notifier_enabled;
+}
+
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled)
 {
     vq->host_notifier_enabled = enabled;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 02/13] vhost: Save masked_notifier state
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 01/13] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 03/13] vhost: Add VhostShadowVirtqueue Eugenio Pérez
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

It will be used to configure shadow virtqueue. Shadow virtqueue will
relay the device->guest notifications, so vhost need to be able to tell
the masking status.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h | 1 +
 hw/virtio/vhost.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 4a8bc75415..ac963bf23d 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -28,6 +28,7 @@ struct vhost_virtqueue {
     unsigned avail_size;
     unsigned long long used_phys;
     unsigned used_size;
+    bool notifier_is_masked;
     EventNotifier masked_notifier;
     struct vhost_dev *dev;
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2163a0d63..4680c0cfcf 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1527,6 +1527,8 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
     if (r < 0) {
         VHOST_OPS_DEBUG("vhost_set_vring_call failed");
+    } else {
+        hdev->vqs[index].notifier_is_masked = mask;
     }
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 03/13] vhost: Add VhostShadowVirtqueue
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 01/13] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 02/13] vhost: Save masked_notifier state Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Vhost shadow virtqueue (SVQ)is an intermediate jump for virtqueue
notifications and buffers, allowing qemu to track them. While qemu is
forwarding the buffers and virtqueue changes, is able to commit the
memory it's being dirtied, the same way regular qemu's VirtIO devices
do.

This commit only exposes basic SVQ allocation and free, so changes
regarding different aspects of SVQ (notifications forwarding, buffer
forwarding, starting/stopping) are more isolated and easier to bisect.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h | 24 ++++++++++++
 hw/virtio/vhost-shadow-virtqueue.c | 63 ++++++++++++++++++++++++++++++
 hw/virtio/meson.build              |  2 +-
 3 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
new file mode 100644
index 0000000000..6cc18d6acb
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -0,0 +1,24 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef VHOST_SHADOW_VIRTQUEUE_H
+#define VHOST_SHADOW_VIRTQUEUE_H
+
+#include "qemu/osdep.h"
+
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/vhost.h"
+
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
+VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
+
+void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
+
+#endif
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
new file mode 100644
index 0000000000..4512e5b058
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -0,0 +1,63 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "hw/virtio/vhost-shadow-virtqueue.h"
+
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+
+/* Shadow virtqueue to relay notifications */
+typedef struct VhostShadowVirtqueue {
+    /* Shadow kick notifier, sent to vhost */
+    EventNotifier kick_notifier;
+    /* Shadow call notifier, sent to vhost */
+    EventNotifier call_notifier;
+} VhostShadowVirtqueue;
+
+/*
+ * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
+ * methods and file descriptors.
+ */
+VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
+{
+    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
+    int r;
+
+    r = event_notifier_init(&svq->kick_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create kick event notifier: %s",
+                     strerror(errno));
+        goto err_init_kick_notifier;
+    }
+
+    r = event_notifier_init(&svq->call_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create call event notifier: %s",
+                     strerror(errno));
+        goto err_init_call_notifier;
+    }
+
+    return g_steal_pointer(&svq);
+
+err_init_call_notifier:
+    event_notifier_cleanup(&svq->kick_notifier);
+
+err_init_kick_notifier:
+    return NULL;
+}
+
+/*
+ * Free the resources of the shadow virtqueue.
+ */
+void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
+{
+    event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq);
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..8b5a0225fe 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 03/13] vhost: Add VhostShadowVirtqueue Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16 13:37   ` Eric Blake
  2021-03-15 19:48 ` [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Command to enable shadow virtqueue looks like:

{ "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "dev0", "enable": true } }

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json     | 22 ++++++++++++++++++++++
 hw/virtio/vhost.c |  6 ++++++
 2 files changed, 28 insertions(+)

diff --git a/qapi/net.json b/qapi/net.json
index c31748c87f..4c5f65d021 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -77,6 +77,28 @@
 ##
 { 'command': 'netdev_del', 'data': {'id': 'str'} }
 
+##
+# @x-vhost-enable-shadow-vq:
+#
+# Use vhost shadow virtqueue.
+#
+# @name: the device name of the VirtIO device
+#
+# @enable: true to use he alternate shadow VQ notification path
+#
+# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
+#
+# Since: 6.0
+#
+# Example:
+#
+# -> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "virtio-net", "enable": false } }
+#
+##
+{ 'command': 'x-vhost-enable-shadow-vq',
+  'data': {'name': 'str', 'enable': 'bool'},
+  'if': 'defined(CONFIG_VHOST_KERNEL)' }
+
 ##
 # @NetLegacyNicOptions:
 #
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 4680c0cfcf..97f1bcfa42 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-commands-net.h"
 #include "hw/virtio/vhost.h"
 #include "qemu/atomic.h"
 #include "qemu/range.h"
@@ -1831,3 +1832,8 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -1;
 }
+
+void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
+{
+    error_setg(errp, "Shadow virtqueue still not implemented");
+}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  7:18   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 06/13] vhost: Route host->guest " Eugenio Pérez
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Shadow virtqueue notifications forwarding is disabled when vhost_dev
stops, so code flow follows usual cleanup.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |   7 ++
 include/hw/virtio/vhost.h          |   4 +
 hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
 hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
 4 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 6cc18d6acb..c891c6510d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -17,6 +17,13 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+bool vhost_shadow_vq_start(struct vhost_dev *dev,
+                           unsigned idx,
+                           VhostShadowVirtqueue *svq);
+void vhost_shadow_vq_stop(struct vhost_dev *dev,
+                          unsigned idx,
+                          VhostShadowVirtqueue *svq);
+
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
 
 void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ac963bf23d..7ffdf9aea0 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -55,6 +55,8 @@ struct vhost_iommu {
     QLIST_ENTRY(vhost_iommu) iommu_next;
 };
 
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
 typedef struct VhostDevConfigOps {
     /* Vhost device config space changed callback
      */
@@ -83,7 +85,9 @@ struct vhost_dev {
     uint64_t backend_cap;
     bool started;
     bool log_enabled;
+    bool shadow_vqs_enabled;
     uint64_t log_size;
+    VhostShadowVirtqueue **shadow_vqs;
     Error *migration_blocker;
     const VhostOps *vhost_ops;
     void *opaque;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 4512e5b058..3e43399e9c 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -8,9 +8,12 @@
  */
 
 #include "hw/virtio/vhost-shadow-virtqueue.h"
+#include "hw/virtio/vhost.h"
+
+#include "standard-headers/linux/vhost_types.h"
 
 #include "qemu/error-report.h"
-#include "qemu/event_notifier.h"
+#include "qemu/main-loop.h"
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
@@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
     EventNotifier call_notifier;
+
+    /*
+     * Borrowed virtqueue's guest to host notifier.
+     * To borrow it in this event notifier allows to register on the event
+     * loop and access the associated shadow virtqueue easily. If we use the
+     * VirtQueue, we don't have an easy way to retrieve it.
+     *
+     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
+     */
+    EventNotifier host_notifier;
+
+    /* Virtio queue shadowing */
+    VirtQueue *vq;
 } VhostShadowVirtqueue;
 
+/* Forward guest notifications */
+static void vhost_handle_guest_kick(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             host_notifier);
+
+    if (unlikely(!event_notifier_test_and_clear(n))) {
+        return;
+    }
+
+    event_notifier_set(&svq->kick_notifier);
+}
+
+/*
+ * Restore the vhost guest to host notifier, i.e., disables svq effect.
+ */
+static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
+                                                     unsigned vhost_index,
+                                                     VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = vhost_index,
+        .fd = event_notifier_get_fd(vq_host_notifier),
+    };
+    int r;
+
+    /* Restore vhost kick */
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    return r ? -errno : 0;
+}
+
+/*
+ * Start shadow virtqueue operation.
+ * @dev vhost device
+ * @hidx vhost virtqueue index
+ * @svq Shadow Virtqueue
+ */
+bool vhost_shadow_vq_start(struct vhost_dev *dev,
+                           unsigned idx,
+                           VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = idx,
+        .fd = event_notifier_get_fd(&svq->kick_notifier),
+    };
+    int r;
+
+    /* Check that notifications are still going directly to vhost dev */
+    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
+
+    /*
+     * event_notifier_set_handler already checks for guest's notifications if
+     * they arrive in the switch, so there is no need to explicitely check for
+     * them.
+     */
+    event_notifier_init_fd(&svq->host_notifier,
+                           event_notifier_get_fd(vq_host_notifier));
+    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
+
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    if (unlikely(r != 0)) {
+        error_report("Couldn't set kick fd: %s", strerror(errno));
+        goto err_set_vring_kick;
+    }
+
+    return true;
+
+err_set_vring_kick:
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    return false;
+}
+
+/*
+ * Stop shadow virtqueue operation.
+ * @dev vhost device
+ * @idx vhost queue index
+ * @svq Shadow Virtqueue
+ */
+void vhost_shadow_vq_stop(struct vhost_dev *dev,
+                          unsigned idx,
+                          VhostShadowVirtqueue *svq)
+{
+    int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+    if (unlikely(r < 0)) {
+        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
+    }
+
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+}
+
 /*
  * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
  * methods and file descriptors.
  */
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
 {
+    int vq_idx = dev->vq_index + idx;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
 
@@ -43,6 +153,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
         goto err_init_call_notifier;
     }
 
+    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 97f1bcfa42..4858a35df6 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -25,6 +25,7 @@
 #include "exec/address-spaces.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file-types.h"
 #include "sysemu/dma.h"
@@ -1219,6 +1220,74 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
                        0, virtio_queue_get_desc_size(vdev, idx));
 }
 
+static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
+{
+    int idx;
+
+    dev->shadow_vqs_enabled = false;
+
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
+        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
+    }
+
+    g_free(dev->shadow_vqs);
+    dev->shadow_vqs = NULL;
+    return 0;
+}
+
+static int vhost_sw_live_migration_start(struct vhost_dev *dev)
+{
+    int idx, stop_idx;
+
+    dev->shadow_vqs = g_new0(VhostShadowVirtqueue *, dev->nvqs);
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        dev->shadow_vqs[idx] = vhost_shadow_vq_new(dev, idx);
+        if (unlikely(dev->shadow_vqs[idx] == NULL)) {
+            goto err_new;
+        }
+    }
+
+    dev->shadow_vqs_enabled = true;
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+        if (unlikely(!ok)) {
+            goto err_start;
+        }
+    }
+
+    return 0;
+
+err_start:
+    dev->shadow_vqs_enabled = false;
+    for (stop_idx = 0; stop_idx < idx; stop_idx++) {
+        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
+    }
+
+err_new:
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
+    }
+    g_free(dev->shadow_vqs);
+
+    return -1;
+}
+
+static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
+                                          bool enable_lm)
+{
+    int r;
+
+    if (enable_lm == dev->shadow_vqs_enabled) {
+        return 0;
+    }
+
+    r = enable_lm ? vhost_sw_live_migration_start(dev)
+                  : vhost_sw_live_migration_stop(dev);
+
+    return r;
+}
+
 static void vhost_eventfd_add(MemoryListener *listener,
                               MemoryRegionSection *section,
                               bool match_data, uint64_t data, EventNotifier *e)
@@ -1381,6 +1450,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->log = NULL;
     hdev->log_size = 0;
     hdev->log_enabled = false;
+    hdev->shadow_vqs_enabled = false;
     hdev->started = false;
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
@@ -1484,6 +1554,10 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
     int i, r;
 
+    if (hdev->shadow_vqs_enabled) {
+        vhost_sw_live_migration_enable(hdev, false);
+    }
+
     for (i = 0; i < hdev->nvqs; ++i) {
         r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
                                          false);
@@ -1798,6 +1872,7 @@ fail_features:
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
     int i;
+    bool is_shadow_vqs_enabled = hdev->shadow_vqs_enabled;
 
     /* should only be called after backend is connected */
     assert(hdev->vhost_ops);
@@ -1805,7 +1880,16 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
     if (hdev->vhost_ops->vhost_dev_start) {
         hdev->vhost_ops->vhost_dev_start(hdev, false);
     }
+    if (is_shadow_vqs_enabled) {
+        /* Shadow virtqueue will be stopped */
+        hdev->shadow_vqs_enabled = false;
+    }
     for (i = 0; i < hdev->nvqs; ++i) {
+        if (is_shadow_vqs_enabled) {
+            vhost_shadow_vq_stop(hdev, i, hdev->shadow_vqs[i]);
+            vhost_shadow_vq_free(hdev->shadow_vqs[i]);
+        }
+
         vhost_virtqueue_stop(hdev,
                              vdev,
                              hdev->vqs + i,
@@ -1819,6 +1903,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
         memory_listener_unregister(&hdev->iommu_listener);
     }
     vhost_log_put(hdev, true);
+    g_free(hdev->shadow_vqs);
+    hdev->shadow_vqs_enabled = false;
     hdev->started = false;
     hdev->vdev = NULL;
 }
@@ -1835,5 +1921,60 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
 void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
 {
-    error_setg(errp, "Shadow virtqueue still not implemented");
+    struct vhost_dev *hdev, *hdev_err;
+    VirtIODevice *vdev;
+    const char *err_cause = NULL;
+    int r;
+    ErrorClass err_class = ERROR_CLASS_GENERIC_ERROR;
+
+    QLIST_FOREACH(hdev, &vhost_devices, entry) {
+        if (hdev->vdev && 0 == strcmp(hdev->vdev->name, name)) {
+            vdev = hdev->vdev;
+            break;
+        }
+    }
+
+    if (!hdev) {
+        err_class = ERROR_CLASS_DEVICE_NOT_FOUND;
+        err_cause = "Device not found";
+        goto not_found_err;
+    }
+
+    for ( ; hdev; hdev = QLIST_NEXT(hdev, entry)) {
+        if (vdev != hdev->vdev) {
+            continue;
+        }
+
+        if (!hdev->started) {
+            err_cause = "Device is not started";
+            goto err;
+        }
+
+        r = vhost_sw_live_migration_enable(hdev, enable);
+        if (unlikely(r)) {
+            err_cause = "Error enabling (see monitor)";
+            goto err;
+        }
+    }
+
+    return;
+
+err:
+    QLIST_FOREACH(hdev_err, &vhost_devices, entry) {
+        if (hdev_err == hdev) {
+            break;
+        }
+
+        if (vdev != hdev->vdev) {
+            continue;
+        }
+
+        vhost_sw_live_migration_enable(hdev, !enable);
+    }
+
+not_found_err:
+    if (err_cause) {
+        error_set(errp, err_class,
+                  "Can't enable shadow vq on %s: %s", name, err_cause);
+    }
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 06/13] vhost: Route host->guest notification through shadow virtqueue
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  7:21   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 07/13] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

On one hand it uses a mutex to synchronize guest masking with SVQ start
and stop, because otherwise guest mask could race with the SVQ
stop code, sending an incorrect call notifier to vhost device. This
would prevent further communication.

On the other hand it needs to add an event to synchronize guest
unmasking with call handling. Not doing that way could cause the guest
to receive notifications after its unmask call. This could be done
through the mutex but the event solution is cheaper for the buffer
forwarding.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |   3 +
 include/hw/virtio/vhost.h          |   1 +
 hw/virtio/vhost-shadow-virtqueue.c | 127 +++++++++++++++++++++++++++++
 hw/virtio/vhost.c                  |  29 ++++++-
 4 files changed, 157 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index c891c6510d..2ca4b92b12 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -17,6 +17,9 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
+void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
+
 bool vhost_shadow_vq_start(struct vhost_dev *dev,
                            unsigned idx,
                            VhostShadowVirtqueue *svq);
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 7ffdf9aea0..2f556bd3d5 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -29,6 +29,7 @@ struct vhost_virtqueue {
     unsigned long long used_phys;
     unsigned used_size;
     bool notifier_is_masked;
+    QemuRecMutex masked_mutex;
     EventNotifier masked_notifier;
     struct vhost_dev *dev;
 };
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 3e43399e9c..8f6ffa729a 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -32,8 +32,22 @@ typedef struct VhostShadowVirtqueue {
      */
     EventNotifier host_notifier;
 
+    /* (Possible) masked notifier */
+    struct {
+        EventNotifier *n;
+
+        /*
+         * Event to confirm unmasking.
+         * set when the masked notifier has no uses
+         */
+        QemuEvent is_free;
+    } masked_notifier;
+
     /* Virtio queue shadowing */
     VirtQueue *vq;
+
+    /* Virtio device */
+    VirtIODevice *vdev;
 } VhostShadowVirtqueue;
 
 /* Forward guest notifications */
@@ -49,6 +63,70 @@ static void vhost_handle_guest_kick(EventNotifier *n)
     event_notifier_set(&svq->kick_notifier);
 }
 
+/* Forward vhost notifications */
+static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             call_notifier);
+    EventNotifier *masked_notifier;
+
+    /* Signal start of using masked notifier */
+    qemu_event_reset(&svq->masked_notifier.is_free);
+    masked_notifier = qatomic_load_acquire(&svq->masked_notifier.n);
+    if (!masked_notifier) {
+        qemu_event_set(&svq->masked_notifier.is_free);
+    }
+
+    if (!masked_notifier) {
+        unsigned n = virtio_get_queue_index(svq->vq);
+        virtio_queue_invalidate_signalled_used(svq->vdev, n);
+        virtio_notify_irqfd(svq->vdev, svq->vq);
+    } else {
+        event_notifier_set(svq->masked_notifier.n);
+    }
+
+    if (masked_notifier) {
+        /* Signal not using it anymore */
+        qemu_event_set(&svq->masked_notifier.is_free);
+    }
+}
+
+static void vhost_shadow_vq_handle_call(EventNotifier *n)
+{
+
+    if (likely(event_notifier_test_and_clear(n))) {
+        vhost_shadow_vq_handle_call_no_test(n);
+    }
+}
+
+/*
+ * Mask the shadow virtqueue.
+ *
+ * It can be called from a guest masking vmexit or shadow virtqueue start
+ * through QMP.
+ *
+ * @vq Shadow virtqueue
+ * @masked Masked notifier to signal instead of guest
+ */
+void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked)
+{
+    qatomic_store_release(&svq->masked_notifier.n, masked);
+}
+
+/*
+ * Unmask the shadow virtqueue.
+ *
+ * It can be called from a guest unmasking vmexit or shadow virtqueue start
+ * through QMP.
+ *
+ * @vq Shadow virtqueue
+ */
+void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
+{
+    qatomic_store_release(&svq->masked_notifier.n, NULL);
+    qemu_event_wait(&svq->masked_notifier.is_free);
+}
+
 /*
  * Restore the vhost guest to host notifier, i.e., disables svq effect.
  */
@@ -103,8 +181,39 @@ bool vhost_shadow_vq_start(struct vhost_dev *dev,
         goto err_set_vring_kick;
     }
 
+    /* Set vhost call */
+    file.fd = event_notifier_get_fd(&svq->call_notifier),
+    r = dev->vhost_ops->vhost_set_vring_call(dev, &file);
+    if (unlikely(r != 0)) {
+        error_report("Couldn't set call fd: %s", strerror(errno));
+        goto err_set_vring_call;
+    }
+
+
+    /*
+     * Lock to avoid a race condition between guest setting masked status and
+     * us.
+     */
+    QEMU_LOCK_GUARD(&dev->vqs[idx].masked_mutex);
+    /* Set shadow vq -> guest notifier */
+    assert(dev->shadow_vqs_enabled);
+    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
+                         dev->vqs[idx].notifier_is_masked);
+
+    if (dev->vqs[idx].notifier_is_masked &&
+               event_notifier_test_and_clear(&dev->vqs[idx].masked_notifier)) {
+        /* Check for pending notifications from the device */
+        vhost_shadow_vq_handle_call_no_test(&svq->call_notifier);
+    }
+
     return true;
 
+err_set_vring_call:
+    r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+    if (unlikely(r < 0)) {
+        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
+    }
+
 err_set_vring_kick:
     event_notifier_set_handler(&svq->host_notifier, NULL);
 
@@ -126,7 +235,19 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
         error_report("Couldn't restore vq kick fd: %s", strerror(-r));
     }
 
+    assert(!dev->shadow_vqs_enabled);
+
     event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    /*
+     * Lock to avoid a race condition between guest setting masked status and
+     * us.
+     */
+    QEMU_LOCK_GUARD(&dev->vqs[idx].masked_mutex);
+
+    /* Restore vhost call */
+    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
+                         dev->vqs[idx].notifier_is_masked);
 }
 
 /*
@@ -154,6 +275,10 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     }
 
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
+    svq->vdev = dev->vdev;
+    event_notifier_set_handler(&svq->call_notifier,
+                               vhost_shadow_vq_handle_call);
+    qemu_event_init(&svq->masked_notifier.is_free, true);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
@@ -168,7 +293,9 @@ err_init_kick_notifier:
  */
 void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
 {
+    qemu_event_destroy(&vq->masked_notifier.is_free);
     event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 4858a35df6..eab3e334f2 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1224,7 +1224,8 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 {
     int idx;
 
-    dev->shadow_vqs_enabled = false;
+    /* Can be read by vhost_virtqueue_mask, from vm exit */
+    qatomic_store_release(&dev->shadow_vqs_enabled, false);
 
     for (idx = 0; idx < dev->nvqs; ++idx) {
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
@@ -1248,7 +1249,8 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
         }
     }
 
-    dev->shadow_vqs_enabled = true;
+    /* Can be read by vhost_virtqueue_mask, from vm exit */
+    qatomic_store_release(&dev->shadow_vqs_enabled, true);
     for (idx = 0; idx < dev->nvqs; ++idx) {
         bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
         if (unlikely(!ok)) {
@@ -1259,7 +1261,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
     return 0;
 
 err_start:
-    dev->shadow_vqs_enabled = false;
+    qatomic_store_release(&dev->shadow_vqs_enabled, false);
     for (stop_idx = 0; stop_idx < idx; stop_idx++) {
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
     }
@@ -1343,6 +1345,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
         goto fail_call;
     }
 
+    qemu_rec_mutex_init(&vq->masked_mutex);
     vq->dev = dev;
 
     return 0;
@@ -1353,6 +1356,7 @@ fail_call:
 
 static void vhost_virtqueue_cleanup(struct vhost_virtqueue *vq)
 {
+    qemu_rec_mutex_destroy(&vq->masked_mutex);
     event_notifier_cleanup(&vq->masked_notifier);
 }
 
@@ -1591,6 +1595,25 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     /* should only be called after backend is connected */
     assert(hdev->vhost_ops);
 
+    /* Avoid race condition with shadow virtqueue stop/start */
+    QEMU_LOCK_GUARD(&hdev->vqs[index].masked_mutex);
+
+    /* Set by QMP thread, so using acquire semantics */
+    if (qatomic_load_acquire(&hdev->shadow_vqs_enabled)) {
+        if (mask) {
+            vhost_shadow_vq_mask(hdev->shadow_vqs[index],
+                                 &hdev->vqs[index].masked_notifier);
+        } else {
+            vhost_shadow_vq_unmask(hdev->shadow_vqs[index]);
+        }
+
+        /*
+         * Vhost call fd must remain the same since shadow vq is not polling
+         * for changes
+         */
+        return;
+    }
+
     if (mask) {
         assert(vdev->use_guest_notifier_mask);
         file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 07/13] vhost: Avoid re-set masked notifier in shadow vq
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 06/13] vhost: Route host->guest " Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Since all the shadow virtqueue device is done in software, we can avoid
the write syscall.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 8f6ffa729a..b6bab438d6 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -41,6 +41,9 @@ typedef struct VhostShadowVirtqueue {
          * set when the masked notifier has no uses
          */
         QemuEvent is_free;
+
+        /* Avoid re-sending signals */
+        bool signaled;
     } masked_notifier;
 
     /* Virtio queue shadowing */
@@ -81,7 +84,8 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
         unsigned n = virtio_get_queue_index(svq->vq);
         virtio_queue_invalidate_signalled_used(svq->vdev, n);
         virtio_notify_irqfd(svq->vdev, svq->vq);
-    } else {
+    } else if (!svq->masked_notifier.signaled) {
+        svq->masked_notifier.signaled = true;
         event_notifier_set(svq->masked_notifier.n);
     }
 
@@ -110,6 +114,7 @@ static void vhost_shadow_vq_handle_call(EventNotifier *n)
  */
 void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked)
 {
+    svq->masked_notifier.signaled = false;
     qatomic_store_release(&svq->masked_notifier.n, masked);
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 07/13] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  7:50   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 09/13] virtio: Add virtio_queue_full Eugenio Pérez
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

It reports the shadow virtqueue address from qemu virtual address space

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  2 ++
 hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 2ca4b92b12..d82c35bccf 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -19,6 +19,8 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
 void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
+void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                                    struct vhost_vring_addr *addr);
 
 bool vhost_shadow_vq_start(struct vhost_dev *dev,
                            unsigned idx,
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index b6bab438d6..1460d1d5d1 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -17,6 +17,9 @@
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
+    /* Shadow vring */
+    struct vring vring;
+
     /* Shadow kick notifier, sent to vhost */
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
@@ -51,6 +54,9 @@ typedef struct VhostShadowVirtqueue {
 
     /* Virtio device */
     VirtIODevice *vdev;
+
+    /* Descriptors copied from guest */
+    vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
 /* Forward guest notifications */
@@ -132,6 +138,19 @@ void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
     qemu_event_wait(&svq->masked_notifier.is_free);
 }
 
+/*
+ * Get the shadow vq vring address.
+ * @svq Shadow virtqueue
+ * @addr Destination to store address
+ */
+void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                                    struct vhost_vring_addr *addr)
+{
+    addr->desc_user_addr = (uint64_t)svq->vring.desc;
+    addr->avail_user_addr = (uint64_t)svq->vring.avail;
+    addr->used_user_addr = (uint64_t)svq->vring.used;
+}
+
 /*
  * Restore the vhost guest to host notifier, i.e., disables svq effect.
  */
@@ -262,7 +281,9 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
 {
     int vq_idx = dev->vq_index + idx;
-    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
+    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
+    size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
+    g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
     int r;
 
     r = event_notifier_init(&svq->kick_notifier, 0);
@@ -279,6 +300,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
         goto err_init_call_notifier;
     }
 
+    vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     svq->vdev = dev->vdev;
     event_notifier_set_handler(&svq->call_notifier,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 09/13] virtio: Add virtio_queue_full
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-15 19:48 ` [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Check if all descriptors of the queue are available. In other words, is
the complete opposite of virtio_queue_empty: If the queue is full, the
driver cannot transfer more buffers to the device until the latter
make some as used.

In Shadow vq this situation happens with the correct guest network
driver, since the rx queue is filled for the device to write. Since
Shadow Virtqueue forward the available ones blindly, it will call the
driver forever for them, reaching the point where no more descriptors
are available.

While a straightforward solution is to keep the count of them in SVQ,
this specific issue is the only need for that counter. Exposing this
check helps to keep the SVQ simpler storing as little status as
possible.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h |  2 ++
 hw/virtio/virtio.c         | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index c2c7cee993..899c5e3506 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -232,6 +232,8 @@ int virtio_queue_ready(VirtQueue *vq);
 
 int virtio_queue_empty(VirtQueue *vq);
 
+bool virtio_queue_full(const VirtQueue *vq);
+
 /* Host binding interface.  */
 
 uint32_t virtio_config_readb(VirtIODevice *vdev, uint32_t addr);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index a86b3f9c26..e9a4d9ffae 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -670,6 +670,20 @@ int virtio_queue_empty(VirtQueue *vq)
     }
 }
 
+/*
+ * virtio_queue_full:
+ * @vq The #VirtQueue
+ *
+ * Check if all descriptors of the queue are available. In other words, is the
+ * complete opposite of virtio_queue_empty: If the queue is full, the driver
+ * cannot transfer more buffers to the device until the latter make some as
+ * used.
+ */
+bool virtio_queue_full(const VirtQueue *vq)
+{
+    return vq->inuse >= vq->vring.num;
+}
+
 static void virtqueue_unmap_sg(VirtQueue *vq, const VirtQueueElement *elem,
                                unsigned int len)
 {
@@ -1439,7 +1453,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
 
     max = vq->vring.num;
 
-    if (vq->inuse >= vq->vring.num) {
+    if (unlikely(virtio_queue_full(vq))) {
         virtio_error(vdev, "Virtqueue size exceeded");
         goto done;
     }
@@ -1574,7 +1588,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
 
     max = vq->vring.num;
 
-    if (vq->inuse >= vq->vring.num) {
+    if (unlikely(virtio_queue_full(vq))) {
         virtio_error(vdev, "Virtqueue size exceeded");
         goto done;
     }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 09/13] virtio: Add virtio_queue_full Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  7:29   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

This method is already present in vhost-user. This commit adapts it to
vhost-net, so SVQ can use.

vhost_kernel_set_enable stops the device, so qemu can ask for its status
(next available idx the device was going to consume). When SVQ starts it
can resume consuming the guest's driver ring, without notice from the
latter. Not stopping the device before of the swapping could imply that
it process more buffers than reported, what would duplicate the device
action.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 31b33bde37..1ac5c574a9 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
     return idx - dev->vq_index;
 }
 
+static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
+                                      bool enable)
+{
+    struct vhost_vring_file file = {
+        .index = idx,
+    };
+
+    if (!enable) {
+        file.fd = -1; /* Pass -1 to unbind from file. */
+    } else {
+        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
+        file.fd = vn_dev->backend;
+    }
+
+    return vhost_kernel_net_set_backend(dev, &file);
+}
+
+static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+    int i;
+
+    for (i = 0; i < dev->nvqs; ++i) {
+        vhost_kernel_set_vq_enable(dev, i, enable);
+    }
+
+    return 0;
+}
+
 #ifdef CONFIG_VHOST_VSOCK
 static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
                                             uint64_t guest_cid)
@@ -317,6 +345,7 @@ static const VhostOps kernel_ops = {
         .vhost_set_owner = vhost_kernel_set_owner,
         .vhost_reset_device = vhost_kernel_reset_device,
         .vhost_get_vq_index = vhost_kernel_get_vq_index,
+        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
 #ifdef CONFIG_VHOST_VSOCK
         .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
         .vhost_vsock_set_running = vhost_kernel_vsock_set_running,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  8:15   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Initial version of shadow virtqueue that actually forward buffers.

It reuses the VirtQueue code for the device part. The driver part is
based on Linux's virtio_ring driver, but with stripped functionality
and optimizations so it's easier to review.

These will be added in later commits.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
 hw/virtio/vhost.c                  | 113 ++++++++++++++-
 2 files changed, 312 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 1460d1d5d1..68ed0f2740 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -9,6 +9,7 @@
 
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost.h"
+#include "hw/virtio/virtio-access.h"
 
 #include "standard-headers/linux/vhost_types.h"
 
@@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
     /* Virtio device */
     VirtIODevice *vdev;
 
+    /* Map for returning guest's descriptors */
+    VirtQueueElement **ring_id_maps;
+
+    /* Next head to expose to device */
+    uint16_t avail_idx_shadow;
+
+    /* Next free descriptor */
+    uint16_t free_head;
+
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from device */
+    uint16_t used_idx;
+
     /* Descriptors copied from guest */
     vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
-/* Forward guest notifications */
+static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+                                    const struct iovec *iovec,
+                                    size_t num, bool more_descs, bool write)
+{
+    uint16_t i = svq->free_head, last = svq->free_head;
+    unsigned n;
+    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
+    vring_desc_t *descs = svq->vring.desc;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (n = 0; n < num; n++) {
+        if (more_descs || (n + 1 < num)) {
+            descs[i].flags = flags | virtio_tswap16(svq->vdev,
+                                                    VRING_DESC_F_NEXT);
+        } else {
+            descs[i].flags = flags;
+        }
+        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
+        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
+
+        last = i;
+        i = virtio_tswap16(svq->vdev, descs[i].next);
+    }
+
+    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
+}
+
+static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
+                                          VirtQueueElement *elem)
+{
+    int head;
+    unsigned avail_idx;
+    vring_avail_t *avail = svq->vring.avail;
+
+    head = svq->free_head;
+
+    /* We need some descriptors here */
+    assert(elem->out_num || elem->in_num);
+
+    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
+                            elem->in_num > 0, false);
+    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
+
+    /*
+     * Put entry in available array (but don't update avail->idx until they
+     * do sync).
+     */
+    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
+    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
+    svq->avail_idx_shadow++;
+
+    /* Expose descriptors to device */
+    smp_wmb();
+    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
+
+    return head;
+
+}
+
+static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
+                                VirtQueueElement *elem)
+{
+    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
+
+    svq->ring_id_maps[qemu_head] = elem;
+}
+
+/* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
@@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
         return;
     }
 
-    event_notifier_set(&svq->kick_notifier);
+    /* Make available as many buffers as possible */
+    do {
+        if (virtio_queue_get_notification(svq->vq)) {
+            /* No more notifications until process all available */
+            virtio_queue_set_notification(svq->vq, false);
+        }
+
+        while (true) {
+            VirtQueueElement *elem;
+            if (virtio_queue_full(svq->vq)) {
+                break;
+            }
+
+            elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            if (!elem) {
+                break;
+            }
+
+            vhost_shadow_vq_add(svq, elem);
+            event_notifier_set(&svq->kick_notifier);
+        }
+
+        virtio_queue_set_notification(svq->vq, true);
+    } while (!virtio_queue_empty(svq->vq));
+}
+
+static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
+{
+    if (svq->used_idx != svq->shadow_used_idx) {
+        return true;
+    }
+
+    /* Get used idx must not be reordered */
+    smp_rmb();
+    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
+
+    return svq->used_idx != svq->shadow_used_idx;
+}
+
+static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
+{
+    vring_desc_t *descs = svq->vring.desc;
+    const vring_used_t *used = svq->vring.used;
+    vring_used_elem_t used_elem;
+    uint16_t last_used;
+
+    if (!vhost_shadow_vq_more_used(svq)) {
+        return NULL;
+    }
+
+    last_used = svq->used_idx & (svq->vring.num - 1);
+    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
+    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
+
+    if (unlikely(used_elem.id >= svq->vring.num)) {
+        error_report("Device %s says index %u is available", svq->vdev->name,
+                     used_elem.id);
+        return NULL;
+    }
+
+    descs[used_elem.id].next = svq->free_head;
+    svq->free_head = used_elem.id;
+
+    svq->used_idx++;
+    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
 }
 
 /* Forward vhost notifications */
@@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
                                              call_notifier);
     EventNotifier *masked_notifier;
+    VirtQueue *vq = svq->vq;
 
     /* Signal start of using masked notifier */
     qemu_event_reset(&svq->masked_notifier.is_free);
@@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
         qemu_event_set(&svq->masked_notifier.is_free);
     }
 
-    if (!masked_notifier) {
-        unsigned n = virtio_get_queue_index(svq->vq);
-        virtio_queue_invalidate_signalled_used(svq->vdev, n);
-        virtio_notify_irqfd(svq->vdev, svq->vq);
-    } else if (!svq->masked_notifier.signaled) {
-        svq->masked_notifier.signaled = true;
-        event_notifier_set(svq->masked_notifier.n);
-    }
+    /* Make as many buffers as possible used. */
+    do {
+        unsigned i = 0;
+
+        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        while (true) {
+            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
+            if (!elem) {
+                break;
+            }
+
+            assert(i < svq->vring.num);
+            virtqueue_fill(vq, elem, elem->len, i++);
+        }
+
+        virtqueue_flush(vq, i);
+        if (!masked_notifier) {
+            virtio_notify_irqfd(svq->vdev, svq->vq);
+        } else if (!svq->masked_notifier.signaled) {
+            svq->masked_notifier.signaled = true;
+            event_notifier_set(svq->masked_notifier.n);
+        }
+    } while (vhost_shadow_vq_more_used(svq));
 
     if (masked_notifier) {
         /* Signal not using it anymore */
@@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
 
 static void vhost_shadow_vq_handle_call(EventNotifier *n)
 {
-
     if (likely(event_notifier_test_and_clear(n))) {
         vhost_shadow_vq_handle_call_no_test(n);
     }
@@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
                           unsigned idx,
                           VhostShadowVirtqueue *svq)
 {
+    int i;
     int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+
+    assert(!dev->shadow_vqs_enabled);
+
     if (unlikely(r < 0)) {
         error_report("Couldn't restore vq kick fd: %s", strerror(-r));
     }
@@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
     /* Restore vhost call */
     vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
                          dev->vqs[idx].notifier_is_masked);
+
+
+    for (i = 0; i < svq->vring.num; ++i) {
+        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
+        /*
+         * Although the doc says we must unpop in order, it's ok to unpop
+         * everything.
+         */
+        if (elem) {
+            virtqueue_unpop(svq->vq, elem, elem->len);
+        }
+    }
 }
 
 /*
@@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
     size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
     g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
-    int r;
+    int r, i;
 
     r = event_notifier_init(&svq->kick_notifier, 0);
     if (r != 0) {
@@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     svq->vdev = dev->vdev;
+    for (i = 0; i < num - 1; i++) {
+        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
+    }
+
+    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_shadow_vq_handle_call);
     qemu_event_init(&svq->masked_notifier.is_free, true);
@@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
     event_notifier_cleanup(&vq->kick_notifier);
     event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq->ring_id_maps);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index eab3e334f2..a373999bc4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 
     trace_vhost_iotlb_miss(dev, 1);
 
+    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
+        uaddr = iova;
+        len = 4096;
+        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
+                                                IOMMU_RW);
+        if (ret) {
+            trace_vhost_iotlb_miss(dev, 2);
+            error_report("Fail to update device iotlb");
+        }
+
+        return ret;
+    }
+
     iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
                                           iova, write,
                                           MEMTXATTRS_UNSPECIFIED);
@@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
     /* Can be read by vhost_virtqueue_mask, from vm exit */
     qatomic_store_release(&dev->shadow_vqs_enabled, false);
 
+    dev->vhost_ops->vhost_set_vring_enable(dev, false);
+    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+        error_report("Fail to invalidate device iotlb");
+    }
+
     for (idx = 0; idx < dev->nvqs; ++idx) {
+        /*
+         * Update used ring information for IOTLB to work correctly,
+         * vhost-kernel code requires for this.
+         */
+        struct vhost_virtqueue *vq = dev->vqs + idx;
+        vhost_device_iotlb_miss(dev, vq->used_phys, true);
+
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
+        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
+                              dev->vq_index + idx);
+    }
+
+    /* Enable guest's vq vring */
+    dev->vhost_ops->vhost_set_vring_enable(dev, true);
+
+    for (idx = 0; idx < dev->nvqs; ++idx) {
         vhost_shadow_vq_free(dev->shadow_vqs[idx]);
     }
 
@@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
     return 0;
 }
 
+/*
+ * Start shadow virtqueue in a given queue.
+ * In failure case, this function leaves queue working as regular vhost mode.
+ */
+static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
+                                             unsigned idx)
+{
+    struct vhost_vring_addr addr = {
+        .index = idx,
+    };
+    struct vhost_vring_state s = {
+        .index = idx,
+    };
+    int r;
+    bool ok;
+
+    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
+    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    /* From this point, vhost_virtqueue_start can reset these changes */
+    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
+    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
+    if (unlikely(r != 0)) {
+        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
+        goto err;
+    }
+
+    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
+    if (unlikely(r != 0)) {
+        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
+        goto err;
+    }
+
+    /*
+     * Update used ring information for IOTLB to work correctly,
+     * vhost-kernel code requires for this.
+     */
+    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
+    if (unlikely(r != 0)) {
+        /* Debug message already printed */
+        goto err;
+    }
+
+    return true;
+
+err:
+    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
+    return false;
+}
+
 static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 {
     int idx, stop_idx;
@@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
         }
     }
 
+    dev->vhost_ops->vhost_set_vring_enable(dev, false);
+    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+        error_report("Fail to invalidate device iotlb");
+    }
+
     /* Can be read by vhost_virtqueue_mask, from vm exit */
     qatomic_store_release(&dev->shadow_vqs_enabled, true);
     for (idx = 0; idx < dev->nvqs; ++idx) {
-        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
         if (unlikely(!ok)) {
             goto err_start;
         }
     }
 
+    /* Enable shadow vq vring */
+    dev->vhost_ops->vhost_set_vring_enable(dev, true);
     return 0;
 
 err_start:
     qatomic_store_release(&dev->shadow_vqs_enabled, false);
     for (stop_idx = 0; stop_idx < idx; stop_idx++) {
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
+        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
+                              dev->vq_index + stop_idx);
     }
 
 err_new:
+    /* Enable guest's vring */
+    dev->vhost_ops->vhost_set_vring_enable(dev, true);
     for (idx = 0; idx < dev->nvqs; ++idx) {
         vhost_shadow_vq_free(dev->shadow_vqs[idx]);
     }
@@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
 
         if (!hdev->started) {
             err_cause = "Device is not started";
+        } else if (!vhost_dev_has_iommu(hdev)) {
+            err_cause = "Does not support iommu";
+        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
+            err_cause = "Is packed";
+        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
+            err_cause = "Have event idx";
+        } else if (hdev->acked_features &
+                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
+            err_cause = "Supports indirect descriptors";
+        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
+            err_cause = "Cannot pause device";
+        }
+
+        if (err_cause) {
             goto err;
         }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  8:07   ` Jason Wang
  2021-03-15 19:48 ` [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
  2021-03-16  8:28 ` [RFC v2 00/13] vDPA software assisted live migration Jason Wang
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 68ed0f2740..7df98fc43f 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -145,6 +145,15 @@ static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
     svq->ring_id_maps[qemu_head] = elem;
 }
 
+static void vhost_shadow_vq_kick(VhostShadowVirtqueue *svq)
+{
+    /* Make sure we are reading updated device flag */
+    smp_rmb();
+    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
+        event_notifier_set(&svq->kick_notifier);
+    }
+}
+
 /* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
@@ -174,7 +183,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
             }
 
             vhost_shadow_vq_add(svq, elem);
-            event_notifier_set(&svq->kick_notifier);
+            vhost_shadow_vq_kick(svq);
         }
 
         virtio_queue_set_notification(svq->vq, true);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
@ 2021-03-15 19:48 ` Eugenio Pérez
  2021-03-16  8:08   ` Jason Wang
  2021-03-16  8:28 ` [RFC v2 00/13] vDPA software assisted live migration Jason Wang
  13 siblings, 1 reply; 46+ messages in thread
From: Eugenio Pérez @ 2021-03-15 19:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 7df98fc43f..e3879a4622 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -71,10 +71,35 @@ typedef struct VhostShadowVirtqueue {
     /* Next head to consume from device */
     uint16_t used_idx;
 
+    /* Cache for the exposed notification flag */
+    bool notification;
+
     /* Descriptors copied from guest */
     vring_desc_t descs[];
 } VhostShadowVirtqueue;
 
+static void vhost_shadow_vq_set_notification(VhostShadowVirtqueue *svq,
+                                             bool enable)
+{
+    uint16_t notification_flag;
+
+    if (svq->notification == enable) {
+        return;
+    }
+
+    notification_flag = virtio_tswap16(svq->vdev, VRING_AVAIL_F_NO_INTERRUPT);
+
+    svq->notification = enable;
+    if (enable) {
+        svq->vring.avail->flags &= ~notification_flag;
+    } else {
+        svq->vring.avail->flags |= notification_flag;
+    }
+
+    /* Make sure device reads our flag */
+    smp_mb();
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
@@ -251,7 +276,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
     do {
         unsigned i = 0;
 
-        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        vhost_shadow_vq_set_notification(svq, false);
         while (true) {
             g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
             if (!elem) {
@@ -269,6 +294,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
             svq->masked_notifier.signaled = true;
             event_notifier_set(svq->masked_notifier.n);
         }
+        vhost_shadow_vq_set_notification(svq, true);
     } while (vhost_shadow_vq_more_used(svq));
 
     if (masked_notifier) {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-15 19:48 ` [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
@ 2021-03-16  7:18   ` Jason Wang
  2021-03-16 10:31     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  7:18 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> stops, so code flow follows usual cleanup.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   7 ++
>   include/hw/virtio/vhost.h          |   4 +
>   hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
>   hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
>   4 files changed, 265 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 6cc18d6acb..c891c6510d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -17,6 +17,13 @@
>   
>   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> +                           unsigned idx,
> +                           VhostShadowVirtqueue *svq);
> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> +                          unsigned idx,
> +                          VhostShadowVirtqueue *svq);
> +
>   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
>   
>   void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index ac963bf23d..7ffdf9aea0 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -55,6 +55,8 @@ struct vhost_iommu {
>       QLIST_ENTRY(vhost_iommu) iommu_next;
>   };
>   
> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> +
>   typedef struct VhostDevConfigOps {
>       /* Vhost device config space changed callback
>        */
> @@ -83,7 +85,9 @@ struct vhost_dev {
>       uint64_t backend_cap;
>       bool started;
>       bool log_enabled;
> +    bool shadow_vqs_enabled;
>       uint64_t log_size;
> +    VhostShadowVirtqueue **shadow_vqs;


Any reason that you don't embed the shadow virtqueue into 
vhost_virtqueue structure?

(Note that there's a masked_notifier in struct vhost_virtqueue).


>       Error *migration_blocker;
>       const VhostOps *vhost_ops;
>       void *opaque;
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 4512e5b058..3e43399e9c 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -8,9 +8,12 @@
>    */
>   
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
> +#include "hw/virtio/vhost.h"
> +
> +#include "standard-headers/linux/vhost_types.h"
>   
>   #include "qemu/error-report.h"
> -#include "qemu/event_notifier.h"
> +#include "qemu/main-loop.h"
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier kick_notifier;
>       /* Shadow call notifier, sent to vhost */
>       EventNotifier call_notifier;
> +
> +    /*
> +     * Borrowed virtqueue's guest to host notifier.
> +     * To borrow it in this event notifier allows to register on the event
> +     * loop and access the associated shadow virtqueue easily. If we use the
> +     * VirtQueue, we don't have an easy way to retrieve it.


So this is something that worries me. It looks like a layer violation 
that makes the codes harder to work correctly.

I wonder if it would be simpler to start from a vDPA dedicated shadow 
virtqueue implementation:

1) have the above fields embeded in vhost_vdpa structure
2) Work at the level of 
vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()

Then the layer is still isolated and you have a much simpler context to 
work that you don't need to care a lot of synchornization:

1) vq masking
2) vhost dev start and stop

?


> +     *
> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> +     */
> +    EventNotifier host_notifier;
> +
> +    /* Virtio queue shadowing */
> +    VirtQueue *vq;
>   } VhostShadowVirtqueue;
>   
> +/* Forward guest notifications */
> +static void vhost_handle_guest_kick(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             host_notifier);
> +
> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> +        return;
> +    }
> +
> +    event_notifier_set(&svq->kick_notifier);
> +}
> +
> +/*
> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> + */
> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> +                                                     unsigned vhost_index,
> +                                                     VhostShadowVirtqueue *svq)
> +{
> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> +    struct vhost_vring_file file = {
> +        .index = vhost_index,
> +        .fd = event_notifier_get_fd(vq_host_notifier),
> +    };
> +    int r;
> +
> +    /* Restore vhost kick */
> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> +    return r ? -errno : 0;
> +}
> +
> +/*
> + * Start shadow virtqueue operation.
> + * @dev vhost device
> + * @hidx vhost virtqueue index
> + * @svq Shadow Virtqueue
> + */
> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> +                           unsigned idx,
> +                           VhostShadowVirtqueue *svq)


It looks to me this assumes the vhost_dev is started before 
vhost_shadow_vq_start()?


> +{
> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> +    struct vhost_vring_file file = {
> +        .index = idx,
> +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> +    };
> +    int r;
> +
> +    /* Check that notifications are still going directly to vhost dev */
> +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> +
> +    /*
> +     * event_notifier_set_handler already checks for guest's notifications if
> +     * they arrive in the switch, so there is no need to explicitely check for
> +     * them.
> +     */
> +    event_notifier_init_fd(&svq->host_notifier,
> +                           event_notifier_get_fd(vq_host_notifier));
> +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> +
> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> +    if (unlikely(r != 0)) {
> +        error_report("Couldn't set kick fd: %s", strerror(errno));
> +        goto err_set_vring_kick;
> +    }
> +
> +    return true;
> +
> +err_set_vring_kick:
> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> +
> +    return false;
> +}
> +
> +/*
> + * Stop shadow virtqueue operation.
> + * @dev vhost device
> + * @idx vhost queue index
> + * @svq Shadow Virtqueue
> + */
> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> +                          unsigned idx,
> +                          VhostShadowVirtqueue *svq)
> +{
> +    int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> +    if (unlikely(r < 0)) {
> +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> +    }
> +
> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> +}
> +
>   /*
>    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
>    * methods and file descriptors.
>    */
>   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>   {
> +    int vq_idx = dev->vq_index + idx;
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>       int r;
>   
> @@ -43,6 +153,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>           goto err_init_call_notifier;
>       }
>   
> +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>       return g_steal_pointer(&svq);
>   
>   err_init_call_notifier:
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 97f1bcfa42..4858a35df6 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -25,6 +25,7 @@
>   #include "exec/address-spaces.h"
>   #include "hw/virtio/virtio-bus.h"
>   #include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "migration/blocker.h"
>   #include "migration/qemu-file-types.h"
>   #include "sysemu/dma.h"
> @@ -1219,6 +1220,74 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
>                          0, virtio_queue_get_desc_size(vdev, idx));
>   }
>   
> +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> +{
> +    int idx;
> +
> +    dev->shadow_vqs_enabled = false;
> +
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> +        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> +    }
> +
> +    g_free(dev->shadow_vqs);
> +    dev->shadow_vqs = NULL;
> +    return 0;
> +}
> +
> +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> +{
> +    int idx, stop_idx;
> +
> +    dev->shadow_vqs = g_new0(VhostShadowVirtqueue *, dev->nvqs);
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        dev->shadow_vqs[idx] = vhost_shadow_vq_new(dev, idx);
> +        if (unlikely(dev->shadow_vqs[idx] == NULL)) {
> +            goto err_new;
> +        }
> +    }
> +
> +    dev->shadow_vqs_enabled = true;
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> +        if (unlikely(!ok)) {
> +            goto err_start;
> +        }
> +    }
> +
> +    return 0;
> +
> +err_start:
> +    dev->shadow_vqs_enabled = false;
> +    for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> +        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> +    }
> +
> +err_new:
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> +        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> +    }
> +    g_free(dev->shadow_vqs);
> +
> +    return -1;
> +}
> +
> +static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
> +                                          bool enable_lm)
> +{


So the live migration part should be done in an separted patch.

Thanks


> +    int r;
> +
> +    if (enable_lm == dev->shadow_vqs_enabled) {
> +        return 0;
> +    }
> +
> +    r = enable_lm ? vhost_sw_live_migration_start(dev)
> +                  : vhost_sw_live_migration_stop(dev);
> +
> +    return r;
> +}
> +
>   static void vhost_eventfd_add(MemoryListener *listener,
>                                 MemoryRegionSection *section,
>                                 bool match_data, uint64_t data, EventNotifier *e)
> @@ -1381,6 +1450,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>       hdev->log = NULL;
>       hdev->log_size = 0;
>       hdev->log_enabled = false;
> +    hdev->shadow_vqs_enabled = false;
>       hdev->started = false;
>       memory_listener_register(&hdev->memory_listener, &address_space_memory);
>       QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
> @@ -1484,6 +1554,10 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
>       BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
>       int i, r;
>   
> +    if (hdev->shadow_vqs_enabled) {
> +        vhost_sw_live_migration_enable(hdev, false);
> +    }
> +
>       for (i = 0; i < hdev->nvqs; ++i) {
>           r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
>                                            false);
> @@ -1798,6 +1872,7 @@ fail_features:
>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>   {
>       int i;
> +    bool is_shadow_vqs_enabled = hdev->shadow_vqs_enabled;
>   
>       /* should only be called after backend is connected */
>       assert(hdev->vhost_ops);
> @@ -1805,7 +1880,16 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>       if (hdev->vhost_ops->vhost_dev_start) {
>           hdev->vhost_ops->vhost_dev_start(hdev, false);
>       }
> +    if (is_shadow_vqs_enabled) {
> +        /* Shadow virtqueue will be stopped */
> +        hdev->shadow_vqs_enabled = false;
> +    }
>       for (i = 0; i < hdev->nvqs; ++i) {
> +        if (is_shadow_vqs_enabled) {
> +            vhost_shadow_vq_stop(hdev, i, hdev->shadow_vqs[i]);
> +            vhost_shadow_vq_free(hdev->shadow_vqs[i]);
> +        }
> +
>           vhost_virtqueue_stop(hdev,
>                                vdev,
>                                hdev->vqs + i,
> @@ -1819,6 +1903,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>           memory_listener_unregister(&hdev->iommu_listener);
>       }
>       vhost_log_put(hdev, true);
> +    g_free(hdev->shadow_vqs);
> +    hdev->shadow_vqs_enabled = false;
>       hdev->started = false;
>       hdev->vdev = NULL;
>   }
> @@ -1835,5 +1921,60 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>   
>   void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>   {
> -    error_setg(errp, "Shadow virtqueue still not implemented");
> +    struct vhost_dev *hdev, *hdev_err;
> +    VirtIODevice *vdev;
> +    const char *err_cause = NULL;
> +    int r;
> +    ErrorClass err_class = ERROR_CLASS_GENERIC_ERROR;
> +
> +    QLIST_FOREACH(hdev, &vhost_devices, entry) {
> +        if (hdev->vdev && 0 == strcmp(hdev->vdev->name, name)) {
> +            vdev = hdev->vdev;
> +            break;
> +        }
> +    }
> +
> +    if (!hdev) {
> +        err_class = ERROR_CLASS_DEVICE_NOT_FOUND;
> +        err_cause = "Device not found";
> +        goto not_found_err;
> +    }
> +
> +    for ( ; hdev; hdev = QLIST_NEXT(hdev, entry)) {
> +        if (vdev != hdev->vdev) {
> +            continue;
> +        }
> +
> +        if (!hdev->started) {
> +            err_cause = "Device is not started";
> +            goto err;
> +        }
> +
> +        r = vhost_sw_live_migration_enable(hdev, enable);
> +        if (unlikely(r)) {
> +            err_cause = "Error enabling (see monitor)";
> +            goto err;
> +        }
> +    }
> +
> +    return;
> +
> +err:
> +    QLIST_FOREACH(hdev_err, &vhost_devices, entry) {
> +        if (hdev_err == hdev) {
> +            break;
> +        }
> +
> +        if (vdev != hdev->vdev) {
> +            continue;
> +        }
> +
> +        vhost_sw_live_migration_enable(hdev, !enable);
> +    }
> +
> +not_found_err:
> +    if (err_cause) {
> +        error_set(errp, err_class,
> +                  "Can't enable shadow vq on %s: %s", name, err_cause);
> +    }
>   }



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 06/13] vhost: Route host->guest notification through shadow virtqueue
  2021-03-15 19:48 ` [RFC v2 06/13] vhost: Route host->guest " Eugenio Pérez
@ 2021-03-16  7:21   ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2021-03-16  7:21 UTC (permalink / raw)
  To: qemu-devel


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> On one hand it uses a mutex to synchronize guest masking with SVQ start
> and stop, because otherwise guest mask could race with the SVQ
> stop code, sending an incorrect call notifier to vhost device. This
> would prevent further communication.


Is this becuase of the OOB monitor? If yes, can we simply drop the QMP 
command and introduce cli paramter for vhost backend?

Thanks


>
> On the other hand it needs to add an event to synchronize guest
> unmasking with call handling. Not doing that way could cause the guest
> to receive notifications after its unmask call. This could be done
> through the mutex but the event solution is cheaper for the buffer
> forwarding.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   3 +
>   include/hw/virtio/vhost.h          |   1 +
>   hw/virtio/vhost-shadow-virtqueue.c | 127 +++++++++++++++++++++++++++++
>   hw/virtio/vhost.c                  |  29 ++++++-
>   4 files changed, 157 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index c891c6510d..2ca4b92b12 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -17,6 +17,9 @@
>   
>   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
> +void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
> +void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
> +
>   bool vhost_shadow_vq_start(struct vhost_dev *dev,
>                              unsigned idx,
>                              VhostShadowVirtqueue *svq);
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 7ffdf9aea0..2f556bd3d5 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -29,6 +29,7 @@ struct vhost_virtqueue {
>       unsigned long long used_phys;
>       unsigned used_size;
>       bool notifier_is_masked;
> +    QemuRecMutex masked_mutex;
>       EventNotifier masked_notifier;
>       struct vhost_dev *dev;
>   };
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 3e43399e9c..8f6ffa729a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -32,8 +32,22 @@ typedef struct VhostShadowVirtqueue {
>        */
>       EventNotifier host_notifier;
>   
> +    /* (Possible) masked notifier */
> +    struct {
> +        EventNotifier *n;
> +
> +        /*
> +         * Event to confirm unmasking.
> +         * set when the masked notifier has no uses
> +         */
> +        QemuEvent is_free;
> +    } masked_notifier;
> +
>       /* Virtio queue shadowing */
>       VirtQueue *vq;
> +
> +    /* Virtio device */
> +    VirtIODevice *vdev;
>   } VhostShadowVirtqueue;
>   
>   /* Forward guest notifications */
> @@ -49,6 +63,70 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>       event_notifier_set(&svq->kick_notifier);
>   }
>   
> +/* Forward vhost notifications */
> +static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             call_notifier);
> +    EventNotifier *masked_notifier;
> +
> +    /* Signal start of using masked notifier */
> +    qemu_event_reset(&svq->masked_notifier.is_free);
> +    masked_notifier = qatomic_load_acquire(&svq->masked_notifier.n);
> +    if (!masked_notifier) {
> +        qemu_event_set(&svq->masked_notifier.is_free);
> +    }
> +
> +    if (!masked_notifier) {
> +        unsigned n = virtio_get_queue_index(svq->vq);
> +        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> +        virtio_notify_irqfd(svq->vdev, svq->vq);
> +    } else {
> +        event_notifier_set(svq->masked_notifier.n);
> +    }
> +
> +    if (masked_notifier) {
> +        /* Signal not using it anymore */
> +        qemu_event_set(&svq->masked_notifier.is_free);
> +    }
> +}
> +
> +static void vhost_shadow_vq_handle_call(EventNotifier *n)
> +{
> +
> +    if (likely(event_notifier_test_and_clear(n))) {
> +        vhost_shadow_vq_handle_call_no_test(n);
> +    }
> +}
> +
> +/*
> + * Mask the shadow virtqueue.
> + *
> + * It can be called from a guest masking vmexit or shadow virtqueue start
> + * through QMP.
> + *
> + * @vq Shadow virtqueue
> + * @masked Masked notifier to signal instead of guest
> + */
> +void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked)
> +{
> +    qatomic_store_release(&svq->masked_notifier.n, masked);
> +}
> +
> +/*
> + * Unmask the shadow virtqueue.
> + *
> + * It can be called from a guest unmasking vmexit or shadow virtqueue start
> + * through QMP.
> + *
> + * @vq Shadow virtqueue
> + */
> +void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
> +{
> +    qatomic_store_release(&svq->masked_notifier.n, NULL);
> +    qemu_event_wait(&svq->masked_notifier.is_free);
> +}
> +
>   /*
>    * Restore the vhost guest to host notifier, i.e., disables svq effect.
>    */
> @@ -103,8 +181,39 @@ bool vhost_shadow_vq_start(struct vhost_dev *dev,
>           goto err_set_vring_kick;
>       }
>   
> +    /* Set vhost call */
> +    file.fd = event_notifier_get_fd(&svq->call_notifier),
> +    r = dev->vhost_ops->vhost_set_vring_call(dev, &file);
> +    if (unlikely(r != 0)) {
> +        error_report("Couldn't set call fd: %s", strerror(errno));
> +        goto err_set_vring_call;
> +    }
> +
> +
> +    /*
> +     * Lock to avoid a race condition between guest setting masked status and
> +     * us.
> +     */
> +    QEMU_LOCK_GUARD(&dev->vqs[idx].masked_mutex);
> +    /* Set shadow vq -> guest notifier */
> +    assert(dev->shadow_vqs_enabled);
> +    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> +                         dev->vqs[idx].notifier_is_masked);
> +
> +    if (dev->vqs[idx].notifier_is_masked &&
> +               event_notifier_test_and_clear(&dev->vqs[idx].masked_notifier)) {
> +        /* Check for pending notifications from the device */
> +        vhost_shadow_vq_handle_call_no_test(&svq->call_notifier);
> +    }
> +
>       return true;
>   
> +err_set_vring_call:
> +    r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> +    if (unlikely(r < 0)) {
> +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> +    }
> +
>   err_set_vring_kick:
>       event_notifier_set_handler(&svq->host_notifier, NULL);
>   
> @@ -126,7 +235,19 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>           error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>       }
>   
> +    assert(!dev->shadow_vqs_enabled);
> +
>       event_notifier_set_handler(&svq->host_notifier, NULL);
> +
> +    /*
> +     * Lock to avoid a race condition between guest setting masked status and
> +     * us.
> +     */
> +    QEMU_LOCK_GUARD(&dev->vqs[idx].masked_mutex);
> +
> +    /* Restore vhost call */
> +    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> +                         dev->vqs[idx].notifier_is_masked);
>   }
>   
>   /*
> @@ -154,6 +275,10 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>       }
>   
>       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> +    svq->vdev = dev->vdev;
> +    event_notifier_set_handler(&svq->call_notifier,
> +                               vhost_shadow_vq_handle_call);
> +    qemu_event_init(&svq->masked_notifier.is_free, true);
>       return g_steal_pointer(&svq);
>   
>   err_init_call_notifier:
> @@ -168,7 +293,9 @@ err_init_kick_notifier:
>    */
>   void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
>   {
> +    qemu_event_destroy(&vq->masked_notifier.is_free);
>       event_notifier_cleanup(&vq->kick_notifier);
> +    event_notifier_set_handler(&vq->call_notifier, NULL);
>       event_notifier_cleanup(&vq->call_notifier);
>       g_free(vq);
>   }
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 4858a35df6..eab3e334f2 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1224,7 +1224,8 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>   {
>       int idx;
>   
> -    dev->shadow_vqs_enabled = false;
> +    /* Can be read by vhost_virtqueue_mask, from vm exit */
> +    qatomic_store_release(&dev->shadow_vqs_enabled, false);
>   
>       for (idx = 0; idx < dev->nvqs; ++idx) {
>           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> @@ -1248,7 +1249,8 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>           }
>       }
>   
> -    dev->shadow_vqs_enabled = true;
> +    /* Can be read by vhost_virtqueue_mask, from vm exit */
> +    qatomic_store_release(&dev->shadow_vqs_enabled, true);
>       for (idx = 0; idx < dev->nvqs; ++idx) {
>           bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>           if (unlikely(!ok)) {
> @@ -1259,7 +1261,7 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>       return 0;
>   
>   err_start:
> -    dev->shadow_vqs_enabled = false;
> +    qatomic_store_release(&dev->shadow_vqs_enabled, false);
>       for (stop_idx = 0; stop_idx < idx; stop_idx++) {
>           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
>       }
> @@ -1343,6 +1345,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
>           goto fail_call;
>       }
>   
> +    qemu_rec_mutex_init(&vq->masked_mutex);
>       vq->dev = dev;
>   
>       return 0;
> @@ -1353,6 +1356,7 @@ fail_call:
>   
>   static void vhost_virtqueue_cleanup(struct vhost_virtqueue *vq)
>   {
> +    qemu_rec_mutex_destroy(&vq->masked_mutex);
>       event_notifier_cleanup(&vq->masked_notifier);
>   }
>   
> @@ -1591,6 +1595,25 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
>       /* should only be called after backend is connected */
>       assert(hdev->vhost_ops);
>   
> +    /* Avoid race condition with shadow virtqueue stop/start */
> +    QEMU_LOCK_GUARD(&hdev->vqs[index].masked_mutex);
> +
> +    /* Set by QMP thread, so using acquire semantics */
> +    if (qatomic_load_acquire(&hdev->shadow_vqs_enabled)) {
> +        if (mask) {
> +            vhost_shadow_vq_mask(hdev->shadow_vqs[index],
> +                                 &hdev->vqs[index].masked_notifier);
> +        } else {
> +            vhost_shadow_vq_unmask(hdev->shadow_vqs[index]);
> +        }
> +
> +        /*
> +         * Vhost call fd must remain the same since shadow vq is not polling
> +         * for changes
> +         */
> +        return;
> +    }
> +
>       if (mask) {
>           assert(vdev->use_guest_notifier_mask);
>           file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable
  2021-03-15 19:48 ` [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
@ 2021-03-16  7:29   ` Jason Wang
  2021-03-16 10:43     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  7:29 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> This method is already present in vhost-user. This commit adapts it to
> vhost-net, so SVQ can use.
>
> vhost_kernel_set_enable stops the device, so qemu can ask for its status
> (next available idx the device was going to consume). When SVQ starts it
> can resume consuming the guest's driver ring, without notice from the
> latter. Not stopping the device before of the swapping could imply that
> it process more buffers than reported, what would duplicate the device
> action.


Note that it might not be the case of vDPA (virtio) or at least virtio 
needs some extension to achieve something similar like this. One example 
is virtio-pci which forbids 0 to be wrote to queue_enable.

This is another reason to start from vhost-vDPA.


>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
>   1 file changed, 29 insertions(+)
>
> diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> index 31b33bde37..1ac5c574a9 100644
> --- a/hw/virtio/vhost-backend.c
> +++ b/hw/virtio/vhost-backend.c
> @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
>       return idx - dev->vq_index;
>   }
>   
> +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
> +                                      bool enable)
> +{
> +    struct vhost_vring_file file = {
> +        .index = idx,
> +    };
> +
> +    if (!enable) {
> +        file.fd = -1; /* Pass -1 to unbind from file. */
> +    } else {
> +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
> +        file.fd = vn_dev->backend;


This can only work with vhost-net devices but not vsock/scsi etc.

Thanks


> +    }
> +
> +    return vhost_kernel_net_set_backend(dev, &file);
> +}
> +
> +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < dev->nvqs; ++i) {
> +        vhost_kernel_set_vq_enable(dev, i, enable);
> +    }
> +
> +    return 0;
> +}
> +
>   #ifdef CONFIG_VHOST_VSOCK
>   static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
>                                               uint64_t guest_cid)
> @@ -317,6 +345,7 @@ static const VhostOps kernel_ops = {
>           .vhost_set_owner = vhost_kernel_set_owner,
>           .vhost_reset_device = vhost_kernel_reset_device,
>           .vhost_get_vq_index = vhost_kernel_get_vq_index,
> +        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
>   #ifdef CONFIG_VHOST_VSOCK
>           .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
>           .vhost_vsock_set_running = vhost_kernel_vsock_set_running,



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-03-15 19:48 ` [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2021-03-16  7:50   ` Jason Wang
  2021-03-16 15:20     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  7:50 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> It reports the shadow virtqueue address from qemu virtual address space


Note that to be used by vDPA, we can't use qemu VA directly here.


>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
>   hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
>   2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 2ca4b92b12..d82c35bccf 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -19,6 +19,8 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
>   void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
>   void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
> +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> +                                    struct vhost_vring_addr *addr);
>   
>   bool vhost_shadow_vq_start(struct vhost_dev *dev,
>                              unsigned idx,
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index b6bab438d6..1460d1d5d1 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -17,6 +17,9 @@
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> +    /* Shadow vring */
> +    struct vring vring;
> +
>       /* Shadow kick notifier, sent to vhost */
>       EventNotifier kick_notifier;
>       /* Shadow call notifier, sent to vhost */
> @@ -51,6 +54,9 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Virtio device */
>       VirtIODevice *vdev;
> +
> +    /* Descriptors copied from guest */
> +    vring_desc_t descs[];
>   } VhostShadowVirtqueue;
>   
>   /* Forward guest notifications */
> @@ -132,6 +138,19 @@ void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
>       qemu_event_wait(&svq->masked_notifier.is_free);
>   }
>   
> +/*
> + * Get the shadow vq vring address.
> + * @svq Shadow virtqueue
> + * @addr Destination to store address
> + */
> +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> +                                    struct vhost_vring_addr *addr)
> +{
> +    addr->desc_user_addr = (uint64_t)svq->vring.desc;
> +    addr->avail_user_addr = (uint64_t)svq->vring.avail;
> +    addr->used_user_addr = (uint64_t)svq->vring.used;
> +}
> +
>   /*
>    * Restore the vhost guest to host notifier, i.e., disables svq effect.
>    */
> @@ -262,7 +281,9 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>   {
>       int vq_idx = dev->vq_index + idx;
> -    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> +    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> +    size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> +    g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
>       int r;
>   
>       r = event_notifier_init(&svq->kick_notifier, 0);
> @@ -279,6 +300,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>           goto err_init_call_notifier;
>       }
>   
> +    vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);


We had some dicussion in the past. Exporting vring_init() is wrong but 
too late to fix (assumes a legacy split layout). Let's not depend on 
this buggy uAPI.

Thanks


>       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>       svq->vdev = dev->vdev;
>       event_notifier_set_handler(&svq->call_notifier,



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-03-15 19:48 ` [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
@ 2021-03-16  8:07   ` Jason Wang
  2021-05-17 17:11     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  8:07 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 68ed0f2740..7df98fc43f 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -145,6 +145,15 @@ static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>       svq->ring_id_maps[qemu_head] = elem;
>   }
>   
> +static void vhost_shadow_vq_kick(VhostShadowVirtqueue *svq)
> +{
> +    /* Make sure we are reading updated device flag */
> +    smp_rmb();


smp_mb() actually? Or it's better to explain this following read needs 
to be orderd with what read before.

Thanks


> +    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
> +        event_notifier_set(&svq->kick_notifier);
> +    }
> +}
> +
>   /* Handle guest->device notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
> @@ -174,7 +183,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>               }
>   
>               vhost_shadow_vq_add(svq, elem);
> -            event_notifier_set(&svq->kick_notifier);
> +            vhost_shadow_vq_kick(svq);
>           }
>   
>           virtio_queue_set_notification(svq->vq, true);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-03-15 19:48 ` [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
@ 2021-03-16  8:08   ` Jason Wang
  2021-05-17 17:32     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  8:08 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 28 +++++++++++++++++++++++++++-
>   1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 7df98fc43f..e3879a4622 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -71,10 +71,35 @@ typedef struct VhostShadowVirtqueue {
>       /* Next head to consume from device */
>       uint16_t used_idx;
>   
> +    /* Cache for the exposed notification flag */
> +    bool notification;
> +
>       /* Descriptors copied from guest */
>       vring_desc_t descs[];
>   } VhostShadowVirtqueue;
>   
> +static void vhost_shadow_vq_set_notification(VhostShadowVirtqueue *svq,
> +                                             bool enable)
> +{
> +    uint16_t notification_flag;
> +
> +    if (svq->notification == enable) {
> +        return;
> +    }
> +
> +    notification_flag = virtio_tswap16(svq->vdev, VRING_AVAIL_F_NO_INTERRUPT);
> +
> +    svq->notification = enable;
> +    if (enable) {
> +        svq->vring.avail->flags &= ~notification_flag;
> +    } else {
> +        svq->vring.avail->flags |= notification_flag;
> +    }
> +
> +    /* Make sure device reads our flag */
> +    smp_mb();


This is a hint, so we don't need memory barrier here.

Thanks


> +}
> +
>   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>                                       const struct iovec *iovec,
>                                       size_t num, bool more_descs, bool write)
> @@ -251,7 +276,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>       do {
>           unsigned i = 0;
>   
> -        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> +        vhost_shadow_vq_set_notification(svq, false);
>           while (true) {
>               g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>               if (!elem) {
> @@ -269,6 +294,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>               svq->masked_notifier.signaled = true;
>               event_notifier_set(svq->masked_notifier.n);
>           }
> +        vhost_shadow_vq_set_notification(svq, true);
>       } while (vhost_shadow_vq_more_used(svq));
>   
>       if (masked_notifier) {



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-15 19:48 ` [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-03-16  8:15   ` Jason Wang
  2021-03-16 16:05     ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  8:15 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review.
>
> These will be added in later commits.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
>   hw/virtio/vhost.c                  | 113 ++++++++++++++-
>   2 files changed, 312 insertions(+), 13 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 1460d1d5d1..68ed0f2740 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -9,6 +9,7 @@
>   
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost.h"
> +#include "hw/virtio/virtio-access.h"
>   
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
>       /* Virtio device */
>       VirtIODevice *vdev;
>   
> +    /* Map for returning guest's descriptors */
> +    VirtQueueElement **ring_id_maps;
> +
> +    /* Next head to expose to device */
> +    uint16_t avail_idx_shadow;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;
> +
> +    /* Next head to consume from device */
> +    uint16_t used_idx;
> +
>       /* Descriptors copied from guest */
>       vring_desc_t descs[];
>   } VhostShadowVirtqueue;
>   
> -/* Forward guest notifications */
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    const struct iovec *iovec,
> +                                    size_t num, bool more_descs, bool write)
> +{
> +    uint16_t i = svq->free_head, last = svq->free_head;
> +    unsigned n;
> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
> +    vring_desc_t *descs = svq->vring.desc;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (n = 0; n < num; n++) {
> +        if (more_descs || (n + 1 < num)) {
> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
> +                                                    VRING_DESC_F_NEXT);
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);


So unsing virtio_tswap() is probably not correct since we're talking 
with vhost backends which has its own endiness.

For vhost-vDPA, we can assume that it's a 1.0 device.


> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
> +
> +        last = i;
> +        i = virtio_tswap16(svq->vdev, descs[i].next);
> +    }
> +
> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
> +}
> +
> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> +                                          VirtQueueElement *elem)
> +{
> +    int head;
> +    unsigned avail_idx;
> +    vring_avail_t *avail = svq->vring.avail;
> +
> +    head = svq->free_head;
> +
> +    /* We need some descriptors here */
> +    assert(elem->out_num || elem->in_num);
> +
> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +                            elem->in_num > 0, false);
> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> +    /*
> +     * Put entry in available array (but don't update avail->idx until they
> +     * do sync).
> +     */
> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
> +    svq->avail_idx_shadow++;
> +
> +    /* Expose descriptors to device */
> +    smp_wmb();
> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
> +
> +    return head;
> +
> +}
> +
> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> +                                VirtQueueElement *elem)
> +{
> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> +
> +    svq->ring_id_maps[qemu_head] = elem;
> +}
> +
> +/* Handle guest->device notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>           return;
>       }
>   
> -    event_notifier_set(&svq->kick_notifier);
> +    /* Make available as many buffers as possible */
> +    do {
> +        if (virtio_queue_get_notification(svq->vq)) {
> +            /* No more notifications until process all available */
> +            virtio_queue_set_notification(svq->vq, false);
> +        }
> +
> +        while (true) {
> +            VirtQueueElement *elem;
> +            if (virtio_queue_full(svq->vq)) {
> +                break;


So we've disabled guest notification. If buffer has been consumed, we 
need to retry the handle_guest_kick here. But I didn't find the code?


> +            }
> +
> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            if (!elem) {
> +                break;
> +            }
> +
> +            vhost_shadow_vq_add(svq, elem);
> +            event_notifier_set(&svq->kick_notifier);
> +        }
> +
> +        virtio_queue_set_notification(svq->vq, true);
> +    } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> +{
> +    if (svq->used_idx != svq->shadow_used_idx) {
> +        return true;
> +    }
> +
> +    /* Get used idx must not be reordered */
> +    smp_rmb();
> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
> +
> +    return svq->used_idx != svq->shadow_used_idx;
> +}
> +
> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> +{
> +    vring_desc_t *descs = svq->vring.desc;
> +    const vring_used_t *used = svq->vring.used;
> +    vring_used_elem_t used_elem;
> +    uint16_t last_used;
> +
> +    if (!vhost_shadow_vq_more_used(svq)) {
> +        return NULL;
> +    }
> +
> +    last_used = svq->used_idx & (svq->vring.num - 1);
> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
> +
> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> +        error_report("Device %s says index %u is available", svq->vdev->name,
> +                     used_elem.id);
> +        return NULL;
> +    }
> +
> +    descs[used_elem.id].next = svq->free_head;
> +    svq->free_head = used_elem.id;
> +
> +    svq->used_idx++;
> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>   }
>   
>   /* Forward vhost notifications */
> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                call_notifier);
>       EventNotifier *masked_notifier;
> +    VirtQueue *vq = svq->vq;
>   
>       /* Signal start of using masked notifier */
>       qemu_event_reset(&svq->masked_notifier.is_free);
> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>           qemu_event_set(&svq->masked_notifier.is_free);
>       }
>   
> -    if (!masked_notifier) {
> -        unsigned n = virtio_get_queue_index(svq->vq);
> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> -    } else if (!svq->masked_notifier.signaled) {
> -        svq->masked_notifier.signaled = true;
> -        event_notifier_set(svq->masked_notifier.n);
> -    }
> +    /* Make as many buffers as possible used. */
> +    do {
> +        unsigned i = 0;
> +
> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> +        while (true) {
> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> +            if (!elem) {
> +                break;
> +            }
> +
> +            assert(i < svq->vring.num);
> +            virtqueue_fill(vq, elem, elem->len, i++);
> +        }
> +
> +        virtqueue_flush(vq, i);
> +        if (!masked_notifier) {
> +            virtio_notify_irqfd(svq->vdev, svq->vq);
> +        } else if (!svq->masked_notifier.signaled) {
> +            svq->masked_notifier.signaled = true;
> +            event_notifier_set(svq->masked_notifier.n);
> +        }
> +    } while (vhost_shadow_vq_more_used(svq));
>   
>       if (masked_notifier) {
>           /* Signal not using it anymore */
> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>   
>   static void vhost_shadow_vq_handle_call(EventNotifier *n)
>   {
> -
>       if (likely(event_notifier_test_and_clear(n))) {
>           vhost_shadow_vq_handle_call_no_test(n);
>       }
> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>                             unsigned idx,
>                             VhostShadowVirtqueue *svq)
>   {
> +    int i;
>       int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> +
> +    assert(!dev->shadow_vqs_enabled);
> +
>       if (unlikely(r < 0)) {
>           error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>       }
> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>       /* Restore vhost call */
>       vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
>                            dev->vqs[idx].notifier_is_masked);
> +
> +
> +    for (i = 0; i < svq->vring.num; ++i) {
> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> +        /*
> +         * Although the doc says we must unpop in order, it's ok to unpop
> +         * everything.
> +         */
> +        if (elem) {
> +            virtqueue_unpop(svq->vq, elem, elem->len);


Shouldn't we need to wait until all pending requests to be drained? Or 
we may end up duplicated requests?

Thanks


> +        }
> +    }
>   }
>   
>   /*
> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
>       size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
>       g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> -    int r;
> +    int r, i;
>   
>       r = event_notifier_init(&svq->kick_notifier, 0);
>       if (r != 0) {
> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>       vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
>       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>       svq->vdev = dev->vdev;
> +    for (i = 0; i < num - 1; i++) {
> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
> +    }
> +
> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>       event_notifier_set_handler(&svq->call_notifier,
>                                  vhost_shadow_vq_handle_call);
>       qemu_event_init(&svq->masked_notifier.is_free, true);
> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
>       event_notifier_cleanup(&vq->kick_notifier);
>       event_notifier_set_handler(&vq->call_notifier, NULL);
>       event_notifier_cleanup(&vq->call_notifier);
> +    g_free(vq->ring_id_maps);
>       g_free(vq);
>   }
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index eab3e334f2..a373999bc4 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
>   
>       trace_vhost_iotlb_miss(dev, 1);
>   
> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
> +        uaddr = iova;
> +        len = 4096;
> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> +                                                IOMMU_RW);
> +        if (ret) {
> +            trace_vhost_iotlb_miss(dev, 2);
> +            error_report("Fail to update device iotlb");
> +        }
> +
> +        return ret;
> +    }
> +
>       iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
>                                             iova, write,
>                                             MEMTXATTRS_UNSPECIFIED);
> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>       /* Can be read by vhost_virtqueue_mask, from vm exit */
>       qatomic_store_release(&dev->shadow_vqs_enabled, false);
>   
> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> +        error_report("Fail to invalidate device iotlb");
> +    }
> +
>       for (idx = 0; idx < dev->nvqs; ++idx) {
> +        /*
> +         * Update used ring information for IOTLB to work correctly,
> +         * vhost-kernel code requires for this.
> +         */
> +        struct vhost_virtqueue *vq = dev->vqs + idx;
> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
> +
>           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> +                              dev->vq_index + idx);
> +    }
> +
> +    /* Enable guest's vq vring */
> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> +
> +    for (idx = 0; idx < dev->nvqs; ++idx) {
>           vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>       }
>   
> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>       return 0;
>   }
>   
> +/*
> + * Start shadow virtqueue in a given queue.
> + * In failure case, this function leaves queue working as regular vhost mode.
> + */
> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
> +                                             unsigned idx)
> +{
> +    struct vhost_vring_addr addr = {
> +        .index = idx,
> +    };
> +    struct vhost_vring_state s = {
> +        .index = idx,
> +    };
> +    int r;
> +    bool ok;
> +
> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    /* From this point, vhost_virtqueue_start can reset these changes */
> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> +    if (unlikely(r != 0)) {
> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
> +        goto err;
> +    }
> +
> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
> +    if (unlikely(r != 0)) {
> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
> +        goto err;
> +    }
> +
> +    /*
> +     * Update used ring information for IOTLB to work correctly,
> +     * vhost-kernel code requires for this.
> +     */
> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
> +    if (unlikely(r != 0)) {
> +        /* Debug message already printed */
> +        goto err;
> +    }
> +
> +    return true;
> +
> +err:
> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> +    return false;
> +}
> +
>   static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>   {
>       int idx, stop_idx;
> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>           }
>       }
>   
> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> +        error_report("Fail to invalidate device iotlb");
> +    }
> +
>       /* Can be read by vhost_virtqueue_mask, from vm exit */
>       qatomic_store_release(&dev->shadow_vqs_enabled, true);
>       for (idx = 0; idx < dev->nvqs; ++idx) {
> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
>           if (unlikely(!ok)) {
>               goto err_start;
>           }
>       }
>   
> +    /* Enable shadow vq vring */
> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>       return 0;
>   
>   err_start:
>       qatomic_store_release(&dev->shadow_vqs_enabled, false);
>       for (stop_idx = 0; stop_idx < idx; stop_idx++) {
>           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> +                              dev->vq_index + stop_idx);
>       }
>   
>   err_new:
> +    /* Enable guest's vring */
> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>       for (idx = 0; idx < dev->nvqs; ++idx) {
>           vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>       }
> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>   
>           if (!hdev->started) {
>               err_cause = "Device is not started";
> +        } else if (!vhost_dev_has_iommu(hdev)) {
> +            err_cause = "Does not support iommu";
> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> +            err_cause = "Is packed";
> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
> +            err_cause = "Have event idx";
> +        } else if (hdev->acked_features &
> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
> +            err_cause = "Supports indirect descriptors";
> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
> +            err_cause = "Cannot pause device";
> +        }
> +
> +        if (err_cause) {
>               goto err;
>           }
>   



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/13] vDPA software assisted live migration
  2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2021-03-15 19:48 ` [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
@ 2021-03-16  8:28 ` Jason Wang
  2021-03-16 17:25   ` Eugenio Perez Martin
  13 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-16  8:28 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, Stefano Garzarella, Michael Lilja,
	Jim Harford, Rob Miller


在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> This series enable shadow virtqueue for vhost-net devices. This is a
> new method of vhost devices migration: Instead of relay on vhost
> device's dirty logging capability, SW assisted LM intercepts dataplane,
> forwarding the descriptors between VM and device. Is intended for vDPA
> devices with no logging, but this address the basic platform to build
> that support on.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net devices, which already
> have dirty page logging capabilities. For qemu to use shadow virtqueues
> these vhost-net devices need to be instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> And shadow virtqueue needs to be enabled for them with QMP command
> like:
>
> {
>    "execute": "x-vhost-enable-shadow-vq",
>    "arguments": {
>      "name": "virtio-net",
>      "enable": false
>    }
> }
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 5 and 6, and starting SVQ. Previous commits
> are cleanup ones and declaration of QMP command.
>
> Commit 11 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> Comments are welcome! Especially on:
> * Different/improved way of synchronization, particularly on the race
>    of masking.
>
> TODO:
> * Event, indirect, packed, and others features of virtio - Waiting for
>    confirmation of the big picture.


So two things in my mind after reviewing the seires:

1) which layer should we implement the shadow virtqueue. E.g if you want 
to do that at virtio level, you need to deal with a lot of 
synchronizations. I prefer to do it in vhost-vDPA.
2) Using VA as IOVA which can not work for vhost-vDPA


> * vDPA devices:


So I think we can start from a vhost-vDPA specific shadow virtqueue 
first, then extending it to be a general one which might be much easier.


> Developing solutions for tracking the available IOVA
>    space for all devices.


For vhost-net, you can assume that [0, ULLONG_MAX] is valid so you can 
simply use VA as IOVA.


> Small POC available, skipping the get/set
>    status (since vDPA does not support it) and just allocating more and
>    more IOVA addresses in a hardcoded range available for the device.


I'm not sure this can work but you need make sure that range can fit the 
size of the all memory regions and need to deal with memory region add 
and del.

I guess you probably need a full functional tree based IOVA allocator.

Thanks


> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
>    sent to device.
> * Automatic kick-in on live-migration.
> * Proper documentation.
>
> Thanks!
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
>    * v1 at
>      https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (13):
>    virtio: Add virtio_queue_is_host_notifier_enabled
>    vhost: Save masked_notifier state
>    vhost: Add VhostShadowVirtqueue
>    vhost: Add x-vhost-enable-shadow-vq qmp
>    vhost: Route guest->host notification through shadow virtqueue
>    vhost: Route host->guest notification through shadow virtqueue
>    vhost: Avoid re-set masked notifier in shadow vq
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    virtio: Add virtio_queue_full
>    vhost: add vhost_kernel_set_vring_enable
>    vhost: Shadow virtqueue buffers forwarding
>    vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
>      kick
>    vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
>      virtqueue
>
>   qapi/net.json                      |  22 ++
>   hw/virtio/vhost-shadow-virtqueue.h |  36 ++
>   include/hw/virtio/vhost.h          |   6 +
>   include/hw/virtio/virtio.h         |   3 +
>   hw/virtio/vhost-backend.c          |  29 ++
>   hw/virtio/vhost-shadow-virtqueue.c | 551 +++++++++++++++++++++++++++++
>   hw/virtio/vhost.c                  | 283 +++++++++++++++
>   hw/virtio/virtio.c                 |  23 +-
>   hw/virtio/meson.build              |   2 +-
>   9 files changed, 952 insertions(+), 3 deletions(-)
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-16  7:18   ` Jason Wang
@ 2021-03-16 10:31     ` Eugenio Perez Martin
  2021-03-17  2:05       ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-16 10:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > Shadow virtqueue notifications forwarding is disabled when vhost_dev
> > stops, so code flow follows usual cleanup.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |   7 ++
> >   include/hw/virtio/vhost.h          |   4 +
> >   hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
> >   hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
> >   4 files changed, 265 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 6cc18d6acb..c891c6510d 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -17,6 +17,13 @@
> >
> >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >
> > +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> > +                           unsigned idx,
> > +                           VhostShadowVirtqueue *svq);
> > +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> > +                          unsigned idx,
> > +                          VhostShadowVirtqueue *svq);
> > +
> >   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
> >
> >   void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index ac963bf23d..7ffdf9aea0 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -55,6 +55,8 @@ struct vhost_iommu {
> >       QLIST_ENTRY(vhost_iommu) iommu_next;
> >   };
> >
> > +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > +
> >   typedef struct VhostDevConfigOps {
> >       /* Vhost device config space changed callback
> >        */
> > @@ -83,7 +85,9 @@ struct vhost_dev {
> >       uint64_t backend_cap;
> >       bool started;
> >       bool log_enabled;
> > +    bool shadow_vqs_enabled;
> >       uint64_t log_size;
> > +    VhostShadowVirtqueue **shadow_vqs;
>
>
> Any reason that you don't embed the shadow virtqueue into
> vhost_virtqueue structure?
>

Not really, it could be relatively big and I would prefer SVQ
members/methods to remain hidden from any other part that includes
vhost.h. But it could be changed, for sure.

> (Note that there's a masked_notifier in struct vhost_virtqueue).
>

They are used differently: in SVQ the masked notifier is a pointer,
and if it's NULL the SVQ code knows that device is not masked. The
vhost_virtqueue is the real owner.

It could be replaced by a boolean in SVQ or something like that, I
experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
and let vhost.c code to manage all the transitions. But I find clearer
the pointer use, since it's the more natural for the
vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.

This masking/unmasking is the part I dislike the most from this
series, so I'm very open to alternatives.

>
> >       Error *migration_blocker;
> >       const VhostOps *vhost_ops;
> >       void *opaque;
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 4512e5b058..3e43399e9c 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -8,9 +8,12 @@
> >    */
> >
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > +#include "hw/virtio/vhost.h"
> > +
> > +#include "standard-headers/linux/vhost_types.h"
> >
> >   #include "qemu/error-report.h"
> > -#include "qemu/event_notifier.h"
> > +#include "qemu/main-loop.h"
> >
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> > @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
> >       EventNotifier kick_notifier;
> >       /* Shadow call notifier, sent to vhost */
> >       EventNotifier call_notifier;
> > +
> > +    /*
> > +     * Borrowed virtqueue's guest to host notifier.
> > +     * To borrow it in this event notifier allows to register on the event
> > +     * loop and access the associated shadow virtqueue easily. If we use the
> > +     * VirtQueue, we don't have an easy way to retrieve it.
>
>
> So this is something that worries me. It looks like a layer violation
> that makes the codes harder to work correctly.
>

I don't follow you here.

The vhost code already depends on virtqueue in the same sense:
virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
if this behavior ever changes it is unlikely for vhost to keep working
without changes. vhost_virtqueue has a kick/call int where I think it
should be stored actually, but they are never used as far as I see.

Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
/* Stop processing guest IO notifications in vhost.
 * Start processing them in qemu.
 ...
But it was easier for this mode to miss a notification, since they
create a new host_notifier in virtio_bus_set_host_notifier right away.
So I decided to use the file descriptor already sent to vhost in
regular operation mode, so guest-related resources change less.

Having said that, maybe it's useful to assert that
vhost_dev_{enable,disable}_notifiers are never called on shadow
virtqueue mode. Also, it could be useful to retrieve it from
virtio_bus, not raw shadow virtqueue, so all get/set are performed
from it. Would that make more sense?

> I wonder if it would be simpler to start from a vDPA dedicated shadow
> virtqueue implementation:
>
> 1) have the above fields embeded in vhost_vdpa structure
> 2) Work at the level of
> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
>

This notifier is never sent to the device in shadow virtqueue mode.
It's for SVQ to react to guest's notifications, registering it on its
main event loop [1]. So if I perform these changes the way I
understand them, SVQ would still rely on this borrowed EventNotifier,
and it would send to the vDPA device the newly created kick_notifier
of VhostShadowVirtqueue.

> Then the layer is still isolated and you have a much simpler context to
> work that you don't need to care a lot of synchornization:
>
> 1) vq masking

This EventNotifier is not used for masking, it does not change from
the start of the shadow virtqueue operation through its end. Call fd
sent to vhost/vdpa device does not change either in shadow virtqueue
mode operation with masking/unmasking. I will try to document it
better.

I think that we will need to handle synchronization with
masking/unmasking from the guest and dynamically enabling SVQ
operation mode, since they can happen at the same time as long as we
let the guest run. There may be better ways of synchronizing them of
course, but I don't see how moving to the vhost-vdpa backend helps
with this. Please expand if I've missed it.

Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
to future patchsets?

> 2) vhost dev start and stop
>
> ?
>
>
> > +     *
> > +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > +     */
> > +    EventNotifier host_notifier;
> > +
> > +    /* Virtio queue shadowing */
> > +    VirtQueue *vq;
> >   } VhostShadowVirtqueue;
> >
> > +/* Forward guest notifications */
> > +static void vhost_handle_guest_kick(EventNotifier *n)
> > +{
> > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > +                                             host_notifier);
> > +
> > +    if (unlikely(!event_notifier_test_and_clear(n))) {
> > +        return;
> > +    }
> > +
> > +    event_notifier_set(&svq->kick_notifier);
> > +}
> > +
> > +/*
> > + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> > + */
> > +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> > +                                                     unsigned vhost_index,
> > +                                                     VhostShadowVirtqueue *svq)
> > +{
> > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > +    struct vhost_vring_file file = {
> > +        .index = vhost_index,
> > +        .fd = event_notifier_get_fd(vq_host_notifier),
> > +    };
> > +    int r;
> > +
> > +    /* Restore vhost kick */
> > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > +    return r ? -errno : 0;
> > +}
> > +
> > +/*
> > + * Start shadow virtqueue operation.
> > + * @dev vhost device
> > + * @hidx vhost virtqueue index
> > + * @svq Shadow Virtqueue
> > + */
> > +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> > +                           unsigned idx,
> > +                           VhostShadowVirtqueue *svq)
>
>
> It looks to me this assumes the vhost_dev is started before
> vhost_shadow_vq_start()?
>

Right.

>
> > +{
> > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > +    struct vhost_vring_file file = {
> > +        .index = idx,
> > +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> > +    };
> > +    int r;
> > +
> > +    /* Check that notifications are still going directly to vhost dev */
> > +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> > +
> > +    /*
> > +     * event_notifier_set_handler already checks for guest's notifications if
> > +     * they arrive in the switch, so there is no need to explicitely check for
> > +     * them.
> > +     */
> > +    event_notifier_init_fd(&svq->host_notifier,
> > +                           event_notifier_get_fd(vq_host_notifier));
> > +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> > +
> > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > +    if (unlikely(r != 0)) {
> > +        error_report("Couldn't set kick fd: %s", strerror(errno));
> > +        goto err_set_vring_kick;
> > +    }
> > +
> > +    return true;
> > +
> > +err_set_vring_kick:
> > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > +
> > +    return false;
> > +}
> > +
> > +/*
> > + * Stop shadow virtqueue operation.
> > + * @dev vhost device
> > + * @idx vhost queue index
> > + * @svq Shadow Virtqueue
> > + */
> > +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> > +                          unsigned idx,
> > +                          VhostShadowVirtqueue *svq)
> > +{
> > +    int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> > +    if (unlikely(r < 0)) {
> > +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> > +    }
> > +
> > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > +}
> > +
> >   /*
> >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> >    * methods and file descriptors.
> >    */
> >   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >   {
> > +    int vq_idx = dev->vq_index + idx;
> >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >       int r;
> >
> > @@ -43,6 +153,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >           goto err_init_call_notifier;
> >       }
> >
> > +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >       return g_steal_pointer(&svq);
> >
> >   err_init_call_notifier:
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 97f1bcfa42..4858a35df6 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -25,6 +25,7 @@
> >   #include "exec/address-spaces.h"
> >   #include "hw/virtio/virtio-bus.h"
> >   #include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "migration/blocker.h"
> >   #include "migration/qemu-file-types.h"
> >   #include "sysemu/dma.h"
> > @@ -1219,6 +1220,74 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
> >                          0, virtio_queue_get_desc_size(vdev, idx));
> >   }
> >
> > +static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> > +{
> > +    int idx;
> > +
> > +    dev->shadow_vqs_enabled = false;
> > +
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> > +        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> > +    }
> > +
> > +    g_free(dev->shadow_vqs);
> > +    dev->shadow_vqs = NULL;
> > +    return 0;
> > +}
> > +
> > +static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> > +{
> > +    int idx, stop_idx;
> > +
> > +    dev->shadow_vqs = g_new0(VhostShadowVirtqueue *, dev->nvqs);
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        dev->shadow_vqs[idx] = vhost_shadow_vq_new(dev, idx);
> > +        if (unlikely(dev->shadow_vqs[idx] == NULL)) {
> > +            goto err_new;
> > +        }
> > +    }
> > +
> > +    dev->shadow_vqs_enabled = true;
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> > +        if (unlikely(!ok)) {
> > +            goto err_start;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +
> > +err_start:
> > +    dev->shadow_vqs_enabled = false;
> > +    for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> > +        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> > +    }
> > +
> > +err_new:
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> > +    }
> > +    g_free(dev->shadow_vqs);
> > +
> > +    return -1;
> > +}
> > +
> > +static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
> > +                                          bool enable_lm)
> > +{
>
>
> So the live migration part should be done in an separted patch.
>

Right, I missed the renaming of this one.

> Thanks
>

[1] or, hopefully in the future, in an independent aio context.


>
> > +    int r;
> > +
> > +    if (enable_lm == dev->shadow_vqs_enabled) {
> > +        return 0;
> > +    }
> > +
> > +    r = enable_lm ? vhost_sw_live_migration_start(dev)
> > +                  : vhost_sw_live_migration_stop(dev);
> > +
> > +    return r;
> > +}
> > +
> >   static void vhost_eventfd_add(MemoryListener *listener,
> >                                 MemoryRegionSection *section,
> >                                 bool match_data, uint64_t data, EventNotifier *e)
> > @@ -1381,6 +1450,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >       hdev->log = NULL;
> >       hdev->log_size = 0;
> >       hdev->log_enabled = false;
> > +    hdev->shadow_vqs_enabled = false;
> >       hdev->started = false;
> >       memory_listener_register(&hdev->memory_listener, &address_space_memory);
> >       QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
> > @@ -1484,6 +1554,10 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
> >       BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
> >       int i, r;
> >
> > +    if (hdev->shadow_vqs_enabled) {
> > +        vhost_sw_live_migration_enable(hdev, false);
> > +    }
> > +
> >       for (i = 0; i < hdev->nvqs; ++i) {
> >           r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
> >                                            false);
> > @@ -1798,6 +1872,7 @@ fail_features:
> >   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
> >   {
> >       int i;
> > +    bool is_shadow_vqs_enabled = hdev->shadow_vqs_enabled;
> >
> >       /* should only be called after backend is connected */
> >       assert(hdev->vhost_ops);
> > @@ -1805,7 +1880,16 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
> >       if (hdev->vhost_ops->vhost_dev_start) {
> >           hdev->vhost_ops->vhost_dev_start(hdev, false);
> >       }
> > +    if (is_shadow_vqs_enabled) {
> > +        /* Shadow virtqueue will be stopped */
> > +        hdev->shadow_vqs_enabled = false;
> > +    }
> >       for (i = 0; i < hdev->nvqs; ++i) {
> > +        if (is_shadow_vqs_enabled) {
> > +            vhost_shadow_vq_stop(hdev, i, hdev->shadow_vqs[i]);
> > +            vhost_shadow_vq_free(hdev->shadow_vqs[i]);
> > +        }
> > +
> >           vhost_virtqueue_stop(hdev,
> >                                vdev,
> >                                hdev->vqs + i,
> > @@ -1819,6 +1903,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
> >           memory_listener_unregister(&hdev->iommu_listener);
> >       }
> >       vhost_log_put(hdev, true);
> > +    g_free(hdev->shadow_vqs);
> > +    hdev->shadow_vqs_enabled = false;
> >       hdev->started = false;
> >       hdev->vdev = NULL;
> >   }
> > @@ -1835,5 +1921,60 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >
> >   void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >   {
> > -    error_setg(errp, "Shadow virtqueue still not implemented");
> > +    struct vhost_dev *hdev, *hdev_err;
> > +    VirtIODevice *vdev;
> > +    const char *err_cause = NULL;
> > +    int r;
> > +    ErrorClass err_class = ERROR_CLASS_GENERIC_ERROR;
> > +
> > +    QLIST_FOREACH(hdev, &vhost_devices, entry) {
> > +        if (hdev->vdev && 0 == strcmp(hdev->vdev->name, name)) {
> > +            vdev = hdev->vdev;
> > +            break;
> > +        }
> > +    }
> > +
> > +    if (!hdev) {
> > +        err_class = ERROR_CLASS_DEVICE_NOT_FOUND;
> > +        err_cause = "Device not found";
> > +        goto not_found_err;
> > +    }
> > +
> > +    for ( ; hdev; hdev = QLIST_NEXT(hdev, entry)) {
> > +        if (vdev != hdev->vdev) {
> > +            continue;
> > +        }
> > +
> > +        if (!hdev->started) {
> > +            err_cause = "Device is not started";
> > +            goto err;
> > +        }
> > +
> > +        r = vhost_sw_live_migration_enable(hdev, enable);
> > +        if (unlikely(r)) {
> > +            err_cause = "Error enabling (see monitor)";
> > +            goto err;
> > +        }
> > +    }
> > +
> > +    return;
> > +
> > +err:
> > +    QLIST_FOREACH(hdev_err, &vhost_devices, entry) {
> > +        if (hdev_err == hdev) {
> > +            break;
> > +        }
> > +
> > +        if (vdev != hdev->vdev) {
> > +            continue;
> > +        }
> > +
> > +        vhost_sw_live_migration_enable(hdev, !enable);
> > +    }
> > +
> > +not_found_err:
> > +    if (err_cause) {
> > +        error_set(errp, err_class,
> > +                  "Can't enable shadow vq on %s: %s", name, err_cause);
> > +    }
> >   }
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable
  2021-03-16  7:29   ` Jason Wang
@ 2021-03-16 10:43     ` Eugenio Perez Martin
  2021-03-17  2:25       ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-16 10:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 8:30 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > This method is already present in vhost-user. This commit adapts it to
> > vhost-net, so SVQ can use.
> >
> > vhost_kernel_set_enable stops the device, so qemu can ask for its status
> > (next available idx the device was going to consume). When SVQ starts it
> > can resume consuming the guest's driver ring, without notice from the
> > latter. Not stopping the device before of the swapping could imply that
> > it process more buffers than reported, what would duplicate the device
> > action.
>
>
> Note that it might not be the case of vDPA (virtio) or at least virtio
> needs some extension to achieve something similar like this. One example
> is virtio-pci which forbids 0 to be wrote to queue_enable.
>
> This is another reason to start from vhost-vDPA.
>
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
> >   1 file changed, 29 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index 31b33bde37..1ac5c574a9 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
> >       return idx - dev->vq_index;
> >   }
> >
> > +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
> > +                                      bool enable)
> > +{
> > +    struct vhost_vring_file file = {
> > +        .index = idx,
> > +    };
> > +
> > +    if (!enable) {
> > +        file.fd = -1; /* Pass -1 to unbind from file. */
> > +    } else {
> > +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
> > +        file.fd = vn_dev->backend;
>
>
> This can only work with vhost-net devices but not vsock/scsi etc.
>

Right. Shadow virtqueue code should also check the return value of the
vhost_set_vring_enable call.

I'm not sure how to solve it without resorting to some ifelse/switch
chain, checking for specific net/vsock/... features, or relaying on
some other qemu class facilities. However, since the main use case is
vDPA live migration, this commit could be left out and SVQ operation
would only be valid for vhost-vdpa and vhost-user.

> Thanks
>
>
> > +    }
> > +
> > +    return vhost_kernel_net_set_backend(dev, &file);
> > +}
> > +
> > +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < dev->nvqs; ++i) {
> > +        vhost_kernel_set_vq_enable(dev, i, enable);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >   #ifdef CONFIG_VHOST_VSOCK
> >   static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
> >                                               uint64_t guest_cid)
> > @@ -317,6 +345,7 @@ static const VhostOps kernel_ops = {
> >           .vhost_set_owner = vhost_kernel_set_owner,
> >           .vhost_reset_device = vhost_kernel_reset_device,
> >           .vhost_get_vq_index = vhost_kernel_get_vq_index,
> > +        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
> >   #ifdef CONFIG_VHOST_VSOCK
> >           .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
> >           .vhost_vsock_set_running = vhost_kernel_vsock_set_running,
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-03-15 19:48 ` [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-03-16 13:37   ` Eric Blake
  0 siblings, 0 replies; 46+ messages in thread
From: Eric Blake @ 2021-03-16 13:37 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Jason Wang,
	Juan Quintela, Markus Armbruster, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, Stefano Garzarella,
	Michael Lilja, Jim Harford, Rob Miller

On 3/15/21 2:48 PM, Eugenio Pérez wrote:
> Command to enable shadow virtqueue looks like:
> 
> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "dev0", "enable": true } }
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  qapi/net.json     | 22 ++++++++++++++++++++++
>  hw/virtio/vhost.c |  6 ++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/qapi/net.json b/qapi/net.json
> index c31748c87f..4c5f65d021 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -77,6 +77,28 @@
>  ##
>  { 'command': 'netdev_del', 'data': {'id': 'str'} }
>  
> +##
> +# @x-vhost-enable-shadow-vq:
> +#
> +# Use vhost shadow virtqueue.
> +#
> +# @name: the device name of the VirtIO device
> +#
> +# @enable: true to use he alternate shadow VQ notification path
> +#
> +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
> +#
> +# Since: 6.0
> +#
> +# Example:
> +#
> +# -> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "virtio-net", "enable": false } }

Long lines; please wrap to keep under 80 columns.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-03-16  7:50   ` Jason Wang
@ 2021-03-16 15:20     ` Eugenio Perez Martin
  2021-05-17 17:39       ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-16 15:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 8:50 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > It reports the shadow virtqueue address from qemu virtual address space
>
>
> Note that to be used by vDPA, we can't use qemu VA directly here.
>

Right, I'm planning to use a different virtual address space if the
device has such limitations.

>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
> >   hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
> >   2 files changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 2ca4b92b12..d82c35bccf 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -19,6 +19,8 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >
> >   void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
> >   void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
> > +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > +                                    struct vhost_vring_addr *addr);
> >
> >   bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >                              unsigned idx,
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index b6bab438d6..1460d1d5d1 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -17,6 +17,9 @@
> >
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> > +    /* Shadow vring */
> > +    struct vring vring;
> > +
> >       /* Shadow kick notifier, sent to vhost */
> >       EventNotifier kick_notifier;
> >       /* Shadow call notifier, sent to vhost */
> > @@ -51,6 +54,9 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> > +
> > +    /* Descriptors copied from guest */
> > +    vring_desc_t descs[];
> >   } VhostShadowVirtqueue;
> >
> >   /* Forward guest notifications */
> > @@ -132,6 +138,19 @@ void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
> >       qemu_event_wait(&svq->masked_notifier.is_free);
> >   }
> >
> > +/*
> > + * Get the shadow vq vring address.
> > + * @svq Shadow virtqueue
> > + * @addr Destination to store address
> > + */
> > +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > +                                    struct vhost_vring_addr *addr)
> > +{
> > +    addr->desc_user_addr = (uint64_t)svq->vring.desc;
> > +    addr->avail_user_addr = (uint64_t)svq->vring.avail;
> > +    addr->used_user_addr = (uint64_t)svq->vring.used;
> > +}
> > +
> >   /*
> >    * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >    */
> > @@ -262,7 +281,9 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >   {
> >       int vq_idx = dev->vq_index + idx;
> > -    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> > +    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > +    size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> > +    g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> >       int r;
> >
> >       r = event_notifier_init(&svq->kick_notifier, 0);
> > @@ -279,6 +300,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >           goto err_init_call_notifier;
> >       }
> >
> > +    vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
>
>
> We had some dicussion in the past. Exporting vring_init() is wrong but
> too late to fix (assumes a legacy split layout). Let's not depend on
> this buggy uAPI.
>

Ok, I will change the way to allocate and initialize it.

> Thanks
>
>
> >       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >       svq->vdev = dev->vdev;
> >       event_notifier_set_handler(&svq->call_notifier,
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-16  8:15   ` Jason Wang
@ 2021-03-16 16:05     ` Eugenio Perez Martin
  2021-03-17  2:50       ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-16 16:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > Initial version of shadow virtqueue that actually forward buffers.
> >
> > It reuses the VirtQueue code for the device part. The driver part is
> > based on Linux's virtio_ring driver, but with stripped functionality
> > and optimizations so it's easier to review.
> >
> > These will be added in later commits.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
> >   hw/virtio/vhost.c                  | 113 ++++++++++++++-
> >   2 files changed, 312 insertions(+), 13 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 1460d1d5d1..68ed0f2740 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -9,6 +9,7 @@
> >
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost.h"
> > +#include "hw/virtio/virtio-access.h"
> >
> >   #include "standard-headers/linux/vhost_types.h"
> >
> > @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> >
> > +    /* Map for returning guest's descriptors */
> > +    VirtQueueElement **ring_id_maps;
> > +
> > +    /* Next head to expose to device */
> > +    uint16_t avail_idx_shadow;
> > +
> > +    /* Next free descriptor */
> > +    uint16_t free_head;
> > +
> > +    /* Last seen used idx */
> > +    uint16_t shadow_used_idx;
> > +
> > +    /* Next head to consume from device */
> > +    uint16_t used_idx;
> > +
> >       /* Descriptors copied from guest */
> >       vring_desc_t descs[];
> >   } VhostShadowVirtqueue;
> >
> > -/* Forward guest notifications */
> > +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > +                                    const struct iovec *iovec,
> > +                                    size_t num, bool more_descs, bool write)
> > +{
> > +    uint16_t i = svq->free_head, last = svq->free_head;
> > +    unsigned n;
> > +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
> > +    vring_desc_t *descs = svq->vring.desc;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    for (n = 0; n < num; n++) {
> > +        if (more_descs || (n + 1 < num)) {
> > +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
> > +                                                    VRING_DESC_F_NEXT);
> > +        } else {
> > +            descs[i].flags = flags;
> > +        }
> > +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
>
>
> So unsing virtio_tswap() is probably not correct since we're talking
> with vhost backends which has its own endiness.
>

I was trying to expose the buffer with the same endianness as the
driver originally offered, so if guest->qemu requires a bswap, I think
it will always require a bswap again to expose to the device again.

> For vhost-vDPA, we can assume that it's a 1.0 device.

Isn't it needed if the host is big endian?

>
>
> > +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
> > +
> > +        last = i;
> > +        i = virtio_tswap16(svq->vdev, descs[i].next);
> > +    }
> > +
> > +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
> > +}
> > +
> > +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> > +                                          VirtQueueElement *elem)
> > +{
> > +    int head;
> > +    unsigned avail_idx;
> > +    vring_avail_t *avail = svq->vring.avail;
> > +
> > +    head = svq->free_head;
> > +
> > +    /* We need some descriptors here */
> > +    assert(elem->out_num || elem->in_num);
> > +
> > +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > +                            elem->in_num > 0, false);
> > +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > +
> > +    /*
> > +     * Put entry in available array (but don't update avail->idx until they
> > +     * do sync).
> > +     */
> > +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> > +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
> > +    svq->avail_idx_shadow++;
> > +
> > +    /* Expose descriptors to device */
> > +    smp_wmb();
> > +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
> > +
> > +    return head;
> > +
> > +}
> > +
> > +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> > +                                VirtQueueElement *elem)
> > +{
> > +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> > +
> > +    svq->ring_id_maps[qemu_head] = elem;
> > +}
> > +
> > +/* Handle guest->device notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >           return;
> >       }
> >
> > -    event_notifier_set(&svq->kick_notifier);
> > +    /* Make available as many buffers as possible */
> > +    do {
> > +        if (virtio_queue_get_notification(svq->vq)) {
> > +            /* No more notifications until process all available */
> > +            virtio_queue_set_notification(svq->vq, false);
> > +        }
> > +
> > +        while (true) {
> > +            VirtQueueElement *elem;
> > +            if (virtio_queue_full(svq->vq)) {
> > +                break;
>
>
> So we've disabled guest notification. If buffer has been consumed, we
> need to retry the handle_guest_kick here. But I didn't find the code?
>

This code follows the pattern of virtio_blk_handle_vq: we jump out of
the inner while, and we re-enable the notifications. After that, we
check for updates on guest avail_idx.

>
> > +            }
> > +
> > +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            vhost_shadow_vq_add(svq, elem);
> > +            event_notifier_set(&svq->kick_notifier);
> > +        }
> > +
> > +        virtio_queue_set_notification(svq->vq, true);
> > +    } while (!virtio_queue_empty(svq->vq));
> > +}
> > +
> > +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> > +{
> > +    if (svq->used_idx != svq->shadow_used_idx) {
> > +        return true;
> > +    }
> > +
> > +    /* Get used idx must not be reordered */
> > +    smp_rmb();
> > +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
> > +
> > +    return svq->used_idx != svq->shadow_used_idx;
> > +}
> > +
> > +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> > +{
> > +    vring_desc_t *descs = svq->vring.desc;
> > +    const vring_used_t *used = svq->vring.used;
> > +    vring_used_elem_t used_elem;
> > +    uint16_t last_used;
> > +
> > +    if (!vhost_shadow_vq_more_used(svq)) {
> > +        return NULL;
> > +    }
> > +
> > +    last_used = svq->used_idx & (svq->vring.num - 1);
> > +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
> > +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
> > +
> > +    if (unlikely(used_elem.id >= svq->vring.num)) {
> > +        error_report("Device %s says index %u is available", svq->vdev->name,
> > +                     used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    descs[used_elem.id].next = svq->free_head;
> > +    svq->free_head = used_elem.id;
> > +
> > +    svq->used_idx++;
> > +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >   }
> >
> >   /* Forward vhost notifications */
> > @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >                                                call_notifier);
> >       EventNotifier *masked_notifier;
> > +    VirtQueue *vq = svq->vq;
> >
> >       /* Signal start of using masked notifier */
> >       qemu_event_reset(&svq->masked_notifier.is_free);
> > @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >           qemu_event_set(&svq->masked_notifier.is_free);
> >       }
> >
> > -    if (!masked_notifier) {
> > -        unsigned n = virtio_get_queue_index(svq->vq);
> > -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> > -        virtio_notify_irqfd(svq->vdev, svq->vq);
> > -    } else if (!svq->masked_notifier.signaled) {
> > -        svq->masked_notifier.signaled = true;
> > -        event_notifier_set(svq->masked_notifier.n);
> > -    }
> > +    /* Make as many buffers as possible used. */
> > +    do {
> > +        unsigned i = 0;
> > +
> > +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> > +        while (true) {
> > +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            assert(i < svq->vring.num);
> > +            virtqueue_fill(vq, elem, elem->len, i++);
> > +        }
> > +
> > +        virtqueue_flush(vq, i);
> > +        if (!masked_notifier) {
> > +            virtio_notify_irqfd(svq->vdev, svq->vq);
> > +        } else if (!svq->masked_notifier.signaled) {
> > +            svq->masked_notifier.signaled = true;
> > +            event_notifier_set(svq->masked_notifier.n);
> > +        }
> > +    } while (vhost_shadow_vq_more_used(svq));
> >
> >       if (masked_notifier) {
> >           /* Signal not using it anymore */
> > @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >
> >   static void vhost_shadow_vq_handle_call(EventNotifier *n)
> >   {
> > -
> >       if (likely(event_notifier_test_and_clear(n))) {
> >           vhost_shadow_vq_handle_call_no_test(n);
> >       }
> > @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >                             unsigned idx,
> >                             VhostShadowVirtqueue *svq)
> >   {
> > +    int i;
> >       int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> > +
> > +    assert(!dev->shadow_vqs_enabled);
> > +
> >       if (unlikely(r < 0)) {
> >           error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >       }
> > @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >       /* Restore vhost call */
> >       vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> >                            dev->vqs[idx].notifier_is_masked);
> > +
> > +
> > +    for (i = 0; i < svq->vring.num; ++i) {
> > +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > +        /*
> > +         * Although the doc says we must unpop in order, it's ok to unpop
> > +         * everything.
> > +         */
> > +        if (elem) {
> > +            virtqueue_unpop(svq->vq, elem, elem->len);
>
>
> Shouldn't we need to wait until all pending requests to be drained? Or
> we may end up duplicated requests?
>

Do you mean pending as in-flight/processing in the device? The device
must be paused at this point. Currently there is no assertion for
this, maybe we can track the device status for it.

For the queue handlers to be running at this point, the main event
loop should serialize QMP and handlers as far as I know (and they
would make all state inconsistent if the device stops suddenly). It
would need to be synchronized if the handlers run in their own AIO
context. That would be nice to have but it's not included here.

> Thanks
>
>
> > +        }
> > +    }
> >   }
> >
> >   /*
> > @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> >       size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> >       g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> > -    int r;
> > +    int r, i;
> >
> >       r = event_notifier_init(&svq->kick_notifier, 0);
> >       if (r != 0) {
> > @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >       vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
> >       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >       svq->vdev = dev->vdev;
> > +    for (i = 0; i < num - 1; i++) {
> > +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
> > +    }
> > +
> > +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >       event_notifier_set_handler(&svq->call_notifier,
> >                                  vhost_shadow_vq_handle_call);
> >       qemu_event_init(&svq->masked_notifier.is_free, true);
> > @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
> >       event_notifier_cleanup(&vq->kick_notifier);
> >       event_notifier_set_handler(&vq->call_notifier, NULL);
> >       event_notifier_cleanup(&vq->call_notifier);
> > +    g_free(vq->ring_id_maps);
> >       g_free(vq);
> >   }
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index eab3e334f2..a373999bc4 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> >
> >       trace_vhost_iotlb_miss(dev, 1);
> >
> > +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
> > +        uaddr = iova;
> > +        len = 4096;
> > +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> > +                                                IOMMU_RW);
> > +        if (ret) {
> > +            trace_vhost_iotlb_miss(dev, 2);
> > +            error_report("Fail to update device iotlb");
> > +        }
> > +
> > +        return ret;
> > +    }
> > +
> >       iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
> >                                             iova, write,
> >                                             MEMTXATTRS_UNSPECIFIED);
> > @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >       /* Can be read by vhost_virtqueue_mask, from vm exit */
> >       qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >
> > +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> > +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> > +        error_report("Fail to invalidate device iotlb");
> > +    }
> > +
> >       for (idx = 0; idx < dev->nvqs; ++idx) {
> > +        /*
> > +         * Update used ring information for IOTLB to work correctly,
> > +         * vhost-kernel code requires for this.
> > +         */
> > +        struct vhost_virtqueue *vq = dev->vqs + idx;
> > +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
> > +
> >           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> > +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> > +                              dev->vq_index + idx);
> > +    }
> > +
> > +    /* Enable guest's vq vring */
> > +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> > +
> > +    for (idx = 0; idx < dev->nvqs; ++idx) {
> >           vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >       }
> >
> > @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >       return 0;
> >   }
> >
> > +/*
> > + * Start shadow virtqueue in a given queue.
> > + * In failure case, this function leaves queue working as regular vhost mode.
> > + */
> > +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
> > +                                             unsigned idx)
> > +{
> > +    struct vhost_vring_addr addr = {
> > +        .index = idx,
> > +    };
> > +    struct vhost_vring_state s = {
> > +        .index = idx,
> > +    };
> > +    int r;
> > +    bool ok;
> > +
> > +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> > +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +
> > +    /* From this point, vhost_virtqueue_start can reset these changes */
> > +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
> > +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> > +    if (unlikely(r != 0)) {
> > +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
> > +        goto err;
> > +    }
> > +
> > +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
> > +    if (unlikely(r != 0)) {
> > +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
> > +        goto err;
> > +    }
> > +
> > +    /*
> > +     * Update used ring information for IOTLB to work correctly,
> > +     * vhost-kernel code requires for this.
> > +     */
> > +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
> > +    if (unlikely(r != 0)) {
> > +        /* Debug message already printed */
> > +        goto err;
> > +    }
> > +
> > +    return true;
> > +
> > +err:
> > +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> > +    return false;
> > +}
> > +
> >   static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >   {
> >       int idx, stop_idx;
> > @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >           }
> >       }
> >
> > +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> > +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> > +        error_report("Fail to invalidate device iotlb");
> > +    }
> > +
> >       /* Can be read by vhost_virtqueue_mask, from vm exit */
> >       qatomic_store_release(&dev->shadow_vqs_enabled, true);
> >       for (idx = 0; idx < dev->nvqs; ++idx) {
> > -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> > +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
> >           if (unlikely(!ok)) {
> >               goto err_start;
> >           }
> >       }
> >
> > +    /* Enable shadow vq vring */
> > +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >       return 0;
> >
> >   err_start:
> >       qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >       for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> >           vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> > +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> > +                              dev->vq_index + stop_idx);
> >       }
> >
> >   err_new:
> > +    /* Enable guest's vring */
> > +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >       for (idx = 0; idx < dev->nvqs; ++idx) {
> >           vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >       }
> > @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >
> >           if (!hdev->started) {
> >               err_cause = "Device is not started";
> > +        } else if (!vhost_dev_has_iommu(hdev)) {
> > +            err_cause = "Does not support iommu";
> > +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> > +            err_cause = "Is packed";
> > +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
> > +            err_cause = "Have event idx";
> > +        } else if (hdev->acked_features &
> > +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
> > +            err_cause = "Supports indirect descriptors";
> > +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
> > +            err_cause = "Cannot pause device";
> > +        }
> > +
> > +        if (err_cause) {
> >               goto err;
> >           }
> >
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 00/13] vDPA software assisted live migration
  2021-03-16  8:28 ` [RFC v2 00/13] vDPA software assisted live migration Jason Wang
@ 2021-03-16 17:25   ` Eugenio Perez Martin
  0 siblings, 0 replies; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-16 17:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 9:28 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > This series enable shadow virtqueue for vhost-net devices. This is a
> > new method of vhost devices migration: Instead of relay on vhost
> > device's dirty logging capability, SW assisted LM intercepts dataplane,
> > forwarding the descriptors between VM and device. Is intended for vDPA
> > devices with no logging, but this address the basic platform to build
> > that support on.
> >
> > In this migration mode, qemu offers a new vring to the device to
> > read and write into, and disable vhost notifiers, processing guest and
> > vhost notifications in qemu. On used buffer relay, qemu will mark the
> > dirty memory as with plain virtio-net devices. This way, devices does
> > not need to have dirty page logging capability.
> >
> > This series is a POC doing SW LM for vhost-net devices, which already
> > have dirty page logging capabilities. For qemu to use shadow virtqueues
> > these vhost-net devices need to be instantiated:
> > * With IOMMU (iommu_platform=on,ats=on)
> > * Without event_idx (event_idx=off)
> >
> > And shadow virtqueue needs to be enabled for them with QMP command
> > like:
> >
> > {
> >    "execute": "x-vhost-enable-shadow-vq",
> >    "arguments": {
> >      "name": "virtio-net",
> >      "enable": false
> >    }
> > }
> >
> > Just the notification forwarding (with no descriptor relay) can be
> > achieved with patches 5 and 6, and starting SVQ. Previous commits
> > are cleanup ones and declaration of QMP command.
> >
> > Commit 11 introduces the buffer forwarding. Previous one are for
> > preparations again, and laters are for enabling some obvious
> > optimizations.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
> > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > not map the shadow vq in guest's VA, but in qemu's.
> >
> > Comments are welcome! Especially on:
> > * Different/improved way of synchronization, particularly on the race
> >    of masking.
> >
> > TODO:
> > * Event, indirect, packed, and others features of virtio - Waiting for
> >    confirmation of the big picture.
>
>
> So two things in my mind after reviewing the seires:
>
> 1) which layer should we implement the shadow virtqueue. E.g if you want
> to do that at virtio level, you need to deal with a lot of
> synchronizations. I prefer to do it in vhost-vDPA.

I'm not sure how to do that and avoid the synchronization. Could you
expand on that point?

> 2) Using VA as IOVA which can not work for vhost-vDPA
>
>
> > * vDPA devices:
>
>
> So I think we can start from a vhost-vDPA specific shadow virtqueue
> first, then extending it to be a general one which might be much easier.
>
>
> > Developing solutions for tracking the available IOVA
> >    space for all devices.
>
>
> For vhost-net, you can assume that [0, ULLONG_MAX] is valid so you can
> simply use VA as IOVA.
>

In the future revision it will be that way unless vdpa device reports
limits on the range of addresses it can translate.

>
> > Small POC available, skipping the get/set
> >    status (since vDPA does not support it) and just allocating more and
> >    more IOVA addresses in a hardcoded range available for the device.
>
>
> I'm not sure this can work but you need make sure that range can fit the
> size of the all memory regions and need to deal with memory region add
> and del.
>
> I guess you probably need a full functional tree based IOVA allocator.
>

The vDPA POC I'm testing with does not free the used memory regions at all.

For future development I'm reusing qemu's iova-tree. Not sure if I
will keep with it until the end of development, but I'm open to better
suggestions of course.


> Thanks
>
>
> > * To sepparate buffers forwarding in its own AIO context, so we can
> >    throw more threads to that task and we don't need to stop the main
> >    event loop.
> > * IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
> >    sent to device.
> > * Automatic kick-in on live-migration.
> > * Proper documentation.
> >
> > Thanks!
> >
> > Changes from v1 RFC:
> >    * Use QMP instead of migration to start SVQ mode.
> >    * Only accepting IOMMU devices, closer behavior with target devices
> >      (vDPA)
> >    * Fix invalid masking/unmasking of vhost call fd.
> >    * Use of proper methods for synchronization.
> >    * No need to modify VirtIO device code, all of the changes are
> >      contained in vhost code.
> >    * Delete superfluous code.
> >    * An intermediate RFC was sent with only the notifications forwarding
> >      changes. It can be seen in
> >      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> >    * v1 at
> >      https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
> >
> > Eugenio Pérez (13):
> >    virtio: Add virtio_queue_is_host_notifier_enabled
> >    vhost: Save masked_notifier state
> >    vhost: Add VhostShadowVirtqueue
> >    vhost: Add x-vhost-enable-shadow-vq qmp
> >    vhost: Route guest->host notification through shadow virtqueue
> >    vhost: Route host->guest notification through shadow virtqueue
> >    vhost: Avoid re-set masked notifier in shadow vq
> >    virtio: Add vhost_shadow_vq_get_vring_addr
> >    virtio: Add virtio_queue_full
> >    vhost: add vhost_kernel_set_vring_enable
> >    vhost: Shadow virtqueue buffers forwarding
> >    vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
> >      kick
> >    vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
> >      virtqueue
> >
> >   qapi/net.json                      |  22 ++
> >   hw/virtio/vhost-shadow-virtqueue.h |  36 ++
> >   include/hw/virtio/vhost.h          |   6 +
> >   include/hw/virtio/virtio.h         |   3 +
> >   hw/virtio/vhost-backend.c          |  29 ++
> >   hw/virtio/vhost-shadow-virtqueue.c | 551 +++++++++++++++++++++++++++++
> >   hw/virtio/vhost.c                  | 283 +++++++++++++++
> >   hw/virtio/virtio.c                 |  23 +-
> >   hw/virtio/meson.build              |   2 +-
> >   9 files changed, 952 insertions(+), 3 deletions(-)
> >   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
> >   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
> >
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-16 10:31     ` Eugenio Perez Martin
@ 2021-03-17  2:05       ` Jason Wang
  2021-03-17 16:47         ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-17  2:05 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller


在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
>>> stops, so code flow follows usual cleanup.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.h |   7 ++
>>>    include/hw/virtio/vhost.h          |   4 +
>>>    hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
>>>    hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
>>>    4 files changed, 265 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index 6cc18d6acb..c891c6510d 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -17,6 +17,13 @@
>>>
>>>    typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>
>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>> +                           unsigned idx,
>>> +                           VhostShadowVirtqueue *svq);
>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>> +                          unsigned idx,
>>> +                          VhostShadowVirtqueue *svq);
>>> +
>>>    VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
>>>
>>>    void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>> index ac963bf23d..7ffdf9aea0 100644
>>> --- a/include/hw/virtio/vhost.h
>>> +++ b/include/hw/virtio/vhost.h
>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
>>>        QLIST_ENTRY(vhost_iommu) iommu_next;
>>>    };
>>>
>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>> +
>>>    typedef struct VhostDevConfigOps {
>>>        /* Vhost device config space changed callback
>>>         */
>>> @@ -83,7 +85,9 @@ struct vhost_dev {
>>>        uint64_t backend_cap;
>>>        bool started;
>>>        bool log_enabled;
>>> +    bool shadow_vqs_enabled;
>>>        uint64_t log_size;
>>> +    VhostShadowVirtqueue **shadow_vqs;
>>
>> Any reason that you don't embed the shadow virtqueue into
>> vhost_virtqueue structure?
>>
> Not really, it could be relatively big and I would prefer SVQ
> members/methods to remain hidden from any other part that includes
> vhost.h. But it could be changed, for sure.
>
>> (Note that there's a masked_notifier in struct vhost_virtqueue).
>>
> They are used differently: in SVQ the masked notifier is a pointer,
> and if it's NULL the SVQ code knows that device is not masked. The
> vhost_virtqueue is the real owner.


Yes, but it's an example for embedding auxciliary data structures in the 
vhost_virtqueue.


>
> It could be replaced by a boolean in SVQ or something like that, I
> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
> and let vhost.c code to manage all the transitions. But I find clearer
> the pointer use, since it's the more natural for the
> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
>
> This masking/unmasking is the part I dislike the most from this
> series, so I'm very open to alternatives.


See below. I think we don't even need to care about that.


>
>>>        Error *migration_blocker;
>>>        const VhostOps *vhost_ops;
>>>        void *opaque;
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 4512e5b058..3e43399e9c 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -8,9 +8,12 @@
>>>     */
>>>
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>> +#include "hw/virtio/vhost.h"
>>> +
>>> +#include "standard-headers/linux/vhost_types.h"
>>>
>>>    #include "qemu/error-report.h"
>>> -#include "qemu/event_notifier.h"
>>> +#include "qemu/main-loop.h"
>>>
>>>    /* Shadow virtqueue to relay notifications */
>>>    typedef struct VhostShadowVirtqueue {
>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
>>>        EventNotifier kick_notifier;
>>>        /* Shadow call notifier, sent to vhost */
>>>        EventNotifier call_notifier;
>>> +
>>> +    /*
>>> +     * Borrowed virtqueue's guest to host notifier.
>>> +     * To borrow it in this event notifier allows to register on the event
>>> +     * loop and access the associated shadow virtqueue easily. If we use the
>>> +     * VirtQueue, we don't have an easy way to retrieve it.
>>
>> So this is something that worries me. It looks like a layer violation
>> that makes the codes harder to work correctly.
>>
> I don't follow you here.
>
> The vhost code already depends on virtqueue in the same sense:
> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
> if this behavior ever changes it is unlikely for vhost to keep working
> without changes. vhost_virtqueue has a kick/call int where I think it
> should be stored actually, but they are never used as far as I see.
>
> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
> /* Stop processing guest IO notifications in vhost.
>   * Start processing them in qemu.
>   ...
> But it was easier for this mode to miss a notification, since they
> create a new host_notifier in virtio_bus_set_host_notifier right away.
> So I decided to use the file descriptor already sent to vhost in
> regular operation mode, so guest-related resources change less.
>
> Having said that, maybe it's useful to assert that
> vhost_dev_{enable,disable}_notifiers are never called on shadow
> virtqueue mode. Also, it could be useful to retrieve it from
> virtio_bus, not raw shadow virtqueue, so all get/set are performed
> from it. Would that make more sense?
>
>> I wonder if it would be simpler to start from a vDPA dedicated shadow
>> virtqueue implementation:
>>
>> 1) have the above fields embeded in vhost_vdpa structure
>> 2) Work at the level of
>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
>>
> This notifier is never sent to the device in shadow virtqueue mode.
> It's for SVQ to react to guest's notifications, registering it on its
> main event loop [1]. So if I perform these changes the way I
> understand them, SVQ would still rely on this borrowed EventNotifier,
> and it would send to the vDPA device the newly created kick_notifier
> of VhostShadowVirtqueue.


The point is that vhost code should be coupled loosely with virtio. If 
you try to "borrow" EventNotifier from virtio, you need to deal with a 
lot of synchrization. An exampleis the masking stuffs.


>
>> Then the layer is still isolated and you have a much simpler context to
>> work that you don't need to care a lot of synchornization:
>>
>> 1) vq masking
> This EventNotifier is not used for masking, it does not change from
> the start of the shadow virtqueue operation through its end. Call fd
> sent to vhost/vdpa device does not change either in shadow virtqueue
> mode operation with masking/unmasking. I will try to document it
> better.
>
> I think that we will need to handle synchronization with
> masking/unmasking from the guest and dynamically enabling SVQ
> operation mode, since they can happen at the same time as long as we
> let the guest run. There may be better ways of synchronizing them of
> course, but I don't see how moving to the vhost-vdpa backend helps
> with this. Please expand if I've missed it.
>
> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
> to future patchsets?


So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and 
hide them from the upper layers like virtio. This means it works at 
vhost level which can see vhost_vring_file only. When enalbed, what it 
needs is just:

1) switch to use svq kickfd and relay ioeventfd to svq kickfd
2) switch to use svq callfd and relay svq callfd to irqfd

It will still behave like a vhost-backend that the switching is done 
internally in vhost-vDPA which is totally transparent to the virtio 
codes of Qemu.

E.g:

1) in the case of guest notifier masking, we don't need to do anything 
since virtio codes will replace another irqfd for us.
2) easily to deal with vhost dev start and stop

The advantages are obvious, simple and easy to implement.


>
>> 2) vhost dev start and stop
>>
>> ?
>>
>>
>>> +     *
>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>> +     */
>>> +    EventNotifier host_notifier;
>>> +
>>> +    /* Virtio queue shadowing */
>>> +    VirtQueue *vq;
>>>    } VhostShadowVirtqueue;
>>>
>>> +/* Forward guest notifications */
>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>> +{
>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> +                                             host_notifier);
>>> +
>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
>>> +        return;
>>> +    }
>>> +
>>> +    event_notifier_set(&svq->kick_notifier);
>>> +}
>>> +
>>> +/*
>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
>>> + */
>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
>>> +                                                     unsigned vhost_index,
>>> +                                                     VhostShadowVirtqueue *svq)
>>> +{
>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>> +    struct vhost_vring_file file = {
>>> +        .index = vhost_index,
>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
>>> +    };
>>> +    int r;
>>> +
>>> +    /* Restore vhost kick */
>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>> +    return r ? -errno : 0;
>>> +}
>>> +
>>> +/*
>>> + * Start shadow virtqueue operation.
>>> + * @dev vhost device
>>> + * @hidx vhost virtqueue index
>>> + * @svq Shadow Virtqueue
>>> + */
>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>> +                           unsigned idx,
>>> +                           VhostShadowVirtqueue *svq)
>>
>> It looks to me this assumes the vhost_dev is started before
>> vhost_shadow_vq_start()?
>>
> Right.


This might not true. Guest may enable and disable virtio drivers after 
the shadow virtqueue is started. You need to deal with that.

Thanks



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable
  2021-03-16 10:43     ` Eugenio Perez Martin
@ 2021-03-17  2:25       ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2021-03-17  2:25 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller


在 2021/3/16 下午6:43, Eugenio Perez Martin 写道:
> On Tue, Mar 16, 2021 at 8:30 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>> This method is already present in vhost-user. This commit adapts it to
>>> vhost-net, so SVQ can use.
>>>
>>> vhost_kernel_set_enable stops the device, so qemu can ask for its status
>>> (next available idx the device was going to consume). When SVQ starts it
>>> can resume consuming the guest's driver ring, without notice from the
>>> latter. Not stopping the device before of the swapping could imply that
>>> it process more buffers than reported, what would duplicate the device
>>> action.
>>
>> Note that it might not be the case of vDPA (virtio) or at least virtio
>> needs some extension to achieve something similar like this. One example
>> is virtio-pci which forbids 0 to be wrote to queue_enable.
>>
>> This is another reason to start from vhost-vDPA.
>>
>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-backend.c | 29 +++++++++++++++++++++++++++++
>>>    1 file changed, 29 insertions(+)
>>>
>>> diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
>>> index 31b33bde37..1ac5c574a9 100644
>>> --- a/hw/virtio/vhost-backend.c
>>> +++ b/hw/virtio/vhost-backend.c
>>> @@ -201,6 +201,34 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
>>>        return idx - dev->vq_index;
>>>    }
>>>
>>> +static int vhost_kernel_set_vq_enable(struct vhost_dev *dev, unsigned idx,
>>> +                                      bool enable)
>>> +{
>>> +    struct vhost_vring_file file = {
>>> +        .index = idx,
>>> +    };
>>> +
>>> +    if (!enable) {
>>> +        file.fd = -1; /* Pass -1 to unbind from file. */
>>> +    } else {
>>> +        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
>>> +        file.fd = vn_dev->backend;
>>
>> This can only work with vhost-net devices but not vsock/scsi etc.
>>
> Right. Shadow virtqueue code should also check the return value of the
> vhost_set_vring_enable call.
>
> I'm not sure how to solve it without resorting to some ifelse/switch
> chain, checking for specific net/vsock/... features, or relaying on
> some other qemu class facilities. However, since the main use case is
> vDPA live migration, this commit could be left out and SVQ operation
> would only be valid for vhost-vdpa and vhost-user.


Yes, that's why I think we can start with vhost-vDPA first.

Thanks


>
>> Thanks
>>
>>
>>> +    }
>>> +
>>> +    return vhost_kernel_net_set_backend(dev, &file);
>>> +}
>>> +
>>> +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < dev->nvqs; ++i) {
>>> +        vhost_kernel_set_vq_enable(dev, i, enable);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    #ifdef CONFIG_VHOST_VSOCK
>>>    static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
>>>                                                uint64_t guest_cid)
>>> @@ -317,6 +345,7 @@ static const VhostOps kernel_ops = {
>>>            .vhost_set_owner = vhost_kernel_set_owner,
>>>            .vhost_reset_device = vhost_kernel_reset_device,
>>>            .vhost_get_vq_index = vhost_kernel_get_vq_index,
>>> +        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
>>>    #ifdef CONFIG_VHOST_VSOCK
>>>            .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
>>>            .vhost_vsock_set_running = vhost_kernel_vsock_set_running,



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-16 16:05     ` Eugenio Perez Martin
@ 2021-03-17  2:50       ` Jason Wang
  2021-03-17 14:38         ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-17  2:50 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Rob Miller, Parav Pandit, Juan Quintela, Guru Prasad,
	Michael S. Tsirkin, Markus Armbruster, qemu-level,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Stefano Garzarella


在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
> On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>> Initial version of shadow virtqueue that actually forward buffers.
>>>
>>> It reuses the VirtQueue code for the device part. The driver part is
>>> based on Linux's virtio_ring driver, but with stripped functionality
>>> and optimizations so it's easier to review.
>>>
>>> These will be added in later commits.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
>>>    hw/virtio/vhost.c                  | 113 ++++++++++++++-
>>>    2 files changed, 312 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 1460d1d5d1..68ed0f2740 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -9,6 +9,7 @@
>>>
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost.h"
>>> +#include "hw/virtio/virtio-access.h"
>>>
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
>>>        /* Virtio device */
>>>        VirtIODevice *vdev;
>>>
>>> +    /* Map for returning guest's descriptors */
>>> +    VirtQueueElement **ring_id_maps;
>>> +
>>> +    /* Next head to expose to device */
>>> +    uint16_t avail_idx_shadow;
>>> +
>>> +    /* Next free descriptor */
>>> +    uint16_t free_head;
>>> +
>>> +    /* Last seen used idx */
>>> +    uint16_t shadow_used_idx;
>>> +
>>> +    /* Next head to consume from device */
>>> +    uint16_t used_idx;
>>> +
>>>        /* Descriptors copied from guest */
>>>        vring_desc_t descs[];
>>>    } VhostShadowVirtqueue;
>>>
>>> -/* Forward guest notifications */
>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> +                                    const struct iovec *iovec,
>>> +                                    size_t num, bool more_descs, bool write)
>>> +{
>>> +    uint16_t i = svq->free_head, last = svq->free_head;
>>> +    unsigned n;
>>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +
>>> +    if (num == 0) {
>>> +        return;
>>> +    }
>>> +
>>> +    for (n = 0; n < num; n++) {
>>> +        if (more_descs || (n + 1 < num)) {
>>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
>>> +                                                    VRING_DESC_F_NEXT);
>>> +        } else {
>>> +            descs[i].flags = flags;
>>> +        }
>>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
>>
>> So unsing virtio_tswap() is probably not correct since we're talking
>> with vhost backends which has its own endiness.
>>
> I was trying to expose the buffer with the same endianness as the
> driver originally offered, so if guest->qemu requires a bswap, I think
> it will always require a bswap again to expose to the device again.


So assumes vhost-vDPA always provide a non-transitional device[1].

Then if Qemu present a transitional device, we need to do the endian 
conversion when necessary, if Qemu present a non-transitional device, we 
don't need to do that, guest driver will do that for us.

But it looks to me the virtio_tswap() can't be used for this since it:

static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
{
#if defined(LEGACY_VIRTIO_IS_BIENDIAN)
     return virtio_is_big_endian(vdev);
#elif defined(TARGET_WORDS_BIGENDIAN)
     if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
         /* Devices conforming to VIRTIO 1.0 or later are always LE. */
         return false;
     }
     return true;
#else
     return false;
#endif
}

So if we present a legacy device on top of a non-transitiaonl vDPA 
device. The VIRITIO_F_VERSION_1 check is wrong.


>
>> For vhost-vDPA, we can assume that it's a 1.0 device.
> Isn't it needed if the host is big endian?


[1]

So non-transitional device always assume little endian.

For vhost-vDPA, we don't want to present transitional device which may 
end up with a lot of burdens.

I suspect the legacy driver plust vhost vDPA already break, so I plan to 
mandate VERSION_1 for all vDPA devices.


>
>>
>>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
>>> +
>>> +        last = i;
>>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
>>> +    }
>>> +
>>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
>>> +}
>>> +
>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
>>> +                                          VirtQueueElement *elem)
>>> +{
>>> +    int head;
>>> +    unsigned avail_idx;
>>> +    vring_avail_t *avail = svq->vring.avail;
>>> +
>>> +    head = svq->free_head;
>>> +
>>> +    /* We need some descriptors here */
>>> +    assert(elem->out_num || elem->in_num);
>>> +
>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +                            elem->in_num > 0, false);
>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> +    /*
>>> +     * Put entry in available array (but don't update avail->idx until they
>>> +     * do sync).
>>> +     */
>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
>>> +    svq->avail_idx_shadow++;
>>> +
>>> +    /* Expose descriptors to device */
>>> +    smp_wmb();
>>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
>>> +
>>> +    return head;
>>> +
>>> +}
>>> +
>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>>> +                                VirtQueueElement *elem)
>>> +{
>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
>>> +
>>> +    svq->ring_id_maps[qemu_head] = elem;
>>> +}
>>> +
>>> +/* Handle guest->device notifications */
>>>    static void vhost_handle_guest_kick(EventNotifier *n)
>>>    {
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>            return;
>>>        }
>>>
>>> -    event_notifier_set(&svq->kick_notifier);
>>> +    /* Make available as many buffers as possible */
>>> +    do {
>>> +        if (virtio_queue_get_notification(svq->vq)) {
>>> +            /* No more notifications until process all available */
>>> +            virtio_queue_set_notification(svq->vq, false);
>>> +        }
>>> +
>>> +        while (true) {
>>> +            VirtQueueElement *elem;
>>> +            if (virtio_queue_full(svq->vq)) {
>>> +                break;
>>
>> So we've disabled guest notification. If buffer has been consumed, we
>> need to retry the handle_guest_kick here. But I didn't find the code?
>>
> This code follows the pattern of virtio_blk_handle_vq: we jump out of
> the inner while, and we re-enable the notifications. After that, we
> check for updates on guest avail_idx.


Ok, but this will end up with a lot of unnecessary kicks without event 
index.


>
>>> +            }
>>> +
>>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            vhost_shadow_vq_add(svq, elem);
>>> +            event_notifier_set(&svq->kick_notifier);
>>> +        }
>>> +
>>> +        virtio_queue_set_notification(svq->vq, true);
>>> +    } while (!virtio_queue_empty(svq->vq));
>>> +}
>>> +
>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
>>> +{
>>> +    if (svq->used_idx != svq->shadow_used_idx) {
>>> +        return true;
>>> +    }
>>> +
>>> +    /* Get used idx must not be reordered */
>>> +    smp_rmb();
>>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
>>> +
>>> +    return svq->used_idx != svq->shadow_used_idx;
>>> +}
>>> +
>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
>>> +{
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +    const vring_used_t *used = svq->vring.used;
>>> +    vring_used_elem_t used_elem;
>>> +    uint16_t last_used;
>>> +
>>> +    if (!vhost_shadow_vq_more_used(svq)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
>>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
>>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
>>> +
>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
>>> +                     used_elem.id);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    descs[used_elem.id].next = svq->free_head;
>>> +    svq->free_head = used_elem.id;
>>> +
>>> +    svq->used_idx++;
>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>>    }
>>>
>>>    /* Forward vhost notifications */
>>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>                                                 call_notifier);
>>>        EventNotifier *masked_notifier;
>>> +    VirtQueue *vq = svq->vq;
>>>
>>>        /* Signal start of using masked notifier */
>>>        qemu_event_reset(&svq->masked_notifier.is_free);
>>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>            qemu_event_set(&svq->masked_notifier.is_free);
>>>        }
>>>
>>> -    if (!masked_notifier) {
>>> -        unsigned n = virtio_get_queue_index(svq->vq);
>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
>>> -    } else if (!svq->masked_notifier.signaled) {
>>> -        svq->masked_notifier.signaled = true;
>>> -        event_notifier_set(svq->masked_notifier.n);
>>> -    }
>>> +    /* Make as many buffers as possible used. */
>>> +    do {
>>> +        unsigned i = 0;
>>> +
>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>> +        while (true) {
>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            assert(i < svq->vring.num);
>>> +            virtqueue_fill(vq, elem, elem->len, i++);
>>> +        }
>>> +
>>> +        virtqueue_flush(vq, i);
>>> +        if (!masked_notifier) {
>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
>>> +        } else if (!svq->masked_notifier.signaled) {
>>> +            svq->masked_notifier.signaled = true;
>>> +            event_notifier_set(svq->masked_notifier.n);
>>> +        }
>>> +    } while (vhost_shadow_vq_more_used(svq));
>>>
>>>        if (masked_notifier) {
>>>            /* Signal not using it anymore */
>>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>
>>>    static void vhost_shadow_vq_handle_call(EventNotifier *n)
>>>    {
>>> -
>>>        if (likely(event_notifier_test_and_clear(n))) {
>>>            vhost_shadow_vq_handle_call_no_test(n);
>>>        }
>>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>                              unsigned idx,
>>>                              VhostShadowVirtqueue *svq)
>>>    {
>>> +    int i;
>>>        int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
>>> +
>>> +    assert(!dev->shadow_vqs_enabled);
>>> +
>>>        if (unlikely(r < 0)) {
>>>            error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>>>        }
>>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>        /* Restore vhost call */
>>>        vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
>>>                             dev->vqs[idx].notifier_is_masked);
>>> +
>>> +
>>> +    for (i = 0; i < svq->vring.num; ++i) {
>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
>>> +        /*
>>> +         * Although the doc says we must unpop in order, it's ok to unpop
>>> +         * everything.
>>> +         */
>>> +        if (elem) {
>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
>>
>> Shouldn't we need to wait until all pending requests to be drained? Or
>> we may end up duplicated requests?
>>
> Do you mean pending as in-flight/processing in the device? The device
> must be paused at this point.


Ok. I see there's a vhost_set_vring_enable(dev, false) in 
vhost_sw_live_migration_start().


> Currently there is no assertion for
> this, maybe we can track the device status for it.
>
> For the queue handlers to be running at this point, the main event
> loop should serialize QMP and handlers as far as I know (and they
> would make all state inconsistent if the device stops suddenly). It
> would need to be synchronized if the handlers run in their own AIO
> context. That would be nice to have but it's not included here.


That's why I suggest to just drop the QMP stuffs and use cli parameters 
to enable shadow virtqueue. Things would be greatly simplified I guess.

Thanks


>
>> Thanks
>>
>>
>>> +        }
>>> +    }
>>>    }
>>>
>>>    /*
>>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>        unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
>>>        size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
>>>        g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
>>> -    int r;
>>> +    int r, i;
>>>
>>>        r = event_notifier_init(&svq->kick_notifier, 0);
>>>        if (r != 0) {
>>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>        vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
>>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>>>        svq->vdev = dev->vdev;
>>> +    for (i = 0; i < num - 1; i++) {
>>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
>>> +    }
>>> +
>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>>>        event_notifier_set_handler(&svq->call_notifier,
>>>                                   vhost_shadow_vq_handle_call);
>>>        qemu_event_init(&svq->masked_notifier.is_free, true);
>>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
>>>        event_notifier_cleanup(&vq->kick_notifier);
>>>        event_notifier_set_handler(&vq->call_notifier, NULL);
>>>        event_notifier_cleanup(&vq->call_notifier);
>>> +    g_free(vq->ring_id_maps);
>>>        g_free(vq);
>>>    }
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index eab3e334f2..a373999bc4 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
>>>
>>>        trace_vhost_iotlb_miss(dev, 1);
>>>
>>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
>>> +        uaddr = iova;
>>> +        len = 4096;
>>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
>>> +                                                IOMMU_RW);
>>> +        if (ret) {
>>> +            trace_vhost_iotlb_miss(dev, 2);
>>> +            error_report("Fail to update device iotlb");
>>> +        }
>>> +
>>> +        return ret;
>>> +    }
>>> +
>>>        iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
>>>                                              iova, write,
>>>                                              MEMTXATTRS_UNSPECIFIED);
>>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>        /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>        qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>
>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>> +        error_report("Fail to invalidate device iotlb");
>>> +    }
>>> +
>>>        for (idx = 0; idx < dev->nvqs; ++idx) {
>>> +        /*
>>> +         * Update used ring information for IOTLB to work correctly,
>>> +         * vhost-kernel code requires for this.
>>> +         */
>>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
>>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
>>> +
>>>            vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>> +                              dev->vq_index + idx);
>>> +    }
>>> +
>>> +    /* Enable guest's vq vring */
>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>> +
>>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
>>>            vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>        }
>>>
>>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>        return 0;
>>>    }
>>>
>>> +/*
>>> + * Start shadow virtqueue in a given queue.
>>> + * In failure case, this function leaves queue working as regular vhost mode.
>>> + */
>>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
>>> +                                             unsigned idx)
>>> +{
>>> +    struct vhost_vring_addr addr = {
>>> +        .index = idx,
>>> +    };
>>> +    struct vhost_vring_state s = {
>>> +        .index = idx,
>>> +    };
>>> +    int r;
>>> +    bool ok;
>>> +
>>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>> +    if (unlikely(!ok)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    /* From this point, vhost_virtqueue_start can reset these changes */
>>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
>>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
>>> +    if (unlikely(r != 0)) {
>>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
>>> +        goto err;
>>> +    }
>>> +
>>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
>>> +    if (unlikely(r != 0)) {
>>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
>>> +        goto err;
>>> +    }
>>> +
>>> +    /*
>>> +     * Update used ring information for IOTLB to work correctly,
>>> +     * vhost-kernel code requires for this.
>>> +     */
>>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
>>> +    if (unlikely(r != 0)) {
>>> +        /* Debug message already printed */
>>> +        goto err;
>>> +    }
>>> +
>>> +    return true;
>>> +
>>> +err:
>>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>> +    return false;
>>> +}
>>> +
>>>    static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>    {
>>>        int idx, stop_idx;
>>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>            }
>>>        }
>>>
>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>> +        error_report("Fail to invalidate device iotlb");
>>> +    }
>>> +
>>>        /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>        qatomic_store_release(&dev->shadow_vqs_enabled, true);
>>>        for (idx = 0; idx < dev->nvqs; ++idx) {
>>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
>>>            if (unlikely(!ok)) {
>>>                goto err_start;
>>>            }
>>>        }
>>>
>>> +    /* Enable shadow vq vring */
>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>        return 0;
>>>
>>>    err_start:
>>>        qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>        for (stop_idx = 0; stop_idx < idx; stop_idx++) {
>>>            vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>> +                              dev->vq_index + stop_idx);
>>>        }
>>>
>>>    err_new:
>>> +    /* Enable guest's vring */
>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>        for (idx = 0; idx < dev->nvqs; ++idx) {
>>>            vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>        }
>>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>
>>>            if (!hdev->started) {
>>>                err_cause = "Device is not started";
>>> +        } else if (!vhost_dev_has_iommu(hdev)) {
>>> +            err_cause = "Does not support iommu";
>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
>>> +            err_cause = "Is packed";
>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
>>> +            err_cause = "Have event idx";
>>> +        } else if (hdev->acked_features &
>>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
>>> +            err_cause = "Supports indirect descriptors";
>>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>> +            err_cause = "Cannot pause device";
>>> +        }
>>> +
>>> +        if (err_cause) {
>>>                goto err;
>>>            }
>>>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-17  2:50       ` Jason Wang
@ 2021-03-17 14:38         ` Eugenio Perez Martin
  2021-03-18  3:14           ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-17 14:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Rob Miller, Parav Pandit, Juan Quintela, Guru Prasad,
	Michael S. Tsirkin, Markus Armbruster, qemu-level,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Stefano Garzarella

On Wed, Mar 17, 2021 at 3:51 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
> > On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>> Initial version of shadow virtqueue that actually forward buffers.
> >>>
> >>> It reuses the VirtQueue code for the device part. The driver part is
> >>> based on Linux's virtio_ring driver, but with stripped functionality
> >>> and optimizations so it's easier to review.
> >>>
> >>> These will be added in later commits.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
> >>>    hw/virtio/vhost.c                  | 113 ++++++++++++++-
> >>>    2 files changed, 312 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 1460d1d5d1..68ed0f2740 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -9,6 +9,7 @@
> >>>
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost.h"
> >>> +#include "hw/virtio/virtio-access.h"
> >>>
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
> >>>        /* Virtio device */
> >>>        VirtIODevice *vdev;
> >>>
> >>> +    /* Map for returning guest's descriptors */
> >>> +    VirtQueueElement **ring_id_maps;
> >>> +
> >>> +    /* Next head to expose to device */
> >>> +    uint16_t avail_idx_shadow;
> >>> +
> >>> +    /* Next free descriptor */
> >>> +    uint16_t free_head;
> >>> +
> >>> +    /* Last seen used idx */
> >>> +    uint16_t shadow_used_idx;
> >>> +
> >>> +    /* Next head to consume from device */
> >>> +    uint16_t used_idx;
> >>> +
> >>>        /* Descriptors copied from guest */
> >>>        vring_desc_t descs[];
> >>>    } VhostShadowVirtqueue;
> >>>
> >>> -/* Forward guest notifications */
> >>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>> +                                    const struct iovec *iovec,
> >>> +                                    size_t num, bool more_descs, bool write)
> >>> +{
> >>> +    uint16_t i = svq->free_head, last = svq->free_head;
> >>> +    unsigned n;
> >>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +
> >>> +    if (num == 0) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    for (n = 0; n < num; n++) {
> >>> +        if (more_descs || (n + 1 < num)) {
> >>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
> >>> +                                                    VRING_DESC_F_NEXT);
> >>> +        } else {
> >>> +            descs[i].flags = flags;
> >>> +        }
> >>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
> >>
> >> So unsing virtio_tswap() is probably not correct since we're talking
> >> with vhost backends which has its own endiness.
> >>
> > I was trying to expose the buffer with the same endianness as the
> > driver originally offered, so if guest->qemu requires a bswap, I think
> > it will always require a bswap again to expose to the device again.
>
>
> So assumes vhost-vDPA always provide a non-transitional device[1].
>
> Then if Qemu present a transitional device, we need to do the endian
> conversion when necessary, if Qemu present a non-transitional device, we
> don't need to do that, guest driver will do that for us.
>
> But it looks to me the virtio_tswap() can't be used for this since it:
>
> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
> {
> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
>      return virtio_is_big_endian(vdev);
> #elif defined(TARGET_WORDS_BIGENDIAN)
>      if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>          /* Devices conforming to VIRTIO 1.0 or later are always LE. */
>          return false;
>      }
>      return true;
> #else
>      return false;
> #endif
> }
>
> So if we present a legacy device on top of a non-transitiaonl vDPA
> device. The VIRITIO_F_VERSION_1 check is wrong.
>
>
> >
> >> For vhost-vDPA, we can assume that it's a 1.0 device.
> > Isn't it needed if the host is big endian?
>
>
> [1]
>
> So non-transitional device always assume little endian.
>
> For vhost-vDPA, we don't want to present transitional device which may
> end up with a lot of burdens.
>
> I suspect the legacy driver plust vhost vDPA already break, so I plan to
> mandate VERSION_1 for all vDPA devices.
>

Right. I think it's the best then.

However, then we will need a similar method to always expose
address/length as little endian, isn't it?

>
> >
> >>
> >>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
> >>> +
> >>> +        last = i;
> >>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
> >>> +    }
> >>> +
> >>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
> >>> +}
> >>> +
> >>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> >>> +                                          VirtQueueElement *elem)
> >>> +{
> >>> +    int head;
> >>> +    unsigned avail_idx;
> >>> +    vring_avail_t *avail = svq->vring.avail;
> >>> +
> >>> +    head = svq->free_head;
> >>> +
> >>> +    /* We need some descriptors here */
> >>> +    assert(elem->out_num || elem->in_num);
> >>> +
> >>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>> +                            elem->in_num > 0, false);
> >>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>> +
> >>> +    /*
> >>> +     * Put entry in available array (but don't update avail->idx until they
> >>> +     * do sync).
> >>> +     */
> >>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> >>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
> >>> +    svq->avail_idx_shadow++;
> >>> +
> >>> +    /* Expose descriptors to device */
> >>> +    smp_wmb();
> >>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
> >>> +
> >>> +    return head;
> >>> +
> >>> +}
> >>> +
> >>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> >>> +                                VirtQueueElement *elem)
> >>> +{
> >>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> >>> +
> >>> +    svq->ring_id_maps[qemu_head] = elem;
> >>> +}
> >>> +
> >>> +/* Handle guest->device notifications */
> >>>    static void vhost_handle_guest_kick(EventNotifier *n)
> >>>    {
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>            return;
> >>>        }
> >>>
> >>> -    event_notifier_set(&svq->kick_notifier);
> >>> +    /* Make available as many buffers as possible */
> >>> +    do {
> >>> +        if (virtio_queue_get_notification(svq->vq)) {
> >>> +            /* No more notifications until process all available */
> >>> +            virtio_queue_set_notification(svq->vq, false);
> >>> +        }
> >>> +
> >>> +        while (true) {
> >>> +            VirtQueueElement *elem;
> >>> +            if (virtio_queue_full(svq->vq)) {
> >>> +                break;
> >>
> >> So we've disabled guest notification. If buffer has been consumed, we
> >> need to retry the handle_guest_kick here. But I didn't find the code?
> >>
> > This code follows the pattern of virtio_blk_handle_vq: we jump out of
> > the inner while, and we re-enable the notifications. After that, we
> > check for updates on guest avail_idx.
>
>
> Ok, but this will end up with a lot of unnecessary kicks without event
> index.
>

I can move the kick out of the inner loop, but that could add latency.

>
> >
> >>> +            }
> >>> +
> >>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            vhost_shadow_vq_add(svq, elem);
> >>> +            event_notifier_set(&svq->kick_notifier);
> >>> +        }
> >>> +
> >>> +        virtio_queue_set_notification(svq->vq, true);
> >>> +    } while (!virtio_queue_empty(svq->vq));
> >>> +}
> >>> +
> >>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    if (svq->used_idx != svq->shadow_used_idx) {
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    /* Get used idx must not be reordered */
> >>> +    smp_rmb();
> >>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
> >>> +
> >>> +    return svq->used_idx != svq->shadow_used_idx;
> >>> +}
> >>> +
> >>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +    const vring_used_t *used = svq->vring.used;
> >>> +    vring_used_elem_t used_elem;
> >>> +    uint16_t last_used;
> >>> +
> >>> +    if (!vhost_shadow_vq_more_used(svq)) {
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    last_used = svq->used_idx & (svq->vring.num - 1);
> >>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
> >>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
> >>> +
> >>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> >>> +        error_report("Device %s says index %u is available", svq->vdev->name,
> >>> +                     used_elem.id);
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    descs[used_elem.id].next = svq->free_head;
> >>> +    svq->free_head = used_elem.id;
> >>> +
> >>> +    svq->used_idx++;
> >>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> >>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >>>    }
> >>>
> >>>    /* Forward vhost notifications */
> >>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>                                                 call_notifier);
> >>>        EventNotifier *masked_notifier;
> >>> +    VirtQueue *vq = svq->vq;
> >>>
> >>>        /* Signal start of using masked notifier */
> >>>        qemu_event_reset(&svq->masked_notifier.is_free);
> >>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>            qemu_event_set(&svq->masked_notifier.is_free);
> >>>        }
> >>>
> >>> -    if (!masked_notifier) {
> >>> -        unsigned n = virtio_get_queue_index(svq->vq);
> >>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> >>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> >>> -    } else if (!svq->masked_notifier.signaled) {
> >>> -        svq->masked_notifier.signaled = true;
> >>> -        event_notifier_set(svq->masked_notifier.n);
> >>> -    }
> >>> +    /* Make as many buffers as possible used. */
> >>> +    do {
> >>> +        unsigned i = 0;
> >>> +
> >>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> >>> +        while (true) {
> >>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            assert(i < svq->vring.num);
> >>> +            virtqueue_fill(vq, elem, elem->len, i++);
> >>> +        }
> >>> +
> >>> +        virtqueue_flush(vq, i);
> >>> +        if (!masked_notifier) {
> >>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
> >>> +        } else if (!svq->masked_notifier.signaled) {
> >>> +            svq->masked_notifier.signaled = true;
> >>> +            event_notifier_set(svq->masked_notifier.n);
> >>> +        }
> >>> +    } while (vhost_shadow_vq_more_used(svq));
> >>>
> >>>        if (masked_notifier) {
> >>>            /* Signal not using it anymore */
> >>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>
> >>>    static void vhost_shadow_vq_handle_call(EventNotifier *n)
> >>>    {
> >>> -
> >>>        if (likely(event_notifier_test_and_clear(n))) {
> >>>            vhost_shadow_vq_handle_call_no_test(n);
> >>>        }
> >>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>                              unsigned idx,
> >>>                              VhostShadowVirtqueue *svq)
> >>>    {
> >>> +    int i;
> >>>        int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> >>> +
> >>> +    assert(!dev->shadow_vqs_enabled);
> >>> +
> >>>        if (unlikely(r < 0)) {
> >>>            error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >>>        }
> >>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>        /* Restore vhost call */
> >>>        vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> >>>                             dev->vqs[idx].notifier_is_masked);
> >>> +
> >>> +
> >>> +    for (i = 0; i < svq->vring.num; ++i) {
> >>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> >>> +        /*
> >>> +         * Although the doc says we must unpop in order, it's ok to unpop
> >>> +         * everything.
> >>> +         */
> >>> +        if (elem) {
> >>> +            virtqueue_unpop(svq->vq, elem, elem->len);
> >>
> >> Shouldn't we need to wait until all pending requests to be drained? Or
> >> we may end up duplicated requests?
> >>
> > Do you mean pending as in-flight/processing in the device? The device
> > must be paused at this point.
>
>
> Ok. I see there's a vhost_set_vring_enable(dev, false) in
> vhost_sw_live_migration_start().
>
>
> > Currently there is no assertion for
> > this, maybe we can track the device status for it.
> >
> > For the queue handlers to be running at this point, the main event
> > loop should serialize QMP and handlers as far as I know (and they
> > would make all state inconsistent if the device stops suddenly). It
> > would need to be synchronized if the handlers run in their own AIO
> > context. That would be nice to have but it's not included here.
>
>
> That's why I suggest to just drop the QMP stuffs and use cli parameters
> to enable shadow virtqueue. Things would be greatly simplified I guess.
>

I can send a series without it, but SVQ will need to be able to kick
in dynamically sooner or later if we want to use it for live
migration.

> Thanks
>
>
> >
> >> Thanks
> >>
> >>
> >>> +        }
> >>> +    }
> >>>    }
> >>>
> >>>    /*
> >>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>        unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> >>>        size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> >>>        g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> >>> -    int r;
> >>> +    int r, i;
> >>>
> >>>        r = event_notifier_init(&svq->kick_notifier, 0);
> >>>        if (r != 0) {
> >>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>        vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
> >>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >>>        svq->vdev = dev->vdev;
> >>> +    for (i = 0; i < num - 1; i++) {
> >>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
> >>> +    }
> >>> +
> >>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >>>        event_notifier_set_handler(&svq->call_notifier,
> >>>                                   vhost_shadow_vq_handle_call);
> >>>        qemu_event_init(&svq->masked_notifier.is_free, true);
> >>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
> >>>        event_notifier_cleanup(&vq->kick_notifier);
> >>>        event_notifier_set_handler(&vq->call_notifier, NULL);
> >>>        event_notifier_cleanup(&vq->call_notifier);
> >>> +    g_free(vq->ring_id_maps);
> >>>        g_free(vq);
> >>>    }
> >>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>> index eab3e334f2..a373999bc4 100644
> >>> --- a/hw/virtio/vhost.c
> >>> +++ b/hw/virtio/vhost.c
> >>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> >>>
> >>>        trace_vhost_iotlb_miss(dev, 1);
> >>>
> >>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
> >>> +        uaddr = iova;
> >>> +        len = 4096;
> >>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> >>> +                                                IOMMU_RW);
> >>> +        if (ret) {
> >>> +            trace_vhost_iotlb_miss(dev, 2);
> >>> +            error_report("Fail to update device iotlb");
> >>> +        }
> >>> +
> >>> +        return ret;
> >>> +    }
> >>> +
> >>>        iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
> >>>                                              iova, write,
> >>>                                              MEMTXATTRS_UNSPECIFIED);
> >>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>        /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>        qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>
> >>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>> +        error_report("Fail to invalidate device iotlb");
> >>> +    }
> >>> +
> >>>        for (idx = 0; idx < dev->nvqs; ++idx) {
> >>> +        /*
> >>> +         * Update used ring information for IOTLB to work correctly,
> >>> +         * vhost-kernel code requires for this.
> >>> +         */
> >>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
> >>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
> >>> +
> >>>            vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> >>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>> +                              dev->vq_index + idx);
> >>> +    }
> >>> +
> >>> +    /* Enable guest's vq vring */
> >>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>> +
> >>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>            vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>        }
> >>>
> >>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>        return 0;
> >>>    }
> >>>
> >>> +/*
> >>> + * Start shadow virtqueue in a given queue.
> >>> + * In failure case, this function leaves queue working as regular vhost mode.
> >>> + */
> >>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
> >>> +                                             unsigned idx)
> >>> +{
> >>> +    struct vhost_vring_addr addr = {
> >>> +        .index = idx,
> >>> +    };
> >>> +    struct vhost_vring_state s = {
> >>> +        .index = idx,
> >>> +    };
> >>> +    int r;
> >>> +    bool ok;
> >>> +
> >>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>> +    if (unlikely(!ok)) {
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    /* From this point, vhost_virtqueue_start can reset these changes */
> >>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
> >>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> >>> +    if (unlikely(r != 0)) {
> >>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
> >>> +        goto err;
> >>> +    }
> >>> +
> >>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
> >>> +    if (unlikely(r != 0)) {
> >>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
> >>> +        goto err;
> >>> +    }
> >>> +
> >>> +    /*
> >>> +     * Update used ring information for IOTLB to work correctly,
> >>> +     * vhost-kernel code requires for this.
> >>> +     */
> >>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
> >>> +    if (unlikely(r != 0)) {
> >>> +        /* Debug message already printed */
> >>> +        goto err;
> >>> +    }
> >>> +
> >>> +    return true;
> >>> +
> >>> +err:
> >>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>> +    return false;
> >>> +}
> >>> +
> >>>    static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>    {
> >>>        int idx, stop_idx;
> >>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>            }
> >>>        }
> >>>
> >>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>> +        error_report("Fail to invalidate device iotlb");
> >>> +    }
> >>> +
> >>>        /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>        qatomic_store_release(&dev->shadow_vqs_enabled, true);
> >>>        for (idx = 0; idx < dev->nvqs; ++idx) {
> >>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
> >>>            if (unlikely(!ok)) {
> >>>                goto err_start;
> >>>            }
> >>>        }
> >>>
> >>> +    /* Enable shadow vq vring */
> >>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>        return 0;
> >>>
> >>>    err_start:
> >>>        qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>        for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> >>>            vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> >>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>> +                              dev->vq_index + stop_idx);
> >>>        }
> >>>
> >>>    err_new:
> >>> +    /* Enable guest's vring */
> >>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>        for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>            vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>        }
> >>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>
> >>>            if (!hdev->started) {
> >>>                err_cause = "Device is not started";
> >>> +        } else if (!vhost_dev_has_iommu(hdev)) {
> >>> +            err_cause = "Does not support iommu";
> >>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> >>> +            err_cause = "Is packed";
> >>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
> >>> +            err_cause = "Have event idx";
> >>> +        } else if (hdev->acked_features &
> >>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
> >>> +            err_cause = "Supports indirect descriptors";
> >>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
> >>> +            err_cause = "Cannot pause device";
> >>> +        }
> >>> +
> >>> +        if (err_cause) {
> >>>                goto err;
> >>>            }
> >>>
> >
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-17  2:05       ` Jason Wang
@ 2021-03-17 16:47         ` Eugenio Perez Martin
  2021-03-18  3:10           ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-17 16:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
> > On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> >>> stops, so code flow follows usual cleanup.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.h |   7 ++
> >>>    include/hw/virtio/vhost.h          |   4 +
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
> >>>    hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
> >>>    4 files changed, 265 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>> index 6cc18d6acb..c891c6510d 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>> @@ -17,6 +17,13 @@
> >>>
> >>>    typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>
> >>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>> +                           unsigned idx,
> >>> +                           VhostShadowVirtqueue *svq);
> >>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>> +                          unsigned idx,
> >>> +                          VhostShadowVirtqueue *svq);
> >>> +
> >>>    VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
> >>>
> >>>    void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> >>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >>> index ac963bf23d..7ffdf9aea0 100644
> >>> --- a/include/hw/virtio/vhost.h
> >>> +++ b/include/hw/virtio/vhost.h
> >>> @@ -55,6 +55,8 @@ struct vhost_iommu {
> >>>        QLIST_ENTRY(vhost_iommu) iommu_next;
> >>>    };
> >>>
> >>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>> +
> >>>    typedef struct VhostDevConfigOps {
> >>>        /* Vhost device config space changed callback
> >>>         */
> >>> @@ -83,7 +85,9 @@ struct vhost_dev {
> >>>        uint64_t backend_cap;
> >>>        bool started;
> >>>        bool log_enabled;
> >>> +    bool shadow_vqs_enabled;
> >>>        uint64_t log_size;
> >>> +    VhostShadowVirtqueue **shadow_vqs;
> >>
> >> Any reason that you don't embed the shadow virtqueue into
> >> vhost_virtqueue structure?
> >>
> > Not really, it could be relatively big and I would prefer SVQ
> > members/methods to remain hidden from any other part that includes
> > vhost.h. But it could be changed, for sure.
> >
> >> (Note that there's a masked_notifier in struct vhost_virtqueue).
> >>
> > They are used differently: in SVQ the masked notifier is a pointer,
> > and if it's NULL the SVQ code knows that device is not masked. The
> > vhost_virtqueue is the real owner.
>
>
> Yes, but it's an example for embedding auxciliary data structures in the
> vhost_virtqueue.
>
>
> >
> > It could be replaced by a boolean in SVQ or something like that, I
> > experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
> > and let vhost.c code to manage all the transitions. But I find clearer
> > the pointer use, since it's the more natural for the
> > vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
> >
> > This masking/unmasking is the part I dislike the most from this
> > series, so I'm very open to alternatives.
>
>
> See below. I think we don't even need to care about that.
>
>
> >
> >>>        Error *migration_blocker;
> >>>        const VhostOps *vhost_ops;
> >>>        void *opaque;
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 4512e5b058..3e43399e9c 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -8,9 +8,12 @@
> >>>     */
> >>>
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>> +#include "hw/virtio/vhost.h"
> >>> +
> >>> +#include "standard-headers/linux/vhost_types.h"
> >>>
> >>>    #include "qemu/error-report.h"
> >>> -#include "qemu/event_notifier.h"
> >>> +#include "qemu/main-loop.h"
> >>>
> >>>    /* Shadow virtqueue to relay notifications */
> >>>    typedef struct VhostShadowVirtqueue {
> >>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
> >>>        EventNotifier kick_notifier;
> >>>        /* Shadow call notifier, sent to vhost */
> >>>        EventNotifier call_notifier;
> >>> +
> >>> +    /*
> >>> +     * Borrowed virtqueue's guest to host notifier.
> >>> +     * To borrow it in this event notifier allows to register on the event
> >>> +     * loop and access the associated shadow virtqueue easily. If we use the
> >>> +     * VirtQueue, we don't have an easy way to retrieve it.
> >>
> >> So this is something that worries me. It looks like a layer violation
> >> that makes the codes harder to work correctly.
> >>
> > I don't follow you here.
> >
> > The vhost code already depends on virtqueue in the same sense:
> > virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
> > if this behavior ever changes it is unlikely for vhost to keep working
> > without changes. vhost_virtqueue has a kick/call int where I think it
> > should be stored actually, but they are never used as far as I see.
> >
> > Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
> > /* Stop processing guest IO notifications in vhost.
> >   * Start processing them in qemu.
> >   ...
> > But it was easier for this mode to miss a notification, since they
> > create a new host_notifier in virtio_bus_set_host_notifier right away.
> > So I decided to use the file descriptor already sent to vhost in
> > regular operation mode, so guest-related resources change less.
> >
> > Having said that, maybe it's useful to assert that
> > vhost_dev_{enable,disable}_notifiers are never called on shadow
> > virtqueue mode. Also, it could be useful to retrieve it from
> > virtio_bus, not raw shadow virtqueue, so all get/set are performed
> > from it. Would that make more sense?
> >
> >> I wonder if it would be simpler to start from a vDPA dedicated shadow
> >> virtqueue implementation:
> >>
> >> 1) have the above fields embeded in vhost_vdpa structure
> >> 2) Work at the level of
> >> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
> >>
> > This notifier is never sent to the device in shadow virtqueue mode.
> > It's for SVQ to react to guest's notifications, registering it on its
> > main event loop [1]. So if I perform these changes the way I
> > understand them, SVQ would still rely on this borrowed EventNotifier,
> > and it would send to the vDPA device the newly created kick_notifier
> > of VhostShadowVirtqueue.
>
>
> The point is that vhost code should be coupled loosely with virtio. If
> you try to "borrow" EventNotifier from virtio, you need to deal with a
> lot of synchrization. An exampleis the masking stuffs.
>

I still don't follow this, sorry.

The svq->host_notifier event notifier is not affected by the masking
issue, it is completely private to SVQ. This commit creates and uses
it, and nothing related to masking is touched until the next commit.

>
> >
> >> Then the layer is still isolated and you have a much simpler context to
> >> work that you don't need to care a lot of synchornization:
> >>
> >> 1) vq masking
> > This EventNotifier is not used for masking, it does not change from
> > the start of the shadow virtqueue operation through its end. Call fd
> > sent to vhost/vdpa device does not change either in shadow virtqueue
> > mode operation with masking/unmasking. I will try to document it
> > better.
> >
> > I think that we will need to handle synchronization with
> > masking/unmasking from the guest and dynamically enabling SVQ
> > operation mode, since they can happen at the same time as long as we
> > let the guest run. There may be better ways of synchronizing them of
> > course, but I don't see how moving to the vhost-vdpa backend helps
> > with this. Please expand if I've missed it.
> >
> > Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
> > to future patchsets?
>
>
> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
> hide them from the upper layers like virtio. This means it works at
> vhost level which can see vhost_vring_file only. When enalbed, what it
> needs is just:
>
> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
> 2) switch to use svq callfd and relay svq callfd to irqfd
>
> It will still behave like a vhost-backend that the switching is done
> internally in vhost-vDPA which is totally transparent to the virtio
> codes of Qemu.
>
> E.g:
>
> 1) in the case of guest notifier masking, we don't need to do anything
> since virtio codes will replace another irqfd for us.

Assuming that we don't modify vhost masking code, but send shadow
virtqueue call descriptor to the vhost device:

If guest virtio code mask the virtqueue and replaces the vhost-vdpa
device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
or the descriptor in your previous second point, svq callfd) with the
masked notifier, vhost_shadow_vq_handle_call will not be called
anymore, and no more used descriptors will be forwarded. They will be
stuck if the shadow virtqueue forever. Guest itself cannot recover
from this situation, since a masking will set irqfd, not SVQ call fd.

> 2) easily to deal with vhost dev start and stop
>
> The advantages are obvious, simple and easy to implement.
>

I still don't see how performing this step from backend code avoids
the synchronization problem, since they will be done from different
threads anyway. Not sure what piece I'm missing.

I can see / tested a few solutions but I don't like them a lot:

* Forbid hot-swapping from/to shadow virtqueue mode, and set it from
cmdline: We will have to deal with setting the SVQ mode dynamically
sooner or later if we want to use it for live migration.
* Forbid coming back to regular mode after switching to shadow
virtqueue mode: The heavy part of the synchronization comes from svq
stopping code, since we need to serialize the setting of device call
fd. This could be acceptable, but I'm not sure about the implications:
What happens if live migration fails and we need to step back? A mutex
is not needed in this scenario, it's ok with atomics and RCU code.

* Replace KVM_IRQFD instead and let SVQ poll the old one and masked
notifier: I haven't thought a lot of this one, I think it's better to
not touch guest notifiers.
* Monitor also masked notifier from SVQ: I think this could be
promising, but SVQ needs to be notified about masking/unmasking
anyway, and there is code that depends on checking the masked notifier
for the pending notification.

>
> >
> >> 2) vhost dev start and stop
> >>
> >> ?
> >>
> >>
> >>> +     *
> >>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >>> +     */
> >>> +    EventNotifier host_notifier;
> >>> +
> >>> +    /* Virtio queue shadowing */
> >>> +    VirtQueue *vq;
> >>>    } VhostShadowVirtqueue;
> >>>
> >>> +/* Forward guest notifications */
> >>> +static void vhost_handle_guest_kick(EventNotifier *n)
> >>> +{
> >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> +                                             host_notifier);
> >>> +
> >>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    event_notifier_set(&svq->kick_notifier);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >>> + */
> >>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> >>> +                                                     unsigned vhost_index,
> >>> +                                                     VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> >>> +    struct vhost_vring_file file = {
> >>> +        .index = vhost_index,
> >>> +        .fd = event_notifier_get_fd(vq_host_notifier),
> >>> +    };
> >>> +    int r;
> >>> +
> >>> +    /* Restore vhost kick */
> >>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >>> +    return r ? -errno : 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Start shadow virtqueue operation.
> >>> + * @dev vhost device
> >>> + * @hidx vhost virtqueue index
> >>> + * @svq Shadow Virtqueue
> >>> + */
> >>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>> +                           unsigned idx,
> >>> +                           VhostShadowVirtqueue *svq)
> >>
> >> It looks to me this assumes the vhost_dev is started before
> >> vhost_shadow_vq_start()?
> >>
> > Right.
>
>
> This might not true. Guest may enable and disable virtio drivers after
> the shadow virtqueue is started. You need to deal with that.
>

Right, I will test this scenario.

> Thanks
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-17 16:47         ` Eugenio Perez Martin
@ 2021-03-18  3:10           ` Jason Wang
  2021-03-18  9:18             ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-18  3:10 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller


在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
> On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
>>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
>>>>> stops, so code flow follows usual cleanup.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     hw/virtio/vhost-shadow-virtqueue.h |   7 ++
>>>>>     include/hw/virtio/vhost.h          |   4 +
>>>>>     hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
>>>>>     hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
>>>>>     4 files changed, 265 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> index 6cc18d6acb..c891c6510d 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> @@ -17,6 +17,13 @@
>>>>>
>>>>>     typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>>
>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>> +                           unsigned idx,
>>>>> +                           VhostShadowVirtqueue *svq);
>>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>> +                          unsigned idx,
>>>>> +                          VhostShadowVirtqueue *svq);
>>>>> +
>>>>>     VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
>>>>>
>>>>>     void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>> index ac963bf23d..7ffdf9aea0 100644
>>>>> --- a/include/hw/virtio/vhost.h
>>>>> +++ b/include/hw/virtio/vhost.h
>>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
>>>>>         QLIST_ENTRY(vhost_iommu) iommu_next;
>>>>>     };
>>>>>
>>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>> +
>>>>>     typedef struct VhostDevConfigOps {
>>>>>         /* Vhost device config space changed callback
>>>>>          */
>>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
>>>>>         uint64_t backend_cap;
>>>>>         bool started;
>>>>>         bool log_enabled;
>>>>> +    bool shadow_vqs_enabled;
>>>>>         uint64_t log_size;
>>>>> +    VhostShadowVirtqueue **shadow_vqs;
>>>> Any reason that you don't embed the shadow virtqueue into
>>>> vhost_virtqueue structure?
>>>>
>>> Not really, it could be relatively big and I would prefer SVQ
>>> members/methods to remain hidden from any other part that includes
>>> vhost.h. But it could be changed, for sure.
>>>
>>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
>>>>
>>> They are used differently: in SVQ the masked notifier is a pointer,
>>> and if it's NULL the SVQ code knows that device is not masked. The
>>> vhost_virtqueue is the real owner.
>>
>> Yes, but it's an example for embedding auxciliary data structures in the
>> vhost_virtqueue.
>>
>>
>>> It could be replaced by a boolean in SVQ or something like that, I
>>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
>>> and let vhost.c code to manage all the transitions. But I find clearer
>>> the pointer use, since it's the more natural for the
>>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
>>>
>>> This masking/unmasking is the part I dislike the most from this
>>> series, so I'm very open to alternatives.
>>
>> See below. I think we don't even need to care about that.
>>
>>
>>>>>         Error *migration_blocker;
>>>>>         const VhostOps *vhost_ops;
>>>>>         void *opaque;
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> index 4512e5b058..3e43399e9c 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> @@ -8,9 +8,12 @@
>>>>>      */
>>>>>
>>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>> +#include "hw/virtio/vhost.h"
>>>>> +
>>>>> +#include "standard-headers/linux/vhost_types.h"
>>>>>
>>>>>     #include "qemu/error-report.h"
>>>>> -#include "qemu/event_notifier.h"
>>>>> +#include "qemu/main-loop.h"
>>>>>
>>>>>     /* Shadow virtqueue to relay notifications */
>>>>>     typedef struct VhostShadowVirtqueue {
>>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
>>>>>         EventNotifier kick_notifier;
>>>>>         /* Shadow call notifier, sent to vhost */
>>>>>         EventNotifier call_notifier;
>>>>> +
>>>>> +    /*
>>>>> +     * Borrowed virtqueue's guest to host notifier.
>>>>> +     * To borrow it in this event notifier allows to register on the event
>>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
>>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
>>>> So this is something that worries me. It looks like a layer violation
>>>> that makes the codes harder to work correctly.
>>>>
>>> I don't follow you here.
>>>
>>> The vhost code already depends on virtqueue in the same sense:
>>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
>>> if this behavior ever changes it is unlikely for vhost to keep working
>>> without changes. vhost_virtqueue has a kick/call int where I think it
>>> should be stored actually, but they are never used as far as I see.
>>>
>>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
>>> /* Stop processing guest IO notifications in vhost.
>>>    * Start processing them in qemu.
>>>    ...
>>> But it was easier for this mode to miss a notification, since they
>>> create a new host_notifier in virtio_bus_set_host_notifier right away.
>>> So I decided to use the file descriptor already sent to vhost in
>>> regular operation mode, so guest-related resources change less.
>>>
>>> Having said that, maybe it's useful to assert that
>>> vhost_dev_{enable,disable}_notifiers are never called on shadow
>>> virtqueue mode. Also, it could be useful to retrieve it from
>>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
>>> from it. Would that make more sense?
>>>
>>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
>>>> virtqueue implementation:
>>>>
>>>> 1) have the above fields embeded in vhost_vdpa structure
>>>> 2) Work at the level of
>>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
>>>>
>>> This notifier is never sent to the device in shadow virtqueue mode.
>>> It's for SVQ to react to guest's notifications, registering it on its
>>> main event loop [1]. So if I perform these changes the way I
>>> understand them, SVQ would still rely on this borrowed EventNotifier,
>>> and it would send to the vDPA device the newly created kick_notifier
>>> of VhostShadowVirtqueue.
>>
>> The point is that vhost code should be coupled loosely with virtio. If
>> you try to "borrow" EventNotifier from virtio, you need to deal with a
>> lot of synchrization. An exampleis the masking stuffs.
>>
> I still don't follow this, sorry.
>
> The svq->host_notifier event notifier is not affected by the masking
> issue, it is completely private to SVQ. This commit creates and uses
> it, and nothing related to masking is touched until the next commit.
>
>>>> Then the layer is still isolated and you have a much simpler context to
>>>> work that you don't need to care a lot of synchornization:
>>>>
>>>> 1) vq masking
>>> This EventNotifier is not used for masking, it does not change from
>>> the start of the shadow virtqueue operation through its end. Call fd
>>> sent to vhost/vdpa device does not change either in shadow virtqueue
>>> mode operation with masking/unmasking. I will try to document it
>>> better.
>>>
>>> I think that we will need to handle synchronization with
>>> masking/unmasking from the guest and dynamically enabling SVQ
>>> operation mode, since they can happen at the same time as long as we
>>> let the guest run. There may be better ways of synchronizing them of
>>> course, but I don't see how moving to the vhost-vdpa backend helps
>>> with this. Please expand if I've missed it.
>>>
>>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
>>> to future patchsets?
>>
>> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
>> hide them from the upper layers like virtio. This means it works at
>> vhost level which can see vhost_vring_file only. When enalbed, what it
>> needs is just:
>>
>> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
>> 2) switch to use svq callfd and relay svq callfd to irqfd
>>
>> It will still behave like a vhost-backend that the switching is done
>> internally in vhost-vDPA which is totally transparent to the virtio
>> codes of Qemu.
>>
>> E.g:
>>
>> 1) in the case of guest notifier masking, we don't need to do anything
>> since virtio codes will replace another irqfd for us.
> Assuming that we don't modify vhost masking code, but send shadow
> virtqueue call descriptor to the vhost device:
>
> If guest virtio code mask the virtqueue and replaces the vhost-vdpa
> device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
> or the descriptor in your previous second point, svq callfd) with the
> masked notifier, vhost_shadow_vq_handle_call will not be called
> anymore, and no more used descriptors will be forwarded. They will be
> stuck if the shadow virtqueue forever. Guest itself cannot recover
> from this situation, since a masking will set irqfd, not SVQ call fd.


Just to make sure we're in the same page. During vq masking, the virtio 
codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():

     if (mask) {
         assert(vdev->use_guest_notifier_mask);
         file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
     } else {
     file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
     }

     file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
     r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);

So consider the shadow virtqueue in done at vhost-vDPA. We just need to 
make sure

1) update the callfd which passed by virtio layer via set_vring_kick()
2) always write to the callfd during vhost_shadow_vq_handle_call()

Then

3) When shadow vq is enabled, we just set the callfd of shadow virtqueue 
to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
4) When shadow vq is disabled, we just set the callfd that is passed by 
virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd

So you can see in step 2 and 4, we don't need to know whether or not the 
vq is masked since we follow the vhost protocol "VhostOps" and do 
everyhing transparently in the vhost-(vDPA) layer.


>
>> 2) easily to deal with vhost dev start and stop
>>
>> The advantages are obvious, simple and easy to implement.
>>
> I still don't see how performing this step from backend code avoids
> the synchronization problem, since they will be done from different
> threads anyway. Not sure what piece I'm missing.


See my reply in another thread. If you enable the shadow virtqueue via a 
OOB monitor that's a real issue.

But I don't think we need to do that since

1) SVQ should be transparnet to management
2) unncessary synchronization issue

We can enable the shadow virtqueue through cli, new parameter with 
vhost-vdpa probably. Then we don't need to care about threads. And in 
the final version with full live migration support, the shadow virtqueue 
should be enabled automatically. E.g for the device without 
VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via 
VHOST_GET_BACKEND_FEATURES.

Thanks


>
> I can see / tested a few solutions but I don't like them a lot:
>
> * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
> cmdline: We will have to deal with setting the SVQ mode dynamically
> sooner or later if we want to use it for live migration.
> * Forbid coming back to regular mode after switching to shadow
> virtqueue mode: The heavy part of the synchronization comes from svq
> stopping code, since we need to serialize the setting of device call
> fd. This could be acceptable, but I'm not sure about the implications:
> What happens if live migration fails and we need to step back? A mutex
> is not needed in this scenario, it's ok with atomics and RCU code.
>
> * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
> notifier: I haven't thought a lot of this one, I think it's better to
> not touch guest notifiers.
> * Monitor also masked notifier from SVQ: I think this could be
> promising, but SVQ needs to be notified about masking/unmasking
> anyway, and there is code that depends on checking the masked notifier
> for the pending notification.
>
>>>> 2) vhost dev start and stop
>>>>
>>>> ?
>>>>
>>>>
>>>>> +     *
>>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>>>> +     */
>>>>> +    EventNotifier host_notifier;
>>>>> +
>>>>> +    /* Virtio queue shadowing */
>>>>> +    VirtQueue *vq;
>>>>>     } VhostShadowVirtqueue;
>>>>>
>>>>> +/* Forward guest notifications */
>>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>>>> +{
>>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>> +                                             host_notifier);
>>>>> +
>>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    event_notifier_set(&svq->kick_notifier);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
>>>>> + */
>>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
>>>>> +                                                     unsigned vhost_index,
>>>>> +                                                     VhostShadowVirtqueue *svq)
>>>>> +{
>>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>>>> +    struct vhost_vring_file file = {
>>>>> +        .index = vhost_index,
>>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
>>>>> +    };
>>>>> +    int r;
>>>>> +
>>>>> +    /* Restore vhost kick */
>>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>>>> +    return r ? -errno : 0;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Start shadow virtqueue operation.
>>>>> + * @dev vhost device
>>>>> + * @hidx vhost virtqueue index
>>>>> + * @svq Shadow Virtqueue
>>>>> + */
>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>> +                           unsigned idx,
>>>>> +                           VhostShadowVirtqueue *svq)
>>>> It looks to me this assumes the vhost_dev is started before
>>>> vhost_shadow_vq_start()?
>>>>
>>> Right.
>>
>> This might not true. Guest may enable and disable virtio drivers after
>> the shadow virtqueue is started. You need to deal with that.
>>
> Right, I will test this scenario.
>
>> Thanks
>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-17 14:38         ` Eugenio Perez Martin
@ 2021-03-18  3:14           ` Jason Wang
  2021-03-18  8:06             ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-18  3:14 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Rob Miller, Parav Pandit, Juan Quintela, Guru Prasad,
	Michael S. Tsirkin, Markus Armbruster, qemu-level,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Stefano Garzarella


在 2021/3/17 下午10:38, Eugenio Perez Martin 写道:
> On Wed, Mar 17, 2021 at 3:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
>>> On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>>>> Initial version of shadow virtqueue that actually forward buffers.
>>>>>
>>>>> It reuses the VirtQueue code for the device part. The driver part is
>>>>> based on Linux's virtio_ring driver, but with stripped functionality
>>>>> and optimizations so it's easier to review.
>>>>>
>>>>> These will be added in later commits.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
>>>>>     hw/virtio/vhost.c                  | 113 ++++++++++++++-
>>>>>     2 files changed, 312 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> index 1460d1d5d1..68ed0f2740 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> @@ -9,6 +9,7 @@
>>>>>
>>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>     #include "hw/virtio/vhost.h"
>>>>> +#include "hw/virtio/virtio-access.h"
>>>>>
>>>>>     #include "standard-headers/linux/vhost_types.h"
>>>>>
>>>>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
>>>>>         /* Virtio device */
>>>>>         VirtIODevice *vdev;
>>>>>
>>>>> +    /* Map for returning guest's descriptors */
>>>>> +    VirtQueueElement **ring_id_maps;
>>>>> +
>>>>> +    /* Next head to expose to device */
>>>>> +    uint16_t avail_idx_shadow;
>>>>> +
>>>>> +    /* Next free descriptor */
>>>>> +    uint16_t free_head;
>>>>> +
>>>>> +    /* Last seen used idx */
>>>>> +    uint16_t shadow_used_idx;
>>>>> +
>>>>> +    /* Next head to consume from device */
>>>>> +    uint16_t used_idx;
>>>>> +
>>>>>         /* Descriptors copied from guest */
>>>>>         vring_desc_t descs[];
>>>>>     } VhostShadowVirtqueue;
>>>>>
>>>>> -/* Forward guest notifications */
>>>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>> +                                    const struct iovec *iovec,
>>>>> +                                    size_t num, bool more_descs, bool write)
>>>>> +{
>>>>> +    uint16_t i = svq->free_head, last = svq->free_head;
>>>>> +    unsigned n;
>>>>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
>>>>> +    vring_desc_t *descs = svq->vring.desc;
>>>>> +
>>>>> +    if (num == 0) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    for (n = 0; n < num; n++) {
>>>>> +        if (more_descs || (n + 1 < num)) {
>>>>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
>>>>> +                                                    VRING_DESC_F_NEXT);
>>>>> +        } else {
>>>>> +            descs[i].flags = flags;
>>>>> +        }
>>>>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
>>>> So unsing virtio_tswap() is probably not correct since we're talking
>>>> with vhost backends which has its own endiness.
>>>>
>>> I was trying to expose the buffer with the same endianness as the
>>> driver originally offered, so if guest->qemu requires a bswap, I think
>>> it will always require a bswap again to expose to the device again.
>>
>> So assumes vhost-vDPA always provide a non-transitional device[1].
>>
>> Then if Qemu present a transitional device, we need to do the endian
>> conversion when necessary, if Qemu present a non-transitional device, we
>> don't need to do that, guest driver will do that for us.
>>
>> But it looks to me the virtio_tswap() can't be used for this since it:
>>
>> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
>> {
>> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
>>       return virtio_is_big_endian(vdev);
>> #elif defined(TARGET_WORDS_BIGENDIAN)
>>       if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>>           /* Devices conforming to VIRTIO 1.0 or later are always LE. */
>>           return false;
>>       }
>>       return true;
>> #else
>>       return false;
>> #endif
>> }
>>
>> So if we present a legacy device on top of a non-transitiaonl vDPA
>> device. The VIRITIO_F_VERSION_1 check is wrong.
>>
>>
>>>> For vhost-vDPA, we can assume that it's a 1.0 device.
>>> Isn't it needed if the host is big endian?
>>
>> [1]
>>
>> So non-transitional device always assume little endian.
>>
>> For vhost-vDPA, we don't want to present transitional device which may
>> end up with a lot of burdens.
>>
>> I suspect the legacy driver plust vhost vDPA already break, so I plan to
>> mandate VERSION_1 for all vDPA devices.
>>
> Right. I think it's the best then.
>
> However, then we will need a similar method to always expose
> address/length as little endian, isn't it?


Yes.


>
>>>>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
>>>>> +
>>>>> +        last = i;
>>>>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
>>>>> +    }
>>>>> +
>>>>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
>>>>> +}
>>>>> +
>>>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
>>>>> +                                          VirtQueueElement *elem)
>>>>> +{
>>>>> +    int head;
>>>>> +    unsigned avail_idx;
>>>>> +    vring_avail_t *avail = svq->vring.avail;
>>>>> +
>>>>> +    head = svq->free_head;
>>>>> +
>>>>> +    /* We need some descriptors here */
>>>>> +    assert(elem->out_num || elem->in_num);
>>>>> +
>>>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>>>> +                            elem->in_num > 0, false);
>>>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>>>> +
>>>>> +    /*
>>>>> +     * Put entry in available array (but don't update avail->idx until they
>>>>> +     * do sync).
>>>>> +     */
>>>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>>>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
>>>>> +    svq->avail_idx_shadow++;
>>>>> +
>>>>> +    /* Expose descriptors to device */
>>>>> +    smp_wmb();
>>>>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
>>>>> +
>>>>> +    return head;
>>>>> +
>>>>> +}
>>>>> +
>>>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>>>>> +                                VirtQueueElement *elem)
>>>>> +{
>>>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
>>>>> +
>>>>> +    svq->ring_id_maps[qemu_head] = elem;
>>>>> +}
>>>>> +
>>>>> +/* Handle guest->device notifications */
>>>>>     static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>     {
>>>>>         VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>             return;
>>>>>         }
>>>>>
>>>>> -    event_notifier_set(&svq->kick_notifier);
>>>>> +    /* Make available as many buffers as possible */
>>>>> +    do {
>>>>> +        if (virtio_queue_get_notification(svq->vq)) {
>>>>> +            /* No more notifications until process all available */
>>>>> +            virtio_queue_set_notification(svq->vq, false);
>>>>> +        }
>>>>> +
>>>>> +        while (true) {
>>>>> +            VirtQueueElement *elem;
>>>>> +            if (virtio_queue_full(svq->vq)) {
>>>>> +                break;
>>>> So we've disabled guest notification. If buffer has been consumed, we
>>>> need to retry the handle_guest_kick here. But I didn't find the code?
>>>>
>>> This code follows the pattern of virtio_blk_handle_vq: we jump out of
>>> the inner while, and we re-enable the notifications. After that, we
>>> check for updates on guest avail_idx.
>>
>> Ok, but this will end up with a lot of unnecessary kicks without event
>> index.
>>
> I can move the kick out of the inner loop, but that could add latency.


So I think the right way is to disable the notification until some 
buffers are consumed by used ring.


>
>>>>> +            }
>>>>> +
>>>>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>>>> +            if (!elem) {
>>>>> +                break;
>>>>> +            }
>>>>> +
>>>>> +            vhost_shadow_vq_add(svq, elem);
>>>>> +            event_notifier_set(&svq->kick_notifier);
>>>>> +        }
>>>>> +
>>>>> +        virtio_queue_set_notification(svq->vq, true);
>>>>> +    } while (!virtio_queue_empty(svq->vq));
>>>>> +}
>>>>> +
>>>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
>>>>> +{
>>>>> +    if (svq->used_idx != svq->shadow_used_idx) {
>>>>> +        return true;
>>>>> +    }
>>>>> +
>>>>> +    /* Get used idx must not be reordered */
>>>>> +    smp_rmb();
>>>>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
>>>>> +
>>>>> +    return svq->used_idx != svq->shadow_used_idx;
>>>>> +}
>>>>> +
>>>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
>>>>> +{
>>>>> +    vring_desc_t *descs = svq->vring.desc;
>>>>> +    const vring_used_t *used = svq->vring.used;
>>>>> +    vring_used_elem_t used_elem;
>>>>> +    uint16_t last_used;
>>>>> +
>>>>> +    if (!vhost_shadow_vq_more_used(svq)) {
>>>>> +        return NULL;
>>>>> +    }
>>>>> +
>>>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
>>>>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
>>>>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
>>>>> +
>>>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
>>>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
>>>>> +                     used_elem.id);
>>>>> +        return NULL;
>>>>> +    }
>>>>> +
>>>>> +    descs[used_elem.id].next = svq->free_head;
>>>>> +    svq->free_head = used_elem.id;
>>>>> +
>>>>> +    svq->used_idx++;
>>>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>>>>     }
>>>>>
>>>>>     /* Forward vhost notifications */
>>>>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>         VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>>                                                  call_notifier);
>>>>>         EventNotifier *masked_notifier;
>>>>> +    VirtQueue *vq = svq->vq;
>>>>>
>>>>>         /* Signal start of using masked notifier */
>>>>>         qemu_event_reset(&svq->masked_notifier.is_free);
>>>>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>             qemu_event_set(&svq->masked_notifier.is_free);
>>>>>         }
>>>>>
>>>>> -    if (!masked_notifier) {
>>>>> -        unsigned n = virtio_get_queue_index(svq->vq);
>>>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
>>>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
>>>>> -    } else if (!svq->masked_notifier.signaled) {
>>>>> -        svq->masked_notifier.signaled = true;
>>>>> -        event_notifier_set(svq->masked_notifier.n);
>>>>> -    }
>>>>> +    /* Make as many buffers as possible used. */
>>>>> +    do {
>>>>> +        unsigned i = 0;
>>>>> +
>>>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>>>> +        while (true) {
>>>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>>>>> +            if (!elem) {
>>>>> +                break;
>>>>> +            }
>>>>> +
>>>>> +            assert(i < svq->vring.num);
>>>>> +            virtqueue_fill(vq, elem, elem->len, i++);
>>>>> +        }
>>>>> +
>>>>> +        virtqueue_flush(vq, i);
>>>>> +        if (!masked_notifier) {
>>>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
>>>>> +        } else if (!svq->masked_notifier.signaled) {
>>>>> +            svq->masked_notifier.signaled = true;
>>>>> +            event_notifier_set(svq->masked_notifier.n);
>>>>> +        }
>>>>> +    } while (vhost_shadow_vq_more_used(svq));
>>>>>
>>>>>         if (masked_notifier) {
>>>>>             /* Signal not using it anymore */
>>>>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>
>>>>>     static void vhost_shadow_vq_handle_call(EventNotifier *n)
>>>>>     {
>>>>> -
>>>>>         if (likely(event_notifier_test_and_clear(n))) {
>>>>>             vhost_shadow_vq_handle_call_no_test(n);
>>>>>         }
>>>>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>                               unsigned idx,
>>>>>                               VhostShadowVirtqueue *svq)
>>>>>     {
>>>>> +    int i;
>>>>>         int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
>>>>> +
>>>>> +    assert(!dev->shadow_vqs_enabled);
>>>>> +
>>>>>         if (unlikely(r < 0)) {
>>>>>             error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>>>>>         }
>>>>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>         /* Restore vhost call */
>>>>>         vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
>>>>>                              dev->vqs[idx].notifier_is_masked);
>>>>> +
>>>>> +
>>>>> +    for (i = 0; i < svq->vring.num; ++i) {
>>>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
>>>>> +        /*
>>>>> +         * Although the doc says we must unpop in order, it's ok to unpop
>>>>> +         * everything.
>>>>> +         */
>>>>> +        if (elem) {
>>>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
>>>> Shouldn't we need to wait until all pending requests to be drained? Or
>>>> we may end up duplicated requests?
>>>>
>>> Do you mean pending as in-flight/processing in the device? The device
>>> must be paused at this point.
>>
>> Ok. I see there's a vhost_set_vring_enable(dev, false) in
>> vhost_sw_live_migration_start().
>>
>>
>>> Currently there is no assertion for
>>> this, maybe we can track the device status for it.
>>>
>>> For the queue handlers to be running at this point, the main event
>>> loop should serialize QMP and handlers as far as I know (and they
>>> would make all state inconsistent if the device stops suddenly). It
>>> would need to be synchronized if the handlers run in their own AIO
>>> context. That would be nice to have but it's not included here.
>>
>> That's why I suggest to just drop the QMP stuffs and use cli parameters
>> to enable shadow virtqueue. Things would be greatly simplified I guess.
>>
> I can send a series without it, but SVQ will need to be able to kick
> in dynamically sooner or later if we want to use it for live
> migration.


I'm not sure I get the issue here. My understnading is everyhing will be 
processed in the same aio context.

Thanks


>
>> Thanks
>>
>>
>>>> Thanks
>>>>
>>>>
>>>>> +        }
>>>>> +    }
>>>>>     }
>>>>>
>>>>>     /*
>>>>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>>>         unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
>>>>>         size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
>>>>>         g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
>>>>> -    int r;
>>>>> +    int r, i;
>>>>>
>>>>>         r = event_notifier_init(&svq->kick_notifier, 0);
>>>>>         if (r != 0) {
>>>>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>>>         vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
>>>>>         svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>>>>>         svq->vdev = dev->vdev;
>>>>> +    for (i = 0; i < num - 1; i++) {
>>>>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
>>>>> +    }
>>>>> +
>>>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>>>>>         event_notifier_set_handler(&svq->call_notifier,
>>>>>                                    vhost_shadow_vq_handle_call);
>>>>>         qemu_event_init(&svq->masked_notifier.is_free, true);
>>>>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
>>>>>         event_notifier_cleanup(&vq->kick_notifier);
>>>>>         event_notifier_set_handler(&vq->call_notifier, NULL);
>>>>>         event_notifier_cleanup(&vq->call_notifier);
>>>>> +    g_free(vq->ring_id_maps);
>>>>>         g_free(vq);
>>>>>     }
>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>> index eab3e334f2..a373999bc4 100644
>>>>> --- a/hw/virtio/vhost.c
>>>>> +++ b/hw/virtio/vhost.c
>>>>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
>>>>>
>>>>>         trace_vhost_iotlb_miss(dev, 1);
>>>>>
>>>>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
>>>>> +        uaddr = iova;
>>>>> +        len = 4096;
>>>>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
>>>>> +                                                IOMMU_RW);
>>>>> +        if (ret) {
>>>>> +            trace_vhost_iotlb_miss(dev, 2);
>>>>> +            error_report("Fail to update device iotlb");
>>>>> +        }
>>>>> +
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>>         iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
>>>>>                                               iova, write,
>>>>>                                               MEMTXATTRS_UNSPECIFIED);
>>>>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>>>         /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>>>
>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>>>> +        error_report("Fail to invalidate device iotlb");
>>>>> +    }
>>>>> +
>>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>> +        /*
>>>>> +         * Update used ring information for IOTLB to work correctly,
>>>>> +         * vhost-kernel code requires for this.
>>>>> +         */
>>>>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
>>>>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
>>>>> +
>>>>>             vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>>>> +                              dev->vq_index + idx);
>>>>> +    }
>>>>> +
>>>>> +    /* Enable guest's vq vring */
>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>> +
>>>>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>             vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>>>         }
>>>>>
>>>>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>>>         return 0;
>>>>>     }
>>>>>
>>>>> +/*
>>>>> + * Start shadow virtqueue in a given queue.
>>>>> + * In failure case, this function leaves queue working as regular vhost mode.
>>>>> + */
>>>>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
>>>>> +                                             unsigned idx)
>>>>> +{
>>>>> +    struct vhost_vring_addr addr = {
>>>>> +        .index = idx,
>>>>> +    };
>>>>> +    struct vhost_vring_state s = {
>>>>> +        .index = idx,
>>>>> +    };
>>>>> +    int r;
>>>>> +    bool ok;
>>>>> +
>>>>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>>>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    /* From this point, vhost_virtqueue_start can reset these changes */
>>>>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
>>>>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
>>>>> +        goto err;
>>>>> +    }
>>>>> +
>>>>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
>>>>> +        goto err;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * Update used ring information for IOTLB to work correctly,
>>>>> +     * vhost-kernel code requires for this.
>>>>> +     */
>>>>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        /* Debug message already printed */
>>>>> +        goto err;
>>>>> +    }
>>>>> +
>>>>> +    return true;
>>>>> +
>>>>> +err:
>>>>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>>     static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>>>     {
>>>>>         int idx, stop_idx;
>>>>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>>>             }
>>>>>         }
>>>>>
>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>>>> +        error_report("Fail to invalidate device iotlb");
>>>>> +    }
>>>>> +
>>>>>         /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, true);
>>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>>>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
>>>>>             if (unlikely(!ok)) {
>>>>>                 goto err_start;
>>>>>             }
>>>>>         }
>>>>>
>>>>> +    /* Enable shadow vq vring */
>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>>         return 0;
>>>>>
>>>>>     err_start:
>>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>>>         for (stop_idx = 0; stop_idx < idx; stop_idx++) {
>>>>>             vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>>>> +                              dev->vq_index + stop_idx);
>>>>>         }
>>>>>
>>>>>     err_new:
>>>>> +    /* Enable guest's vring */
>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>             vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>>>         }
>>>>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>>>
>>>>>             if (!hdev->started) {
>>>>>                 err_cause = "Device is not started";
>>>>> +        } else if (!vhost_dev_has_iommu(hdev)) {
>>>>> +            err_cause = "Does not support iommu";
>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
>>>>> +            err_cause = "Is packed";
>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
>>>>> +            err_cause = "Have event idx";
>>>>> +        } else if (hdev->acked_features &
>>>>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
>>>>> +            err_cause = "Supports indirect descriptors";
>>>>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>>>> +            err_cause = "Cannot pause device";
>>>>> +        }
>>>>> +
>>>>> +        if (err_cause) {
>>>>>                 goto err;
>>>>>             }
>>>>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-18  3:14           ` Jason Wang
@ 2021-03-18  8:06             ` Eugenio Perez Martin
  2021-03-18  9:16               ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-18  8:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Rob Miller, Parav Pandit, Juan Quintela, Guru Prasad,
	Michael S. Tsirkin, Markus Armbruster, qemu-level,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Stefano Garzarella

On Thu, Mar 18, 2021 at 4:14 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/17 下午10:38, Eugenio Perez Martin 写道:
> > On Wed, Mar 17, 2021 at 3:51 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
> >>> On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>>>> Initial version of shadow virtqueue that actually forward buffers.
> >>>>>
> >>>>> It reuses the VirtQueue code for the device part. The driver part is
> >>>>> based on Linux's virtio_ring driver, but with stripped functionality
> >>>>> and optimizations so it's easier to review.
> >>>>>
> >>>>> These will be added in later commits.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
> >>>>>     hw/virtio/vhost.c                  | 113 ++++++++++++++-
> >>>>>     2 files changed, 312 insertions(+), 13 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> index 1460d1d5d1..68ed0f2740 100644
> >>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> @@ -9,6 +9,7 @@
> >>>>>
> >>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>>>     #include "hw/virtio/vhost.h"
> >>>>> +#include "hw/virtio/virtio-access.h"
> >>>>>
> >>>>>     #include "standard-headers/linux/vhost_types.h"
> >>>>>
> >>>>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
> >>>>>         /* Virtio device */
> >>>>>         VirtIODevice *vdev;
> >>>>>
> >>>>> +    /* Map for returning guest's descriptors */
> >>>>> +    VirtQueueElement **ring_id_maps;
> >>>>> +
> >>>>> +    /* Next head to expose to device */
> >>>>> +    uint16_t avail_idx_shadow;
> >>>>> +
> >>>>> +    /* Next free descriptor */
> >>>>> +    uint16_t free_head;
> >>>>> +
> >>>>> +    /* Last seen used idx */
> >>>>> +    uint16_t shadow_used_idx;
> >>>>> +
> >>>>> +    /* Next head to consume from device */
> >>>>> +    uint16_t used_idx;
> >>>>> +
> >>>>>         /* Descriptors copied from guest */
> >>>>>         vring_desc_t descs[];
> >>>>>     } VhostShadowVirtqueue;
> >>>>>
> >>>>> -/* Forward guest notifications */
> >>>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>>>> +                                    const struct iovec *iovec,
> >>>>> +                                    size_t num, bool more_descs, bool write)
> >>>>> +{
> >>>>> +    uint16_t i = svq->free_head, last = svq->free_head;
> >>>>> +    unsigned n;
> >>>>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
> >>>>> +    vring_desc_t *descs = svq->vring.desc;
> >>>>> +
> >>>>> +    if (num == 0) {
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    for (n = 0; n < num; n++) {
> >>>>> +        if (more_descs || (n + 1 < num)) {
> >>>>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
> >>>>> +                                                    VRING_DESC_F_NEXT);
> >>>>> +        } else {
> >>>>> +            descs[i].flags = flags;
> >>>>> +        }
> >>>>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
> >>>> So unsing virtio_tswap() is probably not correct since we're talking
> >>>> with vhost backends which has its own endiness.
> >>>>
> >>> I was trying to expose the buffer with the same endianness as the
> >>> driver originally offered, so if guest->qemu requires a bswap, I think
> >>> it will always require a bswap again to expose to the device again.
> >>
> >> So assumes vhost-vDPA always provide a non-transitional device[1].
> >>
> >> Then if Qemu present a transitional device, we need to do the endian
> >> conversion when necessary, if Qemu present a non-transitional device, we
> >> don't need to do that, guest driver will do that for us.
> >>
> >> But it looks to me the virtio_tswap() can't be used for this since it:
> >>
> >> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
> >> {
> >> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
> >>       return virtio_is_big_endian(vdev);
> >> #elif defined(TARGET_WORDS_BIGENDIAN)
> >>       if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> >>           /* Devices conforming to VIRTIO 1.0 or later are always LE. */
> >>           return false;
> >>       }
> >>       return true;
> >> #else
> >>       return false;
> >> #endif
> >> }
> >>
> >> So if we present a legacy device on top of a non-transitiaonl vDPA
> >> device. The VIRITIO_F_VERSION_1 check is wrong.
> >>
> >>
> >>>> For vhost-vDPA, we can assume that it's a 1.0 device.
> >>> Isn't it needed if the host is big endian?
> >>
> >> [1]
> >>
> >> So non-transitional device always assume little endian.
> >>
> >> For vhost-vDPA, we don't want to present transitional device which may
> >> end up with a lot of burdens.
> >>
> >> I suspect the legacy driver plust vhost vDPA already break, so I plan to
> >> mandate VERSION_1 for all vDPA devices.
> >>
> > Right. I think it's the best then.
> >
> > However, then we will need a similar method to always expose
> > address/length as little endian, isn't it?
>
>
> Yes.
>
>
> >
> >>>>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
> >>>>> +
> >>>>> +        last = i;
> >>>>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
> >>>>> +    }
> >>>>> +
> >>>>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
> >>>>> +}
> >>>>> +
> >>>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> >>>>> +                                          VirtQueueElement *elem)
> >>>>> +{
> >>>>> +    int head;
> >>>>> +    unsigned avail_idx;
> >>>>> +    vring_avail_t *avail = svq->vring.avail;
> >>>>> +
> >>>>> +    head = svq->free_head;
> >>>>> +
> >>>>> +    /* We need some descriptors here */
> >>>>> +    assert(elem->out_num || elem->in_num);
> >>>>> +
> >>>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>>>> +                            elem->in_num > 0, false);
> >>>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>>>> +
> >>>>> +    /*
> >>>>> +     * Put entry in available array (but don't update avail->idx until they
> >>>>> +     * do sync).
> >>>>> +     */
> >>>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> >>>>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
> >>>>> +    svq->avail_idx_shadow++;
> >>>>> +
> >>>>> +    /* Expose descriptors to device */
> >>>>> +    smp_wmb();
> >>>>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
> >>>>> +
> >>>>> +    return head;
> >>>>> +
> >>>>> +}
> >>>>> +
> >>>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> >>>>> +                                VirtQueueElement *elem)
> >>>>> +{
> >>>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> >>>>> +
> >>>>> +    svq->ring_id_maps[qemu_head] = elem;
> >>>>> +}
> >>>>> +
> >>>>> +/* Handle guest->device notifications */
> >>>>>     static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>>     {
> >>>>>         VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>>             return;
> >>>>>         }
> >>>>>
> >>>>> -    event_notifier_set(&svq->kick_notifier);
> >>>>> +    /* Make available as many buffers as possible */
> >>>>> +    do {
> >>>>> +        if (virtio_queue_get_notification(svq->vq)) {
> >>>>> +            /* No more notifications until process all available */
> >>>>> +            virtio_queue_set_notification(svq->vq, false);
> >>>>> +        }
> >>>>> +
> >>>>> +        while (true) {
> >>>>> +            VirtQueueElement *elem;
> >>>>> +            if (virtio_queue_full(svq->vq)) {
> >>>>> +                break;
> >>>> So we've disabled guest notification. If buffer has been consumed, we
> >>>> need to retry the handle_guest_kick here. But I didn't find the code?
> >>>>
> >>> This code follows the pattern of virtio_blk_handle_vq: we jump out of
> >>> the inner while, and we re-enable the notifications. After that, we
> >>> check for updates on guest avail_idx.
> >>
> >> Ok, but this will end up with a lot of unnecessary kicks without event
> >> index.
> >>
> > I can move the kick out of the inner loop, but that could add latency.
>
>
> So I think the right way is to disable the notification until some
> buffers are consumed by used ring.
>

I'm not sure if you mean:

a) To limit to the maximum amount of buffers that can be available in
Shadow Virtqueue at the same time.

As I can see, the easiest way to do this would be to unregister
vhost_handle_guest_kick from the event loop and let
vhost_shadow_vq_handle_call to re-register it at some threshold of
available buffers.

I'm not sure how much this limit should be, but it seems wasteful for
me to not to fill shadow virqueue naturally.

b) To limit the amount of buffers that vhost_handle_guest_kick
forwards to shadow virtqueue in one call.

This already has a natural limit of the queue size, since the buffers
will not be consumed (as forarded-to-guest) by qemu while this
function is running. This limit could be reduced and
vhost_handle_guest_kick could re-enqueue itself if its not reached.
Same as previous, I'm not sure how much is a right limit, but
vhost_handle_guest_kick will not make available more than queue size.

c) To kick every N buffers made available, instead of N=1.

I think this is not the solution you are proposing, but maybe is
simpler than previous.

>
> >
> >>>>> +            }
> >>>>> +
> >>>>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >>>>> +            if (!elem) {
> >>>>> +                break;
> >>>>> +            }
> >>>>> +
> >>>>> +            vhost_shadow_vq_add(svq, elem);
> >>>>> +            event_notifier_set(&svq->kick_notifier);
> >>>>> +        }
> >>>>> +
> >>>>> +        virtio_queue_set_notification(svq->vq, true);
> >>>>> +    } while (!virtio_queue_empty(svq->vq));
> >>>>> +}
> >>>>> +
> >>>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> >>>>> +{
> >>>>> +    if (svq->used_idx != svq->shadow_used_idx) {
> >>>>> +        return true;
> >>>>> +    }
> >>>>> +
> >>>>> +    /* Get used idx must not be reordered */
> >>>>> +    smp_rmb();
> >>>>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
> >>>>> +
> >>>>> +    return svq->used_idx != svq->shadow_used_idx;
> >>>>> +}
> >>>>> +
> >>>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> >>>>> +{
> >>>>> +    vring_desc_t *descs = svq->vring.desc;
> >>>>> +    const vring_used_t *used = svq->vring.used;
> >>>>> +    vring_used_elem_t used_elem;
> >>>>> +    uint16_t last_used;
> >>>>> +
> >>>>> +    if (!vhost_shadow_vq_more_used(svq)) {
> >>>>> +        return NULL;
> >>>>> +    }
> >>>>> +
> >>>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
> >>>>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
> >>>>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
> >>>>> +
> >>>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> >>>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
> >>>>> +                     used_elem.id);
> >>>>> +        return NULL;
> >>>>> +    }
> >>>>> +
> >>>>> +    descs[used_elem.id].next = svq->free_head;
> >>>>> +    svq->free_head = used_elem.id;
> >>>>> +
> >>>>> +    svq->used_idx++;
> >>>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> >>>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >>>>>     }
> >>>>>
> >>>>>     /* Forward vhost notifications */
> >>>>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>         VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>>                                                  call_notifier);
> >>>>>         EventNotifier *masked_notifier;
> >>>>> +    VirtQueue *vq = svq->vq;
> >>>>>
> >>>>>         /* Signal start of using masked notifier */
> >>>>>         qemu_event_reset(&svq->masked_notifier.is_free);
> >>>>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>             qemu_event_set(&svq->masked_notifier.is_free);
> >>>>>         }
> >>>>>
> >>>>> -    if (!masked_notifier) {
> >>>>> -        unsigned n = virtio_get_queue_index(svq->vq);
> >>>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> >>>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> >>>>> -    } else if (!svq->masked_notifier.signaled) {
> >>>>> -        svq->masked_notifier.signaled = true;
> >>>>> -        event_notifier_set(svq->masked_notifier.n);
> >>>>> -    }
> >>>>> +    /* Make as many buffers as possible used. */
> >>>>> +    do {
> >>>>> +        unsigned i = 0;
> >>>>> +
> >>>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> >>>>> +        while (true) {
> >>>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> >>>>> +            if (!elem) {
> >>>>> +                break;
> >>>>> +            }
> >>>>> +
> >>>>> +            assert(i < svq->vring.num);
> >>>>> +            virtqueue_fill(vq, elem, elem->len, i++);
> >>>>> +        }
> >>>>> +
> >>>>> +        virtqueue_flush(vq, i);
> >>>>> +        if (!masked_notifier) {
> >>>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
> >>>>> +        } else if (!svq->masked_notifier.signaled) {
> >>>>> +            svq->masked_notifier.signaled = true;
> >>>>> +            event_notifier_set(svq->masked_notifier.n);
> >>>>> +        }
> >>>>> +    } while (vhost_shadow_vq_more_used(svq));
> >>>>>
> >>>>>         if (masked_notifier) {
> >>>>>             /* Signal not using it anymore */
> >>>>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>
> >>>>>     static void vhost_shadow_vq_handle_call(EventNotifier *n)
> >>>>>     {
> >>>>> -
> >>>>>         if (likely(event_notifier_test_and_clear(n))) {
> >>>>>             vhost_shadow_vq_handle_call_no_test(n);
> >>>>>         }
> >>>>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>>                               unsigned idx,
> >>>>>                               VhostShadowVirtqueue *svq)
> >>>>>     {
> >>>>> +    int i;
> >>>>>         int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> >>>>> +
> >>>>> +    assert(!dev->shadow_vqs_enabled);
> >>>>> +
> >>>>>         if (unlikely(r < 0)) {
> >>>>>             error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >>>>>         }
> >>>>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>>         /* Restore vhost call */
> >>>>>         vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> >>>>>                              dev->vqs[idx].notifier_is_masked);
> >>>>> +
> >>>>> +
> >>>>> +    for (i = 0; i < svq->vring.num; ++i) {
> >>>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> >>>>> +        /*
> >>>>> +         * Although the doc says we must unpop in order, it's ok to unpop
> >>>>> +         * everything.
> >>>>> +         */
> >>>>> +        if (elem) {
> >>>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
> >>>> Shouldn't we need to wait until all pending requests to be drained? Or
> >>>> we may end up duplicated requests?
> >>>>
> >>> Do you mean pending as in-flight/processing in the device? The device
> >>> must be paused at this point.
> >>
> >> Ok. I see there's a vhost_set_vring_enable(dev, false) in
> >> vhost_sw_live_migration_start().
> >>
> >>
> >>> Currently there is no assertion for
> >>> this, maybe we can track the device status for it.
> >>>
> >>> For the queue handlers to be running at this point, the main event
> >>> loop should serialize QMP and handlers as far as I know (and they
> >>> would make all state inconsistent if the device stops suddenly). It
> >>> would need to be synchronized if the handlers run in their own AIO
> >>> context. That would be nice to have but it's not included here.
> >>
> >> That's why I suggest to just drop the QMP stuffs and use cli parameters
> >> to enable shadow virtqueue. Things would be greatly simplified I guess.
> >>
> > I can send a series without it, but SVQ will need to be able to kick
> > in dynamically sooner or later if we want to use it for live
> > migration.
>
>
> I'm not sure I get the issue here. My understnading is everyhing will be
> processed in the same aio context.
>

What I meant is that QMP allows us to activate the shadow virtqueue
mode in any moment, similar to how live migration would activate it.
To enable SVQ with a command line would imply that it runs the same
way for all the time qemu runs.

If we do that way, we don't need more synchronization, since we have
deleted the event that could run concurrently with the masking. But
this synchronization will be needed if we want to enable SVQ
dynamically for live migration, so we are "just" delaying work.

However, if we add vdpa iova range to this patch series, I think it
would be a good idea to delay that synchronization work to future
series, so they are smaller and the first one can be tested better.

> Thanks
>
>
> >
> >> Thanks
> >>
> >>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> +        }
> >>>>> +    }
> >>>>>     }
> >>>>>
> >>>>>     /*
> >>>>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>>>         unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> >>>>>         size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> >>>>>         g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> >>>>> -    int r;
> >>>>> +    int r, i;
> >>>>>
> >>>>>         r = event_notifier_init(&svq->kick_notifier, 0);
> >>>>>         if (r != 0) {
> >>>>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>>>         vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
> >>>>>         svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >>>>>         svq->vdev = dev->vdev;
> >>>>> +    for (i = 0; i < num - 1; i++) {
> >>>>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
> >>>>> +    }
> >>>>> +
> >>>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >>>>>         event_notifier_set_handler(&svq->call_notifier,
> >>>>>                                    vhost_shadow_vq_handle_call);
> >>>>>         qemu_event_init(&svq->masked_notifier.is_free, true);
> >>>>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
> >>>>>         event_notifier_cleanup(&vq->kick_notifier);
> >>>>>         event_notifier_set_handler(&vq->call_notifier, NULL);
> >>>>>         event_notifier_cleanup(&vq->call_notifier);
> >>>>> +    g_free(vq->ring_id_maps);
> >>>>>         g_free(vq);
> >>>>>     }
> >>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>>>> index eab3e334f2..a373999bc4 100644
> >>>>> --- a/hw/virtio/vhost.c
> >>>>> +++ b/hw/virtio/vhost.c
> >>>>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> >>>>>
> >>>>>         trace_vhost_iotlb_miss(dev, 1);
> >>>>>
> >>>>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
> >>>>> +        uaddr = iova;
> >>>>> +        len = 4096;
> >>>>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> >>>>> +                                                IOMMU_RW);
> >>>>> +        if (ret) {
> >>>>> +            trace_vhost_iotlb_miss(dev, 2);
> >>>>> +            error_report("Fail to update device iotlb");
> >>>>> +        }
> >>>>> +
> >>>>> +        return ret;
> >>>>> +    }
> >>>>> +
> >>>>>         iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
> >>>>>                                               iova, write,
> >>>>>                                               MEMTXATTRS_UNSPECIFIED);
> >>>>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>>>         /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>>>
> >>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>>>> +        error_report("Fail to invalidate device iotlb");
> >>>>> +    }
> >>>>> +
> >>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>> +        /*
> >>>>> +         * Update used ring information for IOTLB to work correctly,
> >>>>> +         * vhost-kernel code requires for this.
> >>>>> +         */
> >>>>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
> >>>>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
> >>>>> +
> >>>>>             vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> >>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>>>> +                              dev->vq_index + idx);
> >>>>> +    }
> >>>>> +
> >>>>> +    /* Enable guest's vq vring */
> >>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>> +
> >>>>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>             vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>>>         }
> >>>>>
> >>>>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>>>         return 0;
> >>>>>     }
> >>>>>
> >>>>> +/*
> >>>>> + * Start shadow virtqueue in a given queue.
> >>>>> + * In failure case, this function leaves queue working as regular vhost mode.
> >>>>> + */
> >>>>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
> >>>>> +                                             unsigned idx)
> >>>>> +{
> >>>>> +    struct vhost_vring_addr addr = {
> >>>>> +        .index = idx,
> >>>>> +    };
> >>>>> +    struct vhost_vring_state s = {
> >>>>> +        .index = idx,
> >>>>> +    };
> >>>>> +    int r;
> >>>>> +    bool ok;
> >>>>> +
> >>>>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>>>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>>>> +    if (unlikely(!ok)) {
> >>>>> +        return false;
> >>>>> +    }
> >>>>> +
> >>>>> +    /* From this point, vhost_virtqueue_start can reset these changes */
> >>>>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
> >>>>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> >>>>> +    if (unlikely(r != 0)) {
> >>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
> >>>>> +        goto err;
> >>>>> +    }
> >>>>> +
> >>>>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
> >>>>> +    if (unlikely(r != 0)) {
> >>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
> >>>>> +        goto err;
> >>>>> +    }
> >>>>> +
> >>>>> +    /*
> >>>>> +     * Update used ring information for IOTLB to work correctly,
> >>>>> +     * vhost-kernel code requires for this.
> >>>>> +     */
> >>>>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
> >>>>> +    if (unlikely(r != 0)) {
> >>>>> +        /* Debug message already printed */
> >>>>> +        goto err;
> >>>>> +    }
> >>>>> +
> >>>>> +    return true;
> >>>>> +
> >>>>> +err:
> >>>>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>>>> +    return false;
> >>>>> +}
> >>>>> +
> >>>>>     static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>>>     {
> >>>>>         int idx, stop_idx;
> >>>>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>>>             }
> >>>>>         }
> >>>>>
> >>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>>>> +        error_report("Fail to invalidate device iotlb");
> >>>>> +    }
> >>>>> +
> >>>>>         /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, true);
> >>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>>>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
> >>>>>             if (unlikely(!ok)) {
> >>>>>                 goto err_start;
> >>>>>             }
> >>>>>         }
> >>>>>
> >>>>> +    /* Enable shadow vq vring */
> >>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>>         return 0;
> >>>>>
> >>>>>     err_start:
> >>>>>         qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>>>         for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> >>>>>             vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> >>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>>>> +                              dev->vq_index + stop_idx);
> >>>>>         }
> >>>>>
> >>>>>     err_new:
> >>>>> +    /* Enable guest's vring */
> >>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>>         for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>             vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>>>         }
> >>>>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>>>
> >>>>>             if (!hdev->started) {
> >>>>>                 err_cause = "Device is not started";
> >>>>> +        } else if (!vhost_dev_has_iommu(hdev)) {
> >>>>> +            err_cause = "Does not support iommu";
> >>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> >>>>> +            err_cause = "Is packed";
> >>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
> >>>>> +            err_cause = "Have event idx";
> >>>>> +        } else if (hdev->acked_features &
> >>>>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
> >>>>> +            err_cause = "Supports indirect descriptors";
> >>>>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
> >>>>> +            err_cause = "Cannot pause device";
> >>>>> +        }
> >>>>> +
> >>>>> +        if (err_cause) {
> >>>>>                 goto err;
> >>>>>             }
> >>>>>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-18  8:06             ` Eugenio Perez Martin
@ 2021-03-18  9:16               ` Jason Wang
  2021-03-18  9:54                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-18  9:16 UTC (permalink / raw)
  To: qemu-devel


在 2021/3/18 下午4:06, Eugenio Perez Martin 写道:
> On Thu, Mar 18, 2021 at 4:14 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/17 下午10:38, Eugenio Perez Martin 写道:
>>> On Wed, Mar 17, 2021 at 3:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
>>>>> On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>>>>>> Initial version of shadow virtqueue that actually forward buffers.
>>>>>>>
>>>>>>> It reuses the VirtQueue code for the device part. The driver part is
>>>>>>> based on Linux's virtio_ring driver, but with stripped functionality
>>>>>>> and optimizations so it's easier to review.
>>>>>>>
>>>>>>> These will be added in later commits.
>>>>>>>
>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>> ---
>>>>>>>      hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
>>>>>>>      hw/virtio/vhost.c                  | 113 ++++++++++++++-
>>>>>>>      2 files changed, 312 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> index 1460d1d5d1..68ed0f2740 100644
>>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> @@ -9,6 +9,7 @@
>>>>>>>
>>>>>>>      #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>>>      #include "hw/virtio/vhost.h"
>>>>>>> +#include "hw/virtio/virtio-access.h"
>>>>>>>
>>>>>>>      #include "standard-headers/linux/vhost_types.h"
>>>>>>>
>>>>>>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
>>>>>>>          /* Virtio device */
>>>>>>>          VirtIODevice *vdev;
>>>>>>>
>>>>>>> +    /* Map for returning guest's descriptors */
>>>>>>> +    VirtQueueElement **ring_id_maps;
>>>>>>> +
>>>>>>> +    /* Next head to expose to device */
>>>>>>> +    uint16_t avail_idx_shadow;
>>>>>>> +
>>>>>>> +    /* Next free descriptor */
>>>>>>> +    uint16_t free_head;
>>>>>>> +
>>>>>>> +    /* Last seen used idx */
>>>>>>> +    uint16_t shadow_used_idx;
>>>>>>> +
>>>>>>> +    /* Next head to consume from device */
>>>>>>> +    uint16_t used_idx;
>>>>>>> +
>>>>>>>          /* Descriptors copied from guest */
>>>>>>>          vring_desc_t descs[];
>>>>>>>      } VhostShadowVirtqueue;
>>>>>>>
>>>>>>> -/* Forward guest notifications */
>>>>>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>>>> +                                    const struct iovec *iovec,
>>>>>>> +                                    size_t num, bool more_descs, bool write)
>>>>>>> +{
>>>>>>> +    uint16_t i = svq->free_head, last = svq->free_head;
>>>>>>> +    unsigned n;
>>>>>>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
>>>>>>> +    vring_desc_t *descs = svq->vring.desc;
>>>>>>> +
>>>>>>> +    if (num == 0) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    for (n = 0; n < num; n++) {
>>>>>>> +        if (more_descs || (n + 1 < num)) {
>>>>>>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
>>>>>>> +                                                    VRING_DESC_F_NEXT);
>>>>>>> +        } else {
>>>>>>> +            descs[i].flags = flags;
>>>>>>> +        }
>>>>>>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
>>>>>> So unsing virtio_tswap() is probably not correct since we're talking
>>>>>> with vhost backends which has its own endiness.
>>>>>>
>>>>> I was trying to expose the buffer with the same endianness as the
>>>>> driver originally offered, so if guest->qemu requires a bswap, I think
>>>>> it will always require a bswap again to expose to the device again.
>>>> So assumes vhost-vDPA always provide a non-transitional device[1].
>>>>
>>>> Then if Qemu present a transitional device, we need to do the endian
>>>> conversion when necessary, if Qemu present a non-transitional device, we
>>>> don't need to do that, guest driver will do that for us.
>>>>
>>>> But it looks to me the virtio_tswap() can't be used for this since it:
>>>>
>>>> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
>>>> {
>>>> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
>>>>        return virtio_is_big_endian(vdev);
>>>> #elif defined(TARGET_WORDS_BIGENDIAN)
>>>>        if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>>>>            /* Devices conforming to VIRTIO 1.0 or later are always LE. */
>>>>            return false;
>>>>        }
>>>>        return true;
>>>> #else
>>>>        return false;
>>>> #endif
>>>> }
>>>>
>>>> So if we present a legacy device on top of a non-transitiaonl vDPA
>>>> device. The VIRITIO_F_VERSION_1 check is wrong.
>>>>
>>>>
>>>>>> For vhost-vDPA, we can assume that it's a 1.0 device.
>>>>> Isn't it needed if the host is big endian?
>>>> [1]
>>>>
>>>> So non-transitional device always assume little endian.
>>>>
>>>> For vhost-vDPA, we don't want to present transitional device which may
>>>> end up with a lot of burdens.
>>>>
>>>> I suspect the legacy driver plust vhost vDPA already break, so I plan to
>>>> mandate VERSION_1 for all vDPA devices.
>>>>
>>> Right. I think it's the best then.
>>>
>>> However, then we will need a similar method to always expose
>>> address/length as little endian, isn't it?
>>
>> Yes.
>>
>>
>>>>>>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
>>>>>>> +
>>>>>>> +        last = i;
>>>>>>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
>>>>>>> +                                          VirtQueueElement *elem)
>>>>>>> +{
>>>>>>> +    int head;
>>>>>>> +    unsigned avail_idx;
>>>>>>> +    vring_avail_t *avail = svq->vring.avail;
>>>>>>> +
>>>>>>> +    head = svq->free_head;
>>>>>>> +
>>>>>>> +    /* We need some descriptors here */
>>>>>>> +    assert(elem->out_num || elem->in_num);
>>>>>>> +
>>>>>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>>>>>> +                            elem->in_num > 0, false);
>>>>>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Put entry in available array (but don't update avail->idx until they
>>>>>>> +     * do sync).
>>>>>>> +     */
>>>>>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>>>>>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
>>>>>>> +    svq->avail_idx_shadow++;
>>>>>>> +
>>>>>>> +    /* Expose descriptors to device */
>>>>>>> +    smp_wmb();
>>>>>>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
>>>>>>> +
>>>>>>> +    return head;
>>>>>>> +
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>>>>>>> +                                VirtQueueElement *elem)
>>>>>>> +{
>>>>>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
>>>>>>> +
>>>>>>> +    svq->ring_id_maps[qemu_head] = elem;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* Handle guest->device notifications */
>>>>>>>      static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>>>      {
>>>>>>>          VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>>>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>>>              return;
>>>>>>>          }
>>>>>>>
>>>>>>> -    event_notifier_set(&svq->kick_notifier);
>>>>>>> +    /* Make available as many buffers as possible */
>>>>>>> +    do {
>>>>>>> +        if (virtio_queue_get_notification(svq->vq)) {
>>>>>>> +            /* No more notifications until process all available */
>>>>>>> +            virtio_queue_set_notification(svq->vq, false);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        while (true) {
>>>>>>> +            VirtQueueElement *elem;
>>>>>>> +            if (virtio_queue_full(svq->vq)) {
>>>>>>> +                break;
>>>>>> So we've disabled guest notification. If buffer has been consumed, we
>>>>>> need to retry the handle_guest_kick here. But I didn't find the code?
>>>>>>
>>>>> This code follows the pattern of virtio_blk_handle_vq: we jump out of
>>>>> the inner while, and we re-enable the notifications. After that, we
>>>>> check for updates on guest avail_idx.
>>>> Ok, but this will end up with a lot of unnecessary kicks without event
>>>> index.
>>>>
>>> I can move the kick out of the inner loop, but that could add latency.
>>
>> So I think the right way is to disable the notification until some
>> buffers are consumed by used ring.
>>
> I'm not sure if you mean:
>
> a) To limit to the maximum amount of buffers that can be available in
> Shadow Virtqueue at the same time.
>
> As I can see, the easiest way to do this would be to unregister
> vhost_handle_guest_kick from the event loop and let
> vhost_shadow_vq_handle_call to re-register it at some threshold of
> available buffers.
>
> I'm not sure how much this limit should be, but it seems wasteful for
> me to not to fill shadow virqueue naturally.


Yes, and I'm not sure how much we could gain from this extra complexity.


>
> b) To limit the amount of buffers that vhost_handle_guest_kick
> forwards to shadow virtqueue in one call.
>
> This already has a natural limit of the queue size, since the buffers
> will not be consumed (as forarded-to-guest) by qemu while this
> function is running. This limit could be reduced and
> vhost_handle_guest_kick could re-enqueue itself if its not reached.
> Same as previous, I'm not sure how much is a right limit, but
> vhost_handle_guest_kick will not make available more than queue size.


Yes, so using queue size is how the code works currently and it should 
be fine if we know svq and vq are the same size. We can leave the kick 
notification for the future, (I guess at least for networking device, 
hitting virtio_queue_full() should be rare).

It will be an real issue if svq and vq doesn't have the same size, but 
we can also leave this for future.


>
> c) To kick every N buffers made available, instead of N=1.
>
> I think this is not the solution you are proposing, but maybe is
> simpler than previous.
>
>>>>>>> +            }
>>>>>>> +
>>>>>>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>>>>>> +            if (!elem) {
>>>>>>> +                break;
>>>>>>> +            }
>>>>>>> +
>>>>>>> +            vhost_shadow_vq_add(svq, elem);
>>>>>>> +            event_notifier_set(&svq->kick_notifier);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        virtio_queue_set_notification(svq->vq, true);
>>>>>>> +    } while (!virtio_queue_empty(svq->vq));
>>>>>>> +}
>>>>>>> +
>>>>>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
>>>>>>> +{
>>>>>>> +    if (svq->used_idx != svq->shadow_used_idx) {
>>>>>>> +        return true;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    /* Get used idx must not be reordered */
>>>>>>> +    smp_rmb();
>>>>>>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
>>>>>>> +
>>>>>>> +    return svq->used_idx != svq->shadow_used_idx;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
>>>>>>> +{
>>>>>>> +    vring_desc_t *descs = svq->vring.desc;
>>>>>>> +    const vring_used_t *used = svq->vring.used;
>>>>>>> +    vring_used_elem_t used_elem;
>>>>>>> +    uint16_t last_used;
>>>>>>> +
>>>>>>> +    if (!vhost_shadow_vq_more_used(svq)) {
>>>>>>> +        return NULL;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
>>>>>>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
>>>>>>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
>>>>>>> +
>>>>>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
>>>>>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
>>>>>>> +                     used_elem.id);
>>>>>>> +        return NULL;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    descs[used_elem.id].next = svq->free_head;
>>>>>>> +    svq->free_head = used_elem.id;
>>>>>>> +
>>>>>>> +    svq->used_idx++;
>>>>>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>>>>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>>>>>>      }
>>>>>>>
>>>>>>>      /* Forward vhost notifications */
>>>>>>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>>>          VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>>>>                                                   call_notifier);
>>>>>>>          EventNotifier *masked_notifier;
>>>>>>> +    VirtQueue *vq = svq->vq;
>>>>>>>
>>>>>>>          /* Signal start of using masked notifier */
>>>>>>>          qemu_event_reset(&svq->masked_notifier.is_free);
>>>>>>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>>>              qemu_event_set(&svq->masked_notifier.is_free);
>>>>>>>          }
>>>>>>>
>>>>>>> -    if (!masked_notifier) {
>>>>>>> -        unsigned n = virtio_get_queue_index(svq->vq);
>>>>>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
>>>>>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
>>>>>>> -    } else if (!svq->masked_notifier.signaled) {
>>>>>>> -        svq->masked_notifier.signaled = true;
>>>>>>> -        event_notifier_set(svq->masked_notifier.n);
>>>>>>> -    }
>>>>>>> +    /* Make as many buffers as possible used. */
>>>>>>> +    do {
>>>>>>> +        unsigned i = 0;
>>>>>>> +
>>>>>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>>>>>> +        while (true) {
>>>>>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>>>>>>> +            if (!elem) {
>>>>>>> +                break;
>>>>>>> +            }
>>>>>>> +
>>>>>>> +            assert(i < svq->vring.num);
>>>>>>> +            virtqueue_fill(vq, elem, elem->len, i++);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        virtqueue_flush(vq, i);
>>>>>>> +        if (!masked_notifier) {
>>>>>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
>>>>>>> +        } else if (!svq->masked_notifier.signaled) {
>>>>>>> +            svq->masked_notifier.signaled = true;
>>>>>>> +            event_notifier_set(svq->masked_notifier.n);
>>>>>>> +        }
>>>>>>> +    } while (vhost_shadow_vq_more_used(svq));
>>>>>>>
>>>>>>>          if (masked_notifier) {
>>>>>>>              /* Signal not using it anymore */
>>>>>>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>>>>>
>>>>>>>      static void vhost_shadow_vq_handle_call(EventNotifier *n)
>>>>>>>      {
>>>>>>> -
>>>>>>>          if (likely(event_notifier_test_and_clear(n))) {
>>>>>>>              vhost_shadow_vq_handle_call_no_test(n);
>>>>>>>          }
>>>>>>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>>>                                unsigned idx,
>>>>>>>                                VhostShadowVirtqueue *svq)
>>>>>>>      {
>>>>>>> +    int i;
>>>>>>>          int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
>>>>>>> +
>>>>>>> +    assert(!dev->shadow_vqs_enabled);
>>>>>>> +
>>>>>>>          if (unlikely(r < 0)) {
>>>>>>>              error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>>>>>>>          }
>>>>>>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>>>          /* Restore vhost call */
>>>>>>>          vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
>>>>>>>                               dev->vqs[idx].notifier_is_masked);
>>>>>>> +
>>>>>>> +
>>>>>>> +    for (i = 0; i < svq->vring.num; ++i) {
>>>>>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
>>>>>>> +        /*
>>>>>>> +         * Although the doc says we must unpop in order, it's ok to unpop
>>>>>>> +         * everything.
>>>>>>> +         */
>>>>>>> +        if (elem) {
>>>>>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
>>>>>> Shouldn't we need to wait until all pending requests to be drained? Or
>>>>>> we may end up duplicated requests?
>>>>>>
>>>>> Do you mean pending as in-flight/processing in the device? The device
>>>>> must be paused at this point.
>>>> Ok. I see there's a vhost_set_vring_enable(dev, false) in
>>>> vhost_sw_live_migration_start().
>>>>
>>>>
>>>>> Currently there is no assertion for
>>>>> this, maybe we can track the device status for it.
>>>>>
>>>>> For the queue handlers to be running at this point, the main event
>>>>> loop should serialize QMP and handlers as far as I know (and they
>>>>> would make all state inconsistent if the device stops suddenly). It
>>>>> would need to be synchronized if the handlers run in their own AIO
>>>>> context. That would be nice to have but it's not included here.
>>>> That's why I suggest to just drop the QMP stuffs and use cli parameters
>>>> to enable shadow virtqueue. Things would be greatly simplified I guess.
>>>>
>>> I can send a series without it, but SVQ will need to be able to kick
>>> in dynamically sooner or later if we want to use it for live
>>> migration.
>>
>> I'm not sure I get the issue here. My understnading is everyhing will be
>> processed in the same aio context.
>>
> What I meant is that QMP allows us to activate the shadow virtqueue
> mode in any moment, similar to how live migration would activate it.


I get you.


> To enable SVQ with a command line would imply that it runs the same
> way for all the time qemu runs.


Ok.


>
> If we do that way, we don't need more synchronization, since we have
> deleted the event that could run concurrently with the masking. But
> this synchronization will be needed if we want to enable SVQ
> dynamically for live migration, so we are "just" delaying work.
>
> However, if we add vdpa iova range to this patch series, I think it
> would be a good idea to delay that synchronization work to future
> series, so they are smaller and the first one can be tested better.


Yes, that's why I think we can start from simple case. E.g to let the 
shadow virtqueue logic run. Then we can consider to add synchronization 
in the future.

I guess things like mutex or bh might help, it would be more easier to 
add those stuffs on top.

Thanks


>
>> Thanks
>>
>>
>>>> Thanks
>>>>
>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> +        }
>>>>>>> +    }
>>>>>>>      }
>>>>>>>
>>>>>>>      /*
>>>>>>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>>>>>          unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
>>>>>>>          size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
>>>>>>>          g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
>>>>>>> -    int r;
>>>>>>> +    int r, i;
>>>>>>>
>>>>>>>          r = event_notifier_init(&svq->kick_notifier, 0);
>>>>>>>          if (r != 0) {
>>>>>>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
>>>>>>>          vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
>>>>>>>          svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>>>>>>>          svq->vdev = dev->vdev;
>>>>>>> +    for (i = 0; i < num - 1; i++) {
>>>>>>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>>>>>>>          event_notifier_set_handler(&svq->call_notifier,
>>>>>>>                                     vhost_shadow_vq_handle_call);
>>>>>>>          qemu_event_init(&svq->masked_notifier.is_free, true);
>>>>>>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
>>>>>>>          event_notifier_cleanup(&vq->kick_notifier);
>>>>>>>          event_notifier_set_handler(&vq->call_notifier, NULL);
>>>>>>>          event_notifier_cleanup(&vq->call_notifier);
>>>>>>> +    g_free(vq->ring_id_maps);
>>>>>>>          g_free(vq);
>>>>>>>      }
>>>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>>>> index eab3e334f2..a373999bc4 100644
>>>>>>> --- a/hw/virtio/vhost.c
>>>>>>> +++ b/hw/virtio/vhost.c
>>>>>>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
>>>>>>>
>>>>>>>          trace_vhost_iotlb_miss(dev, 1);
>>>>>>>
>>>>>>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
>>>>>>> +        uaddr = iova;
>>>>>>> +        len = 4096;
>>>>>>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
>>>>>>> +                                                IOMMU_RW);
>>>>>>> +        if (ret) {
>>>>>>> +            trace_vhost_iotlb_miss(dev, 2);
>>>>>>> +            error_report("Fail to update device iotlb");
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        return ret;
>>>>>>> +    }
>>>>>>> +
>>>>>>>          iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
>>>>>>>                                                iova, write,
>>>>>>>                                                MEMTXATTRS_UNSPECIFIED);
>>>>>>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>>>>>          /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>>>>>
>>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>>>>>> +        error_report("Fail to invalidate device iotlb");
>>>>>>> +    }
>>>>>>> +
>>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>>> +        /*
>>>>>>> +         * Update used ring information for IOTLB to work correctly,
>>>>>>> +         * vhost-kernel code requires for this.
>>>>>>> +         */
>>>>>>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
>>>>>>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
>>>>>>> +
>>>>>>>              vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
>>>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>>>>>> +                              dev->vq_index + idx);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    /* Enable guest's vq vring */
>>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>>>> +
>>>>>>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>>>              vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>>>>>          }
>>>>>>>
>>>>>>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
>>>>>>>          return 0;
>>>>>>>      }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Start shadow virtqueue in a given queue.
>>>>>>> + * In failure case, this function leaves queue working as regular vhost mode.
>>>>>>> + */
>>>>>>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
>>>>>>> +                                             unsigned idx)
>>>>>>> +{
>>>>>>> +    struct vhost_vring_addr addr = {
>>>>>>> +        .index = idx,
>>>>>>> +    };
>>>>>>> +    struct vhost_vring_state s = {
>>>>>>> +        .index = idx,
>>>>>>> +    };
>>>>>>> +    int r;
>>>>>>> +    bool ok;
>>>>>>> +
>>>>>>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>>>>>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>>>>>> +    if (unlikely(!ok)) {
>>>>>>> +        return false;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    /* From this point, vhost_virtqueue_start can reset these changes */
>>>>>>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
>>>>>>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
>>>>>>> +        goto err;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
>>>>>>> +        goto err;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Update used ring information for IOTLB to work correctly,
>>>>>>> +     * vhost-kernel code requires for this.
>>>>>>> +     */
>>>>>>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        /* Debug message already printed */
>>>>>>> +        goto err;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return true;
>>>>>>> +
>>>>>>> +err:
>>>>>>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
>>>>>>> +    return false;
>>>>>>> +}
>>>>>>> +
>>>>>>>      static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>>>>>      {
>>>>>>>          int idx, stop_idx;
>>>>>>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
>>>>>>>              }
>>>>>>>          }
>>>>>>>
>>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
>>>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
>>>>>>> +        error_report("Fail to invalidate device iotlb");
>>>>>>> +    }
>>>>>>> +
>>>>>>>          /* Can be read by vhost_virtqueue_mask, from vm exit */
>>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, true);
>>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
>>>>>>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
>>>>>>>              if (unlikely(!ok)) {
>>>>>>>                  goto err_start;
>>>>>>>              }
>>>>>>>          }
>>>>>>>
>>>>>>> +    /* Enable shadow vq vring */
>>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>>>>          return 0;
>>>>>>>
>>>>>>>      err_start:
>>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, false);
>>>>>>>          for (stop_idx = 0; stop_idx < idx; stop_idx++) {
>>>>>>>              vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
>>>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
>>>>>>> +                              dev->vq_index + stop_idx);
>>>>>>>          }
>>>>>>>
>>>>>>>      err_new:
>>>>>>> +    /* Enable guest's vring */
>>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
>>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
>>>>>>>              vhost_shadow_vq_free(dev->shadow_vqs[idx]);
>>>>>>>          }
>>>>>>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>>>>>
>>>>>>>              if (!hdev->started) {
>>>>>>>                  err_cause = "Device is not started";
>>>>>>> +        } else if (!vhost_dev_has_iommu(hdev)) {
>>>>>>> +            err_cause = "Does not support iommu";
>>>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
>>>>>>> +            err_cause = "Is packed";
>>>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
>>>>>>> +            err_cause = "Have event idx";
>>>>>>> +        } else if (hdev->acked_features &
>>>>>>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
>>>>>>> +            err_cause = "Supports indirect descriptors";
>>>>>>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>>>>>> +            err_cause = "Cannot pause device";
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        if (err_cause) {
>>>>>>>                  goto err;
>>>>>>>              }
>>>>>>>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-18  3:10           ` Jason Wang
@ 2021-03-18  9:18             ` Eugenio Perez Martin
  2021-03-18  9:29               ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-18  9:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Thu, Mar 18, 2021 at 4:11 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
> > On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
> >>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> >>>>> stops, so code flow follows usual cleanup.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     hw/virtio/vhost-shadow-virtqueue.h |   7 ++
> >>>>>     include/hw/virtio/vhost.h          |   4 +
> >>>>>     hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
> >>>>>     hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
> >>>>>     4 files changed, 265 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> index 6cc18d6acb..c891c6510d 100644
> >>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> @@ -17,6 +17,13 @@
> >>>>>
> >>>>>     typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>>>
> >>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>>>> +                           unsigned idx,
> >>>>> +                           VhostShadowVirtqueue *svq);
> >>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>> +                          unsigned idx,
> >>>>> +                          VhostShadowVirtqueue *svq);
> >>>>> +
> >>>>>     VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
> >>>>>
> >>>>>     void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> >>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >>>>> index ac963bf23d..7ffdf9aea0 100644
> >>>>> --- a/include/hw/virtio/vhost.h
> >>>>> +++ b/include/hw/virtio/vhost.h
> >>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
> >>>>>         QLIST_ENTRY(vhost_iommu) iommu_next;
> >>>>>     };
> >>>>>
> >>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>>> +
> >>>>>     typedef struct VhostDevConfigOps {
> >>>>>         /* Vhost device config space changed callback
> >>>>>          */
> >>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
> >>>>>         uint64_t backend_cap;
> >>>>>         bool started;
> >>>>>         bool log_enabled;
> >>>>> +    bool shadow_vqs_enabled;
> >>>>>         uint64_t log_size;
> >>>>> +    VhostShadowVirtqueue **shadow_vqs;
> >>>> Any reason that you don't embed the shadow virtqueue into
> >>>> vhost_virtqueue structure?
> >>>>
> >>> Not really, it could be relatively big and I would prefer SVQ
> >>> members/methods to remain hidden from any other part that includes
> >>> vhost.h. But it could be changed, for sure.
> >>>
> >>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
> >>>>
> >>> They are used differently: in SVQ the masked notifier is a pointer,
> >>> and if it's NULL the SVQ code knows that device is not masked. The
> >>> vhost_virtqueue is the real owner.
> >>
> >> Yes, but it's an example for embedding auxciliary data structures in the
> >> vhost_virtqueue.
> >>
> >>
> >>> It could be replaced by a boolean in SVQ or something like that, I
> >>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
> >>> and let vhost.c code to manage all the transitions. But I find clearer
> >>> the pointer use, since it's the more natural for the
> >>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
> >>>
> >>> This masking/unmasking is the part I dislike the most from this
> >>> series, so I'm very open to alternatives.
> >>
> >> See below. I think we don't even need to care about that.
> >>
> >>
> >>>>>         Error *migration_blocker;
> >>>>>         const VhostOps *vhost_ops;
> >>>>>         void *opaque;
> >>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> index 4512e5b058..3e43399e9c 100644
> >>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> @@ -8,9 +8,12 @@
> >>>>>      */
> >>>>>
> >>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>>> +#include "hw/virtio/vhost.h"
> >>>>> +
> >>>>> +#include "standard-headers/linux/vhost_types.h"
> >>>>>
> >>>>>     #include "qemu/error-report.h"
> >>>>> -#include "qemu/event_notifier.h"
> >>>>> +#include "qemu/main-loop.h"
> >>>>>
> >>>>>     /* Shadow virtqueue to relay notifications */
> >>>>>     typedef struct VhostShadowVirtqueue {
> >>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
> >>>>>         EventNotifier kick_notifier;
> >>>>>         /* Shadow call notifier, sent to vhost */
> >>>>>         EventNotifier call_notifier;
> >>>>> +
> >>>>> +    /*
> >>>>> +     * Borrowed virtqueue's guest to host notifier.
> >>>>> +     * To borrow it in this event notifier allows to register on the event
> >>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
> >>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
> >>>> So this is something that worries me. It looks like a layer violation
> >>>> that makes the codes harder to work correctly.
> >>>>
> >>> I don't follow you here.
> >>>
> >>> The vhost code already depends on virtqueue in the same sense:
> >>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
> >>> if this behavior ever changes it is unlikely for vhost to keep working
> >>> without changes. vhost_virtqueue has a kick/call int where I think it
> >>> should be stored actually, but they are never used as far as I see.
> >>>
> >>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
> >>> /* Stop processing guest IO notifications in vhost.
> >>>    * Start processing them in qemu.
> >>>    ...
> >>> But it was easier for this mode to miss a notification, since they
> >>> create a new host_notifier in virtio_bus_set_host_notifier right away.
> >>> So I decided to use the file descriptor already sent to vhost in
> >>> regular operation mode, so guest-related resources change less.
> >>>
> >>> Having said that, maybe it's useful to assert that
> >>> vhost_dev_{enable,disable}_notifiers are never called on shadow
> >>> virtqueue mode. Also, it could be useful to retrieve it from
> >>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
> >>> from it. Would that make more sense?
> >>>
> >>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
> >>>> virtqueue implementation:
> >>>>
> >>>> 1) have the above fields embeded in vhost_vdpa structure
> >>>> 2) Work at the level of
> >>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
> >>>>
> >>> This notifier is never sent to the device in shadow virtqueue mode.
> >>> It's for SVQ to react to guest's notifications, registering it on its
> >>> main event loop [1]. So if I perform these changes the way I
> >>> understand them, SVQ would still rely on this borrowed EventNotifier,
> >>> and it would send to the vDPA device the newly created kick_notifier
> >>> of VhostShadowVirtqueue.
> >>
> >> The point is that vhost code should be coupled loosely with virtio. If
> >> you try to "borrow" EventNotifier from virtio, you need to deal with a
> >> lot of synchrization. An exampleis the masking stuffs.
> >>
> > I still don't follow this, sorry.
> >
> > The svq->host_notifier event notifier is not affected by the masking
> > issue, it is completely private to SVQ. This commit creates and uses
> > it, and nothing related to masking is touched until the next commit.
> >
> >>>> Then the layer is still isolated and you have a much simpler context to
> >>>> work that you don't need to care a lot of synchornization:
> >>>>
> >>>> 1) vq masking
> >>> This EventNotifier is not used for masking, it does not change from
> >>> the start of the shadow virtqueue operation through its end. Call fd
> >>> sent to vhost/vdpa device does not change either in shadow virtqueue
> >>> mode operation with masking/unmasking. I will try to document it
> >>> better.
> >>>
> >>> I think that we will need to handle synchronization with
> >>> masking/unmasking from the guest and dynamically enabling SVQ
> >>> operation mode, since they can happen at the same time as long as we
> >>> let the guest run. There may be better ways of synchronizing them of
> >>> course, but I don't see how moving to the vhost-vdpa backend helps
> >>> with this. Please expand if I've missed it.
> >>>
> >>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
> >>> to future patchsets?
> >>
> >> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
> >> hide them from the upper layers like virtio. This means it works at
> >> vhost level which can see vhost_vring_file only. When enalbed, what it
> >> needs is just:
> >>
> >> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
> >> 2) switch to use svq callfd and relay svq callfd to irqfd
> >>
> >> It will still behave like a vhost-backend that the switching is done
> >> internally in vhost-vDPA which is totally transparent to the virtio
> >> codes of Qemu.
> >>
> >> E.g:
> >>
> >> 1) in the case of guest notifier masking, we don't need to do anything
> >> since virtio codes will replace another irqfd for us.
> > Assuming that we don't modify vhost masking code, but send shadow
> > virtqueue call descriptor to the vhost device:
> >
> > If guest virtio code mask the virtqueue and replaces the vhost-vdpa
> > device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
> > or the descriptor in your previous second point, svq callfd) with the
> > masked notifier, vhost_shadow_vq_handle_call will not be called
> > anymore, and no more used descriptors will be forwarded. They will be
> > stuck if the shadow virtqueue forever. Guest itself cannot recover
> > from this situation, since a masking will set irqfd, not SVQ call fd.
>
>
> Just to make sure we're in the same page. During vq masking, the virtio
> codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():
>
>      if (mask) {
>          assert(vdev->use_guest_notifier_mask);
>          file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
>      } else {
>      file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
>      }
>
>      file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
>      r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
>
> So consider the shadow virtqueue in done at vhost-vDPA. We just need to
> make sure
>
> 1) update the callfd which passed by virtio layer via set_vring_kick()
> 2) always write to the callfd during vhost_shadow_vq_handle_call()
>
> Then
>
> 3) When shadow vq is enabled, we just set the callfd of shadow virtqueue
> to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
> 4) When shadow vq is disabled, we just set the callfd that is passed by
> virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd
>
> So you can see in step 2 and 4, we don't need to know whether or not the
> vq is masked since we follow the vhost protocol "VhostOps" and do
> everyhing transparently in the vhost-(vDPA) layer.
>

All of this assumes that we can enable/disable SVQ dynamically while
the device is running. If it's not the case, there is no need for the
mutex neither in vhost.c code nor vdpa_backend.

As I see it, the issue is that step (2) and (4) happens in different
threads: (2) is in vCPU vmexit, and (4) is in main event loop.
Consider unmasking and disabling SVQ at the same time with no mutex:

vCPU vmexit thread                     aio thread
(unmask)                               (stops SVQ)
|                                      |
|                                      // Last callfd set was masked_notifier
|                                      vdpa_backend.callfd = \
|                                              atomic_read(masked_notifier).
|                                      |
vhost_set_vring_call(vq.guest_notifier)|
-> vdpa_backend.callfd = \             |
           vq.guest_notifier           |
|                                      |
|                                      ioctl(vdpa,
VHOST_SET_VRING_CALL, vdpa_backend.callfd)
|
// guest expects more interrupts, but
// device just set masked

And vhost_set_vring_call could happen entirely even while ioctl is
being executed.

So that is the reason for the mutex: vdpa_backend.call_fd and the
ioctl VHOST_SET_VRING_CALL must be serialized. I'm ok with moving to
vdpa backend, but it's the same code, just in vdpa_backend.c instead
of vhost.c, so it becomes less generic in my opinion.

>
> >
> >> 2) easily to deal with vhost dev start and stop
> >>
> >> The advantages are obvious, simple and easy to implement.
> >>
> > I still don't see how performing this step from backend code avoids
> > the synchronization problem, since they will be done from different
> > threads anyway. Not sure what piece I'm missing.
>
>
> See my reply in another thread. If you enable the shadow virtqueue via a
> OOB monitor that's a real issue.
>
> But I don't think we need to do that since
>
> 1) SVQ should be transparnet to management
> 2) unncessary synchronization issue
>
> We can enable the shadow virtqueue through cli, new parameter with
> vhost-vdpa probably. Then we don't need to care about threads. And in
> the final version with full live migration support, the shadow virtqueue
> should be enabled automatically. E.g for the device without
> VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via
> VHOST_GET_BACKEND_FEATURES.
>

It should be enabled automatically in those condition, but it also
needs to be dynamic, and only be active during migration. Otherwise,
guest should use regular vdpa operation. The problem with masking is
the same if we enable with QMP or because live migration event.

So we will have the previous synchronization problem sooner or later.
If we omit the rollback to regular vdpa operation (in other words,
disabling SVQ), code can be simplified, but I'm not sure if that is
desirable.

Thanks!

> Thanks
>
>
> >
> > I can see / tested a few solutions but I don't like them a lot:
> >
> > * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
> > cmdline: We will have to deal with setting the SVQ mode dynamically
> > sooner or later if we want to use it for live migration.
> > * Forbid coming back to regular mode after switching to shadow
> > virtqueue mode: The heavy part of the synchronization comes from svq
> > stopping code, since we need to serialize the setting of device call
> > fd. This could be acceptable, but I'm not sure about the implications:
> > What happens if live migration fails and we need to step back? A mutex
> > is not needed in this scenario, it's ok with atomics and RCU code.
> >
> > * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
> > notifier: I haven't thought a lot of this one, I think it's better to
> > not touch guest notifiers.
> > * Monitor also masked notifier from SVQ: I think this could be
> > promising, but SVQ needs to be notified about masking/unmasking
> > anyway, and there is code that depends on checking the masked notifier
> > for the pending notification.
> >
> >>>> 2) vhost dev start and stop
> >>>>
> >>>> ?
> >>>>
> >>>>
> >>>>> +     *
> >>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >>>>> +     */
> >>>>> +    EventNotifier host_notifier;
> >>>>> +
> >>>>> +    /* Virtio queue shadowing */
> >>>>> +    VirtQueue *vq;
> >>>>>     } VhostShadowVirtqueue;
> >>>>>
> >>>>> +/* Forward guest notifications */
> >>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>> +{
> >>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>> +                                             host_notifier);
> >>>>> +
> >>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    event_notifier_set(&svq->kick_notifier);
> >>>>> +}
> >>>>> +
> >>>>> +/*
> >>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >>>>> + */
> >>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> >>>>> +                                                     unsigned vhost_index,
> >>>>> +                                                     VhostShadowVirtqueue *svq)
> >>>>> +{
> >>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> >>>>> +    struct vhost_vring_file file = {
> >>>>> +        .index = vhost_index,
> >>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
> >>>>> +    };
> >>>>> +    int r;
> >>>>> +
> >>>>> +    /* Restore vhost kick */
> >>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >>>>> +    return r ? -errno : 0;
> >>>>> +}
> >>>>> +
> >>>>> +/*
> >>>>> + * Start shadow virtqueue operation.
> >>>>> + * @dev vhost device
> >>>>> + * @hidx vhost virtqueue index
> >>>>> + * @svq Shadow Virtqueue
> >>>>> + */
> >>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>>>> +                           unsigned idx,
> >>>>> +                           VhostShadowVirtqueue *svq)
> >>>> It looks to me this assumes the vhost_dev is started before
> >>>> vhost_shadow_vq_start()?
> >>>>
> >>> Right.
> >>
> >> This might not true. Guest may enable and disable virtio drivers after
> >> the shadow virtqueue is started. You need to deal with that.
> >>
> > Right, I will test this scenario.
> >
> >> Thanks
> >>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-18  9:18             ` Eugenio Perez Martin
@ 2021-03-18  9:29               ` Jason Wang
  2021-03-18 10:48                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2021-03-18  9:29 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller


在 2021/3/18 下午5:18, Eugenio Perez Martin 写道:
> On Thu, Mar 18, 2021 at 4:11 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
>>> On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
>>>>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>>>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
>>>>>>> stops, so code flow follows usual cleanup.
>>>>>>>
>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>> ---
>>>>>>>      hw/virtio/vhost-shadow-virtqueue.h |   7 ++
>>>>>>>      include/hw/virtio/vhost.h          |   4 +
>>>>>>>      hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
>>>>>>>      hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
>>>>>>>      4 files changed, 265 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>> index 6cc18d6acb..c891c6510d 100644
>>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>> @@ -17,6 +17,13 @@
>>>>>>>
>>>>>>>      typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>>>>
>>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>>>> +                           unsigned idx,
>>>>>>> +                           VhostShadowVirtqueue *svq);
>>>>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>>> +                          unsigned idx,
>>>>>>> +                          VhostShadowVirtqueue *svq);
>>>>>>> +
>>>>>>>      VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
>>>>>>>
>>>>>>>      void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
>>>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>>>> index ac963bf23d..7ffdf9aea0 100644
>>>>>>> --- a/include/hw/virtio/vhost.h
>>>>>>> +++ b/include/hw/virtio/vhost.h
>>>>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
>>>>>>>          QLIST_ENTRY(vhost_iommu) iommu_next;
>>>>>>>      };
>>>>>>>
>>>>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>>>> +
>>>>>>>      typedef struct VhostDevConfigOps {
>>>>>>>          /* Vhost device config space changed callback
>>>>>>>           */
>>>>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
>>>>>>>          uint64_t backend_cap;
>>>>>>>          bool started;
>>>>>>>          bool log_enabled;
>>>>>>> +    bool shadow_vqs_enabled;
>>>>>>>          uint64_t log_size;
>>>>>>> +    VhostShadowVirtqueue **shadow_vqs;
>>>>>> Any reason that you don't embed the shadow virtqueue into
>>>>>> vhost_virtqueue structure?
>>>>>>
>>>>> Not really, it could be relatively big and I would prefer SVQ
>>>>> members/methods to remain hidden from any other part that includes
>>>>> vhost.h. But it could be changed, for sure.
>>>>>
>>>>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
>>>>>>
>>>>> They are used differently: in SVQ the masked notifier is a pointer,
>>>>> and if it's NULL the SVQ code knows that device is not masked. The
>>>>> vhost_virtqueue is the real owner.
>>>> Yes, but it's an example for embedding auxciliary data structures in the
>>>> vhost_virtqueue.
>>>>
>>>>
>>>>> It could be replaced by a boolean in SVQ or something like that, I
>>>>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
>>>>> and let vhost.c code to manage all the transitions. But I find clearer
>>>>> the pointer use, since it's the more natural for the
>>>>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
>>>>>
>>>>> This masking/unmasking is the part I dislike the most from this
>>>>> series, so I'm very open to alternatives.
>>>> See below. I think we don't even need to care about that.
>>>>
>>>>
>>>>>>>          Error *migration_blocker;
>>>>>>>          const VhostOps *vhost_ops;
>>>>>>>          void *opaque;
>>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> index 4512e5b058..3e43399e9c 100644
>>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>> @@ -8,9 +8,12 @@
>>>>>>>       */
>>>>>>>
>>>>>>>      #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>>> +#include "hw/virtio/vhost.h"
>>>>>>> +
>>>>>>> +#include "standard-headers/linux/vhost_types.h"
>>>>>>>
>>>>>>>      #include "qemu/error-report.h"
>>>>>>> -#include "qemu/event_notifier.h"
>>>>>>> +#include "qemu/main-loop.h"
>>>>>>>
>>>>>>>      /* Shadow virtqueue to relay notifications */
>>>>>>>      typedef struct VhostShadowVirtqueue {
>>>>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
>>>>>>>          EventNotifier kick_notifier;
>>>>>>>          /* Shadow call notifier, sent to vhost */
>>>>>>>          EventNotifier call_notifier;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Borrowed virtqueue's guest to host notifier.
>>>>>>> +     * To borrow it in this event notifier allows to register on the event
>>>>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
>>>>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
>>>>>> So this is something that worries me. It looks like a layer violation
>>>>>> that makes the codes harder to work correctly.
>>>>>>
>>>>> I don't follow you here.
>>>>>
>>>>> The vhost code already depends on virtqueue in the same sense:
>>>>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
>>>>> if this behavior ever changes it is unlikely for vhost to keep working
>>>>> without changes. vhost_virtqueue has a kick/call int where I think it
>>>>> should be stored actually, but they are never used as far as I see.
>>>>>
>>>>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
>>>>> /* Stop processing guest IO notifications in vhost.
>>>>>     * Start processing them in qemu.
>>>>>     ...
>>>>> But it was easier for this mode to miss a notification, since they
>>>>> create a new host_notifier in virtio_bus_set_host_notifier right away.
>>>>> So I decided to use the file descriptor already sent to vhost in
>>>>> regular operation mode, so guest-related resources change less.
>>>>>
>>>>> Having said that, maybe it's useful to assert that
>>>>> vhost_dev_{enable,disable}_notifiers are never called on shadow
>>>>> virtqueue mode. Also, it could be useful to retrieve it from
>>>>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
>>>>> from it. Would that make more sense?
>>>>>
>>>>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
>>>>>> virtqueue implementation:
>>>>>>
>>>>>> 1) have the above fields embeded in vhost_vdpa structure
>>>>>> 2) Work at the level of
>>>>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
>>>>>>
>>>>> This notifier is never sent to the device in shadow virtqueue mode.
>>>>> It's for SVQ to react to guest's notifications, registering it on its
>>>>> main event loop [1]. So if I perform these changes the way I
>>>>> understand them, SVQ would still rely on this borrowed EventNotifier,
>>>>> and it would send to the vDPA device the newly created kick_notifier
>>>>> of VhostShadowVirtqueue.
>>>> The point is that vhost code should be coupled loosely with virtio. If
>>>> you try to "borrow" EventNotifier from virtio, you need to deal with a
>>>> lot of synchrization. An exampleis the masking stuffs.
>>>>
>>> I still don't follow this, sorry.
>>>
>>> The svq->host_notifier event notifier is not affected by the masking
>>> issue, it is completely private to SVQ. This commit creates and uses
>>> it, and nothing related to masking is touched until the next commit.
>>>
>>>>>> Then the layer is still isolated and you have a much simpler context to
>>>>>> work that you don't need to care a lot of synchornization:
>>>>>>
>>>>>> 1) vq masking
>>>>> This EventNotifier is not used for masking, it does not change from
>>>>> the start of the shadow virtqueue operation through its end. Call fd
>>>>> sent to vhost/vdpa device does not change either in shadow virtqueue
>>>>> mode operation with masking/unmasking. I will try to document it
>>>>> better.
>>>>>
>>>>> I think that we will need to handle synchronization with
>>>>> masking/unmasking from the guest and dynamically enabling SVQ
>>>>> operation mode, since they can happen at the same time as long as we
>>>>> let the guest run. There may be better ways of synchronizing them of
>>>>> course, but I don't see how moving to the vhost-vdpa backend helps
>>>>> with this. Please expand if I've missed it.
>>>>>
>>>>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
>>>>> to future patchsets?
>>>> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
>>>> hide them from the upper layers like virtio. This means it works at
>>>> vhost level which can see vhost_vring_file only. When enalbed, what it
>>>> needs is just:
>>>>
>>>> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
>>>> 2) switch to use svq callfd and relay svq callfd to irqfd
>>>>
>>>> It will still behave like a vhost-backend that the switching is done
>>>> internally in vhost-vDPA which is totally transparent to the virtio
>>>> codes of Qemu.
>>>>
>>>> E.g:
>>>>
>>>> 1) in the case of guest notifier masking, we don't need to do anything
>>>> since virtio codes will replace another irqfd for us.
>>> Assuming that we don't modify vhost masking code, but send shadow
>>> virtqueue call descriptor to the vhost device:
>>>
>>> If guest virtio code mask the virtqueue and replaces the vhost-vdpa
>>> device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
>>> or the descriptor in your previous second point, svq callfd) with the
>>> masked notifier, vhost_shadow_vq_handle_call will not be called
>>> anymore, and no more used descriptors will be forwarded. They will be
>>> stuck if the shadow virtqueue forever. Guest itself cannot recover
>>> from this situation, since a masking will set irqfd, not SVQ call fd.
>>
>> Just to make sure we're in the same page. During vq masking, the virtio
>> codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():
>>
>>       if (mask) {
>>           assert(vdev->use_guest_notifier_mask);
>>           file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
>>       } else {
>>       file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
>>       }
>>
>>       file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
>>       r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
>>
>> So consider the shadow virtqueue in done at vhost-vDPA. We just need to
>> make sure
>>
>> 1) update the callfd which passed by virtio layer via set_vring_kick()
>> 2) always write to the callfd during vhost_shadow_vq_handle_call()
>>
>> Then
>>
>> 3) When shadow vq is enabled, we just set the callfd of shadow virtqueue
>> to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
>> 4) When shadow vq is disabled, we just set the callfd that is passed by
>> virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd
>>
>> So you can see in step 2 and 4, we don't need to know whether or not the
>> vq is masked since we follow the vhost protocol "VhostOps" and do
>> everyhing transparently in the vhost-(vDPA) layer.
>>
> All of this assumes that we can enable/disable SVQ dynamically while
> the device is running. If it's not the case, there is no need for the
> mutex neither in vhost.c code nor vdpa_backend.
>
> As I see it, the issue is that step (2) and (4) happens in different
> threads: (2) is in vCPU vmexit, and (4) is in main event loop.
> Consider unmasking and disabling SVQ at the same time with no mutex:
>
> vCPU vmexit thread                     aio thread
> (unmask)                               (stops SVQ)
> |                                      |
> |                                      // Last callfd set was masked_notifier
> |                                      vdpa_backend.callfd = \
> |                                              atomic_read(masked_notifier).
> |                                      |
> vhost_set_vring_call(vq.guest_notifier)|
> -> vdpa_backend.callfd = \             |
>             vq.guest_notifier           |
> |                                      |
> |                                      ioctl(vdpa,
> VHOST_SET_VRING_CALL, vdpa_backend.callfd)
> |
> // guest expects more interrupts, but
> // device just set masked
>
> And vhost_set_vring_call could happen entirely even while ioctl is
> being executed.
>
> So that is the reason for the mutex: vdpa_backend.call_fd and the
> ioctl VHOST_SET_VRING_CALL must be serialized. I'm ok with moving to
> vdpa backend, but it's the same code, just in vdpa_backend.c instead
> of vhost.c, so it becomes less generic in my opinion.


You are right. But let's consider if we can avoid the dedicated mutex.

E.g can we use the BQL, bascially we need to synchronizae with iothread.

Or is it possible to schedule bh then things are serailzied automatically?


>
>>>> 2) easily to deal with vhost dev start and stop
>>>>
>>>> The advantages are obvious, simple and easy to implement.
>>>>
>>> I still don't see how performing this step from backend code avoids
>>> the synchronization problem, since they will be done from different
>>> threads anyway. Not sure what piece I'm missing.
>>
>> See my reply in another thread. If you enable the shadow virtqueue via a
>> OOB monitor that's a real issue.
>>
>> But I don't think we need to do that since
>>
>> 1) SVQ should be transparnet to management
>> 2) unncessary synchronization issue
>>
>> We can enable the shadow virtqueue through cli, new parameter with
>> vhost-vdpa probably. Then we don't need to care about threads. And in
>> the final version with full live migration support, the shadow virtqueue
>> should be enabled automatically. E.g for the device without
>> VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via
>> VHOST_GET_BACKEND_FEATURES.
>>
> It should be enabled automatically in those condition, but it also
> needs to be dynamic, and only be active during migration. Otherwise,
> guest should use regular vdpa operation. The problem with masking is
> the same if we enable with QMP or because live migration event.
>
> So we will have the previous synchronization problem sooner or later.
> If we omit the rollback to regular vdpa operation (in other words,
> disabling SVQ), code can be simplified, but I'm not sure if that is
> desirable.


Rgiht, so I'm ok to have the synchronziation from the start if you wish.

But we need to figure out what to synchronize and how to do synchronize.

THanks


>
> Thanks!
>
>> Thanks
>>
>>
>>> I can see / tested a few solutions but I don't like them a lot:
>>>
>>> * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
>>> cmdline: We will have to deal with setting the SVQ mode dynamically
>>> sooner or later if we want to use it for live migration.
>>> * Forbid coming back to regular mode after switching to shadow
>>> virtqueue mode: The heavy part of the synchronization comes from svq
>>> stopping code, since we need to serialize the setting of device call
>>> fd. This could be acceptable, but I'm not sure about the implications:
>>> What happens if live migration fails and we need to step back? A mutex
>>> is not needed in this scenario, it's ok with atomics and RCU code.
>>>
>>> * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
>>> notifier: I haven't thought a lot of this one, I think it's better to
>>> not touch guest notifiers.
>>> * Monitor also masked notifier from SVQ: I think this could be
>>> promising, but SVQ needs to be notified about masking/unmasking
>>> anyway, and there is code that depends on checking the masked notifier
>>> for the pending notification.
>>>
>>>>>> 2) vhost dev start and stop
>>>>>>
>>>>>> ?
>>>>>>
>>>>>>
>>>>>>> +     *
>>>>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>>>>>> +     */
>>>>>>> +    EventNotifier host_notifier;
>>>>>>> +
>>>>>>> +    /* Virtio queue shadowing */
>>>>>>> +    VirtQueue *vq;
>>>>>>>      } VhostShadowVirtqueue;
>>>>>>>
>>>>>>> +/* Forward guest notifications */
>>>>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>>> +{
>>>>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>>>> +                                             host_notifier);
>>>>>>> +
>>>>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    event_notifier_set(&svq->kick_notifier);
>>>>>>> +}
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
>>>>>>> + */
>>>>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
>>>>>>> +                                                     unsigned vhost_index,
>>>>>>> +                                                     VhostShadowVirtqueue *svq)
>>>>>>> +{
>>>>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>>>>>> +    struct vhost_vring_file file = {
>>>>>>> +        .index = vhost_index,
>>>>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
>>>>>>> +    };
>>>>>>> +    int r;
>>>>>>> +
>>>>>>> +    /* Restore vhost kick */
>>>>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>>>>>> +    return r ? -errno : 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Start shadow virtqueue operation.
>>>>>>> + * @dev vhost device
>>>>>>> + * @hidx vhost virtqueue index
>>>>>>> + * @svq Shadow Virtqueue
>>>>>>> + */
>>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>>>> +                           unsigned idx,
>>>>>>> +                           VhostShadowVirtqueue *svq)
>>>>>> It looks to me this assumes the vhost_dev is started before
>>>>>> vhost_shadow_vq_start()?
>>>>>>
>>>>> Right.
>>>> This might not true. Guest may enable and disable virtio drivers after
>>>> the shadow virtqueue is started. You need to deal with that.
>>>>
>>> Right, I will test this scenario.
>>>
>>>> Thanks
>>>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding
  2021-03-18  9:16               ` Jason Wang
@ 2021-03-18  9:54                 ` Eugenio Perez Martin
  0 siblings, 0 replies; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-18  9:54 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-level

On Thu, Mar 18, 2021 at 10:22 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/18 下午4:06, Eugenio Perez Martin 写道:
> > On Thu, Mar 18, 2021 at 4:14 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/17 下午10:38, Eugenio Perez Martin 写道:
> >>> On Wed, Mar 17, 2021 at 3:51 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/3/17 上午12:05, Eugenio Perez Martin 写道:
> >>>>> On Tue, Mar 16, 2021 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>>>>>> Initial version of shadow virtqueue that actually forward buffers.
> >>>>>>>
> >>>>>>> It reuses the VirtQueue code for the device part. The driver part is
> >>>>>>> based on Linux's virtio_ring driver, but with stripped functionality
> >>>>>>> and optimizations so it's easier to review.
> >>>>>>>
> >>>>>>> These will be added in later commits.
> >>>>>>>
> >>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>> ---
> >>>>>>>      hw/virtio/vhost-shadow-virtqueue.c | 212 +++++++++++++++++++++++++++--
> >>>>>>>      hw/virtio/vhost.c                  | 113 ++++++++++++++-
> >>>>>>>      2 files changed, 312 insertions(+), 13 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> index 1460d1d5d1..68ed0f2740 100644
> >>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> @@ -9,6 +9,7 @@
> >>>>>>>
> >>>>>>>      #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>>>>>      #include "hw/virtio/vhost.h"
> >>>>>>> +#include "hw/virtio/virtio-access.h"
> >>>>>>>
> >>>>>>>      #include "standard-headers/linux/vhost_types.h"
> >>>>>>>
> >>>>>>> @@ -55,11 +56,96 @@ typedef struct VhostShadowVirtqueue {
> >>>>>>>          /* Virtio device */
> >>>>>>>          VirtIODevice *vdev;
> >>>>>>>
> >>>>>>> +    /* Map for returning guest's descriptors */
> >>>>>>> +    VirtQueueElement **ring_id_maps;
> >>>>>>> +
> >>>>>>> +    /* Next head to expose to device */
> >>>>>>> +    uint16_t avail_idx_shadow;
> >>>>>>> +
> >>>>>>> +    /* Next free descriptor */
> >>>>>>> +    uint16_t free_head;
> >>>>>>> +
> >>>>>>> +    /* Last seen used idx */
> >>>>>>> +    uint16_t shadow_used_idx;
> >>>>>>> +
> >>>>>>> +    /* Next head to consume from device */
> >>>>>>> +    uint16_t used_idx;
> >>>>>>> +
> >>>>>>>          /* Descriptors copied from guest */
> >>>>>>>          vring_desc_t descs[];
> >>>>>>>      } VhostShadowVirtqueue;
> >>>>>>>
> >>>>>>> -/* Forward guest notifications */
> >>>>>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>>>>>> +                                    const struct iovec *iovec,
> >>>>>>> +                                    size_t num, bool more_descs, bool write)
> >>>>>>> +{
> >>>>>>> +    uint16_t i = svq->free_head, last = svq->free_head;
> >>>>>>> +    unsigned n;
> >>>>>>> +    uint16_t flags = write ? virtio_tswap16(svq->vdev, VRING_DESC_F_WRITE) : 0;
> >>>>>>> +    vring_desc_t *descs = svq->vring.desc;
> >>>>>>> +
> >>>>>>> +    if (num == 0) {
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    for (n = 0; n < num; n++) {
> >>>>>>> +        if (more_descs || (n + 1 < num)) {
> >>>>>>> +            descs[i].flags = flags | virtio_tswap16(svq->vdev,
> >>>>>>> +                                                    VRING_DESC_F_NEXT);
> >>>>>>> +        } else {
> >>>>>>> +            descs[i].flags = flags;
> >>>>>>> +        }
> >>>>>>> +        descs[i].addr = virtio_tswap64(svq->vdev, (hwaddr)iovec[n].iov_base);
> >>>>>> So unsing virtio_tswap() is probably not correct since we're talking
> >>>>>> with vhost backends which has its own endiness.
> >>>>>>
> >>>>> I was trying to expose the buffer with the same endianness as the
> >>>>> driver originally offered, so if guest->qemu requires a bswap, I think
> >>>>> it will always require a bswap again to expose to the device again.
> >>>> So assumes vhost-vDPA always provide a non-transitional device[1].
> >>>>
> >>>> Then if Qemu present a transitional device, we need to do the endian
> >>>> conversion when necessary, if Qemu present a non-transitional device, we
> >>>> don't need to do that, guest driver will do that for us.
> >>>>
> >>>> But it looks to me the virtio_tswap() can't be used for this since it:
> >>>>
> >>>> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
> >>>> {
> >>>> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
> >>>>        return virtio_is_big_endian(vdev);
> >>>> #elif defined(TARGET_WORDS_BIGENDIAN)
> >>>>        if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> >>>>            /* Devices conforming to VIRTIO 1.0 or later are always LE. */
> >>>>            return false;
> >>>>        }
> >>>>        return true;
> >>>> #else
> >>>>        return false;
> >>>> #endif
> >>>> }
> >>>>
> >>>> So if we present a legacy device on top of a non-transitiaonl vDPA
> >>>> device. The VIRITIO_F_VERSION_1 check is wrong.
> >>>>
> >>>>
> >>>>>> For vhost-vDPA, we can assume that it's a 1.0 device.
> >>>>> Isn't it needed if the host is big endian?
> >>>> [1]
> >>>>
> >>>> So non-transitional device always assume little endian.
> >>>>
> >>>> For vhost-vDPA, we don't want to present transitional device which may
> >>>> end up with a lot of burdens.
> >>>>
> >>>> I suspect the legacy driver plust vhost vDPA already break, so I plan to
> >>>> mandate VERSION_1 for all vDPA devices.
> >>>>
> >>> Right. I think it's the best then.
> >>>
> >>> However, then we will need a similar method to always expose
> >>> address/length as little endian, isn't it?
> >>
> >> Yes.
> >>
> >>
> >>>>>>> +        descs[i].len = virtio_tswap32(svq->vdev, iovec[n].iov_len);
> >>>>>>> +
> >>>>>>> +        last = i;
> >>>>>>> +        i = virtio_tswap16(svq->vdev, descs[i].next);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    svq->free_head = virtio_tswap16(svq->vdev, descs[last].next);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> >>>>>>> +                                          VirtQueueElement *elem)
> >>>>>>> +{
> >>>>>>> +    int head;
> >>>>>>> +    unsigned avail_idx;
> >>>>>>> +    vring_avail_t *avail = svq->vring.avail;
> >>>>>>> +
> >>>>>>> +    head = svq->free_head;
> >>>>>>> +
> >>>>>>> +    /* We need some descriptors here */
> >>>>>>> +    assert(elem->out_num || elem->in_num);
> >>>>>>> +
> >>>>>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>>>>>> +                            elem->in_num > 0, false);
> >>>>>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>>>>>> +
> >>>>>>> +    /*
> >>>>>>> +     * Put entry in available array (but don't update avail->idx until they
> >>>>>>> +     * do sync).
> >>>>>>> +     */
> >>>>>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> >>>>>>> +    avail->ring[avail_idx] = virtio_tswap16(svq->vdev, head);
> >>>>>>> +    svq->avail_idx_shadow++;
> >>>>>>> +
> >>>>>>> +    /* Expose descriptors to device */
> >>>>>>> +    smp_wmb();
> >>>>>>> +    avail->idx = virtio_tswap16(svq->vdev, svq->avail_idx_shadow);
> >>>>>>> +
> >>>>>>> +    return head;
> >>>>>>> +
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> >>>>>>> +                                VirtQueueElement *elem)
> >>>>>>> +{
> >>>>>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> >>>>>>> +
> >>>>>>> +    svq->ring_id_maps[qemu_head] = elem;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/* Handle guest->device notifications */
> >>>>>>>      static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>>>>      {
> >>>>>>>          VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>>>> @@ -69,7 +155,72 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>>>>              return;
> >>>>>>>          }
> >>>>>>>
> >>>>>>> -    event_notifier_set(&svq->kick_notifier);
> >>>>>>> +    /* Make available as many buffers as possible */
> >>>>>>> +    do {
> >>>>>>> +        if (virtio_queue_get_notification(svq->vq)) {
> >>>>>>> +            /* No more notifications until process all available */
> >>>>>>> +            virtio_queue_set_notification(svq->vq, false);
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        while (true) {
> >>>>>>> +            VirtQueueElement *elem;
> >>>>>>> +            if (virtio_queue_full(svq->vq)) {
> >>>>>>> +                break;
> >>>>>> So we've disabled guest notification. If buffer has been consumed, we
> >>>>>> need to retry the handle_guest_kick here. But I didn't find the code?
> >>>>>>
> >>>>> This code follows the pattern of virtio_blk_handle_vq: we jump out of
> >>>>> the inner while, and we re-enable the notifications. After that, we
> >>>>> check for updates on guest avail_idx.
> >>>> Ok, but this will end up with a lot of unnecessary kicks without event
> >>>> index.
> >>>>
> >>> I can move the kick out of the inner loop, but that could add latency.
> >>
> >> So I think the right way is to disable the notification until some
> >> buffers are consumed by used ring.
> >>
> > I'm not sure if you mean:
> >
> > a) To limit to the maximum amount of buffers that can be available in
> > Shadow Virtqueue at the same time.
> >
> > As I can see, the easiest way to do this would be to unregister
> > vhost_handle_guest_kick from the event loop and let
> > vhost_shadow_vq_handle_call to re-register it at some threshold of
> > available buffers.
> >
> > I'm not sure how much this limit should be, but it seems wasteful for
> > me to not to fill shadow virqueue naturally.
>
>
> Yes, and I'm not sure how much we could gain from this extra complexity.
>
>
> >
> > b) To limit the amount of buffers that vhost_handle_guest_kick
> > forwards to shadow virtqueue in one call.
> >
> > This already has a natural limit of the queue size, since the buffers
> > will not be consumed (as forarded-to-guest) by qemu while this
> > function is running. This limit could be reduced and
> > vhost_handle_guest_kick could re-enqueue itself if its not reached.
> > Same as previous, I'm not sure how much is a right limit, but
> > vhost_handle_guest_kick will not make available more than queue size.
>
>
> Yes, so using queue size is how the code works currently and it should
> be fine if we know svq and vq are the same size. We can leave the kick
> notification for the future, (I guess at least for networking device,
> hitting virtio_queue_full() should be rare).
>

It happens in the rx queue. I think it could also happen in the tx
queue under some conditions, but I didn't test.

> It will be an real issue if svq and vq doesn't have the same size, but
> we can also leave this for future.
>
>
> >
> > c) To kick every N buffers made available, instead of N=1.
> >
> > I think this is not the solution you are proposing, but maybe is
> > simpler than previous.
> >
> >>>>>>> +            }
> >>>>>>> +
> >>>>>>> +            elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >>>>>>> +            if (!elem) {
> >>>>>>> +                break;
> >>>>>>> +            }
> >>>>>>> +
> >>>>>>> +            vhost_shadow_vq_add(svq, elem);
> >>>>>>> +            event_notifier_set(&svq->kick_notifier);
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        virtio_queue_set_notification(svq->vq, true);
> >>>>>>> +    } while (!virtio_queue_empty(svq->vq));
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> >>>>>>> +{
> >>>>>>> +    if (svq->used_idx != svq->shadow_used_idx) {
> >>>>>>> +        return true;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    /* Get used idx must not be reordered */
> >>>>>>> +    smp_rmb();
> >>>>>>> +    svq->shadow_used_idx = virtio_tswap16(svq->vdev, svq->vring.used->idx);
> >>>>>>> +
> >>>>>>> +    return svq->used_idx != svq->shadow_used_idx;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> >>>>>>> +{
> >>>>>>> +    vring_desc_t *descs = svq->vring.desc;
> >>>>>>> +    const vring_used_t *used = svq->vring.used;
> >>>>>>> +    vring_used_elem_t used_elem;
> >>>>>>> +    uint16_t last_used;
> >>>>>>> +
> >>>>>>> +    if (!vhost_shadow_vq_more_used(svq)) {
> >>>>>>> +        return NULL;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
> >>>>>>> +    used_elem.id = virtio_tswap32(svq->vdev, used->ring[last_used].id);
> >>>>>>> +    used_elem.len = virtio_tswap32(svq->vdev, used->ring[last_used].len);
> >>>>>>> +
> >>>>>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> >>>>>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
> >>>>>>> +                     used_elem.id);
> >>>>>>> +        return NULL;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    descs[used_elem.id].next = svq->free_head;
> >>>>>>> +    svq->free_head = used_elem.id;
> >>>>>>> +
> >>>>>>> +    svq->used_idx++;
> >>>>>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> >>>>>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >>>>>>>      }
> >>>>>>>
> >>>>>>>      /* Forward vhost notifications */
> >>>>>>> @@ -78,6 +229,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>>>          VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>>>>                                                   call_notifier);
> >>>>>>>          EventNotifier *masked_notifier;
> >>>>>>> +    VirtQueue *vq = svq->vq;
> >>>>>>>
> >>>>>>>          /* Signal start of using masked notifier */
> >>>>>>>          qemu_event_reset(&svq->masked_notifier.is_free);
> >>>>>>> @@ -86,14 +238,29 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>>>              qemu_event_set(&svq->masked_notifier.is_free);
> >>>>>>>          }
> >>>>>>>
> >>>>>>> -    if (!masked_notifier) {
> >>>>>>> -        unsigned n = virtio_get_queue_index(svq->vq);
> >>>>>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> >>>>>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> >>>>>>> -    } else if (!svq->masked_notifier.signaled) {
> >>>>>>> -        svq->masked_notifier.signaled = true;
> >>>>>>> -        event_notifier_set(svq->masked_notifier.n);
> >>>>>>> -    }
> >>>>>>> +    /* Make as many buffers as possible used. */
> >>>>>>> +    do {
> >>>>>>> +        unsigned i = 0;
> >>>>>>> +
> >>>>>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> >>>>>>> +        while (true) {
> >>>>>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> >>>>>>> +            if (!elem) {
> >>>>>>> +                break;
> >>>>>>> +            }
> >>>>>>> +
> >>>>>>> +            assert(i < svq->vring.num);
> >>>>>>> +            virtqueue_fill(vq, elem, elem->len, i++);
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        virtqueue_flush(vq, i);
> >>>>>>> +        if (!masked_notifier) {
> >>>>>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
> >>>>>>> +        } else if (!svq->masked_notifier.signaled) {
> >>>>>>> +            svq->masked_notifier.signaled = true;
> >>>>>>> +            event_notifier_set(svq->masked_notifier.n);
> >>>>>>> +        }
> >>>>>>> +    } while (vhost_shadow_vq_more_used(svq));
> >>>>>>>
> >>>>>>>          if (masked_notifier) {
> >>>>>>>              /* Signal not using it anymore */
> >>>>>>> @@ -103,7 +270,6 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>>>>>
> >>>>>>>      static void vhost_shadow_vq_handle_call(EventNotifier *n)
> >>>>>>>      {
> >>>>>>> -
> >>>>>>>          if (likely(event_notifier_test_and_clear(n))) {
> >>>>>>>              vhost_shadow_vq_handle_call_no_test(n);
> >>>>>>>          }
> >>>>>>> @@ -254,7 +420,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>>>>                                unsigned idx,
> >>>>>>>                                VhostShadowVirtqueue *svq)
> >>>>>>>      {
> >>>>>>> +    int i;
> >>>>>>>          int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
> >>>>>>> +
> >>>>>>> +    assert(!dev->shadow_vqs_enabled);
> >>>>>>> +
> >>>>>>>          if (unlikely(r < 0)) {
> >>>>>>>              error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >>>>>>>          }
> >>>>>>> @@ -272,6 +442,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>>>>          /* Restore vhost call */
> >>>>>>>          vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
> >>>>>>>                               dev->vqs[idx].notifier_is_masked);
> >>>>>>> +
> >>>>>>> +
> >>>>>>> +    for (i = 0; i < svq->vring.num; ++i) {
> >>>>>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> >>>>>>> +        /*
> >>>>>>> +         * Although the doc says we must unpop in order, it's ok to unpop
> >>>>>>> +         * everything.
> >>>>>>> +         */
> >>>>>>> +        if (elem) {
> >>>>>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
> >>>>>> Shouldn't we need to wait until all pending requests to be drained? Or
> >>>>>> we may end up duplicated requests?
> >>>>>>
> >>>>> Do you mean pending as in-flight/processing in the device? The device
> >>>>> must be paused at this point.
> >>>> Ok. I see there's a vhost_set_vring_enable(dev, false) in
> >>>> vhost_sw_live_migration_start().
> >>>>
> >>>>
> >>>>> Currently there is no assertion for
> >>>>> this, maybe we can track the device status for it.
> >>>>>
> >>>>> For the queue handlers to be running at this point, the main event
> >>>>> loop should serialize QMP and handlers as far as I know (and they
> >>>>> would make all state inconsistent if the device stops suddenly). It
> >>>>> would need to be synchronized if the handlers run in their own AIO
> >>>>> context. That would be nice to have but it's not included here.
> >>>> That's why I suggest to just drop the QMP stuffs and use cli parameters
> >>>> to enable shadow virtqueue. Things would be greatly simplified I guess.
> >>>>
> >>> I can send a series without it, but SVQ will need to be able to kick
> >>> in dynamically sooner or later if we want to use it for live
> >>> migration.
> >>
> >> I'm not sure I get the issue here. My understnading is everyhing will be
> >> processed in the same aio context.
> >>
> > What I meant is that QMP allows us to activate the shadow virtqueue
> > mode in any moment, similar to how live migration would activate it.
>
>
> I get you.
>
>
> > To enable SVQ with a command line would imply that it runs the same
> > way for all the time qemu runs.
>
>
> Ok.
>
>
> >
> > If we do that way, we don't need more synchronization, since we have
> > deleted the event that could run concurrently with the masking. But
> > this synchronization will be needed if we want to enable SVQ
> > dynamically for live migration, so we are "just" delaying work.
> >
> > However, if we add vdpa iova range to this patch series, I think it
> > would be a good idea to delay that synchronization work to future
> > series, so they are smaller and the first one can be tested better.
>
>
> Yes, that's why I think we can start from simple case. E.g to let the
> shadow virtqueue logic run. Then we can consider to add synchronization
> in the future.
>
> I guess things like mutex or bh might help, it would be more easier to
> add those stuffs on top.
>

Continuing this topic in patch 05/13, since both have converged.

> Thanks
>
>
> >
> >> Thanks
> >>
> >>
> >>>> Thanks
> >>>>
> >>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>>      }
> >>>>>>>
> >>>>>>>      /*
> >>>>>>> @@ -284,7 +466,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>>>>>          unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> >>>>>>>          size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> >>>>>>>          g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> >>>>>>> -    int r;
> >>>>>>> +    int r, i;
> >>>>>>>
> >>>>>>>          r = event_notifier_init(&svq->kick_notifier, 0);
> >>>>>>>          if (r != 0) {
> >>>>>>> @@ -303,6 +485,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> >>>>>>>          vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
> >>>>>>>          svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >>>>>>>          svq->vdev = dev->vdev;
> >>>>>>> +    for (i = 0; i < num - 1; i++) {
> >>>>>>> +        svq->descs[i].next = virtio_tswap16(dev->vdev, i + 1);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >>>>>>>          event_notifier_set_handler(&svq->call_notifier,
> >>>>>>>                                     vhost_shadow_vq_handle_call);
> >>>>>>>          qemu_event_init(&svq->masked_notifier.is_free, true);
> >>>>>>> @@ -324,5 +511,6 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
> >>>>>>>          event_notifier_cleanup(&vq->kick_notifier);
> >>>>>>>          event_notifier_set_handler(&vq->call_notifier, NULL);
> >>>>>>>          event_notifier_cleanup(&vq->call_notifier);
> >>>>>>> +    g_free(vq->ring_id_maps);
> >>>>>>>          g_free(vq);
> >>>>>>>      }
> >>>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>>>>>> index eab3e334f2..a373999bc4 100644
> >>>>>>> --- a/hw/virtio/vhost.c
> >>>>>>> +++ b/hw/virtio/vhost.c
> >>>>>>> @@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> >>>>>>>
> >>>>>>>          trace_vhost_iotlb_miss(dev, 1);
> >>>>>>>
> >>>>>>> +    if (qatomic_load_acquire(&dev->shadow_vqs_enabled)) {
> >>>>>>> +        uaddr = iova;
> >>>>>>> +        len = 4096;
> >>>>>>> +        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
> >>>>>>> +                                                IOMMU_RW);
> >>>>>>> +        if (ret) {
> >>>>>>> +            trace_vhost_iotlb_miss(dev, 2);
> >>>>>>> +            error_report("Fail to update device iotlb");
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        return ret;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>>          iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
> >>>>>>>                                                iova, write,
> >>>>>>>                                                MEMTXATTRS_UNSPECIFIED);
> >>>>>>> @@ -1227,8 +1240,28 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>>>>>          /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>>>>>
> >>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>>>>>> +        error_report("Fail to invalidate device iotlb");
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>>> +        /*
> >>>>>>> +         * Update used ring information for IOTLB to work correctly,
> >>>>>>> +         * vhost-kernel code requires for this.
> >>>>>>> +         */
> >>>>>>> +        struct vhost_virtqueue *vq = dev->vqs + idx;
> >>>>>>> +        vhost_device_iotlb_miss(dev, vq->used_phys, true);
> >>>>>>> +
> >>>>>>>              vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
> >>>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>>>>>> +                              dev->vq_index + idx);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    /* Enable guest's vq vring */
> >>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>>>> +
> >>>>>>> +    for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>>>              vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>>>>>          }
> >>>>>>>
> >>>>>>> @@ -1237,6 +1270,59 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
> >>>>>>>          return 0;
> >>>>>>>      }
> >>>>>>>
> >>>>>>> +/*
> >>>>>>> + * Start shadow virtqueue in a given queue.
> >>>>>>> + * In failure case, this function leaves queue working as regular vhost mode.
> >>>>>>> + */
> >>>>>>> +static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
> >>>>>>> +                                             unsigned idx)
> >>>>>>> +{
> >>>>>>> +    struct vhost_vring_addr addr = {
> >>>>>>> +        .index = idx,
> >>>>>>> +    };
> >>>>>>> +    struct vhost_vring_state s = {
> >>>>>>> +        .index = idx,
> >>>>>>> +    };
> >>>>>>> +    int r;
> >>>>>>> +    bool ok;
> >>>>>>> +
> >>>>>>> +    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>>>>>> +    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>>>>>> +    if (unlikely(!ok)) {
> >>>>>>> +        return false;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    /* From this point, vhost_virtqueue_start can reset these changes */
> >>>>>>> +    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
> >>>>>>> +    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
> >>>>>>> +    if (unlikely(r != 0)) {
> >>>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
> >>>>>>> +        goto err;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
> >>>>>>> +    if (unlikely(r != 0)) {
> >>>>>>> +        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
> >>>>>>> +        goto err;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    /*
> >>>>>>> +     * Update used ring information for IOTLB to work correctly,
> >>>>>>> +     * vhost-kernel code requires for this.
> >>>>>>> +     */
> >>>>>>> +    r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
> >>>>>>> +    if (unlikely(r != 0)) {
> >>>>>>> +        /* Debug message already printed */
> >>>>>>> +        goto err;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    return true;
> >>>>>>> +
> >>>>>>> +err:
> >>>>>>> +    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
> >>>>>>> +    return false;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>      static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>>>>>      {
> >>>>>>>          int idx, stop_idx;
> >>>>>>> @@ -1249,24 +1335,35 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
> >>>>>>>              }
> >>>>>>>          }
> >>>>>>>
> >>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, false);
> >>>>>>> +    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
> >>>>>>> +        error_report("Fail to invalidate device iotlb");
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>>          /* Can be read by vhost_virtqueue_mask, from vm exit */
> >>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, true);
> >>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>>> -        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
> >>>>>>> +        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
> >>>>>>>              if (unlikely(!ok)) {
> >>>>>>>                  goto err_start;
> >>>>>>>              }
> >>>>>>>          }
> >>>>>>>
> >>>>>>> +    /* Enable shadow vq vring */
> >>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>>>>          return 0;
> >>>>>>>
> >>>>>>>      err_start:
> >>>>>>>          qatomic_store_release(&dev->shadow_vqs_enabled, false);
> >>>>>>>          for (stop_idx = 0; stop_idx < idx; stop_idx++) {
> >>>>>>>              vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
> >>>>>>> +        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
> >>>>>>> +                              dev->vq_index + stop_idx);
> >>>>>>>          }
> >>>>>>>
> >>>>>>>      err_new:
> >>>>>>> +    /* Enable guest's vring */
> >>>>>>> +    dev->vhost_ops->vhost_set_vring_enable(dev, true);
> >>>>>>>          for (idx = 0; idx < dev->nvqs; ++idx) {
> >>>>>>>              vhost_shadow_vq_free(dev->shadow_vqs[idx]);
> >>>>>>>          }
> >>>>>>> @@ -1970,6 +2067,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>>>>>
> >>>>>>>              if (!hdev->started) {
> >>>>>>>                  err_cause = "Device is not started";
> >>>>>>> +        } else if (!vhost_dev_has_iommu(hdev)) {
> >>>>>>> +            err_cause = "Does not support iommu";
> >>>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> >>>>>>> +            err_cause = "Is packed";
> >>>>>>> +        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
> >>>>>>> +            err_cause = "Have event idx";
> >>>>>>> +        } else if (hdev->acked_features &
> >>>>>>> +                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
> >>>>>>> +            err_cause = "Supports indirect descriptors";
> >>>>>>> +        } else if (!hdev->vhost_ops->vhost_set_vring_enable) {
> >>>>>>> +            err_cause = "Cannot pause device";
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        if (err_cause) {
> >>>>>>>                  goto err;
> >>>>>>>              }
> >>>>>>>
> >
>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-18  9:29               ` Jason Wang
@ 2021-03-18 10:48                 ` Eugenio Perez Martin
  2021-03-18 12:04                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-18 10:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Thu, Mar 18, 2021 at 10:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/18 下午5:18, Eugenio Perez Martin 写道:
> > On Thu, Mar 18, 2021 at 4:11 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
> >>> On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
> >>>>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> >>>>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> >>>>>>> stops, so code flow follows usual cleanup.
> >>>>>>>
> >>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>> ---
> >>>>>>>      hw/virtio/vhost-shadow-virtqueue.h |   7 ++
> >>>>>>>      include/hw/virtio/vhost.h          |   4 +
> >>>>>>>      hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
> >>>>>>>      hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
> >>>>>>>      4 files changed, 265 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>>>> index 6cc18d6acb..c891c6510d 100644
> >>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>>>> @@ -17,6 +17,13 @@
> >>>>>>>
> >>>>>>>      typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>>>>>
> >>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>>>>>> +                           unsigned idx,
> >>>>>>> +                           VhostShadowVirtqueue *svq);
> >>>>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> >>>>>>> +                          unsigned idx,
> >>>>>>> +                          VhostShadowVirtqueue *svq);
> >>>>>>> +
> >>>>>>>      VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
> >>>>>>>
> >>>>>>>      void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> >>>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >>>>>>> index ac963bf23d..7ffdf9aea0 100644
> >>>>>>> --- a/include/hw/virtio/vhost.h
> >>>>>>> +++ b/include/hw/virtio/vhost.h
> >>>>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
> >>>>>>>          QLIST_ENTRY(vhost_iommu) iommu_next;
> >>>>>>>      };
> >>>>>>>
> >>>>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>>>>> +
> >>>>>>>      typedef struct VhostDevConfigOps {
> >>>>>>>          /* Vhost device config space changed callback
> >>>>>>>           */
> >>>>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
> >>>>>>>          uint64_t backend_cap;
> >>>>>>>          bool started;
> >>>>>>>          bool log_enabled;
> >>>>>>> +    bool shadow_vqs_enabled;
> >>>>>>>          uint64_t log_size;
> >>>>>>> +    VhostShadowVirtqueue **shadow_vqs;
> >>>>>> Any reason that you don't embed the shadow virtqueue into
> >>>>>> vhost_virtqueue structure?
> >>>>>>
> >>>>> Not really, it could be relatively big and I would prefer SVQ
> >>>>> members/methods to remain hidden from any other part that includes
> >>>>> vhost.h. But it could be changed, for sure.
> >>>>>
> >>>>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
> >>>>>>
> >>>>> They are used differently: in SVQ the masked notifier is a pointer,
> >>>>> and if it's NULL the SVQ code knows that device is not masked. The
> >>>>> vhost_virtqueue is the real owner.
> >>>> Yes, but it's an example for embedding auxciliary data structures in the
> >>>> vhost_virtqueue.
> >>>>
> >>>>
> >>>>> It could be replaced by a boolean in SVQ or something like that, I
> >>>>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
> >>>>> and let vhost.c code to manage all the transitions. But I find clearer
> >>>>> the pointer use, since it's the more natural for the
> >>>>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
> >>>>>
> >>>>> This masking/unmasking is the part I dislike the most from this
> >>>>> series, so I'm very open to alternatives.
> >>>> See below. I think we don't even need to care about that.
> >>>>
> >>>>
> >>>>>>>          Error *migration_blocker;
> >>>>>>>          const VhostOps *vhost_ops;
> >>>>>>>          void *opaque;
> >>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> index 4512e5b058..3e43399e9c 100644
> >>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>>>> @@ -8,9 +8,12 @@
> >>>>>>>       */
> >>>>>>>
> >>>>>>>      #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>>>>> +#include "hw/virtio/vhost.h"
> >>>>>>> +
> >>>>>>> +#include "standard-headers/linux/vhost_types.h"
> >>>>>>>
> >>>>>>>      #include "qemu/error-report.h"
> >>>>>>> -#include "qemu/event_notifier.h"
> >>>>>>> +#include "qemu/main-loop.h"
> >>>>>>>
> >>>>>>>      /* Shadow virtqueue to relay notifications */
> >>>>>>>      typedef struct VhostShadowVirtqueue {
> >>>>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
> >>>>>>>          EventNotifier kick_notifier;
> >>>>>>>          /* Shadow call notifier, sent to vhost */
> >>>>>>>          EventNotifier call_notifier;
> >>>>>>> +
> >>>>>>> +    /*
> >>>>>>> +     * Borrowed virtqueue's guest to host notifier.
> >>>>>>> +     * To borrow it in this event notifier allows to register on the event
> >>>>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
> >>>>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
> >>>>>> So this is something that worries me. It looks like a layer violation
> >>>>>> that makes the codes harder to work correctly.
> >>>>>>
> >>>>> I don't follow you here.
> >>>>>
> >>>>> The vhost code already depends on virtqueue in the same sense:
> >>>>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
> >>>>> if this behavior ever changes it is unlikely for vhost to keep working
> >>>>> without changes. vhost_virtqueue has a kick/call int where I think it
> >>>>> should be stored actually, but they are never used as far as I see.
> >>>>>
> >>>>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
> >>>>> /* Stop processing guest IO notifications in vhost.
> >>>>>     * Start processing them in qemu.
> >>>>>     ...
> >>>>> But it was easier for this mode to miss a notification, since they
> >>>>> create a new host_notifier in virtio_bus_set_host_notifier right away.
> >>>>> So I decided to use the file descriptor already sent to vhost in
> >>>>> regular operation mode, so guest-related resources change less.
> >>>>>
> >>>>> Having said that, maybe it's useful to assert that
> >>>>> vhost_dev_{enable,disable}_notifiers are never called on shadow
> >>>>> virtqueue mode. Also, it could be useful to retrieve it from
> >>>>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
> >>>>> from it. Would that make more sense?
> >>>>>
> >>>>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
> >>>>>> virtqueue implementation:
> >>>>>>
> >>>>>> 1) have the above fields embeded in vhost_vdpa structure
> >>>>>> 2) Work at the level of
> >>>>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
> >>>>>>
> >>>>> This notifier is never sent to the device in shadow virtqueue mode.
> >>>>> It's for SVQ to react to guest's notifications, registering it on its
> >>>>> main event loop [1]. So if I perform these changes the way I
> >>>>> understand them, SVQ would still rely on this borrowed EventNotifier,
> >>>>> and it would send to the vDPA device the newly created kick_notifier
> >>>>> of VhostShadowVirtqueue.
> >>>> The point is that vhost code should be coupled loosely with virtio. If
> >>>> you try to "borrow" EventNotifier from virtio, you need to deal with a
> >>>> lot of synchrization. An exampleis the masking stuffs.
> >>>>
> >>> I still don't follow this, sorry.
> >>>
> >>> The svq->host_notifier event notifier is not affected by the masking
> >>> issue, it is completely private to SVQ. This commit creates and uses
> >>> it, and nothing related to masking is touched until the next commit.
> >>>
> >>>>>> Then the layer is still isolated and you have a much simpler context to
> >>>>>> work that you don't need to care a lot of synchornization:
> >>>>>>
> >>>>>> 1) vq masking
> >>>>> This EventNotifier is not used for masking, it does not change from
> >>>>> the start of the shadow virtqueue operation through its end. Call fd
> >>>>> sent to vhost/vdpa device does not change either in shadow virtqueue
> >>>>> mode operation with masking/unmasking. I will try to document it
> >>>>> better.
> >>>>>
> >>>>> I think that we will need to handle synchronization with
> >>>>> masking/unmasking from the guest and dynamically enabling SVQ
> >>>>> operation mode, since they can happen at the same time as long as we
> >>>>> let the guest run. There may be better ways of synchronizing them of
> >>>>> course, but I don't see how moving to the vhost-vdpa backend helps
> >>>>> with this. Please expand if I've missed it.
> >>>>>
> >>>>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
> >>>>> to future patchsets?
> >>>> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
> >>>> hide them from the upper layers like virtio. This means it works at
> >>>> vhost level which can see vhost_vring_file only. When enalbed, what it
> >>>> needs is just:
> >>>>
> >>>> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
> >>>> 2) switch to use svq callfd and relay svq callfd to irqfd
> >>>>
> >>>> It will still behave like a vhost-backend that the switching is done
> >>>> internally in vhost-vDPA which is totally transparent to the virtio
> >>>> codes of Qemu.
> >>>>
> >>>> E.g:
> >>>>
> >>>> 1) in the case of guest notifier masking, we don't need to do anything
> >>>> since virtio codes will replace another irqfd for us.
> >>> Assuming that we don't modify vhost masking code, but send shadow
> >>> virtqueue call descriptor to the vhost device:
> >>>
> >>> If guest virtio code mask the virtqueue and replaces the vhost-vdpa
> >>> device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
> >>> or the descriptor in your previous second point, svq callfd) with the
> >>> masked notifier, vhost_shadow_vq_handle_call will not be called
> >>> anymore, and no more used descriptors will be forwarded. They will be
> >>> stuck if the shadow virtqueue forever. Guest itself cannot recover
> >>> from this situation, since a masking will set irqfd, not SVQ call fd.
> >>
> >> Just to make sure we're in the same page. During vq masking, the virtio
> >> codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():
> >>
> >>       if (mask) {
> >>           assert(vdev->use_guest_notifier_mask);
> >>           file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
> >>       } else {
> >>       file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
> >>       }
> >>
> >>       file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
> >>       r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
> >>
> >> So consider the shadow virtqueue in done at vhost-vDPA. We just need to
> >> make sure
> >>
> >> 1) update the callfd which passed by virtio layer via set_vring_kick()
> >> 2) always write to the callfd during vhost_shadow_vq_handle_call()
> >>
> >> Then
> >>
> >> 3) When shadow vq is enabled, we just set the callfd of shadow virtqueue
> >> to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
> >> 4) When shadow vq is disabled, we just set the callfd that is passed by
> >> virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd
> >>
> >> So you can see in step 2 and 4, we don't need to know whether or not the
> >> vq is masked since we follow the vhost protocol "VhostOps" and do
> >> everyhing transparently in the vhost-(vDPA) layer.
> >>
> > All of this assumes that we can enable/disable SVQ dynamically while
> > the device is running. If it's not the case, there is no need for the
> > mutex neither in vhost.c code nor vdpa_backend.
> >
> > As I see it, the issue is that step (2) and (4) happens in different
> > threads: (2) is in vCPU vmexit, and (4) is in main event loop.
> > Consider unmasking and disabling SVQ at the same time with no mutex:
> >
> > vCPU vmexit thread                     aio thread
> > (unmask)                               (stops SVQ)
> > |                                      |
> > |                                      // Last callfd set was masked_notifier
> > |                                      vdpa_backend.callfd = \
> > |                                              atomic_read(masked_notifier).
> > |                                      |
> > vhost_set_vring_call(vq.guest_notifier)|
> > -> vdpa_backend.callfd = \             |
> >             vq.guest_notifier           |
> > |                                      |
> > |                                      ioctl(vdpa,
> > VHOST_SET_VRING_CALL, vdpa_backend.callfd)
> > |
> > // guest expects more interrupts, but
> > // device just set masked
> >
> > And vhost_set_vring_call could happen entirely even while ioctl is
> > being executed.
> >
> > So that is the reason for the mutex: vdpa_backend.call_fd and the
> > ioctl VHOST_SET_VRING_CALL must be serialized. I'm ok with moving to
> > vdpa backend, but it's the same code, just in vdpa_backend.c instead
> > of vhost.c, so it becomes less generic in my opinion.
>
>
> You are right. But let's consider if we can avoid the dedicated mutex.
>
> E.g can we use the BQL, bascially we need to synchronizae with iothread.
>
> Or is it possible to schedule bh then things are serailzied automatically?
>

I tried RCU with no success, and I think the same issues apply to bh.
I will try to explain the best I can what I achieved in the past, and
why I discarded it. I will explore BQL approaches, it could be simpler
that way actually.

The hard part to achieve if that no notification can be forwarded to
the guest once masking vmexit returns (isn't it?). Unmasking scenario
is easy with RCU, since the pending notification could reach the guest
asynchronously if it exists.

On the other hand, whatever guest set should take priority over
whatever shadow_vq_stop sets.

With RCU, the problem is that the synchronization point should be the
vmexit thread. The function vhost_virtqueue_mask is already called
within RCU, so it could happen that RCU lock is held longer than it
returns, so the effective masking could happen after vmexit returns. I
see no way to make something like "call_rcu but only in this thread"
or, in other words, " rcu_synchronize after the rcu_unlock of this
thread and then run this".

I tried to explore to synchronize that situation in the event loop,
but the guest is able to call unmask/mask again, making the race
condition. If I can mark a range where unmask/mask cannot return, I'm
creating an artificial mutex. If I retry in the main event loop, there
is a window where notification can reach the guest after masking.

In order to reduce this window, shadow_virtqueue_stop could set the
masked notifier fd unconditionally, and then check if it should unmask
under the mutex. I'm not sure if this is worth however, since
enabling/disabling already involves a system call.

I think it would be way more natural to at least protect
vhost_virtqueue.notifier_is_masked with the BQL, however, so I will
check that possibility. It would be great to be able to do it on bh,
but I think this opens a window for the host to send notifications
when the guest has masked the vq, since mask/unmasking could happen
while bh is running as far as I see.

Actually, in my test I override vhost_virtqueue_mask, so it looks more
similar to your proposal with VhostOps.

Thanks!

>
> >
> >>>> 2) easily to deal with vhost dev start and stop
> >>>>
> >>>> The advantages are obvious, simple and easy to implement.
> >>>>
> >>> I still don't see how performing this step from backend code avoids
> >>> the synchronization problem, since they will be done from different
> >>> threads anyway. Not sure what piece I'm missing.
> >>
> >> See my reply in another thread. If you enable the shadow virtqueue via a
> >> OOB monitor that's a real issue.
> >>
> >> But I don't think we need to do that since
> >>
> >> 1) SVQ should be transparnet to management
> >> 2) unncessary synchronization issue
> >>
> >> We can enable the shadow virtqueue through cli, new parameter with
> >> vhost-vdpa probably. Then we don't need to care about threads. And in
> >> the final version with full live migration support, the shadow virtqueue
> >> should be enabled automatically. E.g for the device without
> >> VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via
> >> VHOST_GET_BACKEND_FEATURES.
> >>
> > It should be enabled automatically in those condition, but it also
> > needs to be dynamic, and only be active during migration. Otherwise,
> > guest should use regular vdpa operation. The problem with masking is
> > the same if we enable with QMP or because live migration event.
> >
> > So we will have the previous synchronization problem sooner or later.
> > If we omit the rollback to regular vdpa operation (in other words,
> > disabling SVQ), code can be simplified, but I'm not sure if that is
> > desirable.
>
>
> Rgiht, so I'm ok to have the synchronziation from the start if you wish.
>
> But we need to figure out what to synchronize and how to do synchronize.
>
> THanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
> >>> I can see / tested a few solutions but I don't like them a lot:
> >>>
> >>> * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
> >>> cmdline: We will have to deal with setting the SVQ mode dynamically
> >>> sooner or later if we want to use it for live migration.
> >>> * Forbid coming back to regular mode after switching to shadow
> >>> virtqueue mode: The heavy part of the synchronization comes from svq
> >>> stopping code, since we need to serialize the setting of device call
> >>> fd. This could be acceptable, but I'm not sure about the implications:
> >>> What happens if live migration fails and we need to step back? A mutex
> >>> is not needed in this scenario, it's ok with atomics and RCU code.
> >>>
> >>> * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
> >>> notifier: I haven't thought a lot of this one, I think it's better to
> >>> not touch guest notifiers.
> >>> * Monitor also masked notifier from SVQ: I think this could be
> >>> promising, but SVQ needs to be notified about masking/unmasking
> >>> anyway, and there is code that depends on checking the masked notifier
> >>> for the pending notification.
> >>>
> >>>>>> 2) vhost dev start and stop
> >>>>>>
> >>>>>> ?
> >>>>>>
> >>>>>>
> >>>>>>> +     *
> >>>>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >>>>>>> +     */
> >>>>>>> +    EventNotifier host_notifier;
> >>>>>>> +
> >>>>>>> +    /* Virtio queue shadowing */
> >>>>>>> +    VirtQueue *vq;
> >>>>>>>      } VhostShadowVirtqueue;
> >>>>>>>
> >>>>>>> +/* Forward guest notifications */
> >>>>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
> >>>>>>> +{
> >>>>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>>>>> +                                             host_notifier);
> >>>>>>> +
> >>>>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    event_notifier_set(&svq->kick_notifier);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/*
> >>>>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >>>>>>> + */
> >>>>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> >>>>>>> +                                                     unsigned vhost_index,
> >>>>>>> +                                                     VhostShadowVirtqueue *svq)
> >>>>>>> +{
> >>>>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> >>>>>>> +    struct vhost_vring_file file = {
> >>>>>>> +        .index = vhost_index,
> >>>>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
> >>>>>>> +    };
> >>>>>>> +    int r;
> >>>>>>> +
> >>>>>>> +    /* Restore vhost kick */
> >>>>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >>>>>>> +    return r ? -errno : 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/*
> >>>>>>> + * Start shadow virtqueue operation.
> >>>>>>> + * @dev vhost device
> >>>>>>> + * @hidx vhost virtqueue index
> >>>>>>> + * @svq Shadow Virtqueue
> >>>>>>> + */
> >>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> >>>>>>> +                           unsigned idx,
> >>>>>>> +                           VhostShadowVirtqueue *svq)
> >>>>>> It looks to me this assumes the vhost_dev is started before
> >>>>>> vhost_shadow_vq_start()?
> >>>>>>
> >>>>> Right.
> >>>> This might not true. Guest may enable and disable virtio drivers after
> >>>> the shadow virtqueue is started. You need to deal with that.
> >>>>
> >>> Right, I will test this scenario.
> >>>
> >>>> Thanks
> >>>>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-18 10:48                 ` Eugenio Perez Martin
@ 2021-03-18 12:04                   ` Eugenio Perez Martin
  2021-03-19  6:55                     ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-03-18 12:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Thu, Mar 18, 2021 at 11:48 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Mar 18, 2021 at 10:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/3/18 下午5:18, Eugenio Perez Martin 写道:
> > > On Thu, Mar 18, 2021 at 4:11 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> 在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
> > >>> On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>>> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
> > >>>>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > >>>>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> > >>>>>>> stops, so code flow follows usual cleanup.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >>>>>>> ---
> > >>>>>>>      hw/virtio/vhost-shadow-virtqueue.h |   7 ++
> > >>>>>>>      include/hw/virtio/vhost.h          |   4 +
> > >>>>>>>      hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
> > >>>>>>>      hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
> > >>>>>>>      4 files changed, 265 insertions(+), 2 deletions(-)
> > >>>>>>>
> > >>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > >>>>>>> index 6cc18d6acb..c891c6510d 100644
> > >>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> > >>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > >>>>>>> @@ -17,6 +17,13 @@
> > >>>>>>>
> > >>>>>>>      typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > >>>>>>>
> > >>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> > >>>>>>> +                           unsigned idx,
> > >>>>>>> +                           VhostShadowVirtqueue *svq);
> > >>>>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
> > >>>>>>> +                          unsigned idx,
> > >>>>>>> +                          VhostShadowVirtqueue *svq);
> > >>>>>>> +
> > >>>>>>>      VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
> > >>>>>>>
> > >>>>>>>      void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
> > >>>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > >>>>>>> index ac963bf23d..7ffdf9aea0 100644
> > >>>>>>> --- a/include/hw/virtio/vhost.h
> > >>>>>>> +++ b/include/hw/virtio/vhost.h
> > >>>>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
> > >>>>>>>          QLIST_ENTRY(vhost_iommu) iommu_next;
> > >>>>>>>      };
> > >>>>>>>
> > >>>>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > >>>>>>> +
> > >>>>>>>      typedef struct VhostDevConfigOps {
> > >>>>>>>          /* Vhost device config space changed callback
> > >>>>>>>           */
> > >>>>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
> > >>>>>>>          uint64_t backend_cap;
> > >>>>>>>          bool started;
> > >>>>>>>          bool log_enabled;
> > >>>>>>> +    bool shadow_vqs_enabled;
> > >>>>>>>          uint64_t log_size;
> > >>>>>>> +    VhostShadowVirtqueue **shadow_vqs;
> > >>>>>> Any reason that you don't embed the shadow virtqueue into
> > >>>>>> vhost_virtqueue structure?
> > >>>>>>
> > >>>>> Not really, it could be relatively big and I would prefer SVQ
> > >>>>> members/methods to remain hidden from any other part that includes
> > >>>>> vhost.h. But it could be changed, for sure.
> > >>>>>
> > >>>>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
> > >>>>>>
> > >>>>> They are used differently: in SVQ the masked notifier is a pointer,
> > >>>>> and if it's NULL the SVQ code knows that device is not masked. The
> > >>>>> vhost_virtqueue is the real owner.
> > >>>> Yes, but it's an example for embedding auxciliary data structures in the
> > >>>> vhost_virtqueue.
> > >>>>
> > >>>>
> > >>>>> It could be replaced by a boolean in SVQ or something like that, I
> > >>>>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
> > >>>>> and let vhost.c code to manage all the transitions. But I find clearer
> > >>>>> the pointer use, since it's the more natural for the
> > >>>>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
> > >>>>>
> > >>>>> This masking/unmasking is the part I dislike the most from this
> > >>>>> series, so I'm very open to alternatives.
> > >>>> See below. I think we don't even need to care about that.
> > >>>>
> > >>>>
> > >>>>>>>          Error *migration_blocker;
> > >>>>>>>          const VhostOps *vhost_ops;
> > >>>>>>>          void *opaque;
> > >>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > >>>>>>> index 4512e5b058..3e43399e9c 100644
> > >>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> > >>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > >>>>>>> @@ -8,9 +8,12 @@
> > >>>>>>>       */
> > >>>>>>>
> > >>>>>>>      #include "hw/virtio/vhost-shadow-virtqueue.h"
> > >>>>>>> +#include "hw/virtio/vhost.h"
> > >>>>>>> +
> > >>>>>>> +#include "standard-headers/linux/vhost_types.h"
> > >>>>>>>
> > >>>>>>>      #include "qemu/error-report.h"
> > >>>>>>> -#include "qemu/event_notifier.h"
> > >>>>>>> +#include "qemu/main-loop.h"
> > >>>>>>>
> > >>>>>>>      /* Shadow virtqueue to relay notifications */
> > >>>>>>>      typedef struct VhostShadowVirtqueue {
> > >>>>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
> > >>>>>>>          EventNotifier kick_notifier;
> > >>>>>>>          /* Shadow call notifier, sent to vhost */
> > >>>>>>>          EventNotifier call_notifier;
> > >>>>>>> +
> > >>>>>>> +    /*
> > >>>>>>> +     * Borrowed virtqueue's guest to host notifier.
> > >>>>>>> +     * To borrow it in this event notifier allows to register on the event
> > >>>>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
> > >>>>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
> > >>>>>> So this is something that worries me. It looks like a layer violation
> > >>>>>> that makes the codes harder to work correctly.
> > >>>>>>
> > >>>>> I don't follow you here.
> > >>>>>
> > >>>>> The vhost code already depends on virtqueue in the same sense:
> > >>>>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
> > >>>>> if this behavior ever changes it is unlikely for vhost to keep working
> > >>>>> without changes. vhost_virtqueue has a kick/call int where I think it
> > >>>>> should be stored actually, but they are never used as far as I see.
> > >>>>>
> > >>>>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
> > >>>>> /* Stop processing guest IO notifications in vhost.
> > >>>>>     * Start processing them in qemu.
> > >>>>>     ...
> > >>>>> But it was easier for this mode to miss a notification, since they
> > >>>>> create a new host_notifier in virtio_bus_set_host_notifier right away.
> > >>>>> So I decided to use the file descriptor already sent to vhost in
> > >>>>> regular operation mode, so guest-related resources change less.
> > >>>>>
> > >>>>> Having said that, maybe it's useful to assert that
> > >>>>> vhost_dev_{enable,disable}_notifiers are never called on shadow
> > >>>>> virtqueue mode. Also, it could be useful to retrieve it from
> > >>>>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
> > >>>>> from it. Would that make more sense?
> > >>>>>
> > >>>>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
> > >>>>>> virtqueue implementation:
> > >>>>>>
> > >>>>>> 1) have the above fields embeded in vhost_vdpa structure
> > >>>>>> 2) Work at the level of
> > >>>>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
> > >>>>>>
> > >>>>> This notifier is never sent to the device in shadow virtqueue mode.
> > >>>>> It's for SVQ to react to guest's notifications, registering it on its
> > >>>>> main event loop [1]. So if I perform these changes the way I
> > >>>>> understand them, SVQ would still rely on this borrowed EventNotifier,
> > >>>>> and it would send to the vDPA device the newly created kick_notifier
> > >>>>> of VhostShadowVirtqueue.
> > >>>> The point is that vhost code should be coupled loosely with virtio. If
> > >>>> you try to "borrow" EventNotifier from virtio, you need to deal with a
> > >>>> lot of synchrization. An exampleis the masking stuffs.
> > >>>>
> > >>> I still don't follow this, sorry.
> > >>>
> > >>> The svq->host_notifier event notifier is not affected by the masking
> > >>> issue, it is completely private to SVQ. This commit creates and uses
> > >>> it, and nothing related to masking is touched until the next commit.
> > >>>
> > >>>>>> Then the layer is still isolated and you have a much simpler context to
> > >>>>>> work that you don't need to care a lot of synchornization:
> > >>>>>>
> > >>>>>> 1) vq masking
> > >>>>> This EventNotifier is not used for masking, it does not change from
> > >>>>> the start of the shadow virtqueue operation through its end. Call fd
> > >>>>> sent to vhost/vdpa device does not change either in shadow virtqueue
> > >>>>> mode operation with masking/unmasking. I will try to document it
> > >>>>> better.
> > >>>>>
> > >>>>> I think that we will need to handle synchronization with
> > >>>>> masking/unmasking from the guest and dynamically enabling SVQ
> > >>>>> operation mode, since they can happen at the same time as long as we
> > >>>>> let the guest run. There may be better ways of synchronizing them of
> > >>>>> course, but I don't see how moving to the vhost-vdpa backend helps
> > >>>>> with this. Please expand if I've missed it.
> > >>>>>
> > >>>>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
> > >>>>> to future patchsets?
> > >>>> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
> > >>>> hide them from the upper layers like virtio. This means it works at
> > >>>> vhost level which can see vhost_vring_file only. When enalbed, what it
> > >>>> needs is just:
> > >>>>
> > >>>> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
> > >>>> 2) switch to use svq callfd and relay svq callfd to irqfd
> > >>>>
> > >>>> It will still behave like a vhost-backend that the switching is done
> > >>>> internally in vhost-vDPA which is totally transparent to the virtio
> > >>>> codes of Qemu.
> > >>>>
> > >>>> E.g:
> > >>>>
> > >>>> 1) in the case of guest notifier masking, we don't need to do anything
> > >>>> since virtio codes will replace another irqfd for us.
> > >>> Assuming that we don't modify vhost masking code, but send shadow
> > >>> virtqueue call descriptor to the vhost device:
> > >>>
> > >>> If guest virtio code mask the virtqueue and replaces the vhost-vdpa
> > >>> device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
> > >>> or the descriptor in your previous second point, svq callfd) with the
> > >>> masked notifier, vhost_shadow_vq_handle_call will not be called
> > >>> anymore, and no more used descriptors will be forwarded. They will be
> > >>> stuck if the shadow virtqueue forever. Guest itself cannot recover
> > >>> from this situation, since a masking will set irqfd, not SVQ call fd.
> > >>
> > >> Just to make sure we're in the same page. During vq masking, the virtio
> > >> codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():
> > >>
> > >>       if (mask) {
> > >>           assert(vdev->use_guest_notifier_mask);
> > >>           file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
> > >>       } else {
> > >>       file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
> > >>       }
> > >>
> > >>       file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
> > >>       r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
> > >>
> > >> So consider the shadow virtqueue in done at vhost-vDPA. We just need to
> > >> make sure
> > >>
> > >> 1) update the callfd which passed by virtio layer via set_vring_kick()
> > >> 2) always write to the callfd during vhost_shadow_vq_handle_call()
> > >>
> > >> Then
> > >>
> > >> 3) When shadow vq is enabled, we just set the callfd of shadow virtqueue
> > >> to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
> > >> 4) When shadow vq is disabled, we just set the callfd that is passed by
> > >> virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd
> > >>
> > >> So you can see in step 2 and 4, we don't need to know whether or not the
> > >> vq is masked since we follow the vhost protocol "VhostOps" and do
> > >> everyhing transparently in the vhost-(vDPA) layer.
> > >>
> > > All of this assumes that we can enable/disable SVQ dynamically while
> > > the device is running. If it's not the case, there is no need for the
> > > mutex neither in vhost.c code nor vdpa_backend.
> > >
> > > As I see it, the issue is that step (2) and (4) happens in different
> > > threads: (2) is in vCPU vmexit, and (4) is in main event loop.
> > > Consider unmasking and disabling SVQ at the same time with no mutex:
> > >
> > > vCPU vmexit thread                     aio thread
> > > (unmask)                               (stops SVQ)
> > > |                                      |
> > > |                                      // Last callfd set was masked_notifier
> > > |                                      vdpa_backend.callfd = \
> > > |                                              atomic_read(masked_notifier).
> > > |                                      |
> > > vhost_set_vring_call(vq.guest_notifier)|
> > > -> vdpa_backend.callfd = \             |
> > >             vq.guest_notifier           |
> > > |                                      |
> > > |                                      ioctl(vdpa,
> > > VHOST_SET_VRING_CALL, vdpa_backend.callfd)
> > > |
> > > // guest expects more interrupts, but
> > > // device just set masked
> > >
> > > And vhost_set_vring_call could happen entirely even while ioctl is
> > > being executed.
> > >
> > > So that is the reason for the mutex: vdpa_backend.call_fd and the
> > > ioctl VHOST_SET_VRING_CALL must be serialized. I'm ok with moving to
> > > vdpa backend, but it's the same code, just in vdpa_backend.c instead
> > > of vhost.c, so it becomes less generic in my opinion.
> >
> >
> > You are right. But let's consider if we can avoid the dedicated mutex.
> >
> > E.g can we use the BQL, bascially we need to synchronizae with iothread.
> >
> > Or is it possible to schedule bh then things are serailzied automatically?
> >
>
> I tried RCU with no success, and I think the same issues apply to bh.
> I will try to explain the best I can what I achieved in the past, and
> why I discarded it. I will explore BQL approaches, it could be simpler
> that way actually.
>
> The hard part to achieve if that no notification can be forwarded to
> the guest once masking vmexit returns (isn't it?). Unmasking scenario
> is easy with RCU, since the pending notification could reach the guest
> asynchronously if it exists.
>
> On the other hand, whatever guest set should take priority over
> whatever shadow_vq_stop sets.
>
> With RCU, the problem is that the synchronization point should be the
> vmexit thread. The function vhost_virtqueue_mask is already called
> within RCU, so it could happen that RCU lock is held longer than it
> returns, so the effective masking could happen after vmexit returns. I
> see no way to make something like "call_rcu but only in this thread"
> or, in other words, " rcu_synchronize after the rcu_unlock of this
> thread and then run this".
>
> I tried to explore to synchronize that situation in the event loop,
> but the guest is able to call unmask/mask again, making the race
> condition. If I can mark a range where unmask/mask cannot return, I'm
> creating an artificial mutex. If I retry in the main event loop, there
> is a window where notification can reach the guest after masking.
>
> In order to reduce this window, shadow_virtqueue_stop could set the
> masked notifier fd unconditionally, and then check if it should unmask
> under the mutex. I'm not sure if this is worth however, since
> enabling/disabling already involves a system call.
>
> I think it would be way more natural to at least protect
> vhost_virtqueue.notifier_is_masked with the BQL, however, so I will
> check that possibility. It would be great to be able to do it on bh,
> but I think this opens a window for the host to send notifications
> when the guest has masked the vq, since mask/unmasking could happen
> while bh is running as far as I see.
>

So actually vhost_shadow_virtqueue_start/stop (being a QMP command)
and vhost_virtqueue_mask (vmexit) already runs under BQL.

The ideal scenario would be to run kick/call handlers in its own aio
context and do not take BQL on them, but I think this is doable with
just atomics and maybe events. For this series I think we can delete
the introduced mutex (and maybe replace it with an assertion?)

Thanks!

> Actually, in my test I override vhost_virtqueue_mask, so it looks more
> similar to your proposal with VhostOps.
>
> Thanks!
>
> >
> > >
> > >>>> 2) easily to deal with vhost dev start and stop
> > >>>>
> > >>>> The advantages are obvious, simple and easy to implement.
> > >>>>
> > >>> I still don't see how performing this step from backend code avoids
> > >>> the synchronization problem, since they will be done from different
> > >>> threads anyway. Not sure what piece I'm missing.
> > >>
> > >> See my reply in another thread. If you enable the shadow virtqueue via a
> > >> OOB monitor that's a real issue.
> > >>
> > >> But I don't think we need to do that since
> > >>
> > >> 1) SVQ should be transparnet to management
> > >> 2) unncessary synchronization issue
> > >>
> > >> We can enable the shadow virtqueue through cli, new parameter with
> > >> vhost-vdpa probably. Then we don't need to care about threads. And in
> > >> the final version with full live migration support, the shadow virtqueue
> > >> should be enabled automatically. E.g for the device without
> > >> VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via
> > >> VHOST_GET_BACKEND_FEATURES.
> > >>
> > > It should be enabled automatically in those condition, but it also
> > > needs to be dynamic, and only be active during migration. Otherwise,
> > > guest should use regular vdpa operation. The problem with masking is
> > > the same if we enable with QMP or because live migration event.
> > >
> > > So we will have the previous synchronization problem sooner or later.
> > > If we omit the rollback to regular vdpa operation (in other words,
> > > disabling SVQ), code can be simplified, but I'm not sure if that is
> > > desirable.
> >
> >
> > Rgiht, so I'm ok to have the synchronziation from the start if you wish.
> >
> > But we need to figure out what to synchronize and how to do synchronize.
> >
> > THanks
> >
> >
> > >
> > > Thanks!
> > >
> > >> Thanks
> > >>
> > >>
> > >>> I can see / tested a few solutions but I don't like them a lot:
> > >>>
> > >>> * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
> > >>> cmdline: We will have to deal with setting the SVQ mode dynamically
> > >>> sooner or later if we want to use it for live migration.
> > >>> * Forbid coming back to regular mode after switching to shadow
> > >>> virtqueue mode: The heavy part of the synchronization comes from svq
> > >>> stopping code, since we need to serialize the setting of device call
> > >>> fd. This could be acceptable, but I'm not sure about the implications:
> > >>> What happens if live migration fails and we need to step back? A mutex
> > >>> is not needed in this scenario, it's ok with atomics and RCU code.
> > >>>
> > >>> * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
> > >>> notifier: I haven't thought a lot of this one, I think it's better to
> > >>> not touch guest notifiers.
> > >>> * Monitor also masked notifier from SVQ: I think this could be
> > >>> promising, but SVQ needs to be notified about masking/unmasking
> > >>> anyway, and there is code that depends on checking the masked notifier
> > >>> for the pending notification.
> > >>>
> > >>>>>> 2) vhost dev start and stop
> > >>>>>>
> > >>>>>> ?
> > >>>>>>
> > >>>>>>
> > >>>>>>> +     *
> > >>>>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > >>>>>>> +     */
> > >>>>>>> +    EventNotifier host_notifier;
> > >>>>>>> +
> > >>>>>>> +    /* Virtio queue shadowing */
> > >>>>>>> +    VirtQueue *vq;
> > >>>>>>>      } VhostShadowVirtqueue;
> > >>>>>>>
> > >>>>>>> +/* Forward guest notifications */
> > >>>>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
> > >>>>>>> +{
> > >>>>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > >>>>>>> +                                             host_notifier);
> > >>>>>>> +
> > >>>>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> > >>>>>>> +        return;
> > >>>>>>> +    }
> > >>>>>>> +
> > >>>>>>> +    event_notifier_set(&svq->kick_notifier);
> > >>>>>>> +}
> > >>>>>>> +
> > >>>>>>> +/*
> > >>>>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> > >>>>>>> + */
> > >>>>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
> > >>>>>>> +                                                     unsigned vhost_index,
> > >>>>>>> +                                                     VhostShadowVirtqueue *svq)
> > >>>>>>> +{
> > >>>>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > >>>>>>> +    struct vhost_vring_file file = {
> > >>>>>>> +        .index = vhost_index,
> > >>>>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
> > >>>>>>> +    };
> > >>>>>>> +    int r;
> > >>>>>>> +
> > >>>>>>> +    /* Restore vhost kick */
> > >>>>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> > >>>>>>> +    return r ? -errno : 0;
> > >>>>>>> +}
> > >>>>>>> +
> > >>>>>>> +/*
> > >>>>>>> + * Start shadow virtqueue operation.
> > >>>>>>> + * @dev vhost device
> > >>>>>>> + * @hidx vhost virtqueue index
> > >>>>>>> + * @svq Shadow Virtqueue
> > >>>>>>> + */
> > >>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
> > >>>>>>> +                           unsigned idx,
> > >>>>>>> +                           VhostShadowVirtqueue *svq)
> > >>>>>> It looks to me this assumes the vhost_dev is started before
> > >>>>>> vhost_shadow_vq_start()?
> > >>>>>>
> > >>>>> Right.
> > >>>> This might not true. Guest may enable and disable virtio drivers after
> > >>>> the shadow virtqueue is started. You need to deal with that.
> > >>>>
> > >>> Right, I will test this scenario.
> > >>>
> > >>>> Thanks
> > >>>>
> >



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue
  2021-03-18 12:04                   ` Eugenio Perez Martin
@ 2021-03-19  6:55                     ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2021-03-19  6:55 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller


在 2021/3/18 下午8:04, Eugenio Perez Martin 写道:
> On Thu, Mar 18, 2021 at 11:48 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
>> On Thu, Mar 18, 2021 at 10:29 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> 在 2021/3/18 下午5:18, Eugenio Perez Martin 写道:
>>>> On Thu, Mar 18, 2021 at 4:11 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>> 在 2021/3/18 上午12:47, Eugenio Perez Martin 写道:
>>>>>> On Wed, Mar 17, 2021 at 3:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>> 在 2021/3/16 下午6:31, Eugenio Perez Martin 写道:
>>>>>>>> On Tue, Mar 16, 2021 at 8:18 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
>>>>>>>>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
>>>>>>>>>> stops, so code flow follows usual cleanup.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>       hw/virtio/vhost-shadow-virtqueue.h |   7 ++
>>>>>>>>>>       include/hw/virtio/vhost.h          |   4 +
>>>>>>>>>>       hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
>>>>>>>>>>       hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
>>>>>>>>>>       4 files changed, 265 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>>>>> index 6cc18d6acb..c891c6510d 100644
>>>>>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>>>>>>>>> @@ -17,6 +17,13 @@
>>>>>>>>>>
>>>>>>>>>>       typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>>>>>>>
>>>>>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>>>>>>> +                           unsigned idx,
>>>>>>>>>> +                           VhostShadowVirtqueue *svq);
>>>>>>>>>> +void vhost_shadow_vq_stop(struct vhost_dev *dev,
>>>>>>>>>> +                          unsigned idx,
>>>>>>>>>> +                          VhostShadowVirtqueue *svq);
>>>>>>>>>> +
>>>>>>>>>>       VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
>>>>>>>>>>
>>>>>>>>>>       void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
>>>>>>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>>>>>>> index ac963bf23d..7ffdf9aea0 100644
>>>>>>>>>> --- a/include/hw/virtio/vhost.h
>>>>>>>>>> +++ b/include/hw/virtio/vhost.h
>>>>>>>>>> @@ -55,6 +55,8 @@ struct vhost_iommu {
>>>>>>>>>>           QLIST_ENTRY(vhost_iommu) iommu_next;
>>>>>>>>>>       };
>>>>>>>>>>
>>>>>>>>>> +typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>>>>>>>> +
>>>>>>>>>>       typedef struct VhostDevConfigOps {
>>>>>>>>>>           /* Vhost device config space changed callback
>>>>>>>>>>            */
>>>>>>>>>> @@ -83,7 +85,9 @@ struct vhost_dev {
>>>>>>>>>>           uint64_t backend_cap;
>>>>>>>>>>           bool started;
>>>>>>>>>>           bool log_enabled;
>>>>>>>>>> +    bool shadow_vqs_enabled;
>>>>>>>>>>           uint64_t log_size;
>>>>>>>>>> +    VhostShadowVirtqueue **shadow_vqs;
>>>>>>>>> Any reason that you don't embed the shadow virtqueue into
>>>>>>>>> vhost_virtqueue structure?
>>>>>>>>>
>>>>>>>> Not really, it could be relatively big and I would prefer SVQ
>>>>>>>> members/methods to remain hidden from any other part that includes
>>>>>>>> vhost.h. But it could be changed, for sure.
>>>>>>>>
>>>>>>>>> (Note that there's a masked_notifier in struct vhost_virtqueue).
>>>>>>>>>
>>>>>>>> They are used differently: in SVQ the masked notifier is a pointer,
>>>>>>>> and if it's NULL the SVQ code knows that device is not masked. The
>>>>>>>> vhost_virtqueue is the real owner.
>>>>>>> Yes, but it's an example for embedding auxciliary data structures in the
>>>>>>> vhost_virtqueue.
>>>>>>>
>>>>>>>
>>>>>>>> It could be replaced by a boolean in SVQ or something like that, I
>>>>>>>> experimented with a tri-state too (UNMASKED, MASKED, MASKED_NOTIFIED)
>>>>>>>> and let vhost.c code to manage all the transitions. But I find clearer
>>>>>>>> the pointer use, since it's the more natural for the
>>>>>>>> vhost_virtqueue_mask, vhost_virtqueue_pending existing functions.
>>>>>>>>
>>>>>>>> This masking/unmasking is the part I dislike the most from this
>>>>>>>> series, so I'm very open to alternatives.
>>>>>>> See below. I think we don't even need to care about that.
>>>>>>>
>>>>>>>
>>>>>>>>>>           Error *migration_blocker;
>>>>>>>>>>           const VhostOps *vhost_ops;
>>>>>>>>>>           void *opaque;
>>>>>>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>>>>> index 4512e5b058..3e43399e9c 100644
>>>>>>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>>>>>>> @@ -8,9 +8,12 @@
>>>>>>>>>>        */
>>>>>>>>>>
>>>>>>>>>>       #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>>>>>> +#include "hw/virtio/vhost.h"
>>>>>>>>>> +
>>>>>>>>>> +#include "standard-headers/linux/vhost_types.h"
>>>>>>>>>>
>>>>>>>>>>       #include "qemu/error-report.h"
>>>>>>>>>> -#include "qemu/event_notifier.h"
>>>>>>>>>> +#include "qemu/main-loop.h"
>>>>>>>>>>
>>>>>>>>>>       /* Shadow virtqueue to relay notifications */
>>>>>>>>>>       typedef struct VhostShadowVirtqueue {
>>>>>>>>>> @@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
>>>>>>>>>>           EventNotifier kick_notifier;
>>>>>>>>>>           /* Shadow call notifier, sent to vhost */
>>>>>>>>>>           EventNotifier call_notifier;
>>>>>>>>>> +
>>>>>>>>>> +    /*
>>>>>>>>>> +     * Borrowed virtqueue's guest to host notifier.
>>>>>>>>>> +     * To borrow it in this event notifier allows to register on the event
>>>>>>>>>> +     * loop and access the associated shadow virtqueue easily. If we use the
>>>>>>>>>> +     * VirtQueue, we don't have an easy way to retrieve it.
>>>>>>>>> So this is something that worries me. It looks like a layer violation
>>>>>>>>> that makes the codes harder to work correctly.
>>>>>>>>>
>>>>>>>> I don't follow you here.
>>>>>>>>
>>>>>>>> The vhost code already depends on virtqueue in the same sense:
>>>>>>>> virtio_queue_get_host_notifier is called on vhost_virtqueue_start. So
>>>>>>>> if this behavior ever changes it is unlikely for vhost to keep working
>>>>>>>> without changes. vhost_virtqueue has a kick/call int where I think it
>>>>>>>> should be stored actually, but they are never used as far as I see.
>>>>>>>>
>>>>>>>> Previous RFC did rely on vhost_dev_disable_notifiers. From its documentation:
>>>>>>>> /* Stop processing guest IO notifications in vhost.
>>>>>>>>      * Start processing them in qemu.
>>>>>>>>      ...
>>>>>>>> But it was easier for this mode to miss a notification, since they
>>>>>>>> create a new host_notifier in virtio_bus_set_host_notifier right away.
>>>>>>>> So I decided to use the file descriptor already sent to vhost in
>>>>>>>> regular operation mode, so guest-related resources change less.
>>>>>>>>
>>>>>>>> Having said that, maybe it's useful to assert that
>>>>>>>> vhost_dev_{enable,disable}_notifiers are never called on shadow
>>>>>>>> virtqueue mode. Also, it could be useful to retrieve it from
>>>>>>>> virtio_bus, not raw shadow virtqueue, so all get/set are performed
>>>>>>>> from it. Would that make more sense?
>>>>>>>>
>>>>>>>>> I wonder if it would be simpler to start from a vDPA dedicated shadow
>>>>>>>>> virtqueue implementation:
>>>>>>>>>
>>>>>>>>> 1) have the above fields embeded in vhost_vdpa structure
>>>>>>>>> 2) Work at the level of
>>>>>>>>> vhost_vdpa_set_vring_kick()/vhost_vdpa_set_vring_call()
>>>>>>>>>
>>>>>>>> This notifier is never sent to the device in shadow virtqueue mode.
>>>>>>>> It's for SVQ to react to guest's notifications, registering it on its
>>>>>>>> main event loop [1]. So if I perform these changes the way I
>>>>>>>> understand them, SVQ would still rely on this borrowed EventNotifier,
>>>>>>>> and it would send to the vDPA device the newly created kick_notifier
>>>>>>>> of VhostShadowVirtqueue.
>>>>>>> The point is that vhost code should be coupled loosely with virtio. If
>>>>>>> you try to "borrow" EventNotifier from virtio, you need to deal with a
>>>>>>> lot of synchrization. An exampleis the masking stuffs.
>>>>>>>
>>>>>> I still don't follow this, sorry.
>>>>>>
>>>>>> The svq->host_notifier event notifier is not affected by the masking
>>>>>> issue, it is completely private to SVQ. This commit creates and uses
>>>>>> it, and nothing related to masking is touched until the next commit.
>>>>>>
>>>>>>>>> Then the layer is still isolated and you have a much simpler context to
>>>>>>>>> work that you don't need to care a lot of synchornization:
>>>>>>>>>
>>>>>>>>> 1) vq masking
>>>>>>>> This EventNotifier is not used for masking, it does not change from
>>>>>>>> the start of the shadow virtqueue operation through its end. Call fd
>>>>>>>> sent to vhost/vdpa device does not change either in shadow virtqueue
>>>>>>>> mode operation with masking/unmasking. I will try to document it
>>>>>>>> better.
>>>>>>>>
>>>>>>>> I think that we will need to handle synchronization with
>>>>>>>> masking/unmasking from the guest and dynamically enabling SVQ
>>>>>>>> operation mode, since they can happen at the same time as long as we
>>>>>>>> let the guest run. There may be better ways of synchronizing them of
>>>>>>>> course, but I don't see how moving to the vhost-vdpa backend helps
>>>>>>>> with this. Please expand if I've missed it.
>>>>>>>>
>>>>>>>> Or do you mean to forbid regular <-> SVQ operation mode transitions and delay it
>>>>>>>> to future patchsets?
>>>>>>> So my idea is to do all the shadow virtqueue in the vhost-vDPA codes and
>>>>>>> hide them from the upper layers like virtio. This means it works at
>>>>>>> vhost level which can see vhost_vring_file only. When enalbed, what it
>>>>>>> needs is just:
>>>>>>>
>>>>>>> 1) switch to use svq kickfd and relay ioeventfd to svq kickfd
>>>>>>> 2) switch to use svq callfd and relay svq callfd to irqfd
>>>>>>>
>>>>>>> It will still behave like a vhost-backend that the switching is done
>>>>>>> internally in vhost-vDPA which is totally transparent to the virtio
>>>>>>> codes of Qemu.
>>>>>>>
>>>>>>> E.g:
>>>>>>>
>>>>>>> 1) in the case of guest notifier masking, we don't need to do anything
>>>>>>> since virtio codes will replace another irqfd for us.
>>>>>> Assuming that we don't modify vhost masking code, but send shadow
>>>>>> virtqueue call descriptor to the vhost device:
>>>>>>
>>>>>> If guest virtio code mask the virtqueue and replaces the vhost-vdpa
>>>>>> device call fd (VhostShadowVirtqueue.call_notifier in the next commit,
>>>>>> or the descriptor in your previous second point, svq callfd) with the
>>>>>> masked notifier, vhost_shadow_vq_handle_call will not be called
>>>>>> anymore, and no more used descriptors will be forwarded. They will be
>>>>>> stuck if the shadow virtqueue forever. Guest itself cannot recover
>>>>>> from this situation, since a masking will set irqfd, not SVQ call fd.
>>>>> Just to make sure we're in the same page. During vq masking, the virtio
>>>>> codes actually use the masked_notifier as callfd in vhost_virtqueue_mask():
>>>>>
>>>>>        if (mask) {
>>>>>            assert(vdev->use_guest_notifier_mask);
>>>>>            file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
>>>>>        } else {
>>>>>        file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
>>>>>        }
>>>>>
>>>>>        file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
>>>>>        r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
>>>>>
>>>>> So consider the shadow virtqueue in done at vhost-vDPA. We just need to
>>>>> make sure
>>>>>
>>>>> 1) update the callfd which passed by virtio layer via set_vring_kick()
>>>>> 2) always write to the callfd during vhost_shadow_vq_handle_call()
>>>>>
>>>>> Then
>>>>>
>>>>> 3) When shadow vq is enabled, we just set the callfd of shadow virtqueue
>>>>> to vDPA via VHOST_SET_VRING_CALL, and poll the svq callfd
>>>>> 4) When shadow vq is disabled, we just set the callfd that is passed by
>>>>> virtio via VHOST_SET_VRING_CALL, stop poll the svq callfd
>>>>>
>>>>> So you can see in step 2 and 4, we don't need to know whether or not the
>>>>> vq is masked since we follow the vhost protocol "VhostOps" and do
>>>>> everyhing transparently in the vhost-(vDPA) layer.
>>>>>
>>>> All of this assumes that we can enable/disable SVQ dynamically while
>>>> the device is running. If it's not the case, there is no need for the
>>>> mutex neither in vhost.c code nor vdpa_backend.
>>>>
>>>> As I see it, the issue is that step (2) and (4) happens in different
>>>> threads: (2) is in vCPU vmexit, and (4) is in main event loop.
>>>> Consider unmasking and disabling SVQ at the same time with no mutex:
>>>>
>>>> vCPU vmexit thread                     aio thread
>>>> (unmask)                               (stops SVQ)
>>>> |                                      |
>>>> |                                      // Last callfd set was masked_notifier
>>>> |                                      vdpa_backend.callfd = \
>>>> |                                              atomic_read(masked_notifier).
>>>> |                                      |
>>>> vhost_set_vring_call(vq.guest_notifier)|
>>>> -> vdpa_backend.callfd = \             |
>>>>              vq.guest_notifier           |
>>>> |                                      |
>>>> |                                      ioctl(vdpa,
>>>> VHOST_SET_VRING_CALL, vdpa_backend.callfd)
>>>> |
>>>> // guest expects more interrupts, but
>>>> // device just set masked
>>>>
>>>> And vhost_set_vring_call could happen entirely even while ioctl is
>>>> being executed.
>>>>
>>>> So that is the reason for the mutex: vdpa_backend.call_fd and the
>>>> ioctl VHOST_SET_VRING_CALL must be serialized. I'm ok with moving to
>>>> vdpa backend, but it's the same code, just in vdpa_backend.c instead
>>>> of vhost.c, so it becomes less generic in my opinion.
>>>
>>> You are right. But let's consider if we can avoid the dedicated mutex.
>>>
>>> E.g can we use the BQL, bascially we need to synchronizae with iothread.
>>>
>>> Or is it possible to schedule bh then things are serailzied automatically?
>>>
>> I tried RCU with no success, and I think the same issues apply to bh.
>> I will try to explain the best I can what I achieved in the past, and
>> why I discarded it. I will explore BQL approaches, it could be simpler
>> that way actually.
>>
>> The hard part to achieve if that no notification can be forwarded to
>> the guest once masking vmexit returns (isn't it?). Unmasking scenario
>> is easy with RCU, since the pending notification could reach the guest
>> asynchronously if it exists.
>>
>> On the other hand, whatever guest set should take priority over
>> whatever shadow_vq_stop sets.
>>
>> With RCU, the problem is that the synchronization point should be the
>> vmexit thread. The function vhost_virtqueue_mask is already called
>> within RCU, so it could happen that RCU lock is held longer than it
>> returns, so the effective masking could happen after vmexit returns. I
>> see no way to make something like "call_rcu but only in this thread"
>> or, in other words, " rcu_synchronize after the rcu_unlock of this
>> thread and then run this".
>>
>> I tried to explore to synchronize that situation in the event loop,
>> but the guest is able to call unmask/mask again, making the race
>> condition. If I can mark a range where unmask/mask cannot return, I'm
>> creating an artificial mutex. If I retry in the main event loop, there
>> is a window where notification can reach the guest after masking.
>>
>> In order to reduce this window, shadow_virtqueue_stop could set the
>> masked notifier fd unconditionally, and then check if it should unmask
>> under the mutex. I'm not sure if this is worth however, since
>> enabling/disabling already involves a system call.
>>
>> I think it would be way more natural to at least protect
>> vhost_virtqueue.notifier_is_masked with the BQL, however, so I will
>> check that possibility. It would be great to be able to do it on bh,
>> but I think this opens a window for the host to send notifications
>> when the guest has masked the vq, since mask/unmasking could happen
>> while bh is running as far as I see.
>>
> So actually vhost_shadow_virtqueue_start/stop (being a QMP command)
> and vhost_virtqueue_mask (vmexit) already runs under BQL.


So actually everyhing is alreay synchornized?

1) MSI-X MMIO handlers from vcpu thread with BQL
2) Poll for ioeventfd and vhost callfd from iothread with BQL
3) the monitor comand with BQL held


>
> The ideal scenario would be to run kick/call handlers in its own aio
> context and do not take BQL on them,


Well, then you still need to synchronize with QMP, MSI-X MMIO.


> but I think this is doable with
> just atomics and maybe events. For this series I think we can delete
> the introduced mutex (and maybe replace it with an assertion?)


Let's try to avoid assertion here.

Thanks


>
> Thanks!
>
>> Actually, in my test I override vhost_virtqueue_mask, so it looks more
>> similar to your proposal with VhostOps.
>>
>> Thanks!
>>
>>>>>>> 2) easily to deal with vhost dev start and stop
>>>>>>>
>>>>>>> The advantages are obvious, simple and easy to implement.
>>>>>>>
>>>>>> I still don't see how performing this step from backend code avoids
>>>>>> the synchronization problem, since they will be done from different
>>>>>> threads anyway. Not sure what piece I'm missing.
>>>>> See my reply in another thread. If you enable the shadow virtqueue via a
>>>>> OOB monitor that's a real issue.
>>>>>
>>>>> But I don't think we need to do that since
>>>>>
>>>>> 1) SVQ should be transparnet to management
>>>>> 2) unncessary synchronization issue
>>>>>
>>>>> We can enable the shadow virtqueue through cli, new parameter with
>>>>> vhost-vdpa probably. Then we don't need to care about threads. And in
>>>>> the final version with full live migration support, the shadow virtqueue
>>>>> should be enabled automatically. E.g for the device without
>>>>> VHOST_F_LOG_ALL or we can have a dedicated capability of vDPA via
>>>>> VHOST_GET_BACKEND_FEATURES.
>>>>>
>>>> It should be enabled automatically in those condition, but it also
>>>> needs to be dynamic, and only be active during migration. Otherwise,
>>>> guest should use regular vdpa operation. The problem with masking is
>>>> the same if we enable with QMP or because live migration event.
>>>>
>>>> So we will have the previous synchronization problem sooner or later.
>>>> If we omit the rollback to regular vdpa operation (in other words,
>>>> disabling SVQ), code can be simplified, but I'm not sure if that is
>>>> desirable.
>>>
>>> Rgiht, so I'm ok to have the synchronziation from the start if you wish.
>>>
>>> But we need to figure out what to synchronize and how to do synchronize.
>>>
>>> THanks
>>>
>>>
>>>> Thanks!
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> I can see / tested a few solutions but I don't like them a lot:
>>>>>>
>>>>>> * Forbid hot-swapping from/to shadow virtqueue mode, and set it from
>>>>>> cmdline: We will have to deal with setting the SVQ mode dynamically
>>>>>> sooner or later if we want to use it for live migration.
>>>>>> * Forbid coming back to regular mode after switching to shadow
>>>>>> virtqueue mode: The heavy part of the synchronization comes from svq
>>>>>> stopping code, since we need to serialize the setting of device call
>>>>>> fd. This could be acceptable, but I'm not sure about the implications:
>>>>>> What happens if live migration fails and we need to step back? A mutex
>>>>>> is not needed in this scenario, it's ok with atomics and RCU code.
>>>>>>
>>>>>> * Replace KVM_IRQFD instead and let SVQ poll the old one and masked
>>>>>> notifier: I haven't thought a lot of this one, I think it's better to
>>>>>> not touch guest notifiers.
>>>>>> * Monitor also masked notifier from SVQ: I think this could be
>>>>>> promising, but SVQ needs to be notified about masking/unmasking
>>>>>> anyway, and there is code that depends on checking the masked notifier
>>>>>> for the pending notification.
>>>>>>
>>>>>>>>> 2) vhost dev start and stop
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +     *
>>>>>>>>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>>>>>>>>> +     */
>>>>>>>>>> +    EventNotifier host_notifier;
>>>>>>>>>> +
>>>>>>>>>> +    /* Virtio queue shadowing */
>>>>>>>>>> +    VirtQueue *vq;
>>>>>>>>>>       } VhostShadowVirtqueue;
>>>>>>>>>>
>>>>>>>>>> +/* Forward guest notifications */
>>>>>>>>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>>>>>>>>> +{
>>>>>>>>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>>>>>>>> +                                             host_notifier);
>>>>>>>>>> +
>>>>>>>>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
>>>>>>>>>> +        return;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    event_notifier_set(&svq->kick_notifier);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
>>>>>>>>>> + */
>>>>>>>>>> +static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
>>>>>>>>>> +                                                     unsigned vhost_index,
>>>>>>>>>> +                                                     VhostShadowVirtqueue *svq)
>>>>>>>>>> +{
>>>>>>>>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>>>>>>>>> +    struct vhost_vring_file file = {
>>>>>>>>>> +        .index = vhost_index,
>>>>>>>>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
>>>>>>>>>> +    };
>>>>>>>>>> +    int r;
>>>>>>>>>> +
>>>>>>>>>> +    /* Restore vhost kick */
>>>>>>>>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>>>>>>>>> +    return r ? -errno : 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Start shadow virtqueue operation.
>>>>>>>>>> + * @dev vhost device
>>>>>>>>>> + * @hidx vhost virtqueue index
>>>>>>>>>> + * @svq Shadow Virtqueue
>>>>>>>>>> + */
>>>>>>>>>> +bool vhost_shadow_vq_start(struct vhost_dev *dev,
>>>>>>>>>> +                           unsigned idx,
>>>>>>>>>> +                           VhostShadowVirtqueue *svq)
>>>>>>>>> It looks to me this assumes the vhost_dev is started before
>>>>>>>>> vhost_shadow_vq_start()?
>>>>>>>>>
>>>>>>>> Right.
>>>>>>> This might not true. Guest may enable and disable virtio drivers after
>>>>>>> the shadow virtqueue is started. You need to deal with that.
>>>>>>>
>>>>>> Right, I will test this scenario.
>>>>>>
>>>>>>> Thanks
>>>>>>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-03-16  8:07   ` Jason Wang
@ 2021-05-17 17:11     ` Eugenio Perez Martin
  0 siblings, 0 replies; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-05-17 17:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 9:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 68ed0f2740..7df98fc43f 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -145,6 +145,15 @@ static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> >       svq->ring_id_maps[qemu_head] = elem;
> >   }
> >
> > +static void vhost_shadow_vq_kick(VhostShadowVirtqueue *svq)
> > +{
> > +    /* Make sure we are reading updated device flag */
> > +    smp_rmb();
>
>
> smp_mb() actually? Or it's better to explain this following read needs
> to be orderd with what read before.
>
> Thanks
>

Sorry for the late reply, I moved to vhost-vdpa usage of SVQ and I
missed these comments.

My intentions were just to order the reading of used ring flags. In
other words, to avoid reading an old value in the next conditional.

Descriptors themselves should be already written because
vhost_shadow_vq_add_split just calls smp_wmb almost at the end of the
execution. avail_ring->idx is not protected by it though. Is that what
you meant about turning it to a full barrier?

Maybe it's clearer just to call smp_mb() between the calls to
vhost_shadow_vq_add_split and vhost_shadow_vq_kick, merging both
memory barriers?

Thanks!

>
> > +    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
> > +        event_notifier_set(&svq->kick_notifier);
> > +    }
> > +}
> > +
> >   /* Handle guest->device notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> > @@ -174,7 +183,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >               }
> >
> >               vhost_shadow_vq_add(svq, elem);
> > -            event_notifier_set(&svq->kick_notifier);
> > +            vhost_shadow_vq_kick(svq);
> >           }
> >
> >           virtio_queue_set_notification(svq->vq, true);
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-03-16  8:08   ` Jason Wang
@ 2021-05-17 17:32     ` Eugenio Perez Martin
  0 siblings, 0 replies; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-05-17 17:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 9:08 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 28 +++++++++++++++++++++++++++-
> >   1 file changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 7df98fc43f..e3879a4622 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -71,10 +71,35 @@ typedef struct VhostShadowVirtqueue {
> >       /* Next head to consume from device */
> >       uint16_t used_idx;
> >
> > +    /* Cache for the exposed notification flag */
> > +    bool notification;
> > +
> >       /* Descriptors copied from guest */
> >       vring_desc_t descs[];
> >   } VhostShadowVirtqueue;
> >
> > +static void vhost_shadow_vq_set_notification(VhostShadowVirtqueue *svq,
> > +                                             bool enable)
> > +{
> > +    uint16_t notification_flag;
> > +
> > +    if (svq->notification == enable) {
> > +        return;
> > +    }
> > +
> > +    notification_flag = virtio_tswap16(svq->vdev, VRING_AVAIL_F_NO_INTERRUPT);
> > +
> > +    svq->notification = enable;
> > +    if (enable) {
> > +        svq->vring.avail->flags &= ~notification_flag;
> > +    } else {
> > +        svq->vring.avail->flags |= notification_flag;
> > +    }
> > +
> > +    /* Make sure device reads our flag */
> > +    smp_mb();
>
>
> This is a hint, so we don't need memory barrier here.
>
> Thanks
>

I will delete it for the next revision.

Thanks!

>
> > +}
> > +
> >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >                                       const struct iovec *iovec,
> >                                       size_t num, bool more_descs, bool write)
> > @@ -251,7 +276,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >       do {
> >           unsigned i = 0;
> >
> > -        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> > +        vhost_shadow_vq_set_notification(svq, false);
> >           while (true) {
> >               g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> >               if (!elem) {
> > @@ -269,6 +294,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >               svq->masked_notifier.signaled = true;
> >               event_notifier_set(svq->masked_notifier.n);
> >           }
> > +        vhost_shadow_vq_set_notification(svq, true);
> >       } while (vhost_shadow_vq_more_used(svq));
> >
> >       if (masked_notifier) {
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-03-16 15:20     ` Eugenio Perez Martin
@ 2021-05-17 17:39       ` Eugenio Perez Martin
  0 siblings, 0 replies; 46+ messages in thread
From: Eugenio Perez Martin @ 2021-05-17 17:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Guru Prasad, Juan Quintela,
	qemu-level, Markus Armbruster, Stefano Garzarella,
	Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
	Michael Lilja, Jim Harford, Rob Miller

On Tue, Mar 16, 2021 at 4:20 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Mar 16, 2021 at 8:50 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/3/16 上午3:48, Eugenio Pérez 写道:
> > > It reports the shadow virtqueue address from qemu virtual address space
> >
> >
> > Note that to be used by vDPA, we can't use qemu VA directly here.
> >
>
> Right, I'm planning to use a different virtual address space if the
> device has such limitations.
>
> >
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
> > >   hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
> > >   2 files changed, 25 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > index 2ca4b92b12..d82c35bccf 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > @@ -19,6 +19,8 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > >
> > >   void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
> > >   void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
> > > +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > > +                                    struct vhost_vring_addr *addr);
> > >
> > >   bool vhost_shadow_vq_start(struct vhost_dev *dev,
> > >                              unsigned idx,
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > index b6bab438d6..1460d1d5d1 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > @@ -17,6 +17,9 @@
> > >
> > >   /* Shadow virtqueue to relay notifications */
> > >   typedef struct VhostShadowVirtqueue {
> > > +    /* Shadow vring */
> > > +    struct vring vring;
> > > +
> > >       /* Shadow kick notifier, sent to vhost */
> > >       EventNotifier kick_notifier;
> > >       /* Shadow call notifier, sent to vhost */
> > > @@ -51,6 +54,9 @@ typedef struct VhostShadowVirtqueue {
> > >
> > >       /* Virtio device */
> > >       VirtIODevice *vdev;
> > > +
> > > +    /* Descriptors copied from guest */
> > > +    vring_desc_t descs[];
> > >   } VhostShadowVirtqueue;
> > >
> > >   /* Forward guest notifications */
> > > @@ -132,6 +138,19 @@ void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
> > >       qemu_event_wait(&svq->masked_notifier.is_free);
> > >   }
> > >
> > > +/*
> > > + * Get the shadow vq vring address.
> > > + * @svq Shadow virtqueue
> > > + * @addr Destination to store address
> > > + */
> > > +void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > > +                                    struct vhost_vring_addr *addr)
> > > +{
> > > +    addr->desc_user_addr = (uint64_t)svq->vring.desc;
> > > +    addr->avail_user_addr = (uint64_t)svq->vring.avail;
> > > +    addr->used_user_addr = (uint64_t)svq->vring.used;
> > > +}
> > > +
> > >   /*
> > >    * Restore the vhost guest to host notifier, i.e., disables svq effect.
> > >    */
> > > @@ -262,7 +281,9 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
> > >   VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> > >   {
> > >       int vq_idx = dev->vq_index + idx;
> > > -    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> > > +    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > +    size_t ring_size = vring_size(num, VRING_DESC_ALIGN_SIZE);
> > > +    g_autofree VhostShadowVirtqueue *svq = g_malloc0(sizeof(*svq) + ring_size);
> > >       int r;
> > >
> > >       r = event_notifier_init(&svq->kick_notifier, 0);
> > > @@ -279,6 +300,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
> > >           goto err_init_call_notifier;
> > >       }
> > >
> > > +    vring_init(&svq->vring, num, svq->descs, VRING_DESC_ALIGN_SIZE);
> >
> >
> > We had some dicussion in the past. Exporting vring_init() is wrong but
> > too late to fix (assumes a legacy split layout). Let's not depend on
> > this buggy uAPI.
> >
>
> Ok, I will change the way to allocate and initialize it.
>

Could we define VIRTIO_RING_NO_LEGACY macro in qemu/osdep.h or similar
to avoid repeating this mistake in the future?

Thanks!

> > Thanks
> >
> >
> > >       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > >       svq->vdev = dev->vdev;
> > >       event_notifier_set_handler(&svq->call_notifier,
> >



^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-05-17 17:46 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-15 19:48 [RFC v2 00/13] vDPA software assisted live migration Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 01/13] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 02/13] vhost: Save masked_notifier state Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 03/13] vhost: Add VhostShadowVirtqueue Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 04/13] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
2021-03-16 13:37   ` Eric Blake
2021-03-15 19:48 ` [RFC v2 05/13] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
2021-03-16  7:18   ` Jason Wang
2021-03-16 10:31     ` Eugenio Perez Martin
2021-03-17  2:05       ` Jason Wang
2021-03-17 16:47         ` Eugenio Perez Martin
2021-03-18  3:10           ` Jason Wang
2021-03-18  9:18             ` Eugenio Perez Martin
2021-03-18  9:29               ` Jason Wang
2021-03-18 10:48                 ` Eugenio Perez Martin
2021-03-18 12:04                   ` Eugenio Perez Martin
2021-03-19  6:55                     ` Jason Wang
2021-03-15 19:48 ` [RFC v2 06/13] vhost: Route host->guest " Eugenio Pérez
2021-03-16  7:21   ` Jason Wang
2021-03-15 19:48 ` [RFC v2 07/13] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 08/13] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
2021-03-16  7:50   ` Jason Wang
2021-03-16 15:20     ` Eugenio Perez Martin
2021-05-17 17:39       ` Eugenio Perez Martin
2021-03-15 19:48 ` [RFC v2 09/13] virtio: Add virtio_queue_full Eugenio Pérez
2021-03-15 19:48 ` [RFC v2 10/13] vhost: add vhost_kernel_set_vring_enable Eugenio Pérez
2021-03-16  7:29   ` Jason Wang
2021-03-16 10:43     ` Eugenio Perez Martin
2021-03-17  2:25       ` Jason Wang
2021-03-15 19:48 ` [RFC v2 11/13] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
2021-03-16  8:15   ` Jason Wang
2021-03-16 16:05     ` Eugenio Perez Martin
2021-03-17  2:50       ` Jason Wang
2021-03-17 14:38         ` Eugenio Perez Martin
2021-03-18  3:14           ` Jason Wang
2021-03-18  8:06             ` Eugenio Perez Martin
2021-03-18  9:16               ` Jason Wang
2021-03-18  9:54                 ` Eugenio Perez Martin
2021-03-15 19:48 ` [RFC v2 12/13] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
2021-03-16  8:07   ` Jason Wang
2021-05-17 17:11     ` Eugenio Perez Martin
2021-03-15 19:48 ` [RFC v2 13/13] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
2021-03-16  8:08   ` Jason Wang
2021-05-17 17:32     ` Eugenio Perez Martin
2021-03-16  8:28 ` [RFC v2 00/13] vDPA software assisted live migration Jason Wang
2021-03-16 17:25   ` Eugenio Perez Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).