qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 00/29] vDPA software assisted live migration
@ 2021-05-19 16:28 Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 01/29] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
                   ` (30 more replies)
  0 siblings, 31 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This series enable shadow virtqueue for vhost-vdpa devices. This is a
new method of vhost devices migration: Instead of relay on vhost
device's dirty logging capability, SW assisted LM intercepts dataplane,
forwarding the descriptors between VM and device. Is intended for vDPA
devices with no dirty memory tracking capabilities.

In this migration mode, qemu offers a new vring to the device to
read and write into, and disable vhost notifiers, processing guest and
vhost notifications in qemu. On used buffer relay, qemu will mark the
dirty memory as with plain virtio-net devices. This way, devices does
not need to have dirty page logging capability.

This series is a POC doing SW LM for vhost-net and vhost-vdpa devices.
The former already have dirty page logging capabilities, but it is both
easier to test and uses different code paths in qemu.

For qemu to use shadow virtqueues the vhost-net devices need to be
instantiated:
* With IOMMU (iommu_platform=on,ats=on)
* Without event_idx (event_idx=off)

And shadow virtqueue needs to be enabled for them with QMP command
like:

{ "execute": "x-vhost-enable-shadow-vq",
      "arguments": { "name": "dev0", "enable": true } }

The series includes some commits to delete in the final version. One
of them is the one that adds vhost_kernel_vring_pause to vhost kernel
devices. This is only intended to work with vhost-net devices, as a way
to test the solution, so don't use any other vhost kernel device in the
same test.

The vhost-vdpa devices should work the same way. However, vp_vdpa is
not working properly with intel iommu unmapping, so this series add two
extra commits to allow testing the solution enable SVQ mode from the
device start and forbidding any other vhost-vdpa memory mapping. The
causes of this are still under debugging.

For testing vhost-vdpa devices vp_vdpa device has been used with nested
virtualization, using a qemu virtio-net device in L0. To be able to
stop and reset status, features in RFC status has been implemented in
commits 5 and 6. After that, virtio-net driver in L0 guest is replaced
by vp_vdpa driver, and a nested qemu instance is launched using it.

This vp_vdpa driver needs to be also modified to support the RFCs,
mainly allowing it to removing the _S_STOPPED status flag and
implementing actual vp_vdpa_set_vq_state and vp_vdpa_get_vq_state
callbacks.

Just the notification forwarding (with no descriptor relay) can be
achieved with patches 7 and 8, and starting SVQ. Previous commits
are cleanup ones and declaration of QMP command.

Commit 17 introduces the buffer forwarding. Previous one are for
preparations again, and laters are for enabling some obvious
optimizations. However, it needs the vdpa device to be able to map
every IOVA space, and some vDPA devices are not able to do so. Checking
of this is added in previous commits.

Later commits allow vhost and shadow virtqueue to track and translate
between qemu virtual addresses and a restricted iommu range. At the
moment is not able to delete old translations, limit maximum range
it can translate, nor vhost add new memory regions from the moment
SVQ is enabled, but is somehow straightforward to add these.

This is a big series, so the idea is to send it in logical chunks once
all comments have been collected. As a first complete usecase, a SVQ
mode with no possibility of going back to regular mode would cover a
first usecase, and this RFC already have all the ingredients but
internal memory tracking.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

Comments are welcome!

TODO:
* Event, indirect, packed, and others features of virtio - Waiting for
  confirmation of the big picture.
* vDPA devices: Grow IOVA tree to track new or deleted memory. Cap
  IOVA limit in tree so it cannot grow forever.
* To sepparate buffers forwarding in its own AIO context, so we can
  throw more threads to that task and we don't need to stop the main
  event loop.
* IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
  sent to device.
* Automatic kick-in on live-migration.
* Proper documentation.

Thanks!

Changes from v2 RFC:
  * Adding vhost-vdpa devices support
  * Fixed some memory leaks pointed by different comments

Changes from v1 RFC:
  * Use QMP instead of migration to start SVQ mode.
  * Only accepting IOMMU devices, closer behavior with target devices
    (vDPA)
  * Fix invalid masking/unmasking of vhost call fd.
  * Use of proper methods for synchronization.
  * No need to modify VirtIO device code, all of the changes are
    contained in vhost code.
  * Delete superfluous code.
  * An intermediate RFC was sent with only the notifications forwarding
    changes. It can be seen in
    https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
  * v1 at
    https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html

Eugenio Pérez (29):
  virtio: Add virtio_queue_is_host_notifier_enabled
  vhost: Save masked_notifier state
  vhost: Add VhostShadowVirtqueue
  vhost: Add x-vhost-enable-shadow-vq qmp
  virtio: Add VIRTIO_F_QUEUE_STATE
  virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  vhost: Route guest->host notification through shadow virtqueue
  vhost: Route host->guest notification through shadow virtqueue
  vhost: Avoid re-set masked notifier in shadow vq
  virtio: Add vhost_shadow_vq_get_vring_addr
  vhost: Add vhost_vring_pause operation
  vhost: add vhost_kernel_vring_pause
  vhost: Add vhost_get_iova_range operation
  vhost: add vhost_has_limited_iova_range
  vhost: Add enable_custom_iommu to VhostOps
  vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
  vhost: Shadow virtqueue buffers forwarding
  vhost: Use vhost_enable_custom_iommu to unmap everything if available
  vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
    kick
  vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
    virtqueue
  vhost: Add VhostIOVATree
  vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
  vhost: Use a tree to store memory mappings
  vhost: Add iova_rev_maps_alloc
  vhost: Add custom IOTLB translations to SVQ
  vhost: Map in vdpa-dev
  vhost-vdpa: Implement vhost_vdpa_vring_pause operation
  vhost-vdpa: never map with vDPA listener
  vhost: Start vhost-vdpa SVQ directly

 qapi/net.json                                 |  22 +
 hw/virtio/vhost-iova-tree.h                   |  61 ++
 hw/virtio/vhost-shadow-virtqueue.h            |  38 ++
 hw/virtio/virtio-pci.h                        |   1 +
 include/hw/virtio/vhost-backend.h             |  16 +
 include/hw/virtio/vhost-vdpa.h                |   2 +-
 include/hw/virtio/vhost.h                     |  14 +
 include/hw/virtio/virtio.h                    |   5 +-
 .../standard-headers/linux/virtio_config.h    |   5 +
 include/standard-headers/linux/virtio_pci.h   |   2 +
 hw/net/virtio-net.c                           |   4 +-
 hw/virtio/vhost-backend.c                     |  42 ++
 hw/virtio/vhost-iova-tree.c                   | 283 ++++++++
 hw/virtio/vhost-shadow-virtqueue.c            | 643 ++++++++++++++++++
 hw/virtio/vhost-vdpa.c                        |  73 +-
 hw/virtio/vhost.c                             | 459 ++++++++++++-
 hw/virtio/virtio-pci.c                        |   9 +
 hw/virtio/virtio.c                            |   5 +
 hw/virtio/meson.build                         |   2 +-
 hw/virtio/trace-events                        |   1 +
 20 files changed, 1663 insertions(+), 24 deletions(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-iova-tree.c
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

-- 
2.27.0




^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC v3 01/29] virtio: Add virtio_queue_is_host_notifier_enabled
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 02/29] vhost: Save masked_notifier state Eugenio Pérez
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This allows shadow virtqueue code to assert the queue status before
making changes.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h | 1 +
 hw/virtio/virtio.c         | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b7ece7a6a8..c2c7cee993 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,6 +316,7 @@ void virtio_device_release_ioeventfd(VirtIODevice *vdev);
 bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
                                                 VirtIOHandleAIOOutput handle_output);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 07f4e60b30..a86b3f9c26 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3594,6 +3594,11 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq)
     return &vq->host_notifier;
 }
 
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq)
+{
+    return vq->host_notifier_enabled;
+}
+
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled)
 {
     vq->host_notifier_enabled = enabled;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 02/29] vhost: Save masked_notifier state
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 01/29] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 03/29] vhost: Add VhostShadowVirtqueue Eugenio Pérez
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

It will be used to configure shadow virtqueue. Shadow virtqueue will
relay the device->guest notifications, so vhost need to be able to tell
the masking status.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h | 1 +
 hw/virtio/vhost.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 4a8bc75415..ac963bf23d 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -28,6 +28,7 @@ struct vhost_virtqueue {
     unsigned avail_size;
     unsigned long long used_phys;
     unsigned used_size;
+    bool notifier_is_masked;
     EventNotifier masked_notifier;
     struct vhost_dev *dev;
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index bbc2f228b5..40f9f64ebd 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1527,6 +1527,8 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     r = hdev->vhost_ops->vhost_set_vring_call(hdev, &file);
     if (r < 0) {
         VHOST_OPS_DEBUG("vhost_set_vring_call failed");
+    } else {
+        hdev->vqs[index].notifier_is_masked = mask;
     }
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 03/29] vhost: Add VhostShadowVirtqueue
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 01/29] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 02/29] vhost: Save masked_notifier state Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Vhost shadow virtqueue (SVQ)is an intermediate jump for virtqueue
notifications and buffers, allowing qemu to track them. While qemu is
forwarding the buffers and virtqueue changes, is able to commit the
memory it's being dirtied, the same way regular qemu's VirtIO devices
do.

This commit only exposes basic SVQ allocation and free, so changes
regarding different aspects of SVQ (notifications forwarding, buffer
forwarding, starting/stopping) are more isolated and easier to bisect.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h | 24 ++++++++++++
 hw/virtio/vhost-shadow-virtqueue.c | 63 ++++++++++++++++++++++++++++++
 hw/virtio/meson.build              |  2 +-
 3 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
new file mode 100644
index 0000000000..6cc18d6acb
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -0,0 +1,24 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef VHOST_SHADOW_VIRTQUEUE_H
+#define VHOST_SHADOW_VIRTQUEUE_H
+
+#include "qemu/osdep.h"
+
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/vhost.h"
+
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
+VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
+
+void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
+
+#endif
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
new file mode 100644
index 0000000000..4512e5b058
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -0,0 +1,63 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "hw/virtio/vhost-shadow-virtqueue.h"
+
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+
+/* Shadow virtqueue to relay notifications */
+typedef struct VhostShadowVirtqueue {
+    /* Shadow kick notifier, sent to vhost */
+    EventNotifier kick_notifier;
+    /* Shadow call notifier, sent to vhost */
+    EventNotifier call_notifier;
+} VhostShadowVirtqueue;
+
+/*
+ * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
+ * methods and file descriptors.
+ */
+VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
+{
+    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
+    int r;
+
+    r = event_notifier_init(&svq->kick_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create kick event notifier: %s",
+                     strerror(errno));
+        goto err_init_kick_notifier;
+    }
+
+    r = event_notifier_init(&svq->call_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create call event notifier: %s",
+                     strerror(errno));
+        goto err_init_call_notifier;
+    }
+
+    return g_steal_pointer(&svq);
+
+err_init_call_notifier:
+    event_notifier_cleanup(&svq->kick_notifier);
+
+err_init_kick_notifier:
+    return NULL;
+}
+
+/*
+ * Free the resources of the shadow virtqueue.
+ */
+void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
+{
+    event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq);
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..8b5a0225fe 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 03/29] vhost: Add VhostShadowVirtqueue Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-21  7:05   ` Markus Armbruster
  2021-05-19 16:28 ` [RFC v3 05/29] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Command to enable shadow virtqueue looks like:

{ "execute": "x-vhost-enable-shadow-vq",
  "arguments": { "name": "dev0", "enable": true } }

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json     | 22 ++++++++++++++++++++++
 hw/virtio/vhost.c |  6 ++++++
 2 files changed, 28 insertions(+)

diff --git a/qapi/net.json b/qapi/net.json
index c31748c87f..660feafdd2 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -77,6 +77,28 @@
 ##
 { 'command': 'netdev_del', 'data': {'id': 'str'} }
 
+##
+# @x-vhost-enable-shadow-vq:
+#
+# Use vhost shadow virtqueue.
+#
+# @name: the device name of the VirtIO device
+#
+# @enable: true to use he alternate shadow VQ notification path
+#
+# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
+#
+# Since: 6.1
+#
+# Example:
+#
+# -> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "virtio-net", "enable": false } }
+#
+##
+{ 'command': 'x-vhost-enable-shadow-vq',
+  'data': {'name': 'str', 'enable': 'bool'},
+  'if': 'defined(CONFIG_VHOST_KERNEL)' }
+
 ##
 # @NetLegacyNicOptions:
 #
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 40f9f64ebd..c4c1f80661 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-commands-net.h"
 #include "hw/virtio/vhost.h"
 #include "qemu/atomic.h"
 #include "qemu/range.h"
@@ -1831,3 +1832,8 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -1;
 }
+
+void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
+{
+    error_setg(errp, "Shadow virtqueue still not implemented");
+}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 05/29] virtio: Add VIRTIO_F_QUEUE_STATE
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Implementation of RFC of device state capability:
https://lists.oasis-open.org/archives/virtio-comment/202012/msg00005.html

With this capability, vdpa device can reset it's index so it can start
consuming from shadow virtqueue (SVQ), that start with state 0. Another
approach would be to make SVQ to start forwarding from the state of the
device when the later is stopped, but this device capability is needed
at the destination of live migration anyway.

The use case is to test SVQ with virtio-pci vdpa (vp_vdpa) with nested
virtualization. Spawning a L0 qemu with a virtio-net device, use
vp_vdpa driver to handle it in the guest, and then spawn a L1 qemu using
that vdpa device. When L1 qemu calls device to set a new state though
vdpa ioctl, vp_vdpa should set each queue state though virtio
VIRTIO_PCI_COMMON_Q_AVAIL_STATE.

Since this is only for testing vhost-vdpa, it's added here before of
proposing to kernel code. No effort is done for checking that device
can actually change its state, its layout, or if the device even
supports to change state at all. These will be added in the future.

Also, a modified version of vp_vdpa that allows to set these in PCI
config is needed.

TODO: Check for feature enabled and split in virtio pci config

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/virtio-pci.h                         | 1 +
 include/hw/virtio/virtio.h                     | 4 +++-
 include/standard-headers/linux/virtio_config.h | 3 +++
 include/standard-headers/linux/virtio_pci.h    | 2 ++
 hw/virtio/virtio-pci.c                         | 9 +++++++++
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index d7d5d403a9..69e34449cd 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -115,6 +115,7 @@ typedef struct VirtIOPCIQueue {
   uint32_t desc[2];
   uint32_t avail[2];
   uint32_t used[2];
+  uint16_t state;
 } VirtIOPCIQueue;
 
 struct VirtIOPCIProxy {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index c2c7cee993..dfcc7d8350 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -289,7 +289,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
     DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
                       VIRTIO_F_IOMMU_PLATFORM, false), \
     DEFINE_PROP_BIT64("packed", _state, _field, \
-                      VIRTIO_F_RING_PACKED, false)
+                      VIRTIO_F_RING_PACKED, false), \
+    DEFINE_PROP_BIT64("save_restore_q_state", _state, _field, \
+                      VIRTIO_F_QUEUE_STATE, true)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 22e3a85f67..59fad3eb45 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -90,4 +90,7 @@
  * Does the device support Single Root I/O Virtualization?
  */
 #define VIRTIO_F_SR_IOV			37
+
+/* Device support save and restore virtqueue state */
+#define VIRTIO_F_QUEUE_STATE            40
 #endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/standard-headers/linux/virtio_pci.h b/include/standard-headers/linux/virtio_pci.h
index db7a8e2fcb..c8d9802a87 100644
--- a/include/standard-headers/linux/virtio_pci.h
+++ b/include/standard-headers/linux/virtio_pci.h
@@ -164,6 +164,7 @@ struct virtio_pci_common_cfg {
 	uint32_t queue_avail_hi;		/* read-write */
 	uint32_t queue_used_lo;		/* read-write */
 	uint32_t queue_used_hi;		/* read-write */
+	uint16_t queue_avail_state;     /* read-write */
 };
 
 /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
@@ -202,6 +203,7 @@ struct virtio_pci_cfg_cap {
 #define VIRTIO_PCI_COMMON_Q_AVAILHI	44
 #define VIRTIO_PCI_COMMON_Q_USEDLO	48
 #define VIRTIO_PCI_COMMON_Q_USEDHI	52
+#define VIRTIO_PCI_COMMON_Q_AVAIL_STATE	56
 
 #endif /* VIRTIO_PCI_NO_MODERN */
 
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 883045a223..ddb6fff098 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1216,6 +1216,9 @@ static uint64_t virtio_pci_common_read(void *opaque, hwaddr addr,
     case VIRTIO_PCI_COMMON_Q_USEDHI:
         val = proxy->vqs[vdev->queue_sel].used[1];
         break;
+    case VIRTIO_PCI_COMMON_Q_AVAIL_STATE:
+        val = virtio_queue_get_last_avail_idx(vdev, vdev->queue_sel);
+        break;
     default:
         val = 0;
     }
@@ -1298,6 +1301,8 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
                        proxy->vqs[vdev->queue_sel].avail[0],
                        ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 |
                        proxy->vqs[vdev->queue_sel].used[0]);
+            virtio_queue_set_last_avail_idx(vdev, vdev->queue_sel,
+                        proxy->vqs[vdev->queue_sel].state);
             proxy->vqs[vdev->queue_sel].enabled = 1;
         } else {
             virtio_error(vdev, "wrong value for queue_enable %"PRIx64, val);
@@ -1321,6 +1326,9 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
     case VIRTIO_PCI_COMMON_Q_USEDHI:
         proxy->vqs[vdev->queue_sel].used[1] = val;
         break;
+    case VIRTIO_PCI_COMMON_Q_AVAIL_STATE:
+        proxy->vqs[vdev->queue_sel].state = val;
+        break;
     default:
         break;
     }
@@ -1900,6 +1908,7 @@ static void virtio_pci_reset(DeviceState *qdev)
         proxy->vqs[i].desc[0] = proxy->vqs[i].desc[1] = 0;
         proxy->vqs[i].avail[0] = proxy->vqs[i].avail[1] = 0;
         proxy->vqs[i].used[0] = proxy->vqs[i].used[1] = 0;
+        proxy->vqs[i].state = 0;
     }
 
     if (pci_is_express(dev)) {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 05/29] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-26  1:06   ` Jason Wang
  2021-05-19 16:28 ` [RFC v3 07/29] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

So the guest can stop and start net device. It implements the RFC
https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html

To stop (as "pause") the device is required to migrate status and vring
addresses between device and SVQ.

This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
virtio_config.h before of even proposing for the kernel, with no feature
flag, and, with no checking in the device. It also needs a modified
vp_vdpa driver that supports to set and retrieve status.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/standard-headers/linux/virtio_config.h | 2 ++
 hw/net/virtio-net.c                            | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 59fad3eb45..b3f6b1365d 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -40,6 +40,8 @@
 #define VIRTIO_CONFIG_S_DRIVER_OK	4
 /* Driver has finished configuring features */
 #define VIRTIO_CONFIG_S_FEATURES_OK	8
+/* Device is stopped */
+#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
 /* Device entered invalid state, driver must reset it */
 #define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
 /* We've given up on this device. */
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 96a3cc8357..2d3caea289 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
     return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
-        (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
+        (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
+        (n->status & VIRTIO_NET_S_LINK_UP) &&
+        vdev->vm_running;
 }
 
 static void virtio_net_announce_notify(VirtIONet *net)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 07/29] vhost: Route guest->host notification through shadow virtqueue
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 08/29] vhost: Route host->guest " Eugenio Pérez
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Shadow virtqueue notifications forwarding is disabled when vhost_dev
stops, so code flow follows usual cleanup.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |   7 ++
 include/hw/virtio/vhost.h          |   4 +
 hw/virtio/vhost-shadow-virtqueue.c | 113 ++++++++++++++++++++++-
 hw/virtio/vhost.c                  | 143 ++++++++++++++++++++++++++++-
 4 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 6cc18d6acb..c891c6510d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -17,6 +17,13 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+bool vhost_shadow_vq_start(struct vhost_dev *dev,
+                           unsigned idx,
+                           VhostShadowVirtqueue *svq);
+void vhost_shadow_vq_stop(struct vhost_dev *dev,
+                          unsigned idx,
+                          VhostShadowVirtqueue *svq);
+
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx);
 
 void vhost_shadow_vq_free(VhostShadowVirtqueue *vq);
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ac963bf23d..7ffdf9aea0 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -55,6 +55,8 @@ struct vhost_iommu {
     QLIST_ENTRY(vhost_iommu) iommu_next;
 };
 
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
 typedef struct VhostDevConfigOps {
     /* Vhost device config space changed callback
      */
@@ -83,7 +85,9 @@ struct vhost_dev {
     uint64_t backend_cap;
     bool started;
     bool log_enabled;
+    bool shadow_vqs_enabled;
     uint64_t log_size;
+    VhostShadowVirtqueue **shadow_vqs;
     Error *migration_blocker;
     const VhostOps *vhost_ops;
     void *opaque;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 4512e5b058..3e43399e9c 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -8,9 +8,12 @@
  */
 
 #include "hw/virtio/vhost-shadow-virtqueue.h"
+#include "hw/virtio/vhost.h"
+
+#include "standard-headers/linux/vhost_types.h"
 
 #include "qemu/error-report.h"
-#include "qemu/event_notifier.h"
+#include "qemu/main-loop.h"
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
@@ -18,14 +21,121 @@ typedef struct VhostShadowVirtqueue {
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
     EventNotifier call_notifier;
+
+    /*
+     * Borrowed virtqueue's guest to host notifier.
+     * To borrow it in this event notifier allows to register on the event
+     * loop and access the associated shadow virtqueue easily. If we use the
+     * VirtQueue, we don't have an easy way to retrieve it.
+     *
+     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
+     */
+    EventNotifier host_notifier;
+
+    /* Virtio queue shadowing */
+    VirtQueue *vq;
 } VhostShadowVirtqueue;
 
+/* Forward guest notifications */
+static void vhost_handle_guest_kick(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             host_notifier);
+
+    if (unlikely(!event_notifier_test_and_clear(n))) {
+        return;
+    }
+
+    event_notifier_set(&svq->kick_notifier);
+}
+
+/*
+ * Restore the vhost guest to host notifier, i.e., disables svq effect.
+ */
+static int vhost_shadow_vq_restore_vdev_host_notifier(struct vhost_dev *dev,
+                                                     unsigned vhost_index,
+                                                     VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = vhost_index,
+        .fd = event_notifier_get_fd(vq_host_notifier),
+    };
+    int r;
+
+    /* Restore vhost kick */
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    return r ? -errno : 0;
+}
+
+/*
+ * Start shadow virtqueue operation.
+ * @dev vhost device
+ * @hidx vhost virtqueue index
+ * @svq Shadow Virtqueue
+ */
+bool vhost_shadow_vq_start(struct vhost_dev *dev,
+                           unsigned idx,
+                           VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = idx,
+        .fd = event_notifier_get_fd(&svq->kick_notifier),
+    };
+    int r;
+
+    /* Check that notifications are still going directly to vhost dev */
+    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
+
+    /*
+     * event_notifier_set_handler already checks for guest's notifications if
+     * they arrive in the switch, so there is no need to explicitely check for
+     * them.
+     */
+    event_notifier_init_fd(&svq->host_notifier,
+                           event_notifier_get_fd(vq_host_notifier));
+    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
+
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    if (unlikely(r != 0)) {
+        error_report("Couldn't set kick fd: %s", strerror(errno));
+        goto err_set_vring_kick;
+    }
+
+    return true;
+
+err_set_vring_kick:
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    return false;
+}
+
+/*
+ * Stop shadow virtqueue operation.
+ * @dev vhost device
+ * @idx vhost queue index
+ * @svq Shadow Virtqueue
+ */
+void vhost_shadow_vq_stop(struct vhost_dev *dev,
+                          unsigned idx,
+                          VhostShadowVirtqueue *svq)
+{
+    int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+    if (unlikely(r < 0)) {
+        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
+    }
+
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+}
+
 /*
  * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
  * methods and file descriptors.
  */
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
 {
+    int vq_idx = dev->vq_index + idx;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
 
@@ -43,6 +153,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
         goto err_init_call_notifier;
     }
 
+    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index c4c1f80661..84091b5251 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -25,6 +25,7 @@
 #include "exec/address-spaces.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file-types.h"
 #include "sysemu/dma.h"
@@ -1219,6 +1220,74 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
                        0, virtio_queue_get_desc_size(vdev, idx));
 }
 
+static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
+{
+    int idx;
+
+    dev->shadow_vqs_enabled = false;
+
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
+        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
+    }
+
+    g_free(dev->shadow_vqs);
+    dev->shadow_vqs = NULL;
+    return 0;
+}
+
+static int vhost_sw_live_migration_start(struct vhost_dev *dev)
+{
+    int idx, stop_idx;
+
+    dev->shadow_vqs = g_new0(VhostShadowVirtqueue *, dev->nvqs);
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        dev->shadow_vqs[idx] = vhost_shadow_vq_new(dev, idx);
+        if (unlikely(dev->shadow_vqs[idx] == NULL)) {
+            goto err_new;
+        }
+    }
+
+    dev->shadow_vqs_enabled = true;
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+        if (unlikely(!ok)) {
+            goto err_start;
+        }
+    }
+
+    return 0;
+
+err_start:
+    dev->shadow_vqs_enabled = false;
+    for (stop_idx = 0; stop_idx < idx; stop_idx++) {
+        vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
+    }
+
+err_new:
+    for (idx = 0; idx < dev->nvqs; ++idx) {
+        vhost_shadow_vq_free(dev->shadow_vqs[idx]);
+    }
+    g_free(dev->shadow_vqs);
+
+    return -1;
+}
+
+static int vhost_sw_live_migration_enable(struct vhost_dev *dev,
+                                          bool enable_lm)
+{
+    int r;
+
+    if (enable_lm == dev->shadow_vqs_enabled) {
+        return 0;
+    }
+
+    r = enable_lm ? vhost_sw_live_migration_start(dev)
+                  : vhost_sw_live_migration_stop(dev);
+
+    return r;
+}
+
 static void vhost_eventfd_add(MemoryListener *listener,
                               MemoryRegionSection *section,
                               bool match_data, uint64_t data, EventNotifier *e)
@@ -1381,6 +1450,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->log = NULL;
     hdev->log_size = 0;
     hdev->log_enabled = false;
+    hdev->shadow_vqs_enabled = false;
     hdev->started = false;
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
@@ -1484,6 +1554,10 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
     int i, r;
 
+    if (hdev->shadow_vqs_enabled) {
+        vhost_sw_live_migration_enable(hdev, false);
+    }
+
     for (i = 0; i < hdev->nvqs; ++i) {
         r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i,
                                          false);
@@ -1798,6 +1872,7 @@ fail_features:
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
     int i;
+    bool is_shadow_vqs_enabled = hdev->shadow_vqs_enabled;
 
     /* should only be called after backend is connected */
     assert(hdev->vhost_ops);
@@ -1805,7 +1880,16 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
     if (hdev->vhost_ops->vhost_dev_start) {
         hdev->vhost_ops->vhost_dev_start(hdev, false);
     }
+    if (is_shadow_vqs_enabled) {
+        /* Shadow virtqueue will be stopped */
+        hdev->shadow_vqs_enabled = false;
+    }
     for (i = 0; i < hdev->nvqs; ++i) {
+        if (is_shadow_vqs_enabled) {
+            vhost_shadow_vq_stop(hdev, i, hdev->shadow_vqs[i]);
+            vhost_shadow_vq_free(hdev->shadow_vqs[i]);
+        }
+
         vhost_virtqueue_stop(hdev,
                              vdev,
                              hdev->vqs + i,
@@ -1819,6 +1903,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
         memory_listener_unregister(&hdev->iommu_listener);
     }
     vhost_log_put(hdev, true);
+    g_free(hdev->shadow_vqs);
+    hdev->shadow_vqs_enabled = false;
     hdev->started = false;
     hdev->vdev = NULL;
 }
@@ -1835,5 +1921,60 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
 void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
 {
-    error_setg(errp, "Shadow virtqueue still not implemented");
+    struct vhost_dev *hdev, *hdev_err;
+    VirtIODevice *vdev;
+    const char *err_cause = NULL;
+    int r;
+    ErrorClass err_class = ERROR_CLASS_GENERIC_ERROR;
+
+    QLIST_FOREACH(hdev, &vhost_devices, entry) {
+        if (hdev->vdev && 0 == strcmp(hdev->vdev->name, name)) {
+            vdev = hdev->vdev;
+            break;
+        }
+    }
+
+    if (!hdev) {
+        err_class = ERROR_CLASS_DEVICE_NOT_FOUND;
+        err_cause = "Device not found";
+        goto not_found_err;
+    }
+
+    for ( ; hdev; hdev = QLIST_NEXT(hdev, entry)) {
+        if (vdev != hdev->vdev) {
+            continue;
+        }
+
+        if (!hdev->started) {
+            err_cause = "Device is not started";
+            goto err;
+        }
+
+        r = vhost_sw_live_migration_enable(hdev, enable);
+        if (unlikely(r)) {
+            err_cause = "Error enabling (see monitor)";
+            goto err;
+        }
+    }
+
+    return;
+
+err:
+    QLIST_FOREACH(hdev_err, &vhost_devices, entry) {
+        if (hdev_err == hdev) {
+            break;
+        }
+
+        if (vdev != hdev->vdev) {
+            continue;
+        }
+
+        vhost_sw_live_migration_enable(hdev, !enable);
+    }
+
+not_found_err:
+    if (err_cause) {
+        error_set(errp, err_class,
+                  "Can't enable shadow vq on %s: %s", name, err_cause);
+    }
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 08/29] vhost: Route host->guest notification through shadow virtqueue
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 07/29] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 09/29] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  3 +
 include/hw/virtio/vhost.h          |  1 +
 hw/virtio/vhost-shadow-virtqueue.c | 95 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                  | 15 +++++
 4 files changed, 114 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index c891c6510d..2ca4b92b12 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -17,6 +17,9 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
+void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
+
 bool vhost_shadow_vq_start(struct vhost_dev *dev,
                            unsigned idx,
                            VhostShadowVirtqueue *svq);
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 7ffdf9aea0..67cedf83da 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -28,6 +28,7 @@ struct vhost_virtqueue {
     unsigned avail_size;
     unsigned long long used_phys;
     unsigned used_size;
+    /* Access/writing to notifier_is_masked is protected by BQL */
     bool notifier_is_masked;
     EventNotifier masked_notifier;
     struct vhost_dev *dev;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 3e43399e9c..7d76e271a5 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -32,8 +32,16 @@ typedef struct VhostShadowVirtqueue {
      */
     EventNotifier host_notifier;
 
+    /* (Possible) masked notifier */
+    struct {
+        EventNotifier *n;
+    } masked_notifier;
+
     /* Virtio queue shadowing */
     VirtQueue *vq;
+
+    /* Virtio device */
+    VirtIODevice *vdev;
 } VhostShadowVirtqueue;
 
 /* Forward guest notifications */
@@ -49,6 +57,58 @@ static void vhost_handle_guest_kick(EventNotifier *n)
     event_notifier_set(&svq->kick_notifier);
 }
 
+/* Forward vhost notifications */
+static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             call_notifier);
+    EventNotifier *masked_notifier;
+
+    masked_notifier = svq->masked_notifier.n;
+
+    if (!masked_notifier) {
+        unsigned n = virtio_get_queue_index(svq->vq);
+        virtio_queue_invalidate_signalled_used(svq->vdev, n);
+        virtio_notify_irqfd(svq->vdev, svq->vq);
+    } else {
+        event_notifier_set(svq->masked_notifier.n);
+    }
+}
+
+static void vhost_shadow_vq_handle_call(EventNotifier *n)
+{
+    if (likely(event_notifier_test_and_clear(n))) {
+        vhost_shadow_vq_handle_call_no_test(n);
+    }
+}
+
+/*
+ * Mask the shadow virtqueue.
+ *
+ * It can be called from a guest masking vmexit or shadow virtqueue start
+ * through QMP.
+ *
+ * @vq Shadow virtqueue
+ * @masked Masked notifier to signal instead of guest
+ */
+void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked)
+{
+    svq->masked_notifier.n = masked;
+}
+
+/*
+ * Unmask the shadow virtqueue.
+ *
+ * It can be called from a guest unmasking vmexit or shadow virtqueue start
+ * through QMP.
+ *
+ * @vq Shadow virtqueue
+ */
+void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
+{
+    svq->masked_notifier.n = NULL;
+}
+
 /*
  * Restore the vhost guest to host notifier, i.e., disables svq effect.
  */
@@ -103,8 +163,33 @@ bool vhost_shadow_vq_start(struct vhost_dev *dev,
         goto err_set_vring_kick;
     }
 
+    /* Set vhost call */
+    file.fd = event_notifier_get_fd(&svq->call_notifier),
+    r = dev->vhost_ops->vhost_set_vring_call(dev, &file);
+    if (unlikely(r != 0)) {
+        error_report("Couldn't set call fd: %s", strerror(errno));
+        goto err_set_vring_call;
+    }
+
+    /* Set shadow vq -> guest notifier */
+    assert(dev->shadow_vqs_enabled);
+    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
+                         dev->vqs[idx].notifier_is_masked);
+
+    if (dev->vqs[idx].notifier_is_masked &&
+               event_notifier_test_and_clear(&dev->vqs[idx].masked_notifier)) {
+        /* Check for pending notifications from the device */
+        vhost_shadow_vq_handle_call_no_test(&svq->call_notifier);
+    }
+
     return true;
 
+err_set_vring_call:
+    r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+    if (unlikely(r < 0)) {
+        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
+    }
+
 err_set_vring_kick:
     event_notifier_set_handler(&svq->host_notifier, NULL);
 
@@ -126,7 +211,13 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
         error_report("Couldn't restore vq kick fd: %s", strerror(-r));
     }
 
+    assert(!dev->shadow_vqs_enabled);
+
     event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    /* Restore vhost call */
+    vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
+                         dev->vqs[idx].notifier_is_masked);
 }
 
 /*
@@ -154,6 +245,9 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     }
 
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
+    svq->vdev = dev->vdev;
+    event_notifier_set_handler(&svq->call_notifier,
+                               vhost_shadow_vq_handle_call);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
@@ -169,6 +263,7 @@ err_init_kick_notifier:
 void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
 {
     event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 84091b5251..9c9c63345b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1591,6 +1591,21 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     /* should only be called after backend is connected */
     assert(hdev->vhost_ops);
 
+    if (hdev->shadow_vqs_enabled) {
+        if (mask) {
+            vhost_shadow_vq_mask(hdev->shadow_vqs[index],
+                                 &hdev->vqs[index].masked_notifier);
+        } else {
+            vhost_shadow_vq_unmask(hdev->shadow_vqs[index]);
+        }
+
+        /*
+         * Vhost call fd must remain the same since shadow vq is not polling
+         * for changes
+         */
+        return;
+    }
+
     if (mask) {
         assert(vdev->use_guest_notifier_mask);
         file.fd = event_notifier_get_fd(&hdev->vqs[index].masked_notifier);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 09/29] vhost: Avoid re-set masked notifier in shadow vq
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 08/29] vhost: Route host->guest " Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 10/29] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Since all the shadow virtqueue device is done in software, we can avoid
the write syscall.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 7d76e271a5..c22acb4605 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -35,6 +35,9 @@ typedef struct VhostShadowVirtqueue {
     /* (Possible) masked notifier */
     struct {
         EventNotifier *n;
+
+        /* Avoid re-sending signals */
+        bool signaled;
     } masked_notifier;
 
     /* Virtio queue shadowing */
@@ -70,7 +73,8 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
         unsigned n = virtio_get_queue_index(svq->vq);
         virtio_queue_invalidate_signalled_used(svq->vdev, n);
         virtio_notify_irqfd(svq->vdev, svq->vq);
-    } else {
+    } else if (!svq->masked_notifier.signaled) {
+        svq->masked_notifier.signaled = true;
         event_notifier_set(svq->masked_notifier.n);
     }
 }
@@ -93,6 +97,7 @@ static void vhost_shadow_vq_handle_call(EventNotifier *n)
  */
 void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked)
 {
+    svq->masked_notifier.signaled = false;
     svq->masked_notifier.n = masked;
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 10/29] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 09/29] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 11/29] vhost: Add vhost_vring_pause operation Eugenio Pérez
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

It reports the shadow virtqueue address from qemu virtual address space

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  4 +++
 hw/virtio/vhost-shadow-virtqueue.c | 46 ++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 2ca4b92b12..725091bc97 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -19,6 +19,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 void vhost_shadow_vq_mask(VhostShadowVirtqueue *svq, EventNotifier *masked);
 void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq);
+void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                                    struct vhost_vring_addr *addr);
+size_t vhost_shadow_vq_driver_area_size(const VhostShadowVirtqueue *svq);
+size_t vhost_shadow_vq_device_area_size(const VhostShadowVirtqueue *svq);
 
 bool vhost_shadow_vq_start(struct vhost_dev *dev,
                            unsigned idx,
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index c22acb4605..ff50f12410 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -17,6 +17,9 @@
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
+    /* Shadow vring */
+    struct vring vring;
+
     /* Shadow kick notifier, sent to vhost */
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
@@ -114,6 +117,35 @@ void vhost_shadow_vq_unmask(VhostShadowVirtqueue *svq)
     svq->masked_notifier.n = NULL;
 }
 
+/*
+ * Get the shadow vq vring address.
+ * @svq Shadow virtqueue
+ * @addr Destination to store address
+ */
+void vhost_shadow_vq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                                    struct vhost_vring_addr *addr)
+{
+    addr->desc_user_addr = (uint64_t)svq->vring.desc;
+    addr->avail_user_addr = (uint64_t)svq->vring.avail;
+    addr->used_user_addr = (uint64_t)svq->vring.used;
+}
+
+size_t vhost_shadow_vq_driver_area_size(const VhostShadowVirtqueue *svq)
+{
+    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+    size_t desc_size = virtio_queue_get_desc_size(svq->vdev, vq_idx);
+    size_t avail_size = virtio_queue_get_avail_size(svq->vdev, vq_idx);
+
+    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);
+}
+
+size_t vhost_shadow_vq_device_area_size(const VhostShadowVirtqueue *svq)
+{
+    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+    size_t used_size = virtio_queue_get_used_size(svq->vdev, vq_idx);
+    return ROUND_UP(used_size, qemu_real_host_page_size);
+}
+
 /*
  * Restore the vhost guest to host notifier, i.e., disables svq effect.
  */
@@ -232,6 +264,10 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
 VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
 {
     int vq_idx = dev->vq_index + idx;
+    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
+    size_t desc_size = virtio_queue_get_desc_size(dev->vdev, vq_idx);
+    size_t driver_size;
+    size_t device_size;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
 
@@ -251,6 +287,14 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
 
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     svq->vdev = dev->vdev;
+    driver_size = vhost_shadow_vq_driver_area_size(svq);
+    device_size = vhost_shadow_vq_device_area_size(svq);
+    svq->vring.num = num;
+    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
+    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
+    memset(svq->vring.desc, 0, driver_size);
+    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
+    memset(svq->vring.used, 0, device_size);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_shadow_vq_handle_call);
     return g_steal_pointer(&svq);
@@ -270,5 +314,7 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
     event_notifier_cleanup(&vq->kick_notifier);
     event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
+    qemu_vfree(vq->vring.desc);
+    qemu_vfree(vq->vring.used);
     g_free(vq);
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 11/29] vhost: Add vhost_vring_pause operation
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 10/29] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 12/29] vhost: add vhost_kernel_vring_pause Eugenio Pérez
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

With this operation a device can be paused by a backend, allowing the
later to ask status without the risk of the device override it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-backend.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 8a6f8e2a7a..94d3323905 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -125,6 +125,8 @@ typedef int (*vhost_get_device_id_op)(struct vhost_dev *dev, uint32_t *dev_id);
 
 typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
 
+typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -169,6 +171,7 @@ typedef struct VhostOps {
     vhost_dev_start_op vhost_dev_start;
     vhost_vq_get_addr_op  vhost_vq_get_addr;
     vhost_get_device_id_op vhost_get_device_id;
+    vhost_vring_pause_op vhost_vring_pause;
     vhost_force_iommu_op vhost_force_iommu;
 } VhostOps;
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 12/29] vhost: add vhost_kernel_vring_pause
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 11/29] vhost: Add vhost_vring_pause operation Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 13/29] vhost: Add vhost_get_iova_range operation Eugenio Pérez
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This is just a commit to allow the testing with vhost-net, not intended
for the final version or any other device.

vhost_kernel_vring_pause stops the device, so qemu can ask for its status
(next available idx the device was going to consume) and to replace
vring addresses. When SVQ starts it can resume consuming the guest's
driver ring, without notice from the latter. Not stopping the device
before of the swapping could imply that it process more buffers than
reported, what would duplicate the device action.

Mimic vhost-vdpa behavior, vhost_kernel_start is intended to resume the
device. In the former it performs a full reset. Since this is a
temporary commit to allow testing with vhost-net, the latter just set a
new backend, that is enough for vhost-net to realize the new vring
addresses.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-backend.c | 42 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 31b33bde37..9653b7fddb 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -201,6 +201,46 @@ static int vhost_kernel_get_vq_index(struct vhost_dev *dev, int idx)
     return idx - dev->vq_index;
 }
 
+static int vhost_kernel_set_vq_pause(struct vhost_dev *dev, unsigned idx,
+                                     bool pause)
+{
+    struct vhost_vring_file file = {
+        .index = idx,
+    };
+
+    if (pause) {
+        file.fd = -1; /* Pass -1 to unbind from file. */
+    } else {
+        struct vhost_net *vn_dev = container_of(dev, struct vhost_net, dev);
+        file.fd = vn_dev->backend;
+    }
+
+    return vhost_kernel_net_set_backend(dev, &file);
+}
+
+static int vhost_kernel_vring_pause(struct vhost_dev *dev)
+{
+    int i;
+
+    for (i = 0; i < dev->nvqs; ++i) {
+        vhost_kernel_set_vq_pause(dev, i, true);
+    }
+
+    return 0;
+}
+
+static int vhost_kernel_start(struct vhost_dev *dev, bool start)
+{
+    int i;
+
+    assert(start);
+    for (i = 0; i < dev->nvqs; ++i) {
+        vhost_kernel_set_vq_pause(dev, i, false);
+    }
+
+    return 0;
+}
+
 #ifdef CONFIG_VHOST_VSOCK
 static int vhost_kernel_vsock_set_guest_cid(struct vhost_dev *dev,
                                             uint64_t guest_cid)
@@ -317,6 +357,8 @@ static const VhostOps kernel_ops = {
         .vhost_set_owner = vhost_kernel_set_owner,
         .vhost_reset_device = vhost_kernel_reset_device,
         .vhost_get_vq_index = vhost_kernel_get_vq_index,
+        .vhost_dev_start = vhost_kernel_start,
+        .vhost_vring_pause = vhost_kernel_vring_pause,
 #ifdef CONFIG_VHOST_VSOCK
         .vhost_vsock_set_guest_cid = vhost_kernel_vsock_set_guest_cid,
         .vhost_vsock_set_running = vhost_kernel_vsock_set_running,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 12/29] vhost: add vhost_kernel_vring_pause Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-26  1:14   ` Jason Wang
  2021-05-19 16:28 ` [RFC v3 14/29] vhost: add vhost_has_limited_iova_range Eugenio Pérez
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

For simplicity, If a device does not support this operation it means
that it can handle full (uint64_t)-1 iova address.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  5 +++++
 hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
 hw/virtio/trace-events            |  1 +
 3 files changed, 24 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 94d3323905..bcb112c166 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -36,6 +36,7 @@ struct vhost_vring_addr;
 struct vhost_scsi_target;
 struct vhost_iotlb_msg;
 struct vhost_virtqueue;
+struct vhost_vdpa_iova_range;
 
 typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
 typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
@@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
 
 typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
 
+typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
+                                    hwaddr *first, hwaddr *last);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -173,6 +177,7 @@ typedef struct VhostOps {
     vhost_get_device_id_op vhost_get_device_id;
     vhost_vring_pause_op vhost_vring_pause;
     vhost_force_iommu_op vhost_force_iommu;
+    vhost_get_iova_range vhost_get_iova_range;
 } VhostOps;
 
 extern const VhostOps user_ops;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 01d2101d09..74fe92935e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
+                                     hwaddr *first, hwaddr *last)
+{
+    int ret;
+    struct vhost_vdpa_iova_range range;
+
+    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
+    if (ret != 0) {
+        return ret;
+    }
+
+    *first = range.first;
+    *last = range.last;
+    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
+    return ret;
+}
+
 const VhostOps vdpa_ops = {
         .backend_type = VHOST_BACKEND_TYPE_VDPA,
         .vhost_backend_init = vhost_vdpa_init,
@@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
         .vhost_get_device_id = vhost_vdpa_get_device_id,
         .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
         .vhost_force_iommu = vhost_vdpa_force_iommu,
+        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
 };
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index c62727f879..5debe3a681 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
 vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
 vhost_vdpa_set_owner(void *dev) "dev: %p"
 vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
+vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
 
 # virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 14/29] vhost: add vhost_has_limited_iova_range
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 13/29] vhost: Add vhost_get_iova_range operation Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps Eugenio Pérez
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h |  5 +++++
 hw/virtio/vhost.c         | 17 +++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 67cedf83da..c97a4c0017 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -88,6 +88,10 @@ struct vhost_dev {
     bool log_enabled;
     bool shadow_vqs_enabled;
     uint64_t log_size;
+    struct {
+        hwaddr first;
+        hwaddr last;
+    } iova_range;
     VhostShadowVirtqueue **shadow_vqs;
     Error *migration_blocker;
     const VhostOps *vhost_ops;
@@ -129,6 +133,7 @@ uint64_t vhost_get_features(struct vhost_dev *hdev, const int *feature_bits,
 void vhost_ack_features(struct vhost_dev *hdev, const int *feature_bits,
                         uint64_t features);
 bool vhost_has_free_slot(void);
+bool vhost_has_limited_iova_range(const struct vhost_dev *hdev);
 
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 9c9c63345b..333877ca3b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1386,6 +1386,18 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         goto fail;
     }
 
+    if (hdev->vhost_ops->vhost_get_iova_range) {
+        r = hdev->vhost_ops->vhost_get_iova_range(hdev,
+                                                 &hdev->iova_range.first,
+                                                 &hdev->iova_range.last);
+        if (unlikely(r != 0)) {
+            error_report("Can't request IOVA range");
+            goto fail;
+        }
+    } else {
+        hdev->iova_range.last = (hwaddr)-1;
+    }
+
     for (i = 0; i < hdev->nvqs; ++i, ++n_initialized_vqs) {
         r = vhost_virtqueue_init(hdev, hdev->vqs + i, hdev->vq_index + i);
         if (r < 0) {
@@ -1622,6 +1634,11 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     }
 }
 
+bool vhost_has_limited_iova_range(const struct vhost_dev *hdev)
+{
+    return hdev->iova_range.first || hdev->iova_range.last != HWADDR_MAX;
+}
+
 uint64_t vhost_get_features(struct vhost_dev *hdev, const int *feature_bits,
                             uint64_t features)
 {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (13 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 14/29] vhost: add vhost_has_limited_iova_range Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-31  9:01   ` Jason Wang
  2021-05-19 16:28 ` [RFC v3 16/29] vhost-vdpa: Add vhost_vdpa_enable_custom_iommu Eugenio Pérez
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This operation enable the backend-specific IOTLB entries.

If a backend support this, it start managing its own entries, and vhost
can disable it through this operation and recover control.

Every enable/disable operation must also clear all IOTLB device entries.

At the moment, the only backend that does so is vhost-vdpa. To fully
support these, vdpa needs also to expose a way for vhost subsystem to
map and unmap entries. This will be done in future commits.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-backend.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index bcb112c166..f8eed2ace5 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -128,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
 
 typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
 
+typedef int (*vhost_enable_custom_iommu_op)(struct vhost_dev *dev,
+                                            bool enable);
+
 typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
                                     hwaddr *first, hwaddr *last);
 
@@ -177,6 +180,7 @@ typedef struct VhostOps {
     vhost_get_device_id_op vhost_get_device_id;
     vhost_vring_pause_op vhost_vring_pause;
     vhost_force_iommu_op vhost_force_iommu;
+    vhost_enable_custom_iommu_op vhost_enable_custom_iommu;
     vhost_get_iova_range vhost_get_iova_range;
 } VhostOps;
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 16/29] vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (14 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Implementation of vhost_ops->enable_custom_iommu

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 74fe92935e..9e7a0ce5e0 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -272,6 +272,29 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
     vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
 }
 
+static int vhost_vdpa_enable_custom_iommu(struct vhost_dev *dev, bool enable)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    hwaddr iova_range_last = dev->iova_range.last;
+    if (iova_range_last != (hwaddr)-1) {
+        iova_range_last++;
+    }
+
+    if (enable) {
+        int r = vhost_vdpa_dma_unmap(v, dev->iova_range.first, iova_range_last);
+        if (r != 0) {
+            error_report("Fail to invalidate device iotlb");
+        }
+
+        memory_listener_register(&v->listener, &address_space_memory);
+    } else {
+        memory_listener_unregister(&v->listener);
+        return vhost_vdpa_dma_unmap(v, dev->iova_range.first, iova_range_last);
+    }
+
+    return 0;
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque)
 {
     struct vhost_vdpa *v;
@@ -299,7 +322,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
     v = dev->opaque;
     trace_vhost_vdpa_cleanup(dev, v);
-    memory_listener_unregister(&v->listener);
+    vhost_vdpa_enable_custom_iommu(dev, false);
 
     dev->opaque = NULL;
     return 0;
@@ -470,11 +493,10 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
 
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
-    struct vhost_vdpa *v = dev->opaque;
     trace_vhost_vdpa_dev_start(dev, started);
     if (started) {
         uint8_t status = 0;
-        memory_listener_register(&v->listener, &address_space_memory);
+        vhost_vdpa_enable_custom_iommu(dev, true);
         vhost_vdpa_set_vring_ready(dev);
         vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
         vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
@@ -484,7 +506,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         vhost_vdpa_reset_device(dev);
         vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                    VIRTIO_CONFIG_S_DRIVER);
-        memory_listener_unregister(&v->listener);
+        vhost_vdpa_enable_custom_iommu(dev, false);
 
         return 0;
     }
@@ -628,5 +650,6 @@ const VhostOps vdpa_ops = {
         .vhost_get_device_id = vhost_vdpa_get_device_id,
         .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
         .vhost_force_iommu = vhost_vdpa_force_iommu,
+        .vhost_enable_custom_iommu = vhost_vdpa_enable_custom_iommu,
         .vhost_get_iova_range = vhost_vdpa_get_iova_range,
 };
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (15 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 16/29] vhost-vdpa: Add vhost_vdpa_enable_custom_iommu Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-06-02  9:50   ` Jason Wang
  2021-05-19 16:28 ` [RFC v3 18/29] vhost: Use vhost_enable_custom_iommu to unmap everything if available Eugenio Pérez
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Initial version of shadow virtqueue that actually forward buffers. The
exposed addresses are the qemu's virtual address, so devices with IOMMU
that does not allow full mapping of qemu's address space does not work
at the moment.

Also for simplicity it only supports modern devices, that expects vring
in little endian, with split ring and no event idx or indirect
descriptors.

It reuses the VirtQueue code for the device part. The driver part is
based on Linux's virtio_ring driver, but with stripped functionality
and optimizations so it's easier to review.

Later commits will solve some of these concerns.

Code also need to map used ring (device part) as RW in, and only in,
vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
would print an error in case of vhost devices with its own mapping
(vdpa).

To know if this call is needed, vhost_sw_live_migration_start_vq and
vhost_sw_live_migration_stop copy the test performed in
vhost_dev_start. Testing for the actual backend type could be cleaner,
or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
another vhostOp. We could extract this test in its own function too,
so its name could give a better hint. Just copy the vhost_dev_start
check at the moment.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
 hw/virtio/vhost.c                  | 134 ++++++++++++++++++-
 2 files changed, 325 insertions(+), 14 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index ff50f12410..6d767fe248 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -9,6 +9,7 @@
 
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost.h"
+#include "hw/virtio/virtio-access.h"
 
 #include "standard-headers/linux/vhost_types.h"
 
@@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
 
     /* Virtio device */
     VirtIODevice *vdev;
+
+    /* Map for returning guest's descriptors */
+    VirtQueueElement **ring_id_maps;
+
+    /* Next head to expose to device */
+    uint16_t avail_idx_shadow;
+
+    /* Next free descriptor */
+    uint16_t free_head;
+
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from device */
+    uint16_t used_idx;
 } VhostShadowVirtqueue;
 
-/* Forward guest notifications */
+static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+                                    const struct iovec *iovec,
+                                    size_t num, bool more_descs, bool write)
+{
+    uint16_t i = svq->free_head, last = svq->free_head;
+    unsigned n;
+    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
+    vring_desc_t *descs = svq->vring.desc;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (n = 0; n < num; n++) {
+        if (more_descs || (n + 1 < num)) {
+            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
+        } else {
+            descs[i].flags = flags;
+        }
+        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
+        descs[i].len = cpu_to_le32(iovec[n].iov_len);
+
+        last = i;
+        i = cpu_to_le16(descs[i].next);
+    }
+
+    svq->free_head = le16_to_cpu(descs[last].next);
+}
+
+static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
+                                          VirtQueueElement *elem)
+{
+    int head;
+    unsigned avail_idx;
+    vring_avail_t *avail = svq->vring.avail;
+
+    head = svq->free_head;
+
+    /* We need some descriptors here */
+    assert(elem->out_num || elem->in_num);
+
+    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
+                            elem->in_num > 0, false);
+    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
+
+    /*
+     * Put entry in available array (but don't update avail->idx until they
+     * do sync).
+     */
+    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
+    avail->ring[avail_idx] = cpu_to_le16(head);
+    svq->avail_idx_shadow++;
+
+    /* Expose descriptors to device */
+    smp_wmb();
+    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
+
+    return head;
+
+}
+
+static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
+                                VirtQueueElement *elem)
+{
+    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
+
+    svq->ring_id_maps[qemu_head] = elem;
+}
+
+/* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
@@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
         return;
     }
 
-    event_notifier_set(&svq->kick_notifier);
+    /* Make available as many buffers as possible */
+    do {
+        if (virtio_queue_get_notification(svq->vq)) {
+            /* No more notifications until process all available */
+            virtio_queue_set_notification(svq->vq, false);
+        }
+
+        while (true) {
+            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            if (!elem) {
+                break;
+            }
+
+            vhost_shadow_vq_add(svq, elem);
+            event_notifier_set(&svq->kick_notifier);
+        }
+
+        virtio_queue_set_notification(svq->vq, true);
+    } while (!virtio_queue_empty(svq->vq));
+}
+
+static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
+{
+    if (svq->used_idx != svq->shadow_used_idx) {
+        return true;
+    }
+
+    /* Get used idx must not be reordered */
+    smp_rmb();
+    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
+
+    return svq->used_idx != svq->shadow_used_idx;
+}
+
+static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
+{
+    vring_desc_t *descs = svq->vring.desc;
+    const vring_used_t *used = svq->vring.used;
+    vring_used_elem_t used_elem;
+    uint16_t last_used;
+
+    if (!vhost_shadow_vq_more_used(svq)) {
+        return NULL;
+    }
+
+    last_used = svq->used_idx & (svq->vring.num - 1);
+    used_elem.id = le32_to_cpu(used->ring[last_used].id);
+    used_elem.len = le32_to_cpu(used->ring[last_used].len);
+
+    if (unlikely(used_elem.id >= svq->vring.num)) {
+        error_report("Device %s says index %u is available", svq->vdev->name,
+                     used_elem.id);
+        return NULL;
+    }
+
+    descs[used_elem.id].next = svq->free_head;
+    svq->free_head = used_elem.id;
+
+    svq->used_idx++;
+    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
 }
 
 /* Forward vhost notifications */
@@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
                                              call_notifier);
     EventNotifier *masked_notifier;
+    VirtQueue *vq = svq->vq;
 
     masked_notifier = svq->masked_notifier.n;
 
-    if (!masked_notifier) {
-        unsigned n = virtio_get_queue_index(svq->vq);
-        virtio_queue_invalidate_signalled_used(svq->vdev, n);
-        virtio_notify_irqfd(svq->vdev, svq->vq);
-    } else if (!svq->masked_notifier.signaled) {
-        svq->masked_notifier.signaled = true;
-        event_notifier_set(svq->masked_notifier.n);
-    }
+    /* Make as many buffers as possible used. */
+    do {
+        unsigned i = 0;
+
+        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        while (true) {
+            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
+            if (!elem) {
+                break;
+            }
+
+            assert(i < svq->vring.num);
+            virtqueue_fill(vq, elem, elem->len, i++);
+        }
+
+        virtqueue_flush(vq, i);
+        if (!masked_notifier) {
+            virtio_notify_irqfd(svq->vdev, svq->vq);
+        } else if (!svq->masked_notifier.signaled) {
+            svq->masked_notifier.signaled = true;
+            event_notifier_set(svq->masked_notifier.n);
+        }
+    } while (vhost_shadow_vq_more_used(svq));
 }
 
 static void vhost_shadow_vq_handle_call(EventNotifier *n)
@@ -243,7 +404,11 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
                           unsigned idx,
                           VhostShadowVirtqueue *svq)
 {
+    int i;
     int r = vhost_shadow_vq_restore_vdev_host_notifier(dev, idx, svq);
+
+    assert(!dev->shadow_vqs_enabled);
+
     if (unlikely(r < 0)) {
         error_report("Couldn't restore vq kick fd: %s", strerror(-r));
     }
@@ -255,6 +420,18 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
     /* Restore vhost call */
     vhost_virtqueue_mask(dev, dev->vdev, dev->vq_index + idx,
                          dev->vqs[idx].notifier_is_masked);
+
+
+    for (i = 0; i < svq->vring.num; ++i) {
+        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
+        /*
+         * Although the doc says we must unpop in order, it's ok to unpop
+         * everything.
+         */
+        if (elem) {
+            virtqueue_unpop(svq->vq, elem, elem->len);
+        }
+    }
 }
 
 /*
@@ -269,7 +446,7 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     size_t driver_size;
     size_t device_size;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
-    int r;
+    int r, i;
 
     r = event_notifier_init(&svq->kick_notifier, 0);
     if (r != 0) {
@@ -295,6 +472,11 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     memset(svq->vring.desc, 0, driver_size);
     svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
     memset(svq->vring.used, 0, device_size);
+    for (i = 0; i < num - 1; i++) {
+        svq->vring.desc[i].next = cpu_to_le16(i + 1);
+    }
+
+    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_shadow_vq_handle_call);
     return g_steal_pointer(&svq);
@@ -314,6 +496,7 @@ void vhost_shadow_vq_free(VhostShadowVirtqueue *vq)
     event_notifier_cleanup(&vq->kick_notifier);
     event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq->ring_id_maps);
     qemu_vfree(vq->vring.desc);
     qemu_vfree(vq->vring.used);
     g_free(vq);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 333877ca3b..5b5001a08a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1021,6 +1021,19 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 
     trace_vhost_iotlb_miss(dev, 1);
 
+    if (dev->shadow_vqs_enabled) {
+        uaddr = iova;
+        len = 4096;
+        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
+                                                IOMMU_RW);
+        if (ret) {
+            trace_vhost_iotlb_miss(dev, 2);
+            error_report("Fail to update device iotlb");
+        }
+
+        return ret;
+    }
+
     iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
                                           iova, write,
                                           MEMTXATTRS_UNSPECIFIED);
@@ -1222,12 +1235,37 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
 
 static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 {
-    int idx;
+    int idx, r;
 
     dev->shadow_vqs_enabled = false;
 
+    r = dev->vhost_ops->vhost_vring_pause(dev);
+    assert(r == 0);
+    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+        error_report("Fail to invalidate device iotlb");
+    }
+
     for (idx = 0; idx < dev->nvqs; ++idx) {
+        struct vhost_virtqueue *vq = dev->vqs + idx;
+        if (vhost_dev_has_iommu(dev) &&
+            dev->vhost_ops->vhost_set_iotlb_callback) {
+            /*
+             * Update used ring information for IOTLB to work correctly,
+             * vhost-kernel code requires for this.
+             */
+            vhost_device_iotlb_miss(dev, vq->used_phys, true);
+        }
+
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[idx]);
+        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
+                              dev->vq_index + idx);
+    }
+
+    /* Enable guest's vq vring */
+    r = dev->vhost_ops->vhost_dev_start(dev, true);
+    assert(r == 0);
+
+    for (idx = 0; idx < dev->nvqs; ++idx) {
         vhost_shadow_vq_free(dev->shadow_vqs[idx]);
     }
 
@@ -1236,9 +1274,64 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
     return 0;
 }
 
+/*
+ * Start shadow virtqueue in a given queue.
+ * In failure case, this function leaves queue working as regular vhost mode.
+ */
+static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
+                                             unsigned idx)
+{
+    struct vhost_vring_addr addr = {
+        .index = idx,
+    };
+    struct vhost_vring_state s = {
+        .index = idx,
+    };
+    int r;
+    bool ok;
+
+    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
+    ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    /* From this point, vhost_virtqueue_start can reset these changes */
+    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
+    r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
+    if (unlikely(r != 0)) {
+        VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
+        goto err;
+    }
+
+    r = dev->vhost_ops->vhost_set_vring_base(dev, &s);
+    if (unlikely(r != 0)) {
+        VHOST_OPS_DEBUG("vhost_set_vring_base for shadow vq failed");
+        goto err;
+    }
+
+    if (vhost_dev_has_iommu(dev) && dev->vhost_ops->vhost_set_iotlb_callback) {
+        /*
+         * Update used ring information for IOTLB to work correctly,
+         * vhost-kernel code requires for this.
+         */
+        r = vhost_device_iotlb_miss(dev, addr.used_user_addr, true);
+        if (unlikely(r != 0)) {
+            /* Debug message already printed */
+            goto err;
+        }
+    }
+
+    return true;
+
+err:
+    vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
+    return false;
+}
+
 static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 {
-    int idx, stop_idx;
+    int r, idx, stop_idx;
 
     dev->shadow_vqs = g_new0(VhostShadowVirtqueue *, dev->nvqs);
     for (idx = 0; idx < dev->nvqs; ++idx) {
@@ -1248,23 +1341,37 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
         }
     }
 
+    r = dev->vhost_ops->vhost_vring_pause(dev);
+    assert(r == 0);
+    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+        error_report("Fail to invalidate device iotlb");
+    }
+
+    /* Can be read by vhost_virtqueue_mask, from vm exit */
     dev->shadow_vqs_enabled = true;
     for (idx = 0; idx < dev->nvqs; ++idx) {
-        bool ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
+        bool ok = vhost_sw_live_migration_start_vq(dev, idx);
         if (unlikely(!ok)) {
             goto err_start;
         }
     }
 
+    /* Enable shadow vq vring */
+    r = dev->vhost_ops->vhost_dev_start(dev, true);
+    assert(r == 0);
     return 0;
 
 err_start:
     dev->shadow_vqs_enabled = false;
     for (stop_idx = 0; stop_idx < idx; stop_idx++) {
         vhost_shadow_vq_stop(dev, idx, dev->shadow_vqs[stop_idx]);
+        vhost_virtqueue_start(dev, dev->vdev, &dev->vqs[idx],
+                              dev->vq_index + stop_idx);
     }
 
 err_new:
+    /* Enable guest's vring */
+    dev->vhost_ops->vhost_set_vring_enable(dev, true);
     for (idx = 0; idx < dev->nvqs; ++idx) {
         vhost_shadow_vq_free(dev->shadow_vqs[idx]);
     }
@@ -1979,6 +2086,27 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
 
         if (!hdev->started) {
             err_cause = "Device is not started";
+        } else if (!vhost_dev_has_iommu(hdev)) {
+            err_cause = "Does not support iommu";
+        } else if (hdev->acked_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
+            err_cause = "Is packed";
+        } else if (hdev->acked_features & BIT_ULL(VIRTIO_RING_F_EVENT_IDX)) {
+            err_cause = "Have event idx";
+        } else if (hdev->acked_features &
+                   BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC)) {
+            err_cause = "Supports indirect descriptors";
+        } else if (!hdev->vhost_ops->vhost_vring_pause ||
+                   !hdev->vhost_ops->vhost_dev_start) {
+            err_cause = "Cannot pause device";
+        } else if (hdev->vhost_ops->vhost_get_iova_range) {
+            err_cause = "Device may not support all iova range";
+        } else if (hdev->vhost_ops->vhost_enable_custom_iommu) {
+            err_cause = "Device does not use regular IOMMU";
+        } else if (!virtio_vdev_has_feature(hdev->vdev, VIRTIO_F_VERSION_1)) {
+            err_cause = "Legacy VirtIO device";
+        }
+
+        if (err_cause) {
             goto err;
         }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 18/29] vhost: Use vhost_enable_custom_iommu to unmap everything if available
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (16 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 19/29] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This call is the right way to unmap every IOTLB in devices with non
standard IOMMU (vdpa devices), since regular one would require an IOTLB
message they don't support.

Another possible solution would be to implement
.vhost_send_device_iotlb_msg vhost operation in vhost-vdpa, but it
could conflict with expected backend iommu operations.

Currently, this method does not work for vp_vdpa. For some reason, intel
IOMMU is not able to map anything when vdpa has unmapped everything.
However that is on kernel side, this commit code should be as intended
in the final version.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 5b5001a08a..c8fa9df9b3 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1241,7 +1241,12 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
 
     r = dev->vhost_ops->vhost_vring_pause(dev);
     assert(r == 0);
-    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+    if (dev->vhost_ops->vhost_enable_custom_iommu) {
+        r = dev->vhost_ops->vhost_enable_custom_iommu(dev, false);
+    } else {
+        r = vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL);
+    }
+    if (r) {
         error_report("Fail to invalidate device iotlb");
     }
 
@@ -1343,7 +1348,12 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 
     r = dev->vhost_ops->vhost_vring_pause(dev);
     assert(r == 0);
-    if (vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL)) {
+    if (dev->vhost_ops->vhost_enable_custom_iommu) {
+        r = dev->vhost_ops->vhost_enable_custom_iommu(dev, false);
+    } else {
+        r = vhost_backend_invalidate_device_iotlb(dev, 0, -1ULL);
+    }
+    if (r) {
         error_report("Fail to invalidate device iotlb");
     }
 
@@ -2100,8 +2110,6 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
             err_cause = "Cannot pause device";
         } else if (hdev->vhost_ops->vhost_get_iova_range) {
             err_cause = "Device may not support all iova range";
-        } else if (hdev->vhost_ops->vhost_enable_custom_iommu) {
-            err_cause = "Device does not use regular IOMMU";
         } else if (!virtio_vdev_has_feature(hdev->vdev, VIRTIO_F_VERSION_1)) {
             err_cause = "Legacy VirtIO device";
         }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 19/29] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (17 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 18/29] vhost: Use vhost_enable_custom_iommu to unmap everything if available Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 20/29] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 6d767fe248..6b42147449 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -135,6 +135,15 @@ static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
     svq->ring_id_maps[qemu_head] = elem;
 }
 
+static void vhost_shadow_vq_kick(VhostShadowVirtqueue *svq)
+{
+    /* Make sure we are reading updated device flag */
+    smp_rmb();
+    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
+        event_notifier_set(&svq->kick_notifier);
+    }
+}
+
 /* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
@@ -159,7 +168,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
             }
 
             vhost_shadow_vq_add(svq, elem);
-            event_notifier_set(&svq->kick_notifier);
+            vhost_shadow_vq_kick(svq);
         }
 
         virtio_queue_set_notification(svq->vq, true);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 20/29] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (18 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 19/29] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 21/29] vhost: Add VhostIOVATree Eugenio Pérez
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 6b42147449..934d3bb27b 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -64,8 +64,30 @@ typedef struct VhostShadowVirtqueue {
 
     /* Next head to consume from device */
     uint16_t used_idx;
+
+    /* Cache for the exposed notification flag */
+    bool notification;
 } VhostShadowVirtqueue;
 
+static void vhost_shadow_vq_set_notification(VhostShadowVirtqueue *svq,
+                                             bool enable)
+{
+    uint16_t notification_flag;
+
+    if (svq->notification == enable) {
+        return;
+    }
+
+    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+
+    svq->notification = enable;
+    if (enable) {
+        svq->vring.avail->flags &= ~notification_flag;
+    } else {
+        svq->vring.avail->flags |= notification_flag;
+    }
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
@@ -231,7 +253,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
     do {
         unsigned i = 0;
 
-        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        vhost_shadow_vq_set_notification(svq, false);
         while (true) {
             g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
             if (!elem) {
@@ -249,6 +271,7 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
             svq->masked_notifier.signaled = true;
             event_notifier_set(svq->masked_notifier.n);
         }
+        vhost_shadow_vq_set_notification(svq, true);
     } while (vhost_shadow_vq_more_used(svq));
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (19 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 20/29] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-31  9:40   ` Jason Wang
  2021-05-19 16:28 ` [RFC v3 22/29] vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps Eugenio Pérez
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This tree is able to look for a translated address from a IOVA address.

At first glance is similar to util/iova-tree. However, SVQ working on
devices with limited IOVA space need more capabilities, like allocating
IOVA chunks or perform reverse translations (qemu addresses to iova).

Starting a sepparated implementation. Knowing than insertions/deletions
will not be as frequent as searches, it uses an ordered array at
implementation. A different name could be used, but ordered
searchable array is a little bit long though.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-iova-tree.h |  50 ++++++++++
 hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
 hw/virtio/meson.build       |   2 +-
 3 files changed, 239 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-iova-tree.c

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
new file mode 100644
index 0000000000..2a44af8b3a
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.h
@@ -0,0 +1,50 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
+#define HW_VIRTIO_VHOST_IOVA_TREE_H
+
+#include <gmodule.h>
+
+#include "exec/memory.h"
+
+typedef struct VhostDMAMap {
+    void *translated_addr;
+    hwaddr iova;
+    hwaddr size;                /* Inclusive */
+    IOMMUAccessFlags perm;
+} VhostDMAMap;
+
+typedef enum VhostDMAMapNewRC {
+    VHOST_DMA_MAP_OVERLAP = -2,
+    VHOST_DMA_MAP_INVALID = -1,
+    VHOST_DMA_MAP_OK = 0,
+} VhostDMAMapNewRC;
+
+/**
+ * VhostIOVATree
+ *
+ * Store and search IOVA -> Translated mappings.
+ *
+ * Note that it cannot remove nodes.
+ */
+typedef struct VhostIOVATree {
+    /* Ordered array of reverse translations, IOVA address to qemu memory. */
+    GArray *iova_taddr_map;
+} VhostIOVATree;
+
+void vhost_iova_tree_new(VhostIOVATree *iova_rm);
+void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
+
+const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
+                                              const VhostDMAMap *map);
+VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
+                                        VhostDMAMap *map);
+
+#endif
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
new file mode 100644
index 0000000000..dfd7e448b5
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.c
@@ -0,0 +1,188 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "vhost-iova-tree.h"
+
+#define G_ARRAY_NOT_ZERO_TERMINATED false
+#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
+
+/**
+ * Inserts an element after an existing one in garray.
+ *
+ * @array      The array
+ * @prev_elem  The previous element of array of NULL if prepending
+ * @map        The DMA map
+ *
+ * It provides the aditional advantage of being type safe over
+ * g_array_insert_val, which accepts a reference pointer instead of a value
+ * with no complains.
+ */
+static void vhost_iova_tree_insert_after(GArray *array,
+                                         const VhostDMAMap *prev_elem,
+                                         const VhostDMAMap *map)
+{
+    size_t pos;
+
+    if (!prev_elem) {
+        pos = 0;
+    } else {
+        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
+    }
+
+    g_array_insert_val(array, pos, *map);
+}
+
+static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
+{
+    const VhostDMAMap *m1 = a, *m2 = b;
+
+    if (m1->iova > m2->iova + m2->size) {
+        return 1;
+    }
+
+    if (m1->iova + m1->size < m2->iova) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+/**
+ * Find the previous node to a given iova
+ *
+ * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
+ * @map    The map to insert
+ * @prev   Returned location of the previous map
+ *
+ * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
+ * it already exists. It is ok to use this function to check if a given range
+ * exists, but it will use a linear search.
+ *
+ * TODO: We can use bsearch to locate the entry if we save the state in the
+ * needle, knowing that the needle is always the first argument to
+ * compare_func.
+ */
+static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
+                                                  GCompareFunc compare_func,
+                                                  const VhostDMAMap *map,
+                                                  const VhostDMAMap **prev)
+{
+    size_t i;
+    int r;
+
+    *prev = NULL;
+    for (i = 0; i < array->len; ++i) {
+        r = compare_func(map, &g_array_index(array, typeof(*map), i));
+        if (r == 0) {
+            return VHOST_DMA_MAP_OVERLAP;
+        }
+        if (r < 0) {
+            return VHOST_DMA_MAP_OK;
+        }
+
+        *prev = &g_array_index(array, typeof(**prev), i);
+    }
+
+    return VHOST_DMA_MAP_OK;
+}
+
+/**
+ * Create a new IOVA tree
+ *
+ * @tree  The IOVA tree
+ */
+void vhost_iova_tree_new(VhostIOVATree *tree)
+{
+    assert(tree);
+
+    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
+                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
+                                       sizeof(VhostDMAMap));
+}
+
+/**
+ * Destroy an IOVA tree
+ *
+ * @tree  The iova tree
+ */
+void vhost_iova_tree_destroy(VhostIOVATree *tree)
+{
+    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
+}
+
+/**
+ * Perform a search on a GArray.
+ *
+ * @array Glib array
+ * @map Map to look up
+ * @compare_func Compare function to use
+ *
+ * Return The found element or NULL if not found.
+ *
+ * This can be replaced with g_array_binary_search (Since glib 2.62) when that
+ * is common enough.
+ */
+static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
+                                                  const VhostDMAMap *map,
+                                                  GCompareFunc compare_func)
+{
+    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
+}
+
+/**
+ * Find the translated address stored from a IOVA address
+ *
+ * @tree  The iova tree
+ * @map   The map with the memory address
+ *
+ * Return the stored mapping, or NULL if not found.
+ */
+const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
+                                              const VhostDMAMap *map)
+{
+    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
+                                  vhost_iova_tree_cmp_iova);
+}
+
+/**
+ * Insert a new map
+ *
+ * @tree  The iova tree
+ * @map   The iova map
+ *
+ * Returns:
+ * - VHOST_DMA_MAP_OK if the map fits in the container
+ * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
+ * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
+ * Can query the assignated iova in map.
+ */
+VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
+                                        VhostDMAMap *map)
+{
+    const VhostDMAMap *prev;
+    int find_prev_rc;
+
+    if (map->translated_addr + map->size < map->translated_addr ||
+        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
+        return VHOST_DMA_MAP_INVALID;
+    }
+
+    /* Check for duplicates, and save position for insertion */
+    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
+                                             vhost_iova_tree_cmp_iova, map,
+                                             &prev);
+    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
+        return VHOST_DMA_MAP_OVERLAP;
+    }
+
+    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
+    return VHOST_DMA_MAP_OK;
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index 8b5a0225fe..cb306b83c6 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 22/29] vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (20 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 21/29] vhost: Add VhostIOVATree Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 23/29] vhost: Use a tree to store memory mappings Eugenio Pérez
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Shadow virtqueue can translate addresses from guest's address to it's
own address space this way. It duplicates the array so it can search
efficiently both directions, and it will signal overlap if iova or the
translated address is present in it's each array.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-iova-tree.h | 10 +++++++-
 hw/virtio/vhost-iova-tree.c | 49 ++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 2a44af8b3a..589e86bd91 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -30,18 +30,26 @@ typedef enum VhostDMAMapNewRC {
 /**
  * VhostIOVATree
  *
- * Store and search IOVA -> Translated mappings.
+ * Store and search IOVA -> Translated mappings and the reverse, from
+ * translated address to IOVA.
  *
  * Note that it cannot remove nodes.
  */
 typedef struct VhostIOVATree {
     /* Ordered array of reverse translations, IOVA address to qemu memory. */
     GArray *iova_taddr_map;
+
+    /*
+     * Ordered array of translations from qemu virtual memory address to iova
+     */
+    GArray *taddr_iova_map;
 } VhostIOVATree;
 
 void vhost_iova_tree_new(VhostIOVATree *iova_rm);
 void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
 
+const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_rm,
+                                             const VhostDMAMap *map);
 const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
                                               const VhostDMAMap *map);
 VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index dfd7e448b5..2900390a1e 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -39,6 +39,22 @@ static void vhost_iova_tree_insert_after(GArray *array,
     g_array_insert_val(array, pos, *map);
 }
 
+static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b)
+{
+    const VhostDMAMap *m1 = a, *m2 = b;
+
+    if (m1->translated_addr > m2->translated_addr + m2->size) {
+        return 1;
+    }
+
+    if (m1->translated_addr + m1->size < m2->translated_addr) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
 static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
 {
     const VhostDMAMap *m1 = a, *m2 = b;
@@ -106,6 +122,9 @@ void vhost_iova_tree_new(VhostIOVATree *tree)
     tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
                                        G_ARRAY_NOT_CLEAR_ON_ALLOC,
                                        sizeof(VhostDMAMap));
+    tree->taddr_iova_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
+                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
+                                       sizeof(VhostDMAMap));
 }
 
 /**
@@ -116,6 +135,7 @@ void vhost_iova_tree_new(VhostIOVATree *tree)
 void vhost_iova_tree_destroy(VhostIOVATree *tree)
 {
     g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
+    g_array_unref(g_steal_pointer(&tree->taddr_iova_map));
 }
 
 /**
@@ -137,6 +157,21 @@ static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
     return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
 }
 
+/**
+ * Find the IOVA address stored from a memory address
+ *
+ * @tree     The iova tree
+ * @map      The map with the memory address
+ *
+ * Return the stored mapping, or NULL if not found.
+ */
+const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
+                                             const VhostDMAMap *map)
+{
+    return vhost_iova_tree_bsearch(tree->taddr_iova_map, map,
+                                   vhost_iova_tree_cmp_taddr);
+}
+
 /**
  * Find the translated address stored from a IOVA address
  *
@@ -167,7 +202,7 @@ const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
 VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
                                         VhostDMAMap *map)
 {
-    const VhostDMAMap *prev;
+    const VhostDMAMap *qemu_prev, *iova_prev;
     int find_prev_rc;
 
     if (map->translated_addr + map->size < map->translated_addr ||
@@ -178,11 +213,19 @@ VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
     /* Check for duplicates, and save position for insertion */
     find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
                                              vhost_iova_tree_cmp_iova, map,
-                                             &prev);
+                                             &iova_prev);
+    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
+        return VHOST_DMA_MAP_OVERLAP;
+    }
+
+    find_prev_rc = vhost_iova_tree_find_prev(tree->taddr_iova_map,
+                                             vhost_iova_tree_cmp_taddr, map,
+                                             &qemu_prev);
     if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
         return VHOST_DMA_MAP_OVERLAP;
     }
 
-    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
+    vhost_iova_tree_insert_after(tree->iova_taddr_map, iova_prev, map);
+    vhost_iova_tree_insert_after(tree->taddr_iova_map, qemu_prev, map);
     return VHOST_DMA_MAP_OK;
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 23/29] vhost: Use a tree to store memory mappings
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (21 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 22/29] vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 24/29] vhost: Add iova_rev_maps_alloc Eugenio Pérez
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

At the moment, the tree is only used to store 1:1 maps of the qemu
virtual addresses of shadow virtqueue vring and the guest's addresses.
In other words, the tree only serves to check if the address the guest
exposed is valid at the moment qemu receives the miss.

It does not work if device has restrictions in its iova
range at the moment.

Updates to tree are protected by BQL, each one always run from main
event loop context. vhost_device_iotlb_miss runs in the same one on
reading it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h |   3 +
 hw/virtio/vhost.c         | 121 ++++++++++++++++++++++++++++++--------
 2 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index c97a4c0017..773f882145 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -2,6 +2,7 @@
 #define VHOST_H
 
 #include "hw/virtio/vhost-backend.h"
+#include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/virtio.h"
 #include "exec/memory.h"
 
@@ -88,6 +89,8 @@ struct vhost_dev {
     bool log_enabled;
     bool shadow_vqs_enabled;
     uint64_t log_size;
+    /* IOVA mapping used by Shadow Virtqueue */
+    VhostIOVATree iova_map;
     struct {
         hwaddr first;
         hwaddr last;
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index c8fa9df9b3..925d2146a4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1013,31 +1013,45 @@ static int vhost_memory_region_lookup(struct vhost_dev *hdev,
 
 int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 {
-    IOMMUTLBEntry iotlb;
+    IOMMUAccessFlags perm;
     uint64_t uaddr, len;
     int ret = -EFAULT;
 
-    RCU_READ_LOCK_GUARD();
-
     trace_vhost_iotlb_miss(dev, 1);
 
     if (dev->shadow_vqs_enabled) {
-        uaddr = iova;
-        len = 4096;
-        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len,
-                                                IOMMU_RW);
-        if (ret) {
-            trace_vhost_iotlb_miss(dev, 2);
-            error_report("Fail to update device iotlb");
+        /* Shadow virtqueue translations in its Virtual Address Space */
+        const VhostDMAMap *result;
+        const VhostDMAMap needle = {
+            .iova = iova,
+        };
+
+        result = vhost_iova_tree_find_taddr(&dev->iova_map, &needle);
+
+        if (unlikely(!result)) {
+            goto out;
         }
 
-        return ret;
-    }
+        iova = result->iova;
+        uaddr = (uint64_t)result->translated_addr;
+        /*
+         * In IOVATree, result.iova + result.size is the last element of iova.
+         * For vhost, it is one past that last element.
+         */
+        len = result->size + 1;
+        perm = result->perm;
+    } else {
+        IOMMUTLBEntry iotlb;
+
+        RCU_READ_LOCK_GUARD();
+        iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
+                                              iova, write,
+                                              MEMTXATTRS_UNSPECIFIED);
+
+        if (iotlb.target_as == NULL) {
+            goto out;
+        }
 
-    iotlb = address_space_get_iotlb_entry(dev->vdev->dma_as,
-                                          iova, write,
-                                          MEMTXATTRS_UNSPECIFIED);
-    if (iotlb.target_as != NULL) {
         ret = vhost_memory_region_lookup(dev, iotlb.translated_addr,
                                          &uaddr, &len);
         if (ret) {
@@ -1049,14 +1063,14 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 
         len = MIN(iotlb.addr_mask + 1, len);
         iova = iova & ~iotlb.addr_mask;
+        perm = iotlb.perm;
+    }
 
-        ret = vhost_backend_update_device_iotlb(dev, iova, uaddr,
-                                                len, iotlb.perm);
-        if (ret) {
-            trace_vhost_iotlb_miss(dev, 4);
-            error_report("Fail to update device iotlb");
-            goto out;
-        }
+    ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len, perm);
+    if (ret) {
+        trace_vhost_iotlb_miss(dev, 4);
+        error_report("Fail to update device iotlb");
+        goto out;
     }
 
     trace_vhost_iotlb_miss(dev, 2);
@@ -1249,7 +1263,7 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
     if (r) {
         error_report("Fail to invalidate device iotlb");
     }
-
+    vhost_iova_tree_destroy(&dev->iova_map);
     for (idx = 0; idx < dev->nvqs; ++idx) {
         struct vhost_virtqueue *vq = dev->vqs + idx;
         if (vhost_dev_has_iommu(dev) &&
@@ -1279,6 +1293,26 @@ static int vhost_sw_live_migration_stop(struct vhost_dev *dev)
     return 0;
 }
 
+static bool vhost_shadow_vq_start_store_sections(struct vhost_dev *dev)
+{
+    int idx;
+
+    for (idx = 0; idx < dev->n_mem_sections; ++idx) {
+        size_t region_size = dev->mem->regions[idx].memory_size;
+        VhostDMAMap region = {
+            .iova = dev->mem->regions[idx].userspace_addr,
+            .translated_addr = (void *)dev->mem->regions[idx].userspace_addr,
+            .size = region_size - 1,
+            .perm = VHOST_ACCESS_RW,
+        };
+
+        VhostDMAMapNewRC r = vhost_iova_tree_insert(&dev->iova_map, &region);
+        assert(r == VHOST_DMA_MAP_OK);
+    }
+
+    return true;
+}
+
 /*
  * Start shadow virtqueue in a given queue.
  * In failure case, this function leaves queue working as regular vhost mode.
@@ -1292,9 +1326,37 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
     struct vhost_vring_state s = {
         .index = idx,
     };
+    VhostDMAMap driver_region, device_region;
+
     int r;
     bool ok;
 
+    assert(dev->shadow_vqs[idx] != NULL);
+    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
+    driver_region = (VhostDMAMap) {
+        .iova = addr.desc_user_addr,
+        .translated_addr = (void *)addr.desc_user_addr,
+
+        /*
+         * DMAMAp.size include the last byte included in the range, while
+         * sizeof marks one past it. Substract one byte to make them match.
+         */
+        .size = vhost_shadow_vq_driver_area_size(dev->shadow_vqs[idx]) - 1,
+        .perm = VHOST_ACCESS_RO,
+    };
+    device_region = (VhostDMAMap) {
+        .iova = addr.used_user_addr,
+        .translated_addr = (void *)addr.used_user_addr,
+        .size = vhost_shadow_vq_device_area_size(dev->shadow_vqs[idx]) - 1,
+        .perm = VHOST_ACCESS_RW,
+    };
+
+    r = vhost_iova_tree_insert(&dev->iova_map, &driver_region);
+    assert(r == VHOST_DMA_MAP_OK);
+
+    r = vhost_iova_tree_insert(&dev->iova_map, &device_region);
+    assert(r == VHOST_DMA_MAP_OK);
+
     vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
     ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
     if (unlikely(!ok)) {
@@ -1302,7 +1364,6 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
     }
 
     /* From this point, vhost_virtqueue_start can reset these changes */
-    vhost_shadow_vq_get_vring_addr(dev->shadow_vqs[idx], &addr);
     r = dev->vhost_ops->vhost_set_vring_addr(dev, &addr);
     if (unlikely(r != 0)) {
         VHOST_OPS_DEBUG("vhost_set_vring_addr for shadow vq failed");
@@ -1315,6 +1376,7 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
         goto err;
     }
 
+
     if (vhost_dev_has_iommu(dev) && dev->vhost_ops->vhost_set_iotlb_callback) {
         /*
          * Update used ring information for IOTLB to work correctly,
@@ -1357,6 +1419,15 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
         error_report("Fail to invalidate device iotlb");
     }
 
+    /*
+     * Create new iova mappings. SVQ always expose qemu's VA.
+     * TODO: Fine tune the exported mapping. Default vhost does not expose
+     * everything.
+     */
+
+    vhost_iova_tree_new(&dev->iova_map);
+    vhost_shadow_vq_start_store_sections(dev);
+
     /* Can be read by vhost_virtqueue_mask, from vm exit */
     dev->shadow_vqs_enabled = true;
     for (idx = 0; idx < dev->nvqs; ++idx) {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 24/29] vhost: Add iova_rev_maps_alloc
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (22 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 23/29] vhost: Use a tree to store memory mappings Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-05-19 16:28 ` [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ Eugenio Pérez
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This way the tree can store/map a completely new virtual address space,
in the range the device admits.

It does not limit the range it will allocate, but the IOVA address of
the maps are allocated growing. A range limitation will be need for the
cases where start_addr != 0.

Tools for remove mappings will be needed also.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-iova-tree.h | 11 +++---
 hw/virtio/vhost-iova-tree.c | 72 +++++++++++++++++++++++++++++++------
 2 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 589e86bd91..92e2d992b3 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -22,16 +22,17 @@ typedef struct VhostDMAMap {
 } VhostDMAMap;
 
 typedef enum VhostDMAMapNewRC {
+    VHOST_DMA_MAP_NO_SPACE = -3,
     VHOST_DMA_MAP_OVERLAP = -2,
     VHOST_DMA_MAP_INVALID = -1,
     VHOST_DMA_MAP_OK = 0,
 } VhostDMAMapNewRC;
 
 /**
- * VhostIOVATree
- *
- * Store and search IOVA -> Translated mappings and the reverse, from
- * translated address to IOVA.
+ * VhostIOVATree, able to:
+ * - Translate iova address
+ * - Reverse translate iova address (from translated to iova)
+ * - Allocate IOVA regions for translated range (potentially slow operation)
  *
  * Note that it cannot remove nodes.
  */
@@ -54,5 +55,7 @@ const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
                                               const VhostDMAMap *map);
 VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
                                         VhostDMAMap *map);
+VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *iova_rm,
+                                       VhostDMAMap *map);
 
 #endif
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 2900390a1e..7699d96bbb 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -187,8 +187,30 @@ const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
                                   vhost_iova_tree_cmp_iova);
 }
 
+static bool vhost_iova_tree_find_iova_hole(const GArray *iova_map,
+                                           const VhostDMAMap *map,
+                                           const VhostDMAMap **prev_elem)
+{
+    size_t i;
+    hwaddr iova = 0;
+
+    *prev_elem = NULL;
+    for (i = 0; i < iova_map->len; i++) {
+        const VhostDMAMap *next = &g_array_index(iova_map, typeof(*next), i);
+        hwaddr hole_end = next->iova;
+        if (map->size < hole_end - iova) {
+            return true;
+        }
+
+        iova = next->iova + next->size + 1;
+        *prev_elem = next;
+    }
+
+    return ((hwaddr)-1 - iova) > iova_map->len;
+}
+
 /**
- * Insert a new map
+ * Insert a new map - internal
  *
  * @tree  The iova tree
  * @map   The iova map
@@ -197,10 +219,13 @@ const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
  * - VHOST_DMA_MAP_OK if the map fits in the container
  * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
  * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
- * Can query the assignated iova in map.
+ * - VHOST_DMA_MAP_NO_SPACE if iova_rm cannot allocate more space.
+ *
+ * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
  */
-VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
-                                        VhostDMAMap *map)
+static VhostDMAMapNewRC vhost_iova_tree_insert_int(VhostIOVATree *tree,
+                                                   VhostDMAMap *map,
+                                                   bool allocate)
 {
     const VhostDMAMap *qemu_prev, *iova_prev;
     int find_prev_rc;
@@ -210,12 +235,27 @@ VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
         return VHOST_DMA_MAP_INVALID;
     }
 
-    /* Check for duplicates, and save position for insertion */
-    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
-                                             vhost_iova_tree_cmp_iova, map,
-                                             &iova_prev);
-    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
-        return VHOST_DMA_MAP_OVERLAP;
+    if (allocate) {
+        /* Search for a hole in iova space big enough */
+        bool fit = vhost_iova_tree_find_iova_hole(tree->iova_taddr_map, map,
+                                                &iova_prev);
+        if (!fit) {
+            return VHOST_DMA_MAP_NO_SPACE;
+        }
+
+        map->iova = iova_prev ? (iova_prev->iova + iova_prev->size) + 1 : 0;
+    } else {
+        if (map->iova + map->size < map->iova) {
+            return VHOST_DMA_MAP_INVALID;
+        }
+
+        /* Check for duplicates, and save position for insertion */
+        find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
+                                                 vhost_iova_tree_cmp_iova, map,
+                                                 &iova_prev);
+        if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
+            return VHOST_DMA_MAP_OVERLAP;
+        }
     }
 
     find_prev_rc = vhost_iova_tree_find_prev(tree->taddr_iova_map,
@@ -229,3 +269,15 @@ VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
     vhost_iova_tree_insert_after(tree->taddr_iova_map, qemu_prev, map);
     return VHOST_DMA_MAP_OK;
 }
+
+VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
+                                        VhostDMAMap *map)
+{
+    return vhost_iova_tree_insert_int(tree, map, false);
+}
+
+VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *tree,
+                                       VhostDMAMap *map)
+{
+    return vhost_iova_tree_insert_int(tree, map, true);
+}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (23 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 24/29] vhost: Add iova_rev_maps_alloc Eugenio Pérez
@ 2021-05-19 16:28 ` Eugenio Pérez
  2021-06-02  9:51   ` Jason Wang
  2021-05-19 16:29 ` [RFC v3 26/29] vhost: Map in vdpa-dev Eugenio Pérez
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Use translations added in IOVAReverseMaps in SVQ if the vhost device
does not support the mapping of the full qemu's virtual address space.
In other cases, Shadow Virtqueue still uses the qemu's virtual address
of the buffer pointed by the descriptor, which has been translated
already by qemu's VirtQueue machinery.

Now every element needs to store the previous address also, so VirtQueue
can consume the elements properly. This adds a little overhead per VQ
element, having to allocate more memory to stash them. As a possible
optimization, this allocation could be avoided if the descriptor is not
a chain but a single one, but this is left undone.

Checking also for vhost_set_iotlb_callback to send used ring remapping.
This is only needed for kernel, and would print an error in case of
vhost devices with its own mapping (vdpa).

This could change for other callback, like checking for
vhost_force_iommu, enable_custom_iommu, or another. Another option could
be to, at least, extract the check of "is map(used, writable) needed?"
in another function. But at the moment just copy the check used in
vhost_dev_start here.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
 hw/virtio/vhost.c                  |  29 +++++--
 2 files changed, 145 insertions(+), 18 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 934d3bb27b..a92da979d1 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -10,12 +10,19 @@
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/virtio/vhost-iova-tree.h"
 
 #include "standard-headers/linux/vhost_types.h"
 
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 
+typedef struct SVQElement {
+    VirtQueueElement elem;
+    void **in_sg_stash;
+    void **out_sg_stash;
+} SVQElement;
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
     /* Shadow vring */
@@ -50,8 +57,11 @@ typedef struct VhostShadowVirtqueue {
     /* Virtio device */
     VirtIODevice *vdev;
 
+    /* IOVA mapping if used */
+    VhostIOVATree *iova_map;
+
     /* Map for returning guest's descriptors */
-    VirtQueueElement **ring_id_maps;
+    SVQElement **ring_id_maps;
 
     /* Next head to expose to device */
     uint16_t avail_idx_shadow;
@@ -88,6 +98,66 @@ static void vhost_shadow_vq_set_notification(VhostShadowVirtqueue *svq,
     }
 }
 
+static void vhost_shadow_vq_stash_addr(void ***stash, const struct iovec *iov,
+                                       size_t num)
+{
+    size_t i;
+
+    if (num == 0) {
+        return;
+    }
+
+    *stash = g_new(void *, num);
+    for (i = 0; i < num; ++i) {
+        (*stash)[i] = iov[i].iov_base;
+    }
+}
+
+static void vhost_shadow_vq_unstash_addr(void **stash,
+                                         struct iovec *iov,
+                                         size_t num)
+{
+    size_t i;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (i = 0; i < num; ++i) {
+        iov[i].iov_base = stash[i];
+    }
+    g_free(stash);
+}
+
+static void vhost_shadow_vq_translate_addr(const VhostShadowVirtqueue *svq,
+                                           struct iovec *iovec, size_t num)
+{
+    size_t i;
+
+    for (i = 0; i < num; ++i) {
+        VhostDMAMap needle = {
+            .translated_addr = iovec[i].iov_base,
+            .size = iovec[i].iov_len,
+        };
+        size_t off;
+
+        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
+                                                           &needle);
+        /*
+         * Map cannot be NULL since iova map contains all guest space and
+         * qemu already has a physical address mapped
+         */
+        assert(map);
+
+        /*
+         * Map->iova chunk size is ignored. What to do if descriptor
+         * (addr, size) does not fit is delegated to the device.
+         */
+        off = needle.translated_addr - map->translated_addr;
+        iovec[i].iov_base = (void *)(map->iova + off);
+    }
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
@@ -118,8 +188,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
 }
 
 static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
-                                          VirtQueueElement *elem)
+                                          SVQElement *svq_elem)
 {
+    VirtQueueElement *elem = &svq_elem->elem;
     int head;
     unsigned avail_idx;
     vring_avail_t *avail = svq->vring.avail;
@@ -129,6 +200,16 @@ static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
     /* We need some descriptors here */
     assert(elem->out_num || elem->in_num);
 
+    if (svq->iova_map) {
+        vhost_shadow_vq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg,
+                                   elem->in_num);
+        vhost_shadow_vq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg,
+                                   elem->out_num);
+
+        vhost_shadow_vq_translate_addr(svq, elem->in_sg, elem->in_num);
+        vhost_shadow_vq_translate_addr(svq, elem->out_sg, elem->out_num);
+    }
+
     vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
                             elem->in_num > 0, false);
     vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
@@ -150,7 +231,7 @@ static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
 }
 
 static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
-                                VirtQueueElement *elem)
+                                SVQElement *elem)
 {
     unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
 
@@ -184,7 +265,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
         }
 
         while (true) {
-            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
             if (!elem) {
                 break;
             }
@@ -210,7 +291,7 @@ static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
     return svq->used_idx != svq->shadow_used_idx;
 }
 
-static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
+static SVQElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
 {
     vring_desc_t *descs = svq->vring.desc;
     const vring_used_t *used = svq->vring.used;
@@ -235,7 +316,7 @@ static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
     svq->free_head = used_elem.id;
 
     svq->used_idx++;
-    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
     return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
 }
 
@@ -255,12 +336,21 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
 
         vhost_shadow_vq_set_notification(svq, false);
         while (true) {
-            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
-            if (!elem) {
+            g_autofree SVQElement *svq_elem = vhost_shadow_vq_get_buf(svq);
+            VirtQueueElement *elem;
+            if (!svq_elem) {
                 break;
             }
 
             assert(i < svq->vring.num);
+            elem = &svq_elem->elem;
+
+            if (svq->iova_map) {
+                vhost_shadow_vq_unstash_addr(svq_elem->in_sg_stash,
+                                             elem->in_sg, elem->in_num);
+                vhost_shadow_vq_unstash_addr(svq_elem->out_sg_stash,
+                                             elem->out_sg, elem->out_num);
+            }
             virtqueue_fill(vq, elem, elem->len, i++);
         }
 
@@ -455,14 +545,27 @@ void vhost_shadow_vq_stop(struct vhost_dev *dev,
 
 
     for (i = 0; i < svq->vring.num; ++i) {
-        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
+        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
+        VirtQueueElement *elem;
+
+        if (!svq_elem) {
+            continue;
+        }
+
+        elem = &svq_elem->elem;
+
+        if (svq->iova_map) {
+            vhost_shadow_vq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
+                                         elem->in_num);
+            vhost_shadow_vq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
+                                         elem->out_num);
+        }
+
         /*
          * Although the doc says we must unpop in order, it's ok to unpop
          * everything.
          */
-        if (elem) {
-            virtqueue_unpop(svq->vq, elem, elem->len);
-        }
+        virtqueue_unpop(svq->vq, elem, elem->len);
     }
 }
 
@@ -504,11 +607,16 @@ VhostShadowVirtqueue *vhost_shadow_vq_new(struct vhost_dev *dev, int idx)
     memset(svq->vring.desc, 0, driver_size);
     svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
     memset(svq->vring.used, 0, device_size);
+
+    if (vhost_has_limited_iova_range(dev)) {
+        svq->iova_map = &dev->iova_map;
+    }
+
     for (i = 0; i < num - 1; i++) {
         svq->vring.desc[i].next = cpu_to_le16(i + 1);
     }
 
-    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
+    svq->ring_id_maps = g_new0(SVQElement *, num);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_shadow_vq_handle_call);
     return g_steal_pointer(&svq);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 925d2146a4..4339b899ea 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1306,7 +1306,13 @@ static bool vhost_shadow_vq_start_store_sections(struct vhost_dev *dev)
             .perm = VHOST_ACCESS_RW,
         };
 
-        VhostDMAMapNewRC r = vhost_iova_tree_insert(&dev->iova_map, &region);
+        VhostDMAMapNewRC r;
+
+        if (vhost_has_limited_iova_range(dev)) {
+            r = vhost_iova_tree_alloc(&dev->iova_map, &region);
+        } else {
+            r = vhost_iova_tree_insert(&dev->iova_map, &region);
+        }
         assert(r == VHOST_DMA_MAP_OK);
     }
 
@@ -1351,11 +1357,24 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
         .perm = VHOST_ACCESS_RW,
     };
 
-    r = vhost_iova_tree_insert(&dev->iova_map, &driver_region);
-    assert(r == VHOST_DMA_MAP_OK);
+    if (vhost_has_limited_iova_range(dev)) {
+        r = vhost_iova_tree_alloc(&dev->iova_map, &driver_region);
+        assert(r == VHOST_DMA_MAP_OK);
+
+        r = vhost_iova_tree_alloc(&dev->iova_map, &device_region);
+        assert(r == VHOST_DMA_MAP_OK);
 
-    r = vhost_iova_tree_insert(&dev->iova_map, &device_region);
-    assert(r == VHOST_DMA_MAP_OK);
+        addr.avail_user_addr = driver_region.iova + addr.avail_user_addr
+                               - addr.desc_user_addr;
+        addr.desc_user_addr = driver_region.iova;
+        addr.used_user_addr = device_region.iova;
+    } else {
+        r = vhost_iova_tree_insert(&dev->iova_map, &driver_region);
+        assert(r == VHOST_DMA_MAP_OK);
+
+        r = vhost_iova_tree_insert(&dev->iova_map, &device_region);
+        assert(r == VHOST_DMA_MAP_OK);
+    }
 
     vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
     ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 26/29] vhost: Map in vdpa-dev
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (24 preceding siblings ...)
  2021-05-19 16:28 ` [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2021-05-19 16:29 ` Eugenio Pérez
  2021-05-19 16:29 ` [RFC v3 27/29] vhost-vdpa: Implement vhost_vdpa_vring_pause operation Eugenio Pérez
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Use and export vhost-vpda functions directly. In the final version,
these methods needs to be exposed through VhostOps, or vhost-vdpa
backend needs to be adapted to work with vhost_send_device_iotlb_msg
in case its custom iommu is disabled.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  4 ++++
 hw/virtio/vhost-vdpa.c            |  2 +-
 hw/virtio/vhost.c                 | 18 ++++++++++++++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index f8eed2ace5..9d88074e4d 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -203,4 +203,8 @@ int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
 
 int vhost_user_gpu_set_socket(struct vhost_dev *dev, int fd);
 
+struct vhost_vdpa;
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
+                              void *vaddr, bool readonly);
+
 #endif /* VHOST_BACKEND_H */
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 9e7a0ce5e0..c742e6944e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -36,7 +36,7 @@ static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
                               void *vaddr, bool readonly)
 {
     struct vhost_msg_v2 msg = {};
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 4339b899ea..286863ad42 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1314,9 +1314,19 @@ static bool vhost_shadow_vq_start_store_sections(struct vhost_dev *dev)
             r = vhost_iova_tree_insert(&dev->iova_map, &region);
         }
         assert(r == VHOST_DMA_MAP_OK);
+        r = vhost_vdpa_dma_map(dev->opaque, region.iova, region_size,
+                               (void *)dev->mem->regions[idx].userspace_addr,
+                               false);
+        if (r != 0) {
+            goto fail;
+        }
     }
 
     return true;
+
+fail:
+    assert(0);
+    return false;
 }
 
 /*
@@ -1377,6 +1387,14 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
     }
 
     vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
+    /* TODO: Why cannot make this read only? */
+    r = vhost_vdpa_dma_map(dev->opaque, addr.desc_user_addr, driver_region.size,
+                           (void *)driver_region.translated_addr, false);
+    assert(r == 0);
+    r = vhost_vdpa_dma_map(dev->opaque, addr.used_user_addr, device_region.size,
+                           (void *)device_region.translated_addr, false);
+    assert(r == 0);
+
     ok = vhost_shadow_vq_start(dev, idx, dev->shadow_vqs[idx]);
     if (unlikely(!ok)) {
         return false;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 27/29] vhost-vdpa: Implement vhost_vdpa_vring_pause operation
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (25 preceding siblings ...)
  2021-05-19 16:29 ` [RFC v3 26/29] vhost: Map in vdpa-dev Eugenio Pérez
@ 2021-05-19 16:29 ` Eugenio Pérez
  2021-05-19 16:29 ` [RFC v3 28/29] vhost-vdpa: never map with vDPA listener Eugenio Pérez
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This uses the status bit DEVICE_STOPPED, that is currently discussed in
VirtIO, and is implemented in qemu VirtIO-net devices in previous
commits.

Removal of _S_DEVICE_STOPPED can be done in the future if an use case
arises.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index c742e6944e..dfb465be96 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -618,6 +618,19 @@ static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
     return ret;
 }
 
+static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
+{
+    int r;
+    uint8_t status;
+
+    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
+    do {
+        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
+    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
+
+    return 0;
+}
+
 const VhostOps vdpa_ops = {
         .backend_type = VHOST_BACKEND_TYPE_VDPA,
         .vhost_backend_init = vhost_vdpa_init,
@@ -650,6 +663,7 @@ const VhostOps vdpa_ops = {
         .vhost_get_device_id = vhost_vdpa_get_device_id,
         .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
         .vhost_force_iommu = vhost_vdpa_force_iommu,
+        .vhost_vring_pause = vhost_vdpa_vring_pause,
         .vhost_enable_custom_iommu = vhost_vdpa_enable_custom_iommu,
         .vhost_get_iova_range = vhost_vdpa_get_iova_range,
 };
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 28/29] vhost-vdpa: never map with vDPA listener
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (26 preceding siblings ...)
  2021-05-19 16:29 ` [RFC v3 27/29] vhost-vdpa: Implement vhost_vdpa_vring_pause operation Eugenio Pérez
@ 2021-05-19 16:29 ` Eugenio Pérez
  2021-05-19 16:29 ` [RFC v3 29/29] vhost: Start vhost-vdpa SVQ directly Eugenio Pérez
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

This commit is a workaround that will not go to the final version.

vp_vdpa is not able to reset all IOTLBs, so we force to not to map them
in the first place.

Checkpath detects a few errors because of #if 0 / #endif pairs, but it's
the less intrusive way to comment out all the code we want to skip.
Since this commit is not intended to go to the final series, I left it
that way.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 2 +-
 hw/virtio/vhost-vdpa.c         | 8 +++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 9b81a409da..06afe42ab6 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -17,7 +17,7 @@
 typedef struct vhost_vdpa {
     int device_fd;
     uint32_t msg_type;
-    MemoryListener listener;
+    /* MemoryListener listener; */
     struct vhost_dev *dev;
 } VhostVDPA;
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index dfb465be96..30e4e306fb 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -23,6 +23,7 @@
 #include "trace.h"
 #include "qemu-common.h"
 
+#if 0
 static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -35,6 +36,7 @@ static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
             */
            section->offset_within_address_space & (1ULL << 63);
 }
+#endif
 
 int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
                               void *vaddr, bool readonly)
@@ -62,6 +64,7 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
     return ret;
 }
 
+#if 0
 static int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova,
                                 hwaddr size)
 {
@@ -246,6 +249,7 @@ static const MemoryListener vhost_vdpa_memory_listener = {
     .region_add = vhost_vdpa_listener_region_add,
     .region_del = vhost_vdpa_listener_region_del,
 };
+#endif
 
 static int vhost_vdpa_call(struct vhost_dev *dev, unsigned long int request,
                              void *arg)
@@ -274,6 +278,7 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
 
 static int vhost_vdpa_enable_custom_iommu(struct vhost_dev *dev, bool enable)
 {
+#if 0
     struct vhost_vdpa *v = dev->opaque;
     hwaddr iova_range_last = dev->iova_range.last;
     if (iova_range_last != (hwaddr)-1) {
@@ -291,6 +296,7 @@ static int vhost_vdpa_enable_custom_iommu(struct vhost_dev *dev, bool enable)
         memory_listener_unregister(&v->listener);
         return vhost_vdpa_dma_unmap(v, dev->iova_range.first, iova_range_last);
     }
+#endif
 
     return 0;
 }
@@ -307,7 +313,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque)
     dev->opaque =  opaque ;
     vhost_vdpa_call(dev, VHOST_GET_FEATURES, &features);
     dev->backend_features = features;
-    v->listener = vhost_vdpa_memory_listener;
+    /* v->listener = vhost_vdpa_memory_listener; */
     v->msg_type = VHOST_IOTLB_MSG_V2;
 
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC v3 29/29] vhost: Start vhost-vdpa SVQ directly
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (27 preceding siblings ...)
  2021-05-19 16:29 ` [RFC v3 28/29] vhost-vdpa: never map with vDPA listener Eugenio Pérez
@ 2021-05-19 16:29 ` Eugenio Pérez
  2021-05-24  9:38 ` [RFC v3 00/29] vDPA software assisted live migration Michael S. Tsirkin
  2021-06-02  9:59 ` Jason Wang
  30 siblings, 0 replies; 67+ messages in thread
From: Eugenio Pérez @ 2021-05-19 16:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella

Since it does not have sense to keep a non-working vdpa device, start
directly in SVQ mode.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 286863ad42..fd812e1a80 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1386,7 +1386,6 @@ static bool vhost_sw_live_migration_start_vq(struct vhost_dev *dev,
         assert(r == VHOST_DMA_MAP_OK);
     }
 
-    vhost_virtqueue_stop(dev, dev->vdev, &dev->vqs[idx], dev->vq_index + idx);
     /* TODO: Why cannot make this read only? */
     r = vhost_vdpa_dma_map(dev->opaque, addr.desc_user_addr, driver_region.size,
                            (void *)driver_region.translated_addr, false);
@@ -1467,6 +1466,11 @@ static int vhost_sw_live_migration_start(struct vhost_dev *dev)
 
     /* Can be read by vhost_virtqueue_mask, from vm exit */
     dev->shadow_vqs_enabled = true;
+
+    /* Reset device, so SVQ can assign its address */
+    r = dev->vhost_ops->vhost_dev_start(dev, false);
+    assert(r == 0);
+
     for (idx = 0; idx < dev->nvqs; ++idx) {
         bool ok = vhost_sw_live_migration_start_vq(dev, idx);
         if (unlikely(!ok)) {
@@ -2107,6 +2111,8 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
             vhost_device_iotlb_miss(hdev, vq->used_phys, true);
         }
     }
+
+    vhost_sw_live_migration_start(hdev);
     return 0;
 fail_log:
     vhost_log_put(hdev, false);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-05-19 16:28 ` [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-05-21  7:05   ` Markus Armbruster
  2021-05-24  7:13     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Armbruster @ 2021-05-21  7:05 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-devel, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Michael Lilja, Stefano Garzarella

Eugenio Pérez <eperezma@redhat.com> writes:

> Command to enable shadow virtqueue looks like:
>
> { "execute": "x-vhost-enable-shadow-vq",
>   "arguments": { "name": "dev0", "enable": true } }
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  qapi/net.json     | 22 ++++++++++++++++++++++
>  hw/virtio/vhost.c |  6 ++++++
>  2 files changed, 28 insertions(+)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index c31748c87f..660feafdd2 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -77,6 +77,28 @@
>  ##
>  { 'command': 'netdev_del', 'data': {'id': 'str'} }
>  
> +##
> +# @x-vhost-enable-shadow-vq:
> +#
> +# Use vhost shadow virtqueue.
> +#
> +# @name: the device name of the VirtIO device
> +#
> +# @enable: true to use he alternate shadow VQ notification path

Typo "he".

What's a "notification path", and why should I care?

Maybe

   # @enable: Enable alternate shadow VQ notification

> +#
> +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.

This is confusing.  What do you mean by "Not found"?

If you mean DeviceNotFound:

1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
GenericError.  Perhaps later patches will change that.

2. Do you really need to distinguish "vhost is not enabled" from other
errors?

> +#
> +# Since: 6.1
> +#
> +# Example:
> +#
> +# -> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "virtio-net", "enable": false } }

Please break the long line, e.g. like this:

   # -> { "execute": "x-vhost-enable-shadow-vq",
   #      "arguments": { "name": "virtio-net", "enable": false } }

We normally show output in examples, too.

> +#
> +##
> +{ 'command': 'x-vhost-enable-shadow-vq',
> +  'data': {'name': 'str', 'enable': 'bool'},
> +  'if': 'defined(CONFIG_VHOST_KERNEL)' }
> +
>  ##
>  # @NetLegacyNicOptions:
>  #
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 40f9f64ebd..c4c1f80661 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +#include "qapi/qapi-commands-net.h"
>  #include "hw/virtio/vhost.h"
>  #include "qemu/atomic.h"
>  #include "qemu/range.h"
> @@ -1831,3 +1832,8 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -1;
>  }
> +
> +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> +{
> +    error_setg(errp, "Shadow virtqueue still not implemented");
> +}



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-05-21  7:05   ` Markus Armbruster
@ 2021-05-24  7:13     ` Eugenio Perez Martin
  2021-06-08 14:23       ` Markus Armbruster
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-05-24  7:13 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Michael Lilja, Stefano Garzarella

On Fri, May 21, 2021 at 9:05 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Pérez <eperezma@redhat.com> writes:
>
> > Command to enable shadow virtqueue looks like:
> >
> > { "execute": "x-vhost-enable-shadow-vq",
> >   "arguments": { "name": "dev0", "enable": true } }
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  qapi/net.json     | 22 ++++++++++++++++++++++
> >  hw/virtio/vhost.c |  6 ++++++
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index c31748c87f..660feafdd2 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -77,6 +77,28 @@
> >  ##
> >  { 'command': 'netdev_del', 'data': {'id': 'str'} }
> >
> > +##
> > +# @x-vhost-enable-shadow-vq:
> > +#
> > +# Use vhost shadow virtqueue.
> > +#
> > +# @name: the device name of the VirtIO device
> > +#
> > +# @enable: true to use he alternate shadow VQ notification path
>
> Typo "he".
>

Thanks, I will fix it!

> What's a "notification path", and why should I care?
>
> Maybe
>
>    # @enable: Enable alternate shadow VQ notification
>

Your description is more accurate at some point of the series, I will fix it.

> > +#
> > +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
>
> This is confusing.  What do you mean by "Not found"?
>
> If you mean DeviceNotFound:
>
> 1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
> GenericError.  Perhaps later patches will change that.
>

Right, I left the documentation in an intermediate state. At this
point it will always return failure, and in future ones it depends on
some conditions as you may have seen.

If I carry the QMP command to future series, I will update the doc
accordly to every commit.

> 2. Do you really need to distinguish "vhost is not enabled" from other
> errors?
>

SVQ cannot work if the device backend is not vhost, like qemu VirtIO
dev. What I meant is that "qemu will only look for its name in the set
of vhost devices, so you will have a device not found if the device is
not a vhost one", which may not be 100% clear at first glance. Maybe
this wording is better?

> > +#
> > +# Since: 6.1
> > +#
> > +# Example:
> > +#
> > +# -> { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "virtio-net", "enable": false } }
>
> Please break the long line, e.g. like this:
>
>    # -> { "execute": "x-vhost-enable-shadow-vq",
>    #      "arguments": { "name": "virtio-net", "enable": false } }
>
> We normally show output in examples, too.
>

Ok, I will fix both issues.

Thanks!

> > +#
> > +##
> > +{ 'command': 'x-vhost-enable-shadow-vq',
> > +  'data': {'name': 'str', 'enable': 'bool'},
> > +  'if': 'defined(CONFIG_VHOST_KERNEL)' }
> > +
> >  ##
> >  # @NetLegacyNicOptions:
> >  #
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 40f9f64ebd..c4c1f80661 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -15,6 +15,7 @@
> >
> >  #include "qemu/osdep.h"
> >  #include "qapi/error.h"
> > +#include "qapi/qapi-commands-net.h"
> >  #include "hw/virtio/vhost.h"
> >  #include "qemu/atomic.h"
> >  #include "qemu/range.h"
> > @@ -1831,3 +1832,8 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >
> >      return -1;
> >  }
> > +
> > +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> > +{
> > +    error_setg(errp, "Shadow virtqueue still not implemented");
> > +}
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (28 preceding siblings ...)
  2021-05-19 16:29 ` [RFC v3 29/29] vhost: Start vhost-vdpa SVQ directly Eugenio Pérez
@ 2021-05-24  9:38 ` Michael S. Tsirkin
  2021-05-24 10:37   ` Eugenio Perez Martin
  2021-06-02  9:59 ` Jason Wang
  30 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2021-05-24  9:38 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Markus Armbruster,
	qemu-devel, Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi,
	Eli Cohen, virtualization, Michael Lilja, Stefano Garzarella

On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.

That might become a significant limitation. And it worries me that
this is such a big patchset which might yet take a while to get
finalized.

I have an idea: how about as a first step we implement a transparent
switch from vdpa to a software virtio in QEMU or a software vhost in
kernel?

This will give us live migration quickly with performance comparable
to failover but without dependance on guest cooperation.

Next step could be driving vdpa from userspace while still copying
packets to a pre-registered buffer.

Finally your approach will be a performance optimization for devices
that support arbitrary IOVA.

Thoughts?

-- 
MST



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-24  9:38 ` [RFC v3 00/29] vDPA software assisted live migration Michael S. Tsirkin
@ 2021-05-24 10:37   ` Eugenio Perez Martin
  2021-05-24 11:29     ` Michael S. Tsirkin
  2021-05-25  0:09     ` Jason Wang
  0 siblings, 2 replies; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-05-24 10:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Markus Armbruster,
	qemu-level, Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi,
	Eli Cohen, virtualization, Michael Lilja, Stefano Garzarella

On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > Commit 17 introduces the buffer forwarding. Previous one are for
> > preparations again, and laters are for enabling some obvious
> > optimizations. However, it needs the vdpa device to be able to map
> > every IOVA space, and some vDPA devices are not able to do so. Checking
> > of this is added in previous commits.
>
> That might become a significant limitation. And it worries me that
> this is such a big patchset which might yet take a while to get
> finalized.
>

Sorry, maybe I've been unclear here: Latter commits in this series
address this limitation. Still not perfect: for example, it does not
support adding or removing guest's memory at the moment, but this
should be easy to implement on top.

The main issue I'm observing is from the kernel if I'm not wrong: If I
unmap every address, I cannot re-map them again. But code in this
patchset is mostly final, except for the comments it may arise in the
mail list of course.

> I have an idea: how about as a first step we implement a transparent
> switch from vdpa to a software virtio in QEMU or a software vhost in
> kernel?
>
> This will give us live migration quickly with performance comparable
> to failover but without dependance on guest cooperation.
>

I think it should be doable. I'm not sure about the effort that needs
to be done in qemu to hide these "hypervisor-failover devices" from
guest's view but it should be comparable to failover, as you say.

Networking should be ok by its nature, although it could require care
on the host hardware setup. But I'm not sure how other types of
vhost/vdpa devices may work that way. How would a disk/scsi device
switch modes? Can the kernel take control of the vdpa device through
vhost, and just start reporting with a dirty bitmap?

Thanks!

> Next step could be driving vdpa from userspace while still copying
> packets to a pre-registered buffer.
>
> Finally your approach will be a performance optimization for devices
> that support arbitrary IOVA.
>
> Thoughts?
>
> --
> MST
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-24 10:37   ` Eugenio Perez Martin
@ 2021-05-24 11:29     ` Michael S. Tsirkin
  2021-07-19 14:13       ` Stefan Hajnoczi
  2021-05-25  0:09     ` Jason Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2021-05-24 11:29 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Markus Armbruster,
	qemu-level, Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi,
	Eli Cohen, virtualization, Michael Lilja, Stefano Garzarella

On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > preparations again, and laters are for enabling some obvious
> > > optimizations. However, it needs the vdpa device to be able to map
> > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > of this is added in previous commits.
> >
> > That might become a significant limitation. And it worries me that
> > this is such a big patchset which might yet take a while to get
> > finalized.
> >
> 
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
> 
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again. But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
> 
> > I have an idea: how about as a first step we implement a transparent
> > switch from vdpa to a software virtio in QEMU or a software vhost in
> > kernel?
> >
> > This will give us live migration quickly with performance comparable
> > to failover but without dependance on guest cooperation.
> >
> 
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.
> 
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
> 
> Thanks!

It depends of course, e.g. blk is mostly reads/writes so
not a lot of state. just don't reorder or drop requests.

> > Next step could be driving vdpa from userspace while still copying
> > packets to a pre-registered buffer.
> >
> > Finally your approach will be a performance optimization for devices
> > that support arbitrary IOVA.
> >
> > Thoughts?
> >
> > --
> > MST
> >



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-24 10:37   ` Eugenio Perez Martin
  2021-05-24 11:29     ` Michael S. Tsirkin
@ 2021-05-25  0:09     ` Jason Wang
  1 sibling, 0 replies; 67+ messages in thread
From: Jason Wang @ 2021-05-25  0:09 UTC (permalink / raw)
  To: Eugenio Perez Martin, Michael S. Tsirkin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster, qemu-level,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	virtualization, Michael Lilja, Stefano Garzarella


在 2021/5/24 下午6:37, Eugenio Perez Martin 写道:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
>>> Commit 17 introduces the buffer forwarding. Previous one are for
>>> preparations again, and laters are for enabling some obvious
>>> optimizations. However, it needs the vdpa device to be able to map
>>> every IOVA space, and some vDPA devices are not able to do so. Checking
>>> of this is added in previous commits.
>> That might become a significant limitation. And it worries me that
>> this is such a big patchset which might yet take a while to get
>> finalized.
>>
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
>
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again.


This looks like a bug.

Does this happen only on some specific device (e.g vp_vdpa) or it's a 
general issue of vhost-vdpa?


>   But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
>
>> I have an idea: how about as a first step we implement a transparent
>> switch from vdpa to a software virtio in QEMU or a software vhost in
>> kernel?
>>
>> This will give us live migration quickly with performance comparable
>> to failover but without dependance on guest cooperation.
>>
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.


Yes, if we want to switch, I'd go to a fallback to vhost-vdpa network 
backend instead.

Thanks


>
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
>
> Thanks!
>
>> Next step could be driving vdpa from userspace while still copying
>> packets to a pre-registered buffer.
>>
>> Finally your approach will be a performance optimization for devices
>> that support arbitrary IOVA.
>>
>> Thoughts?
>>
>> --
>> MST
>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-05-19 16:28 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
@ 2021-05-26  1:06   ` Jason Wang
  2021-05-26  1:10     ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-26  1:06 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> So the guest can stop and start net device. It implements the RFC
> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
>
> To stop (as "pause") the device is required to migrate status and vring
> addresses between device and SVQ.
>
> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
> virtio_config.h before of even proposing for the kernel, with no feature
> flag, and, with no checking in the device. It also needs a modified
> vp_vdpa driver that supports to set and retrieve status.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/standard-headers/linux/virtio_config.h | 2 ++
>   hw/net/virtio-net.c                            | 4 +++-
>   2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
> index 59fad3eb45..b3f6b1365d 100644
> --- a/include/standard-headers/linux/virtio_config.h
> +++ b/include/standard-headers/linux/virtio_config.h
> @@ -40,6 +40,8 @@
>   #define VIRTIO_CONFIG_S_DRIVER_OK	4
>   /* Driver has finished configuring features */
>   #define VIRTIO_CONFIG_S_FEATURES_OK	8
> +/* Device is stopped */
> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
>   /* Device entered invalid state, driver must reset it */
>   #define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
>   /* We've given up on this device. */
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 96a3cc8357..2d3caea289 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
>   {
>       VirtIODevice *vdev = VIRTIO_DEVICE(n);
>       return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
> -        (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
> +        (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
> +        (n->status & VIRTIO_NET_S_LINK_UP) &&
> +        vdev->vm_running;
>   }
>   
>   static void virtio_net_announce_notify(VirtIONet *net)


It looks to me this is only the part of pause. We still need the resume?

Thanks




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-05-26  1:06   ` Jason Wang
@ 2021-05-26  1:10     ` Jason Wang
  2021-06-01  7:13       ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-26  1:10 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/26 上午9:06, Jason Wang 写道:
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>> So the guest can stop and start net device. It implements the RFC
>> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html 
>>
>>
>> To stop (as "pause") the device is required to migrate status and vring
>> addresses between device and SVQ.
>>
>> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
>> virtio_config.h before of even proposing for the kernel, with no feature
>> flag, and, with no checking in the device. It also needs a modified
>> vp_vdpa driver that supports to set and retrieve status.
>>
>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> ---
>>   include/standard-headers/linux/virtio_config.h | 2 ++
>>   hw/net/virtio-net.c                            | 4 +++-
>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/standard-headers/linux/virtio_config.h 
>> b/include/standard-headers/linux/virtio_config.h
>> index 59fad3eb45..b3f6b1365d 100644
>> --- a/include/standard-headers/linux/virtio_config.h
>> +++ b/include/standard-headers/linux/virtio_config.h
>> @@ -40,6 +40,8 @@
>>   #define VIRTIO_CONFIG_S_DRIVER_OK    4
>>   /* Driver has finished configuring features */
>>   #define VIRTIO_CONFIG_S_FEATURES_OK    8
>> +/* Device is stopped */
>> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
>>   /* Device entered invalid state, driver must reset it */
>>   #define VIRTIO_CONFIG_S_NEEDS_RESET    0x40
>>   /* We've given up on this device. */
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 96a3cc8357..2d3caea289 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n, 
>> uint8_t status)
>>   {
>>       VirtIODevice *vdev = VIRTIO_DEVICE(n);
>>       return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
>> -        (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
>> +        (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
>> +        (n->status & VIRTIO_NET_S_LINK_UP) &&
>> +        vdev->vm_running;
>>   }
>>     static void virtio_net_announce_notify(VirtIONet *net)
>
>
> It looks to me this is only the part of pause. 


And even for pause, I don't see anything that prevents rx/tx from being 
executed? (E.g virtio_net_handle_tx_bh or virtio_net_handle_rx).

Thanks


> We still need the resume?
>
> Thanks
>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-05-19 16:28 ` [RFC v3 13/29] vhost: Add vhost_get_iova_range operation Eugenio Pérez
@ 2021-05-26  1:14   ` Jason Wang
  2021-05-26 17:49     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-26  1:14 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> For simplicity, If a device does not support this operation it means
> that it can handle full (uint64_t)-1 iova address.


Note that, we probably need a separated patch for this.

And we need to this during vhost-vdpa initialization. If GPA is out of 
the range, we need to fail the start of vhost-vdpa.

THanks


>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-backend.h |  5 +++++
>   hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
>   hw/virtio/trace-events            |  1 +
>   3 files changed, 24 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index 94d3323905..bcb112c166 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
>   struct vhost_scsi_target;
>   struct vhost_iotlb_msg;
>   struct vhost_virtqueue;
> +struct vhost_vdpa_iova_range;
>   
>   typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
>   typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>   
>   typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>   
> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> +                                    hwaddr *first, hwaddr *last);
> +
>   typedef struct VhostOps {
>       VhostBackendType backend_type;
>       vhost_backend_init vhost_backend_init;
> @@ -173,6 +177,7 @@ typedef struct VhostOps {
>       vhost_get_device_id_op vhost_get_device_id;
>       vhost_vring_pause_op vhost_vring_pause;
>       vhost_force_iommu_op vhost_force_iommu;
> +    vhost_get_iova_range vhost_get_iova_range;
>   } VhostOps;
>   
>   extern const VhostOps user_ops;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 01d2101d09..74fe92935e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>       return true;
>   }
>   
> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> +                                     hwaddr *first, hwaddr *last)
> +{
> +    int ret;
> +    struct vhost_vdpa_iova_range range;
> +
> +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    *first = range.first;
> +    *last = range.last;
> +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> +    return ret;
> +}
> +
>   const VhostOps vdpa_ops = {
>           .backend_type = VHOST_BACKEND_TYPE_VDPA,
>           .vhost_backend_init = vhost_vdpa_init,
> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
>           .vhost_get_device_id = vhost_vdpa_get_device_id,
>           .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>           .vhost_force_iommu = vhost_vdpa_force_iommu,
> +        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
>   };
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index c62727f879..5debe3a681 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>   vhost_vdpa_set_owner(void *dev) "dev: %p"
>   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>   
>   # virtio.c
>   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-05-26  1:14   ` Jason Wang
@ 2021-05-26 17:49     ` Eugenio Perez Martin
  2021-05-27  4:51       ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-05-26 17:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > For simplicity, If a device does not support this operation it means
> > that it can handle full (uint64_t)-1 iova address.
>
>
> Note that, we probably need a separated patch for this.
>

Actually the comment is not in the right commit, the next one is the
one that uses it. Is that what you mean?

> And we need to this during vhost-vdpa initialization. If GPA is out of
> the range, we need to fail the start of vhost-vdpa.
>

Right, that is still to-do.

Maybe a series with just these two commits and failing the start if
GPA is not in the range, as you say, would help to split the amount of
changes.

I will send it if no more comments arise about it.

Thanks!

> THanks
>
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-backend.h |  5 +++++
> >   hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
> >   hw/virtio/trace-events            |  1 +
> >   3 files changed, 24 insertions(+)
> >
> > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > index 94d3323905..bcb112c166 100644
> > --- a/include/hw/virtio/vhost-backend.h
> > +++ b/include/hw/virtio/vhost-backend.h
> > @@ -36,6 +36,7 @@ struct vhost_vring_addr;
> >   struct vhost_scsi_target;
> >   struct vhost_iotlb_msg;
> >   struct vhost_virtqueue;
> > +struct vhost_vdpa_iova_range;
> >
> >   typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
> >   typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
> > @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >
> >   typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
> >
> > +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> > +                                    hwaddr *first, hwaddr *last);
> > +
> >   typedef struct VhostOps {
> >       VhostBackendType backend_type;
> >       vhost_backend_init vhost_backend_init;
> > @@ -173,6 +177,7 @@ typedef struct VhostOps {
> >       vhost_get_device_id_op vhost_get_device_id;
> >       vhost_vring_pause_op vhost_vring_pause;
> >       vhost_force_iommu_op vhost_force_iommu;
> > +    vhost_get_iova_range vhost_get_iova_range;
> >   } VhostOps;
> >
> >   extern const VhostOps user_ops;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 01d2101d09..74fe92935e 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >       return true;
> >   }
> >
> > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > +                                     hwaddr *first, hwaddr *last)
> > +{
> > +    int ret;
> > +    struct vhost_vdpa_iova_range range;
> > +
> > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > +    if (ret != 0) {
> > +        return ret;
> > +    }
> > +
> > +    *first = range.first;
> > +    *last = range.last;
> > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > +    return ret;
> > +}
> > +
> >   const VhostOps vdpa_ops = {
> >           .backend_type = VHOST_BACKEND_TYPE_VDPA,
> >           .vhost_backend_init = vhost_vdpa_init,
> > @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
> >           .vhost_get_device_id = vhost_vdpa_get_device_id,
> >           .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> >           .vhost_force_iommu = vhost_vdpa_force_iommu,
> > +        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
> >   };
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index c62727f879..5debe3a681 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> >
> >   # virtio.c
> >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-05-26 17:49     ` Eugenio Perez Martin
@ 2021-05-27  4:51       ` Jason Wang
  2021-06-01  7:17         ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-27  4:51 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella


在 2021/5/27 上午1:49, Eugenio Perez Martin 写道:
> On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> For simplicity, If a device does not support this operation it means
>>> that it can handle full (uint64_t)-1 iova address.
>>
>> Note that, we probably need a separated patch for this.
>>
> Actually the comment is not in the right commit, the next one is the
> one that uses it. Is that what you mean?


No, it's about the following suggestions.


>
>> And we need to this during vhost-vdpa initialization. If GPA is out of
>> the range, we need to fail the start of vhost-vdpa.


Note that this is for non-IOMMU case. For the case of vIOMMU we probably 
need to validate it against address width or other similar attributes.

Thanks


>>
> Right, that is still to-do.
>
> Maybe a series with just these two commits and failing the start if
> GPA is not in the range, as you say, would help to split the amount of
> changes.
>
> I will send it if no more comments arise about it.
>
> Thanks!
>
>> THanks
>>
>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    include/hw/virtio/vhost-backend.h |  5 +++++
>>>    hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
>>>    hw/virtio/trace-events            |  1 +
>>>    3 files changed, 24 insertions(+)
>>>
>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>> index 94d3323905..bcb112c166 100644
>>> --- a/include/hw/virtio/vhost-backend.h
>>> +++ b/include/hw/virtio/vhost-backend.h
>>> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
>>>    struct vhost_scsi_target;
>>>    struct vhost_iotlb_msg;
>>>    struct vhost_virtqueue;
>>> +struct vhost_vdpa_iova_range;
>>>
>>>    typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
>>>    typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>>> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>
>>>    typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>>>
>>> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
>>> +                                    hwaddr *first, hwaddr *last);
>>> +
>>>    typedef struct VhostOps {
>>>        VhostBackendType backend_type;
>>>        vhost_backend_init vhost_backend_init;
>>> @@ -173,6 +177,7 @@ typedef struct VhostOps {
>>>        vhost_get_device_id_op vhost_get_device_id;
>>>        vhost_vring_pause_op vhost_vring_pause;
>>>        vhost_force_iommu_op vhost_force_iommu;
>>> +    vhost_get_iova_range vhost_get_iova_range;
>>>    } VhostOps;
>>>
>>>    extern const VhostOps user_ops;
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 01d2101d09..74fe92935e 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>>        return true;
>>>    }
>>>
>>> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
>>> +                                     hwaddr *first, hwaddr *last)
>>> +{
>>> +    int ret;
>>> +    struct vhost_vdpa_iova_range range;
>>> +
>>> +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
>>> +    if (ret != 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    *first = range.first;
>>> +    *last = range.last;
>>> +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
>>> +    return ret;
>>> +}
>>> +
>>>    const VhostOps vdpa_ops = {
>>>            .backend_type = VHOST_BACKEND_TYPE_VDPA,
>>>            .vhost_backend_init = vhost_vdpa_init,
>>> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
>>>            .vhost_get_device_id = vhost_vdpa_get_device_id,
>>>            .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>>            .vhost_force_iommu = vhost_vdpa_force_iommu,
>>> +        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
>>>    };
>>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>>> index c62727f879..5debe3a681 100644
>>> --- a/hw/virtio/trace-events
>>> +++ b/hw/virtio/trace-events
>>> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>>>    vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>>>    vhost_vdpa_set_owner(void *dev) "dev: %p"
>>>    vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
>>> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>>>
>>>    # virtio.c
>>>    virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps
  2021-05-19 16:28 ` [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps Eugenio Pérez
@ 2021-05-31  9:01   ` Jason Wang
  2021-06-01  7:49     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-31  9:01 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This operation enable the backend-specific IOTLB entries.
>
> If a backend support this, it start managing its own entries, and vhost
> can disable it through this operation and recover control.
>
> Every enable/disable operation must also clear all IOTLB device entries.
>
> At the moment, the only backend that does so is vhost-vdpa. To fully
> support these, vdpa needs also to expose a way for vhost subsystem to
> map and unmap entries. This will be done in future commits.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>


I think there's probably no need to introduce this helper.

Instead, we can introduce ops like shadow_vq_start()/stop(). Then the 
details like this could be hided there.

(And hide the backend deatils (avoid calling vhost_vdpa_dma_map()) 
directly from the vhost.c)

Thanks


> ---
>   include/hw/virtio/vhost-backend.h | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index bcb112c166..f8eed2ace5 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -128,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>   
>   typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>   
> +typedef int (*vhost_enable_custom_iommu_op)(struct vhost_dev *dev,
> +                                            bool enable);
> +
>   typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
>                                       hwaddr *first, hwaddr *last);
>   
> @@ -177,6 +180,7 @@ typedef struct VhostOps {
>       vhost_get_device_id_op vhost_get_device_id;
>       vhost_vring_pause_op vhost_vring_pause;
>       vhost_force_iommu_op vhost_force_iommu;
> +    vhost_enable_custom_iommu_op vhost_enable_custom_iommu;
>       vhost_get_iova_range vhost_get_iova_range;
>   } VhostOps;
>   



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-05-19 16:28 ` [RFC v3 21/29] vhost: Add VhostIOVATree Eugenio Pérez
@ 2021-05-31  9:40   ` Jason Wang
  2021-06-01  8:15     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-05-31  9:40 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This tree is able to look for a translated address from a IOVA address.
>
> At first glance is similar to util/iova-tree. However, SVQ working on
> devices with limited IOVA space need more capabilities, like allocating
> IOVA chunks or perform reverse translations (qemu addresses to iova).
>
> Starting a sepparated implementation. Knowing than insertions/deletions
> will not be as frequent as searches,


This might not be true if vIOMMU is enabled.


> it uses an ordered array at
> implementation.


I wonder how much overhead could g_array be if it needs to grow.


>   A different name could be used, but ordered
> searchable array is a little bit long though.


Note that we had a very good example for this. That is the kernel iova 
allocator which is implemented via rbtree.

Instead of figuring out g_array vs g_tree stuffs, I would simple go with 
g_tree first (based on util/iova-tree) and borrow the well design kernel 
iova allocator API to have a generic IOVA one instead of coupling it 
with vhost. It could be used by other userspace driver in the future:

init_iova_domain()/put_iova_domain();

alloc_iova()/free_iova();

find_iova();

Another reference is the iova allocator that is implemented in VFIO.

Thanks


>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-iova-tree.h |  50 ++++++++++
>   hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
>   hw/virtio/meson.build       |   2 +-
>   3 files changed, 239 insertions(+), 1 deletion(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> new file mode 100644
> index 0000000000..2a44af8b3a
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -0,0 +1,50 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> +
> +#include <gmodule.h>
> +
> +#include "exec/memory.h"
> +
> +typedef struct VhostDMAMap {
> +    void *translated_addr;
> +    hwaddr iova;
> +    hwaddr size;                /* Inclusive */
> +    IOMMUAccessFlags perm;
> +} VhostDMAMap;
> +
> +typedef enum VhostDMAMapNewRC {
> +    VHOST_DMA_MAP_OVERLAP = -2,
> +    VHOST_DMA_MAP_INVALID = -1,
> +    VHOST_DMA_MAP_OK = 0,
> +} VhostDMAMapNewRC;
> +
> +/**
> + * VhostIOVATree
> + *
> + * Store and search IOVA -> Translated mappings.
> + *
> + * Note that it cannot remove nodes.
> + */
> +typedef struct VhostIOVATree {
> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> +    GArray *iova_taddr_map;
> +} VhostIOVATree;
> +
> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
> +
> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
> +                                              const VhostDMAMap *map);
> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
> +                                        VhostDMAMap *map);
> +
> +#endif
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> new file mode 100644
> index 0000000000..dfd7e448b5
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -0,0 +1,188 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "vhost-iova-tree.h"
> +
> +#define G_ARRAY_NOT_ZERO_TERMINATED false
> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> +
> +/**
> + * Inserts an element after an existing one in garray.
> + *
> + * @array      The array
> + * @prev_elem  The previous element of array of NULL if prepending
> + * @map        The DMA map
> + *
> + * It provides the aditional advantage of being type safe over
> + * g_array_insert_val, which accepts a reference pointer instead of a value
> + * with no complains.
> + */
> +static void vhost_iova_tree_insert_after(GArray *array,
> +                                         const VhostDMAMap *prev_elem,
> +                                         const VhostDMAMap *map)
> +{
> +    size_t pos;
> +
> +    if (!prev_elem) {
> +        pos = 0;
> +    } else {
> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> +    }
> +
> +    g_array_insert_val(array, pos, *map);
> +}
> +
> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
> +{
> +    const VhostDMAMap *m1 = a, *m2 = b;
> +
> +    if (m1->iova > m2->iova + m2->size) {
> +        return 1;
> +    }
> +
> +    if (m1->iova + m1->size < m2->iova) {
> +        return -1;
> +    }
> +
> +    /* Overlapped */
> +    return 0;
> +}
> +
> +/**
> + * Find the previous node to a given iova
> + *
> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> + * @map    The map to insert
> + * @prev   Returned location of the previous map
> + *
> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> + * it already exists. It is ok to use this function to check if a given range
> + * exists, but it will use a linear search.
> + *
> + * TODO: We can use bsearch to locate the entry if we save the state in the
> + * needle, knowing that the needle is always the first argument to
> + * compare_func.
> + */
> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> +                                                  GCompareFunc compare_func,
> +                                                  const VhostDMAMap *map,
> +                                                  const VhostDMAMap **prev)
> +{
> +    size_t i;
> +    int r;
> +
> +    *prev = NULL;
> +    for (i = 0; i < array->len; ++i) {
> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> +        if (r == 0) {
> +            return VHOST_DMA_MAP_OVERLAP;
> +        }
> +        if (r < 0) {
> +            return VHOST_DMA_MAP_OK;
> +        }
> +
> +        *prev = &g_array_index(array, typeof(**prev), i);
> +    }
> +
> +    return VHOST_DMA_MAP_OK;
> +}
> +
> +/**
> + * Create a new IOVA tree
> + *
> + * @tree  The IOVA tree
> + */
> +void vhost_iova_tree_new(VhostIOVATree *tree)
> +{
> +    assert(tree);
> +
> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> +                                       sizeof(VhostDMAMap));
> +}
> +
> +/**
> + * Destroy an IOVA tree
> + *
> + * @tree  The iova tree
> + */
> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
> +{
> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> +}
> +
> +/**
> + * Perform a search on a GArray.
> + *
> + * @array Glib array
> + * @map Map to look up
> + * @compare_func Compare function to use
> + *
> + * Return The found element or NULL if not found.
> + *
> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
> + * is common enough.
> + */
> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
> +                                                  const VhostDMAMap *map,
> +                                                  GCompareFunc compare_func)
> +{
> +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
> +}
> +
> +/**
> + * Find the translated address stored from a IOVA address
> + *
> + * @tree  The iova tree
> + * @map   The map with the memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
> +                                              const VhostDMAMap *map)
> +{
> +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
> +                                  vhost_iova_tree_cmp_iova);
> +}
> +
> +/**
> + * Insert a new map
> + *
> + * @tree  The iova tree
> + * @map   The iova map
> + *
> + * Returns:
> + * - VHOST_DMA_MAP_OK if the map fits in the container
> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> + * Can query the assignated iova in map.
> + */
> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
> +                                        VhostDMAMap *map)
> +{
> +    const VhostDMAMap *prev;
> +    int find_prev_rc;
> +
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> +        return VHOST_DMA_MAP_INVALID;
> +    }
> +
> +    /* Check for duplicates, and save position for insertion */
> +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
> +                                             vhost_iova_tree_cmp_iova, map,
> +                                             &prev);
> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> +        return VHOST_DMA_MAP_OVERLAP;
> +    }
> +
> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
> +    return VHOST_DMA_MAP_OK;
> +}
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index 8b5a0225fe..cb306b83c6 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>   
>   virtio_ss = ss.source_set()
>   virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-05-26  1:10     ` Jason Wang
@ 2021-06-01  7:13       ` Eugenio Perez Martin
  2021-06-03  3:12         ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-01  7:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Wed, May 26, 2021 at 3:10 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/26 上午9:06, Jason Wang 写道:
> >
> > 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> >> So the guest can stop and start net device. It implements the RFC
> >> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
> >>
> >>
> >> To stop (as "pause") the device is required to migrate status and vring
> >> addresses between device and SVQ.
> >>
> >> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
> >> virtio_config.h before of even proposing for the kernel, with no feature
> >> flag, and, with no checking in the device. It also needs a modified
> >> vp_vdpa driver that supports to set and retrieve status.
> >>
> >> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> ---
> >>   include/standard-headers/linux/virtio_config.h | 2 ++
> >>   hw/net/virtio-net.c                            | 4 +++-
> >>   2 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/standard-headers/linux/virtio_config.h
> >> b/include/standard-headers/linux/virtio_config.h
> >> index 59fad3eb45..b3f6b1365d 100644
> >> --- a/include/standard-headers/linux/virtio_config.h
> >> +++ b/include/standard-headers/linux/virtio_config.h
> >> @@ -40,6 +40,8 @@
> >>   #define VIRTIO_CONFIG_S_DRIVER_OK    4
> >>   /* Driver has finished configuring features */
> >>   #define VIRTIO_CONFIG_S_FEATURES_OK    8
> >> +/* Device is stopped */
> >> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
> >>   /* Device entered invalid state, driver must reset it */
> >>   #define VIRTIO_CONFIG_S_NEEDS_RESET    0x40
> >>   /* We've given up on this device. */
> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >> index 96a3cc8357..2d3caea289 100644
> >> --- a/hw/net/virtio-net.c
> >> +++ b/hw/net/virtio-net.c
> >> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n,
> >> uint8_t status)
> >>   {
> >>       VirtIODevice *vdev = VIRTIO_DEVICE(n);
> >>       return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
> >> -        (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
> >> +        (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
> >> +        (n->status & VIRTIO_NET_S_LINK_UP) &&
> >> +        vdev->vm_running;
> >>   }
> >>     static void virtio_net_announce_notify(VirtIONet *net)
> >
> >
> > It looks to me this is only the part of pause.
>

For SVQ we need to switch vring addresses, and a full reset of the
device is required for that. Actually, the pause is just used to
recover

If you prefer this could be sent as a separate series where the full
pause/resume cycle is implemented, and then SVQ uses the pause part.
However there are no use for the resume part at the moment.

>
> And even for pause, I don't see anything that prevents rx/tx from being
> executed? (E.g virtio_net_handle_tx_bh or virtio_net_handle_rx).
>

virtio_net_started is called from virtio_net_set_status. If
_S_DEVICE_STOPPED is true, the former return false, and variable
queue_started is false in the latter:
  queue_started =
            virtio_net_started(n, queue_status) && !n->vhost_started;

After that, it should work like a regular device reset or link down if
I'm not wrong, and the last part of virtio_net_set_status should
delete timer or cancel bh.

> Thanks
>
>
> > We still need the resume?
> >
> > Thanks
> >
> >
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-05-27  4:51       ` Jason Wang
@ 2021-06-01  7:17         ` Eugenio Perez Martin
  2021-06-03  3:13           ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-01  7:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Thu, May 27, 2021 at 6:51 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/27 上午1:49, Eugenio Perez Martin 写道:
> > On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> >>> For simplicity, If a device does not support this operation it means
> >>> that it can handle full (uint64_t)-1 iova address.
> >>
> >> Note that, we probably need a separated patch for this.
> >>
> > Actually the comment is not in the right commit, the next one is the
> > one that uses it. Is that what you mean?
>
>
> No, it's about the following suggestions.
>
>
> >
> >> And we need to this during vhost-vdpa initialization. If GPA is out of
> >> the range, we need to fail the start of vhost-vdpa.
>
>
> Note that this is for non-IOMMU case. For the case of vIOMMU we probably
> need to validate it against address width or other similar attributes.
>

Right.

What should qemu do if the memory of the guest gets expanded outside
of the range? I think there is not a clear way to fail the memory
addition, isn't it?

> Thanks
>
>
> >>
> > Right, that is still to-do.
> >
> > Maybe a series with just these two commits and failing the start if
> > GPA is not in the range, as you say, would help to split the amount of
> > changes.
> >
> > I will send it if no more comments arise about it.
> >
> > Thanks!
> >
> >> THanks
> >>
> >>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    include/hw/virtio/vhost-backend.h |  5 +++++
> >>>    hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
> >>>    hw/virtio/trace-events            |  1 +
> >>>    3 files changed, 24 insertions(+)
> >>>
> >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>> index 94d3323905..bcb112c166 100644
> >>> --- a/include/hw/virtio/vhost-backend.h
> >>> +++ b/include/hw/virtio/vhost-backend.h
> >>> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
> >>>    struct vhost_scsi_target;
> >>>    struct vhost_iotlb_msg;
> >>>    struct vhost_virtqueue;
> >>> +struct vhost_vdpa_iova_range;
> >>>
> >>>    typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
> >>>    typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
> >>> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >>>
> >>>    typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
> >>>
> >>> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> >>> +                                    hwaddr *first, hwaddr *last);
> >>> +
> >>>    typedef struct VhostOps {
> >>>        VhostBackendType backend_type;
> >>>        vhost_backend_init vhost_backend_init;
> >>> @@ -173,6 +177,7 @@ typedef struct VhostOps {
> >>>        vhost_get_device_id_op vhost_get_device_id;
> >>>        vhost_vring_pause_op vhost_vring_pause;
> >>>        vhost_force_iommu_op vhost_force_iommu;
> >>> +    vhost_get_iova_range vhost_get_iova_range;
> >>>    } VhostOps;
> >>>
> >>>    extern const VhostOps user_ops;
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index 01d2101d09..74fe92935e 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >>>        return true;
> >>>    }
> >>>
> >>> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> >>> +                                     hwaddr *first, hwaddr *last)
> >>> +{
> >>> +    int ret;
> >>> +    struct vhost_vdpa_iova_range range;
> >>> +
> >>> +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> >>> +    if (ret != 0) {
> >>> +        return ret;
> >>> +    }
> >>> +
> >>> +    *first = range.first;
> >>> +    *last = range.last;
> >>> +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> >>> +    return ret;
> >>> +}
> >>> +
> >>>    const VhostOps vdpa_ops = {
> >>>            .backend_type = VHOST_BACKEND_TYPE_VDPA,
> >>>            .vhost_backend_init = vhost_vdpa_init,
> >>> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
> >>>            .vhost_get_device_id = vhost_vdpa_get_device_id,
> >>>            .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> >>>            .vhost_force_iommu = vhost_vdpa_force_iommu,
> >>> +        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
> >>>    };
> >>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> >>> index c62727f879..5debe3a681 100644
> >>> --- a/hw/virtio/trace-events
> >>> +++ b/hw/virtio/trace-events
> >>> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> >>>    vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> >>>    vhost_vdpa_set_owner(void *dev) "dev: %p"
> >>>    vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> >>> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> >>>
> >>>    # virtio.c
> >>>    virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps
  2021-05-31  9:01   ` Jason Wang
@ 2021-06-01  7:49     ` Eugenio Perez Martin
  0 siblings, 0 replies; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-01  7:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Mon, May 31, 2021 at 11:02 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > This operation enable the backend-specific IOTLB entries.
> >
> > If a backend support this, it start managing its own entries, and vhost
> > can disable it through this operation and recover control.
> >
> > Every enable/disable operation must also clear all IOTLB device entries.
> >
> > At the moment, the only backend that does so is vhost-vdpa. To fully
> > support these, vdpa needs also to expose a way for vhost subsystem to
> > map and unmap entries. This will be done in future commits.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
>
> I think there's probably no need to introduce this helper.
>
> Instead, we can introduce ops like shadow_vq_start()/stop(). Then the
> details like this could be hided there.
>

I'm also fine with your approach, but then the ownership of the shadow
virtqueue would be splitted between vhost code and the vhost backend
code.

With the current code, vhost is in charge for mapping DMA entries, and
delegates in the backend when the latter has its own means of map [1].
If we just expose shadow_vq_start/stop, the logic of when to map gets
somehow duplicated in vhost and in the backend, and it is not obvious
that future code changes in one side need to be duplicated in the
backend.

I understand that this way needs to expose more vhost operations, but
I think each of these are smaller and fit better in "vhost backend
implementation of an operation" than just telling the backend that
shadow vq is started.

> (And hide the backend deatils (avoid calling vhost_vdpa_dma_map())
> directly from the vhost.c)
>

Sure, the direct call of vhost_vdpa_dma_map is not intended to be that
way in the final series, it's just an intermediate step. I could have
been more explicit about that, sorry.

[1] At the moment it just calls vhost_vdpa_dma_map directly, but this
should be changed by a vhost_ops, and that op is optional: If not
present, vIOMMU is used.

> Thanks
>
>
> > ---
> >   include/hw/virtio/vhost-backend.h | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > index bcb112c166..f8eed2ace5 100644
> > --- a/include/hw/virtio/vhost-backend.h
> > +++ b/include/hw/virtio/vhost-backend.h
> > @@ -128,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >
> >   typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
> >
> > +typedef int (*vhost_enable_custom_iommu_op)(struct vhost_dev *dev,
> > +                                            bool enable);
> > +
> >   typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> >                                       hwaddr *first, hwaddr *last);
> >
> > @@ -177,6 +180,7 @@ typedef struct VhostOps {
> >       vhost_get_device_id_op vhost_get_device_id;
> >       vhost_vring_pause_op vhost_vring_pause;
> >       vhost_force_iommu_op vhost_force_iommu;
> > +    vhost_enable_custom_iommu_op vhost_enable_custom_iommu;
> >       vhost_get_iova_range vhost_get_iova_range;
> >   } VhostOps;
> >
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-05-31  9:40   ` Jason Wang
@ 2021-06-01  8:15     ` Eugenio Perez Martin
  2021-07-14  3:04       ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-01  8:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > This tree is able to look for a translated address from a IOVA address.
> >
> > At first glance is similar to util/iova-tree. However, SVQ working on
> > devices with limited IOVA space need more capabilities, like allocating
> > IOVA chunks or perform reverse translations (qemu addresses to iova).
> >
> > Starting a sepparated implementation. Knowing than insertions/deletions
> > will not be as frequent as searches,
>
>
> This might not be true if vIOMMU is enabled.
>

Right.

>
> > it uses an ordered array at
> > implementation.
>
>
> I wonder how much overhead could g_array be if it needs to grow.
>

I didn't do any tests, actually. But I see this VhostIOVATree as a
replaceable tool, just to get the buffer translations to work. So I'm
both ok with change it now and ok to delay it, since they should not
be hard to do.

>
> >   A different name could be used, but ordered
> > searchable array is a little bit long though.
>
>
> Note that we had a very good example for this. That is the kernel iova
> allocator which is implemented via rbtree.
>
> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
> g_tree first (based on util/iova-tree) and borrow the well design kernel
> iova allocator API to have a generic IOVA one instead of coupling it
> with vhost. It could be used by other userspace driver in the future:
>
> init_iova_domain()/put_iova_domain();
>
> alloc_iova()/free_iova();
>
> find_iova();
>

We could go that way, but then the iova-tree should be extended to
support both translations (iova->translated_addr is now implemented in
iova-tree, the reverse is not). If I understood you correctly,
borrowing the kernel iova allocator would give us both, right?

Note that it is not coupled to vhost at all except in the name: all
the implementations only work with hwaddr and void pointers memory.
Just to illustrate the point, I think it could be a drop-in
replacement for iova-tree at this moment (with all the
drawbacks/advantages of an array vs tree).

> Another reference is the iova allocator that is implemented in VFIO.

I will check this too.

Thanks!


>
> Thanks
>
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-iova-tree.h |  50 ++++++++++
> >   hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
> >   hw/virtio/meson.build       |   2 +-
> >   3 files changed, 239 insertions(+), 1 deletion(-)
> >   create mode 100644 hw/virtio/vhost-iova-tree.h
> >   create mode 100644 hw/virtio/vhost-iova-tree.c
> >
> > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > new file mode 100644
> > index 0000000000..2a44af8b3a
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.h
> > @@ -0,0 +1,50 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > +
> > +#include <gmodule.h>
> > +
> > +#include "exec/memory.h"
> > +
> > +typedef struct VhostDMAMap {
> > +    void *translated_addr;
> > +    hwaddr iova;
> > +    hwaddr size;                /* Inclusive */
> > +    IOMMUAccessFlags perm;
> > +} VhostDMAMap;
> > +
> > +typedef enum VhostDMAMapNewRC {
> > +    VHOST_DMA_MAP_OVERLAP = -2,
> > +    VHOST_DMA_MAP_INVALID = -1,
> > +    VHOST_DMA_MAP_OK = 0,
> > +} VhostDMAMapNewRC;
> > +
> > +/**
> > + * VhostIOVATree
> > + *
> > + * Store and search IOVA -> Translated mappings.
> > + *
> > + * Note that it cannot remove nodes.
> > + */
> > +typedef struct VhostIOVATree {
> > +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> > +    GArray *iova_taddr_map;
> > +} VhostIOVATree;
> > +
> > +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
> > +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
> > +
> > +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
> > +                                              const VhostDMAMap *map);
> > +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
> > +                                        VhostDMAMap *map);
> > +
> > +#endif
> > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > new file mode 100644
> > index 0000000000..dfd7e448b5
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.c
> > @@ -0,0 +1,188 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "vhost-iova-tree.h"
> > +
> > +#define G_ARRAY_NOT_ZERO_TERMINATED false
> > +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> > +
> > +/**
> > + * Inserts an element after an existing one in garray.
> > + *
> > + * @array      The array
> > + * @prev_elem  The previous element of array of NULL if prepending
> > + * @map        The DMA map
> > + *
> > + * It provides the aditional advantage of being type safe over
> > + * g_array_insert_val, which accepts a reference pointer instead of a value
> > + * with no complains.
> > + */
> > +static void vhost_iova_tree_insert_after(GArray *array,
> > +                                         const VhostDMAMap *prev_elem,
> > +                                         const VhostDMAMap *map)
> > +{
> > +    size_t pos;
> > +
> > +    if (!prev_elem) {
> > +        pos = 0;
> > +    } else {
> > +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> > +    }
> > +
> > +    g_array_insert_val(array, pos, *map);
> > +}
> > +
> > +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
> > +{
> > +    const VhostDMAMap *m1 = a, *m2 = b;
> > +
> > +    if (m1->iova > m2->iova + m2->size) {
> > +        return 1;
> > +    }
> > +
> > +    if (m1->iova + m1->size < m2->iova) {
> > +        return -1;
> > +    }
> > +
> > +    /* Overlapped */
> > +    return 0;
> > +}
> > +
> > +/**
> > + * Find the previous node to a given iova
> > + *
> > + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> > + * @map    The map to insert
> > + * @prev   Returned location of the previous map
> > + *
> > + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> > + * it already exists. It is ok to use this function to check if a given range
> > + * exists, but it will use a linear search.
> > + *
> > + * TODO: We can use bsearch to locate the entry if we save the state in the
> > + * needle, knowing that the needle is always the first argument to
> > + * compare_func.
> > + */
> > +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> > +                                                  GCompareFunc compare_func,
> > +                                                  const VhostDMAMap *map,
> > +                                                  const VhostDMAMap **prev)
> > +{
> > +    size_t i;
> > +    int r;
> > +
> > +    *prev = NULL;
> > +    for (i = 0; i < array->len; ++i) {
> > +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> > +        if (r == 0) {
> > +            return VHOST_DMA_MAP_OVERLAP;
> > +        }
> > +        if (r < 0) {
> > +            return VHOST_DMA_MAP_OK;
> > +        }
> > +
> > +        *prev = &g_array_index(array, typeof(**prev), i);
> > +    }
> > +
> > +    return VHOST_DMA_MAP_OK;
> > +}
> > +
> > +/**
> > + * Create a new IOVA tree
> > + *
> > + * @tree  The IOVA tree
> > + */
> > +void vhost_iova_tree_new(VhostIOVATree *tree)
> > +{
> > +    assert(tree);
> > +
> > +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> > +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> > +                                       sizeof(VhostDMAMap));
> > +}
> > +
> > +/**
> > + * Destroy an IOVA tree
> > + *
> > + * @tree  The iova tree
> > + */
> > +void vhost_iova_tree_destroy(VhostIOVATree *tree)
> > +{
> > +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> > +}
> > +
> > +/**
> > + * Perform a search on a GArray.
> > + *
> > + * @array Glib array
> > + * @map Map to look up
> > + * @compare_func Compare function to use
> > + *
> > + * Return The found element or NULL if not found.
> > + *
> > + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
> > + * is common enough.
> > + */
> > +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
> > +                                                  const VhostDMAMap *map,
> > +                                                  GCompareFunc compare_func)
> > +{
> > +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
> > +}
> > +
> > +/**
> > + * Find the translated address stored from a IOVA address
> > + *
> > + * @tree  The iova tree
> > + * @map   The map with the memory address
> > + *
> > + * Return the stored mapping, or NULL if not found.
> > + */
> > +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
> > +                                              const VhostDMAMap *map)
> > +{
> > +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
> > +                                  vhost_iova_tree_cmp_iova);
> > +}
> > +
> > +/**
> > + * Insert a new map
> > + *
> > + * @tree  The iova tree
> > + * @map   The iova map
> > + *
> > + * Returns:
> > + * - VHOST_DMA_MAP_OK if the map fits in the container
> > + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> > + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> > + * Can query the assignated iova in map.
> > + */
> > +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
> > +                                        VhostDMAMap *map)
> > +{
> > +    const VhostDMAMap *prev;
> > +    int find_prev_rc;
> > +
> > +    if (map->translated_addr + map->size < map->translated_addr ||
> > +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> > +        return VHOST_DMA_MAP_INVALID;
> > +    }
> > +
> > +    /* Check for duplicates, and save position for insertion */
> > +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
> > +                                             vhost_iova_tree_cmp_iova, map,
> > +                                             &prev);
> > +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> > +        return VHOST_DMA_MAP_OVERLAP;
> > +    }
> > +
> > +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
> > +    return VHOST_DMA_MAP_OK;
> > +}
> > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > index 8b5a0225fe..cb306b83c6 100644
> > --- a/hw/virtio/meson.build
> > +++ b/hw/virtio/meson.build
> > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> >
> >   virtio_ss = ss.source_set()
> >   virtio_ss.add(files('virtio.c'))
> > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
  2021-05-19 16:28 ` [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-06-02  9:50   ` Jason Wang
  2021-06-02 17:18     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-06-02  9:50 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers. The
> exposed addresses are the qemu's virtual address, so devices with IOMMU
> that does not allow full mapping of qemu's address space does not work
> at the moment.
>
> Also for simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review.
>
> Later commits will solve some of these concerns.


It would be more more easier to review if you squash those 
"enhancements" into this patch.


>
> Code also need to map used ring (device part) as RW in, and only in,
> vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
> would print an error in case of vhost devices with its own mapping
> (vdpa).


I think we should not depend on the IOTLB miss. Instead, we should 
program the device IOTLB before starting the svq. Or is there anything 
that prevent you from doing this?


> To know if this call is needed, vhost_sw_live_migration_start_vq and
> vhost_sw_live_migration_stop copy the test performed in
> vhost_dev_start. Testing for the actual backend type could be cleaner,
> or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
> another vhostOp. We could extract this test in its own function too,
> so its name could give a better hint. Just copy the vhost_dev_start
> check at the moment.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
>   hw/virtio/vhost.c                  | 134 ++++++++++++++++++-
>   2 files changed, 325 insertions(+), 14 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index ff50f12410..6d767fe248 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -9,6 +9,7 @@
>   
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost.h"
> +#include "hw/virtio/virtio-access.h"
>   
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Virtio device */
>       VirtIODevice *vdev;
> +
> +    /* Map for returning guest's descriptors */
> +    VirtQueueElement **ring_id_maps;
> +
> +    /* Next head to expose to device */
> +    uint16_t avail_idx_shadow;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;
> +
> +    /* Next head to consume from device */
> +    uint16_t used_idx;
>   } VhostShadowVirtqueue;
>   
> -/* Forward guest notifications */
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    const struct iovec *iovec,
> +                                    size_t num, bool more_descs, bool write)
> +{
> +    uint16_t i = svq->free_head, last = svq->free_head;
> +    unsigned n;
> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +    vring_desc_t *descs = svq->vring.desc;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (n = 0; n < num; n++) {
> +        if (more_descs || (n + 1 < num)) {
> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> +
> +        last = i;
> +        i = cpu_to_le16(descs[i].next);
> +    }
> +
> +    svq->free_head = le16_to_cpu(descs[last].next);
> +}
> +
> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> +                                          VirtQueueElement *elem)
> +{
> +    int head;
> +    unsigned avail_idx;
> +    vring_avail_t *avail = svq->vring.avail;
> +
> +    head = svq->free_head;
> +
> +    /* We need some descriptors here */
> +    assert(elem->out_num || elem->in_num);
> +
> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +                            elem->in_num > 0, false);
> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> +    /*
> +     * Put entry in available array (but don't update avail->idx until they
> +     * do sync).
> +     */
> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> +    avail->ring[avail_idx] = cpu_to_le16(head);
> +    svq->avail_idx_shadow++;
> +
> +    /* Expose descriptors to device */


It's better to describe the detail order.

E.g "Update avail index after the descriptor is wrote"


> +    smp_wmb();
> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> +
> +    return head;
> +
> +}
> +
> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> +                                VirtQueueElement *elem)
> +{
> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> +
> +    svq->ring_id_maps[qemu_head] = elem;
> +}
> +
> +/* Handle guest->device notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>           return;
>       }
>   
> -    event_notifier_set(&svq->kick_notifier);
> +    /* Make available as many buffers as possible */
> +    do {
> +        if (virtio_queue_get_notification(svq->vq)) {
> +            /* No more notifications until process all available */
> +            virtio_queue_set_notification(svq->vq, false);
> +        }
> +
> +        while (true) {
> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            if (!elem) {
> +                break;
> +            }
> +
> +            vhost_shadow_vq_add(svq, elem);
> +            event_notifier_set(&svq->kick_notifier);
> +        }
> +
> +        virtio_queue_set_notification(svq->vq, true);
> +    } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> +{
> +    if (svq->used_idx != svq->shadow_used_idx) {
> +        return true;
> +    }
> +
> +    /* Get used idx must not be reordered */
> +    smp_rmb();
> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> +
> +    return svq->used_idx != svq->shadow_used_idx;
> +}
> +
> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> +{
> +    vring_desc_t *descs = svq->vring.desc;
> +    const vring_used_t *used = svq->vring.used;
> +    vring_used_elem_t used_elem;
> +    uint16_t last_used;
> +
> +    if (!vhost_shadow_vq_more_used(svq)) {
> +        return NULL;
> +    }
> +
> +    last_used = svq->used_idx & (svq->vring.num - 1);
> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> +
> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> +        error_report("Device %s says index %u is available", svq->vdev->name,
> +                     used_elem.id);
> +        return NULL;
> +    }
> +
> +    descs[used_elem.id].next = svq->free_head;
> +    svq->free_head = used_elem.id;
> +
> +    svq->used_idx++;
> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>   }
>   
>   /* Forward vhost notifications */
> @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                call_notifier);
>       EventNotifier *masked_notifier;
> +    VirtQueue *vq = svq->vq;
>   
>       masked_notifier = svq->masked_notifier.n;
>   
> -    if (!masked_notifier) {
> -        unsigned n = virtio_get_queue_index(svq->vq);
> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> -    } else if (!svq->masked_notifier.signaled) {
> -        svq->masked_notifier.signaled = true;
> -        event_notifier_set(svq->masked_notifier.n);
> -    }
> +    /* Make as many buffers as possible used. */
> +    do {
> +        unsigned i = 0;
> +
> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> +        while (true) {
> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> +            if (!elem) {
> +                break;
> +            }
> +
> +            assert(i < svq->vring.num);
> +            virtqueue_fill(vq, elem, elem->len, i++);
> +        }
> +
> +        virtqueue_flush(vq, i);
> +        if (!masked_notifier) {
> +            virtio_notify_irqfd(svq->vdev, svq->vq);


Any reason that you don't use virtio_notify() here?


> +        } else if (!svq->masked_notifier.signaled) {
> +            svq->masked_notifier.signaled = true;
> +            event_notifier_set(svq->masked_notifier.n);
> +        }


This is an example of the extra complexity if you do shadow virtqueue at 
virtio level.

If you do everything at e.g vhost-vdpa, you don't need to care about 
masked_notifer at all.

Thanks



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
  2021-05-19 16:28 ` [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2021-06-02  9:51   ` Jason Wang
  2021-06-02 17:51     ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-06-02  9:51 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> Use translations added in IOVAReverseMaps in SVQ if the vhost device
> does not support the mapping of the full qemu's virtual address space.
> In other cases, Shadow Virtqueue still uses the qemu's virtual address
> of the buffer pointed by the descriptor, which has been translated
> already by qemu's VirtQueue machinery.


I'd say let stick to a single kind of translation (iova allocator) that 
works for all the cases first and add optimizations on top.


>
> Now every element needs to store the previous address also, so VirtQueue
> can consume the elements properly. This adds a little overhead per VQ
> element, having to allocate more memory to stash them. As a possible
> optimization, this allocation could be avoided if the descriptor is not
> a chain but a single one, but this is left undone.
>
> Checking also for vhost_set_iotlb_callback to send used ring remapping.
> This is only needed for kernel, and would print an error in case of
> vhost devices with its own mapping (vdpa).
>
> This could change for other callback, like checking for
> vhost_force_iommu, enable_custom_iommu, or another. Another option could
> be to, at least, extract the check of "is map(used, writable) needed?"
> in another function. But at the moment just copy the check used in
> vhost_dev_start here.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
>   hw/virtio/vhost.c                  |  29 +++++--
>   2 files changed, 145 insertions(+), 18 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 934d3bb27b..a92da979d1 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -10,12 +10,19 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>   
>   #include "standard-headers/linux/vhost_types.h"
>   
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   
> +typedef struct SVQElement {
> +    VirtQueueElement elem;
> +    void **in_sg_stash;
> +    void **out_sg_stash;


Any reason for the trick like this?

Can we simply use iovec and iov_copy() here?

Thanks




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
                   ` (29 preceding siblings ...)
  2021-05-24  9:38 ` [RFC v3 00/29] vDPA software assisted live migration Michael S. Tsirkin
@ 2021-06-02  9:59 ` Jason Wang
  30 siblings, 0 replies; 67+ messages in thread
From: Jason Wang @ 2021-06-02  9:59 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Michael Lilja,
	Stefano Garzarella


在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This series enable shadow virtqueue for vhost-vdpa devices. This is a
> new method of vhost devices migration: Instead of relay on vhost
> device's dirty logging capability, SW assisted LM intercepts dataplane,
> forwarding the descriptors between VM and device. Is intended for vDPA
> devices with no dirty memory tracking capabilities.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net and vhost-vdpa devices.
> The former already have dirty page logging capabilities, but it is both
> easier to test and uses different code paths in qemu.
>
> For qemu to use shadow virtqueues the vhost-net devices need to be
> instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> And shadow virtqueue needs to be enabled for them with QMP command
> like:
>
> { "execute": "x-vhost-enable-shadow-vq",
>        "arguments": { "name": "dev0", "enable": true } }
>
> The series includes some commits to delete in the final version. One
> of them is the one that adds vhost_kernel_vring_pause to vhost kernel
> devices. This is only intended to work with vhost-net devices, as a way
> to test the solution, so don't use any other vhost kernel device in the
> same test.
>
> The vhost-vdpa devices should work the same way. However, vp_vdpa is
> not working properly with intel iommu unmapping, so this series add two
> extra commits to allow testing the solution enable SVQ mode from the
> device start and forbidding any other vhost-vdpa memory mapping. The
> causes of this are still under debugging.
>
> For testing vhost-vdpa devices vp_vdpa device has been used with nested
> virtualization, using a qemu virtio-net device in L0. To be able to
> stop and reset status, features in RFC status has been implemented in
> commits 5 and 6. After that, virtio-net driver in L0 guest is replaced
> by vp_vdpa driver, and a nested qemu instance is launched using it.
>
> This vp_vdpa driver needs to be also modified to support the RFCs,
> mainly allowing it to removing the _S_STOPPED status flag and
> implementing actual vp_vdpa_set_vq_state and vp_vdpa_get_vq_state
> callbacks.
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 8, and starting SVQ. Previous commits
> are cleanup ones and declaration of QMP command.
>
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.
>
> Later commits allow vhost and shadow virtqueue to track and translate
> between qemu virtual addresses and a restricted iommu range. At the
> moment is not able to delete old translations, limit maximum range
> it can translate, nor vhost add new memory regions from the moment
> SVQ is enabled, but is somehow straightforward to add these.
>
> This is a big series, so the idea is to send it in logical chunks once
> all comments have been collected. As a first complete usecase, a SVQ
> mode with no possibility of going back to regular mode would cover a
> first usecase, and this RFC already have all the ingredients but
> internal memory tracking.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> Comments are welcome!


Thanks a lot for working on this.

I feel like we need to start from something simple to have some minimal 
functions to work first.

There are two major issues for the current complexity:

1) the current code tries to work on general virtio level (all kinds of 
vhost backends)
2) two kinds of translations are used (qemu HVA and qemu IOVA)

For 1), I'd suggest to start from vhost-vdpa, and it's even better to 
hide all the svq stuffs from the vhost API first. That is to say, do it 
totally inside vhost-vDPA and leave the rest part of Qemu untouched. 
This makes things easier and after this part is merged. We can start to 
think of how to generalize it other vhost bakcends (or it's still 
questionable that it's worth to do so). I believe most of the codes 
could be re-used.

For 2), I think we'd better always go with qemu IOVA (IOVA allocator). 
This should work for all cases and may simplify the code a lot. In the 
future, if we found qemu HVA is useful, we can implement some dedicated 
alocator for vhost-net to have 1:1 mapping.

Thoughts?

Thanks


>
> TODO:
> * Event, indirect, packed, and others features of virtio - Waiting for
>    confirmation of the big picture.
> * vDPA devices: Grow IOVA tree to track new or deleted memory. Cap
>    IOVA limit in tree so it cannot grow forever.
> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
>    sent to device.
> * Automatic kick-in on live-migration.
> * Proper documentation.
>
> Thanks!
>
> Changes from v2 RFC:
>    * Adding vhost-vdpa devices support
>    * Fixed some memory leaks pointed by different comments
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
>    * v1 at
>      https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (29):
>    virtio: Add virtio_queue_is_host_notifier_enabled
>    vhost: Save masked_notifier state
>    vhost: Add VhostShadowVirtqueue
>    vhost: Add x-vhost-enable-shadow-vq qmp
>    virtio: Add VIRTIO_F_QUEUE_STATE
>    virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>    vhost: Route guest->host notification through shadow virtqueue
>    vhost: Route host->guest notification through shadow virtqueue
>    vhost: Avoid re-set masked notifier in shadow vq
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    vhost: Add vhost_vring_pause operation
>    vhost: add vhost_kernel_vring_pause
>    vhost: Add vhost_get_iova_range operation
>    vhost: add vhost_has_limited_iova_range
>    vhost: Add enable_custom_iommu to VhostOps
>    vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
>    vhost: Shadow virtqueue buffers forwarding
>    vhost: Use vhost_enable_custom_iommu to unmap everything if available
>    vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
>      kick
>    vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
>      virtqueue
>    vhost: Add VhostIOVATree
>    vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
>    vhost: Use a tree to store memory mappings
>    vhost: Add iova_rev_maps_alloc
>    vhost: Add custom IOTLB translations to SVQ
>    vhost: Map in vdpa-dev
>    vhost-vdpa: Implement vhost_vdpa_vring_pause operation
>    vhost-vdpa: never map with vDPA listener
>    vhost: Start vhost-vdpa SVQ directly
>
>   qapi/net.json                                 |  22 +
>   hw/virtio/vhost-iova-tree.h                   |  61 ++
>   hw/virtio/vhost-shadow-virtqueue.h            |  38 ++
>   hw/virtio/virtio-pci.h                        |   1 +
>   include/hw/virtio/vhost-backend.h             |  16 +
>   include/hw/virtio/vhost-vdpa.h                |   2 +-
>   include/hw/virtio/vhost.h                     |  14 +
>   include/hw/virtio/virtio.h                    |   5 +-
>   .../standard-headers/linux/virtio_config.h    |   5 +
>   include/standard-headers/linux/virtio_pci.h   |   2 +
>   hw/net/virtio-net.c                           |   4 +-
>   hw/virtio/vhost-backend.c                     |  42 ++
>   hw/virtio/vhost-iova-tree.c                   | 283 ++++++++
>   hw/virtio/vhost-shadow-virtqueue.c            | 643 ++++++++++++++++++
>   hw/virtio/vhost-vdpa.c                        |  73 +-
>   hw/virtio/vhost.c                             | 459 ++++++++++++-
>   hw/virtio/virtio-pci.c                        |   9 +
>   hw/virtio/virtio.c                            |   5 +
>   hw/virtio/meson.build                         |   2 +-
>   hw/virtio/trace-events                        |   1 +
>   20 files changed, 1663 insertions(+), 24 deletions(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
  2021-06-02  9:50   ` Jason Wang
@ 2021-06-02 17:18     ` Eugenio Perez Martin
  2021-06-03  3:34       ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-02 17:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Wed, Jun 2, 2021 at 11:51 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > Initial version of shadow virtqueue that actually forward buffers. The
> > exposed addresses are the qemu's virtual address, so devices with IOMMU
> > that does not allow full mapping of qemu's address space does not work
> > at the moment.
> >
> > Also for simplicity it only supports modern devices, that expects vring
> > in little endian, with split ring and no event idx or indirect
> > descriptors.
> >
> > It reuses the VirtQueue code for the device part. The driver part is
> > based on Linux's virtio_ring driver, but with stripped functionality
> > and optimizations so it's easier to review.
> >
> > Later commits will solve some of these concerns.
>
>
> It would be more more easier to review if you squash those
> "enhancements" into this patch.
>

Ok, they will be in the same commit for the next version.

>
> >
> > Code also need to map used ring (device part) as RW in, and only in,
> > vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
> > would print an error in case of vhost devices with its own mapping
> > (vdpa).
>
>
> I think we should not depend on the IOTLB miss. Instead, we should
> program the device IOTLB before starting the svq. Or is there anything
> that prevent you from doing this?
>

Sorry for being unclear, that is what I meant in the message: No other
device than kernel vhost needs the map (as "sent iotlb miss ahead"),
so we must make it conditional. Doing it unconditionally would make
nothing but an error appear on the screen, but it is incorrect anyway.

Is it clearer this way?

>
> > To know if this call is needed, vhost_sw_live_migration_start_vq and
> > vhost_sw_live_migration_stop copy the test performed in
> > vhost_dev_start. Testing for the actual backend type could be cleaner,
> > or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
> > another vhostOp. We could extract this test in its own function too,
> > so its name could give a better hint. Just copy the vhost_dev_start
> > check at the moment.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
> >   hw/virtio/vhost.c                  | 134 ++++++++++++++++++-
> >   2 files changed, 325 insertions(+), 14 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index ff50f12410..6d767fe248 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -9,6 +9,7 @@
> >
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost.h"
> > +#include "hw/virtio/virtio-access.h"
> >
> >   #include "standard-headers/linux/vhost_types.h"
> >
> > @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> > +
> > +    /* Map for returning guest's descriptors */
> > +    VirtQueueElement **ring_id_maps;
> > +
> > +    /* Next head to expose to device */
> > +    uint16_t avail_idx_shadow;
> > +
> > +    /* Next free descriptor */
> > +    uint16_t free_head;
> > +
> > +    /* Last seen used idx */
> > +    uint16_t shadow_used_idx;
> > +
> > +    /* Next head to consume from device */
> > +    uint16_t used_idx;
> >   } VhostShadowVirtqueue;
> >
> > -/* Forward guest notifications */
> > +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > +                                    const struct iovec *iovec,
> > +                                    size_t num, bool more_descs, bool write)
> > +{
> > +    uint16_t i = svq->free_head, last = svq->free_head;
> > +    unsigned n;
> > +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> > +    vring_desc_t *descs = svq->vring.desc;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    for (n = 0; n < num; n++) {
> > +        if (more_descs || (n + 1 < num)) {
> > +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> > +        } else {
> > +            descs[i].flags = flags;
> > +        }
> > +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> > +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> > +
> > +        last = i;
> > +        i = cpu_to_le16(descs[i].next);
> > +    }
> > +
> > +    svq->free_head = le16_to_cpu(descs[last].next);
> > +}
> > +
> > +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> > +                                          VirtQueueElement *elem)
> > +{
> > +    int head;
> > +    unsigned avail_idx;
> > +    vring_avail_t *avail = svq->vring.avail;
> > +
> > +    head = svq->free_head;
> > +
> > +    /* We need some descriptors here */
> > +    assert(elem->out_num || elem->in_num);
> > +
> > +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > +                            elem->in_num > 0, false);
> > +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > +
> > +    /*
> > +     * Put entry in available array (but don't update avail->idx until they
> > +     * do sync).
> > +     */
> > +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> > +    avail->ring[avail_idx] = cpu_to_le16(head);
> > +    svq->avail_idx_shadow++;
> > +
> > +    /* Expose descriptors to device */
>
>
> It's better to describe the detail order.
>
> E.g "Update avail index after the descriptor is wrote"
>

Agree, I will replace it with your wording.

>
> > +    smp_wmb();
> > +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> > +
> > +    return head;
> > +
> > +}
> > +
> > +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> > +                                VirtQueueElement *elem)
> > +{
> > +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> > +
> > +    svq->ring_id_maps[qemu_head] = elem;
> > +}
> > +
> > +/* Handle guest->device notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >           return;
> >       }
> >
> > -    event_notifier_set(&svq->kick_notifier);
> > +    /* Make available as many buffers as possible */
> > +    do {
> > +        if (virtio_queue_get_notification(svq->vq)) {
> > +            /* No more notifications until process all available */
> > +            virtio_queue_set_notification(svq->vq, false);
> > +        }
> > +
> > +        while (true) {
> > +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            vhost_shadow_vq_add(svq, elem);
> > +            event_notifier_set(&svq->kick_notifier);
> > +        }
> > +
> > +        virtio_queue_set_notification(svq->vq, true);
> > +    } while (!virtio_queue_empty(svq->vq));
> > +}
> > +
> > +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> > +{
> > +    if (svq->used_idx != svq->shadow_used_idx) {
> > +        return true;
> > +    }
> > +
> > +    /* Get used idx must not be reordered */
> > +    smp_rmb();
> > +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> > +
> > +    return svq->used_idx != svq->shadow_used_idx;
> > +}
> > +
> > +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> > +{
> > +    vring_desc_t *descs = svq->vring.desc;
> > +    const vring_used_t *used = svq->vring.used;
> > +    vring_used_elem_t used_elem;
> > +    uint16_t last_used;
> > +
> > +    if (!vhost_shadow_vq_more_used(svq)) {
> > +        return NULL;
> > +    }
> > +
> > +    last_used = svq->used_idx & (svq->vring.num - 1);
> > +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> > +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> > +
> > +    if (unlikely(used_elem.id >= svq->vring.num)) {
> > +        error_report("Device %s says index %u is available", svq->vdev->name,
> > +                     used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    descs[used_elem.id].next = svq->free_head;
> > +    svq->free_head = used_elem.id;
> > +
> > +    svq->used_idx++;
> > +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >   }
> >
> >   /* Forward vhost notifications */
> > @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >                                                call_notifier);
> >       EventNotifier *masked_notifier;
> > +    VirtQueue *vq = svq->vq;
> >
> >       masked_notifier = svq->masked_notifier.n;
> >
> > -    if (!masked_notifier) {
> > -        unsigned n = virtio_get_queue_index(svq->vq);
> > -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> > -        virtio_notify_irqfd(svq->vdev, svq->vq);
> > -    } else if (!svq->masked_notifier.signaled) {
> > -        svq->masked_notifier.signaled = true;
> > -        event_notifier_set(svq->masked_notifier.n);
> > -    }
> > +    /* Make as many buffers as possible used. */
> > +    do {
> > +        unsigned i = 0;
> > +
> > +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> > +        while (true) {
> > +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            assert(i < svq->vring.num);
> > +            virtqueue_fill(vq, elem, elem->len, i++);
> > +        }
> > +
> > +        virtqueue_flush(vq, i);
> > +        if (!masked_notifier) {
> > +            virtio_notify_irqfd(svq->vdev, svq->vq);
>
>
> Any reason that you don't use virtio_notify() here?
>

No reason but to make sure guest_notifier is used. I'm not sure if
this is an implementation detail though.

I can test to switch to virtio_notify, what would be the advantage here?

>
> > +        } else if (!svq->masked_notifier.signaled) {
> > +            svq->masked_notifier.signaled = true;
> > +            event_notifier_set(svq->masked_notifier.n);
> > +        }
>
>
> This is an example of the extra complexity if you do shadow virtqueue at
.> virtio level.
>
> If you do everything at e.g vhost-vdpa, you don't need to care about
> masked_notifer at all.
>

Correct me if I'm wrong, what you mean is to use the backend
vhost_set_vring_call to set the guest notifier (from SVQ point of
view), and then set it unconditionally. The function
vhost_virtqueue_mask would set the masked one by itself, no
modification is needed here.

Backend would still need a conditional checking if SVQ is enabled, so
it either sends call_fd to backend or to SVQ. The call to
virtqueue_fill, would still be needed if we don't want to duplicate
all the device virtio's logic in the vhost-vdpa backend.

Another possibility would be to just store guest_notifier in SVQ, and
replace it with masked notifier and back. I think this is more aligned
with what you have in mind, but it still needs changes to
vhost_virtqueue_mask. Note that the boolean store
masked_notifier.signaled is just a (maybe premature) optimization to
skip the unneeded write syscall, but it could be omitted for brevity.
Or maybe a cleaner solution is to use io_uring for this write? :).

Thanks!

> Thanks
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
  2021-06-02  9:51   ` Jason Wang
@ 2021-06-02 17:51     ` Eugenio Perez Martin
  2021-06-03  3:39       ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-02 17:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Wed, Jun 2, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > Use translations added in IOVAReverseMaps in SVQ if the vhost device
> > does not support the mapping of the full qemu's virtual address space.
> > In other cases, Shadow Virtqueue still uses the qemu's virtual address
> > of the buffer pointed by the descriptor, which has been translated
> > already by qemu's VirtQueue machinery.
>
>
> I'd say let stick to a single kind of translation (iova allocator) that
> works for all the cases first and add optimizations on top.
>

Ok, I will start from here for the next revision.

>
> >
> > Now every element needs to store the previous address also, so VirtQueue
> > can consume the elements properly. This adds a little overhead per VQ
> > element, having to allocate more memory to stash them. As a possible
> > optimization, this allocation could be avoided if the descriptor is not
> > a chain but a single one, but this is left undone.
> >
> > Checking also for vhost_set_iotlb_callback to send used ring remapping.
> > This is only needed for kernel, and would print an error in case of
> > vhost devices with its own mapping (vdpa).
> >
> > This could change for other callback, like checking for
> > vhost_force_iommu, enable_custom_iommu, or another. Another option could
> > be to, at least, extract the check of "is map(used, writable) needed?"
> > in another function. But at the moment just copy the check used in
> > vhost_dev_start here.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
> >   hw/virtio/vhost.c                  |  29 +++++--
> >   2 files changed, 145 insertions(+), 18 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 934d3bb27b..a92da979d1 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -10,12 +10,19 @@
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost.h"
> >   #include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/vhost-iova-tree.h"
> >
> >   #include "standard-headers/linux/vhost_types.h"
> >
> >   #include "qemu/error-report.h"
> >   #include "qemu/main-loop.h"
> >
> > +typedef struct SVQElement {
> > +    VirtQueueElement elem;
> > +    void **in_sg_stash;
> > +    void **out_sg_stash;
>
>
> Any reason for the trick like this?
>
> Can we simply use iovec and iov_copy() here?
>

At the moment the device writes the buffer directly to the guest's
memory, and SVQ only translates the descriptor. In this scenario,
there would be no need for iov_copy, isn't it?

The reason for stash and unstash them was to allow the 1:1 mapping
with qemu memory and IOMMU and iova allocator to work with less
changes, In particular, the reason for unstash is that virtqueue_fill,
expects qemu pointers to set the guest memory page as dirty in
virtqueue_unmap_sg->dma_memory_unmap.

Now I think that just storing the iova address from the allocator in a
separated field and using a wrapper to get the IOVA addresses in SVQ
would be a better idea, so I would change to this if everyone agrees.

Thanks!

> Thanks
>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-06-01  7:13       ` Eugenio Perez Martin
@ 2021-06-03  3:12         ` Jason Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Wang @ 2021-06-03  3:12 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella


在 2021/6/1 下午3:13, Eugenio Perez Martin 写道:
> On Wed, May 26, 2021 at 3:10 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/26 上午9:06, Jason Wang 写道:
>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>> So the guest can stop and start net device. It implements the RFC
>>>> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
>>>>
>>>>
>>>> To stop (as "pause") the device is required to migrate status and vring
>>>> addresses between device and SVQ.
>>>>
>>>> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
>>>> virtio_config.h before of even proposing for the kernel, with no feature
>>>> flag, and, with no checking in the device. It also needs a modified
>>>> vp_vdpa driver that supports to set and retrieve status.
>>>>
>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>> ---
>>>>    include/standard-headers/linux/virtio_config.h | 2 ++
>>>>    hw/net/virtio-net.c                            | 4 +++-
>>>>    2 files changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/standard-headers/linux/virtio_config.h
>>>> b/include/standard-headers/linux/virtio_config.h
>>>> index 59fad3eb45..b3f6b1365d 100644
>>>> --- a/include/standard-headers/linux/virtio_config.h
>>>> +++ b/include/standard-headers/linux/virtio_config.h
>>>> @@ -40,6 +40,8 @@
>>>>    #define VIRTIO_CONFIG_S_DRIVER_OK    4
>>>>    /* Driver has finished configuring features */
>>>>    #define VIRTIO_CONFIG_S_FEATURES_OK    8
>>>> +/* Device is stopped */
>>>> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
>>>>    /* Device entered invalid state, driver must reset it */
>>>>    #define VIRTIO_CONFIG_S_NEEDS_RESET    0x40
>>>>    /* We've given up on this device. */
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 96a3cc8357..2d3caea289 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n,
>>>> uint8_t status)
>>>>    {
>>>>        VirtIODevice *vdev = VIRTIO_DEVICE(n);
>>>>        return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
>>>> -        (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
>>>> +        (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
>>>> +        (n->status & VIRTIO_NET_S_LINK_UP) &&
>>>> +        vdev->vm_running;
>>>>    }
>>>>      static void virtio_net_announce_notify(VirtIONet *net)
>>>
>>> It looks to me this is only the part of pause.
> For SVQ we need to switch vring addresses, and a full reset of the
> device is required for that. Actually, the pause is just used to
> recover
>
> If you prefer this could be sent as a separate series where the full
> pause/resume cycle is implemented, and then SVQ uses the pause part.
> However there are no use for the resume part at the moment.


That would be fine if you can send it in another series.


>
>> And even for pause, I don't see anything that prevents rx/tx from being
>> executed? (E.g virtio_net_handle_tx_bh or virtio_net_handle_rx).
>>
> virtio_net_started is called from virtio_net_set_status. If
> _S_DEVICE_STOPPED is true, the former return false, and variable
> queue_started is false in the latter:
>    queue_started =
>              virtio_net_started(n, queue_status) && !n->vhost_started;
>
> After that, it should work like a regular device reset or link down if
> I'm not wrong, and the last part of virtio_net_set_status should
> delete timer or cancel bh.


You are right.

Thanks


>
>> Thanks
>>
>>
>>> We still need the resume?
>>>
>>> Thanks
>>>
>>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
  2021-06-01  7:17         ` Eugenio Perez Martin
@ 2021-06-03  3:13           ` Jason Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Wang @ 2021-06-03  3:13 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella


在 2021/6/1 下午3:17, Eugenio Perez Martin 写道:
> On Thu, May 27, 2021 at 6:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/27 上午1:49, Eugenio Perez Martin 写道:
>>> On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>>> For simplicity, If a device does not support this operation it means
>>>>> that it can handle full (uint64_t)-1 iova address.
>>>> Note that, we probably need a separated patch for this.
>>>>
>>> Actually the comment is not in the right commit, the next one is the
>>> one that uses it. Is that what you mean?
>>
>> No, it's about the following suggestions.
>>
>>
>>>> And we need to this during vhost-vdpa initialization. If GPA is out of
>>>> the range, we need to fail the start of vhost-vdpa.
>>
>> Note that this is for non-IOMMU case. For the case of vIOMMU we probably
>> need to validate it against address width or other similar attributes.
>>
> Right.
>
> What should qemu do if the memory of the guest gets expanded outside
> of the range? I think there is not a clear way to fail the memory
> addition, isn't it?


I'm not sure, but I guess there should be a way to fail the memory hot-add.

(otherwise we can introduce the error reporting for them)

Thanks


>
>> Thanks
>>
>>
>>> Right, that is still to-do.
>>>
>>> Maybe a series with just these two commits and failing the start if
>>> GPA is not in the range, as you say, would help to split the amount of
>>> changes.
>>>
>>> I will send it if no more comments arise about it.
>>>
>>> Thanks!
>>>
>>>> THanks
>>>>
>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     include/hw/virtio/vhost-backend.h |  5 +++++
>>>>>     hw/virtio/vhost-vdpa.c            | 18 ++++++++++++++++++
>>>>>     hw/virtio/trace-events            |  1 +
>>>>>     3 files changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>> index 94d3323905..bcb112c166 100644
>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
>>>>>     struct vhost_scsi_target;
>>>>>     struct vhost_iotlb_msg;
>>>>>     struct vhost_virtqueue;
>>>>> +struct vhost_vdpa_iova_range;
>>>>>
>>>>>     typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
>>>>>     typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>>>>> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>>>
>>>>>     typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>>>>>
>>>>> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
>>>>> +                                    hwaddr *first, hwaddr *last);
>>>>> +
>>>>>     typedef struct VhostOps {
>>>>>         VhostBackendType backend_type;
>>>>>         vhost_backend_init vhost_backend_init;
>>>>> @@ -173,6 +177,7 @@ typedef struct VhostOps {
>>>>>         vhost_get_device_id_op vhost_get_device_id;
>>>>>         vhost_vring_pause_op vhost_vring_pause;
>>>>>         vhost_force_iommu_op vhost_force_iommu;
>>>>> +    vhost_get_iova_range vhost_get_iova_range;
>>>>>     } VhostOps;
>>>>>
>>>>>     extern const VhostOps user_ops;
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 01d2101d09..74fe92935e 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -579,6 +579,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>>>>         return true;
>>>>>     }
>>>>>
>>>>> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
>>>>> +                                     hwaddr *first, hwaddr *last)
>>>>> +{
>>>>> +    int ret;
>>>>> +    struct vhost_vdpa_iova_range range;
>>>>> +
>>>>> +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
>>>>> +    if (ret != 0) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    *first = range.first;
>>>>> +    *last = range.last;
>>>>> +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>>     const VhostOps vdpa_ops = {
>>>>>             .backend_type = VHOST_BACKEND_TYPE_VDPA,
>>>>>             .vhost_backend_init = vhost_vdpa_init,
>>>>> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
>>>>>             .vhost_get_device_id = vhost_vdpa_get_device_id,
>>>>>             .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>>>>             .vhost_force_iommu = vhost_vdpa_force_iommu,
>>>>> +        .vhost_get_iova_range = vhost_vdpa_get_iova_range,
>>>>>     };
>>>>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>>>>> index c62727f879..5debe3a681 100644
>>>>> --- a/hw/virtio/trace-events
>>>>> +++ b/hw/virtio/trace-events
>>>>> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>>>>>     vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>>>>>     vhost_vdpa_set_owner(void *dev) "dev: %p"
>>>>>     vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
>>>>> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>>>>>
>>>>>     # virtio.c
>>>>>     virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
  2021-06-02 17:18     ` Eugenio Perez Martin
@ 2021-06-03  3:34       ` Jason Wang
  2021-06-04  8:37         ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-06-03  3:34 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella


在 2021/6/3 上午1:18, Eugenio Perez Martin 写道:
> On Wed, Jun 2, 2021 at 11:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> Initial version of shadow virtqueue that actually forward buffers. The
>>> exposed addresses are the qemu's virtual address, so devices with IOMMU
>>> that does not allow full mapping of qemu's address space does not work
>>> at the moment.
>>>
>>> Also for simplicity it only supports modern devices, that expects vring
>>> in little endian, with split ring and no event idx or indirect
>>> descriptors.
>>>
>>> It reuses the VirtQueue code for the device part. The driver part is
>>> based on Linux's virtio_ring driver, but with stripped functionality
>>> and optimizations so it's easier to review.
>>>
>>> Later commits will solve some of these concerns.
>>
>> It would be more more easier to review if you squash those
>> "enhancements" into this patch.
>>
> Ok, they will be in the same commit for the next version.
>
>>> Code also need to map used ring (device part) as RW in, and only in,
>>> vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
>>> would print an error in case of vhost devices with its own mapping
>>> (vdpa).
>>
>> I think we should not depend on the IOTLB miss. Instead, we should
>> program the device IOTLB before starting the svq. Or is there anything
>> that prevent you from doing this?
>>
> Sorry for being unclear, that is what I meant in the message: No other
> device than kernel vhost needs the map (as "sent iotlb miss ahead"),
> so we must make it conditional. Doing it unconditionally would make
> nothing but an error appear on the screen, but it is incorrect anyway.
>
> Is it clearer this way?


So what I'm worrying is the following code:

int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
{
...

     if (dev->shadow_vqs_enabled) {
         /* Shadow virtqueue translations in its Virtual Address Space */
         const VhostDMAMap *result;
         const VhostDMAMap needle = {
             .iova = iova,
         };

         result = vhost_iova_tree_find_taddr(&dev->iova_map, &needle);
         ...

}

I wonder the reason for doing that (sorry if I've asked this in the 
previous version).

I think the correct way is to use map those in the device IOTLB before, 
instead of using the miss.

Then we can have a unified code for vhost-vDPA and vhost-kernel.


>
>>> To know if this call is needed, vhost_sw_live_migration_start_vq and
>>> vhost_sw_live_migration_stop copy the test performed in
>>> vhost_dev_start. Testing for the actual backend type could be cleaner,
>>> or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
>>> another vhostOp. We could extract this test in its own function too,
>>> so its name could give a better hint. Just copy the vhost_dev_start
>>> check at the moment.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
>>>    hw/virtio/vhost.c                  | 134 ++++++++++++++++++-
>>>    2 files changed, 325 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index ff50f12410..6d767fe248 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -9,6 +9,7 @@
>>>
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost.h"
>>> +#include "hw/virtio/virtio-access.h"
>>>
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
>>>
>>>        /* Virtio device */
>>>        VirtIODevice *vdev;
>>> +
>>> +    /* Map for returning guest's descriptors */
>>> +    VirtQueueElement **ring_id_maps;
>>> +
>>> +    /* Next head to expose to device */
>>> +    uint16_t avail_idx_shadow;
>>> +
>>> +    /* Next free descriptor */
>>> +    uint16_t free_head;
>>> +
>>> +    /* Last seen used idx */
>>> +    uint16_t shadow_used_idx;
>>> +
>>> +    /* Next head to consume from device */
>>> +    uint16_t used_idx;
>>>    } VhostShadowVirtqueue;
>>>
>>> -/* Forward guest notifications */
>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> +                                    const struct iovec *iovec,
>>> +                                    size_t num, bool more_descs, bool write)
>>> +{
>>> +    uint16_t i = svq->free_head, last = svq->free_head;
>>> +    unsigned n;
>>> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +
>>> +    if (num == 0) {
>>> +        return;
>>> +    }
>>> +
>>> +    for (n = 0; n < num; n++) {
>>> +        if (more_descs || (n + 1 < num)) {
>>> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
>>> +        } else {
>>> +            descs[i].flags = flags;
>>> +        }
>>> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>> +
>>> +        last = i;
>>> +        i = cpu_to_le16(descs[i].next);
>>> +    }
>>> +
>>> +    svq->free_head = le16_to_cpu(descs[last].next);
>>> +}
>>> +
>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
>>> +                                          VirtQueueElement *elem)
>>> +{
>>> +    int head;
>>> +    unsigned avail_idx;
>>> +    vring_avail_t *avail = svq->vring.avail;
>>> +
>>> +    head = svq->free_head;
>>> +
>>> +    /* We need some descriptors here */
>>> +    assert(elem->out_num || elem->in_num);
>>> +
>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +                            elem->in_num > 0, false);
>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> +    /*
>>> +     * Put entry in available array (but don't update avail->idx until they
>>> +     * do sync).
>>> +     */
>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>> +    avail->ring[avail_idx] = cpu_to_le16(head);
>>> +    svq->avail_idx_shadow++;
>>> +
>>> +    /* Expose descriptors to device */
>>
>> It's better to describe the detail order.
>>
>> E.g "Update avail index after the descriptor is wrote"
>>
> Agree, I will replace it with your wording.
>
>>> +    smp_wmb();
>>> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
>>> +
>>> +    return head;
>>> +
>>> +}
>>> +
>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>>> +                                VirtQueueElement *elem)
>>> +{
>>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
>>> +
>>> +    svq->ring_id_maps[qemu_head] = elem;
>>> +}
>>> +
>>> +/* Handle guest->device notifications */
>>>    static void vhost_handle_guest_kick(EventNotifier *n)
>>>    {
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>            return;
>>>        }
>>>
>>> -    event_notifier_set(&svq->kick_notifier);
>>> +    /* Make available as many buffers as possible */
>>> +    do {
>>> +        if (virtio_queue_get_notification(svq->vq)) {
>>> +            /* No more notifications until process all available */
>>> +            virtio_queue_set_notification(svq->vq, false);
>>> +        }
>>> +
>>> +        while (true) {
>>> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            vhost_shadow_vq_add(svq, elem);
>>> +            event_notifier_set(&svq->kick_notifier);
>>> +        }
>>> +
>>> +        virtio_queue_set_notification(svq->vq, true);
>>> +    } while (!virtio_queue_empty(svq->vq));
>>> +}
>>> +
>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
>>> +{
>>> +    if (svq->used_idx != svq->shadow_used_idx) {
>>> +        return true;
>>> +    }
>>> +
>>> +    /* Get used idx must not be reordered */
>>> +    smp_rmb();
>>> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
>>> +
>>> +    return svq->used_idx != svq->shadow_used_idx;
>>> +}
>>> +
>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
>>> +{
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +    const vring_used_t *used = svq->vring.used;
>>> +    vring_used_elem_t used_elem;
>>> +    uint16_t last_used;
>>> +
>>> +    if (!vhost_shadow_vq_more_used(svq)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
>>> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
>>> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
>>> +
>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
>>> +        error_report("Device %s says index %u is available", svq->vdev->name,
>>> +                     used_elem.id);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    descs[used_elem.id].next = svq->free_head;
>>> +    svq->free_head = used_elem.id;
>>> +
>>> +    svq->used_idx++;
>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>>    }
>>>
>>>    /* Forward vhost notifications */
>>> @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>                                                 call_notifier);
>>>        EventNotifier *masked_notifier;
>>> +    VirtQueue *vq = svq->vq;
>>>
>>>        masked_notifier = svq->masked_notifier.n;
>>>
>>> -    if (!masked_notifier) {
>>> -        unsigned n = virtio_get_queue_index(svq->vq);
>>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
>>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
>>> -    } else if (!svq->masked_notifier.signaled) {
>>> -        svq->masked_notifier.signaled = true;
>>> -        event_notifier_set(svq->masked_notifier.n);
>>> -    }
>>> +    /* Make as many buffers as possible used. */
>>> +    do {
>>> +        unsigned i = 0;
>>> +
>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>> +        while (true) {
>>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            assert(i < svq->vring.num);
>>> +            virtqueue_fill(vq, elem, elem->len, i++);
>>> +        }
>>> +
>>> +        virtqueue_flush(vq, i);
>>> +        if (!masked_notifier) {
>>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
>>
>> Any reason that you don't use virtio_notify() here?
>>
> No reason but to make sure guest_notifier is used. I'm not sure if
> this is an implementation detail though.


The difference is that virtio_notify() will go through the memory API 
which will finally go to irqfd in this case.


>
> I can test to switch to virtio_notify, what would be the advantage here?


Probably no.


>
>>> +        } else if (!svq->masked_notifier.signaled) {
>>> +            svq->masked_notifier.signaled = true;
>>> +            event_notifier_set(svq->masked_notifier.n);
>>> +        }
>>
>> This is an example of the extra complexity if you do shadow virtqueue at
> .> virtio level.
>> If you do everything at e.g vhost-vdpa, you don't need to care about
>> masked_notifer at all.
>>
> Correct me if I'm wrong, what you mean is to use the backend
> vhost_set_vring_call to set the guest notifier (from SVQ point of
> view), and then set it unconditionally. The function
> vhost_virtqueue_mask would set the masked one by itself, no
> modification is needed here.


Something like this, from the point of vhost, it doesn't even need to 
know whether or not the notifier is masked or not. All it needs is to 
write to the eventfd set via vq call.


>
> Backend would still need a conditional checking if SVQ is enabled, so
> it either sends call_fd to backend or to SVQ.


Yes.


> The call to
> virtqueue_fill, would still be needed if we don't want to duplicate
> all the device virtio's logic in the vhost-vdpa backend.


Yes, you can make the buffer forwarding a common library then it could 
be used other vhost backend in the future.

The point is to avoid touching vhost protocols to reduce the changeset 
and have someting minimal for our requirements (vhost-vDPA mainly).


>
> Another possibility would be to just store guest_notifier in SVQ, and
> replace it with masked notifier and back. I think this is more aligned
> with what you have in mind, but it still needs changes to
> vhost_virtqueue_mask. Note that the boolean store
> masked_notifier.signaled is just a (maybe premature) optimization to
> skip the unneeded write syscall, but it could be omitted for brevity.
> Or maybe a cleaner solution is to use io_uring for this write? :).


Looks like not what I meant :)

To clarify, it works like:

1) record the vq call fd1 during vhost_vdpa_set_vring_call
2) when svq is not enabled, set this fd1 to vhost-VDPA via 
VHOST_SET_VRING_CALL
3) when svq is enabled, initialize and set fd2 to vhost-vDPA, poll and 
handle guest kick via fd1 and rely fd1 to fd2

So we don't need to care much about the masking, in the svq codes, we 
just stick to use the fd that is set via recent vhost_vdpa_set_vring_call().

That means, if virtqueue is masked, we're using mased_notifier actually, 
but it's totally transparent to us.

So the idea is behave like a normal vhost-vDPA backend, and hide the 
shadowing from the virtio codes.

Thanks


>
> Thanks!
>
>> Thanks
>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
  2021-06-02 17:51     ` Eugenio Perez Martin
@ 2021-06-03  3:39       ` Jason Wang
  2021-06-04  9:07         ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-06-03  3:39 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella


在 2021/6/3 上午1:51, Eugenio Perez Martin 写道:
> On Wed, Jun 2, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> Use translations added in IOVAReverseMaps in SVQ if the vhost device
>>> does not support the mapping of the full qemu's virtual address space.
>>> In other cases, Shadow Virtqueue still uses the qemu's virtual address
>>> of the buffer pointed by the descriptor, which has been translated
>>> already by qemu's VirtQueue machinery.
>>
>> I'd say let stick to a single kind of translation (iova allocator) that
>> works for all the cases first and add optimizations on top.
>>
> Ok, I will start from here for the next revision.
>
>>> Now every element needs to store the previous address also, so VirtQueue
>>> can consume the elements properly. This adds a little overhead per VQ
>>> element, having to allocate more memory to stash them. As a possible
>>> optimization, this allocation could be avoided if the descriptor is not
>>> a chain but a single one, but this is left undone.
>>>
>>> Checking also for vhost_set_iotlb_callback to send used ring remapping.
>>> This is only needed for kernel, and would print an error in case of
>>> vhost devices with its own mapping (vdpa).
>>>
>>> This could change for other callback, like checking for
>>> vhost_force_iommu, enable_custom_iommu, or another. Another option could
>>> be to, at least, extract the check of "is map(used, writable) needed?"
>>> in another function. But at the moment just copy the check used in
>>> vhost_dev_start here.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
>>>    hw/virtio/vhost.c                  |  29 +++++--
>>>    2 files changed, 145 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 934d3bb27b..a92da979d1 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -10,12 +10,19 @@
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost.h"
>>>    #include "hw/virtio/virtio-access.h"
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>>    #include "qemu/error-report.h"
>>>    #include "qemu/main-loop.h"
>>>
>>> +typedef struct SVQElement {
>>> +    VirtQueueElement elem;
>>> +    void **in_sg_stash;
>>> +    void **out_sg_stash;
>>
>> Any reason for the trick like this?
>>
>> Can we simply use iovec and iov_copy() here?
>>
> At the moment the device writes the buffer directly to the guest's
> memory, and SVQ only translates the descriptor. In this scenario,
> there would be no need for iov_copy, isn't it?


It depends on which kinds of translation you used.

If I read the code correctly, stash is used for storing HVAs after the 
HVA->IOVA translation.

This looks exactly the work of iov (and do we guarantee the there will 
be a 1:1 translation?)

And if the mapping is 1:1 you can simply use iov_copy().

But this wont' be a option if we will always use iova allocator.


>
> The reason for stash and unstash them was to allow the 1:1 mapping
> with qemu memory and IOMMU and iova allocator to work with less
> changes, In particular, the reason for unstash is that virtqueue_fill,
> expects qemu pointers to set the guest memory page as dirty in
> virtqueue_unmap_sg->dma_memory_unmap.
>
> Now I think that just storing the iova address from the allocator in a
> separated field and using a wrapper to get the IOVA addresses in SVQ
> would be a better idea, so I would change to this if everyone agrees.


I agree.

Thanks


>
> Thanks!
>
>> Thanks
>>
>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
  2021-06-03  3:34       ` Jason Wang
@ 2021-06-04  8:37         ` Eugenio Perez Martin
  0 siblings, 0 replies; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-04  8:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Thu, Jun 3, 2021 at 5:35 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/3 上午1:18, Eugenio Perez Martin 写道:
> > On Wed, Jun 2, 2021 at 11:51 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> >>> Initial version of shadow virtqueue that actually forward buffers. The
> >>> exposed addresses are the qemu's virtual address, so devices with IOMMU
> >>> that does not allow full mapping of qemu's address space does not work
> >>> at the moment.
> >>>
> >>> Also for simplicity it only supports modern devices, that expects vring
> >>> in little endian, with split ring and no event idx or indirect
> >>> descriptors.
> >>>
> >>> It reuses the VirtQueue code for the device part. The driver part is
> >>> based on Linux's virtio_ring driver, but with stripped functionality
> >>> and optimizations so it's easier to review.
> >>>
> >>> Later commits will solve some of these concerns.
> >>
> >> It would be more more easier to review if you squash those
> >> "enhancements" into this patch.
> >>
> > Ok, they will be in the same commit for the next version.
> >
> >>> Code also need to map used ring (device part) as RW in, and only in,
> >>> vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
> >>> would print an error in case of vhost devices with its own mapping
> >>> (vdpa).
> >>
> >> I think we should not depend on the IOTLB miss. Instead, we should
> >> program the device IOTLB before starting the svq. Or is there anything
> >> that prevent you from doing this?
> >>
> > Sorry for being unclear, that is what I meant in the message: No other
> > device than kernel vhost needs the map (as "sent iotlb miss ahead"),
> > so we must make it conditional. Doing it unconditionally would make
> > nothing but an error appear on the screen, but it is incorrect anyway.
> >
> > Is it clearer this way?
>
>
> So what I'm worrying is the following code:
>
> int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> {
> ...
>
>      if (dev->shadow_vqs_enabled) {
>          /* Shadow virtqueue translations in its Virtual Address Space */
>          const VhostDMAMap *result;
>          const VhostDMAMap needle = {
>              .iova = iova,
>          };
>
>          result = vhost_iova_tree_find_taddr(&dev->iova_map, &needle);
>          ...
>
> }
>
> I wonder the reason for doing that (sorry if I've asked this in the
> previous version).
>
> I think the correct way is to use map those in the device IOTLB before,
> instead of using the miss.
>
> Then we can have a unified code for vhost-vDPA and vhost-kernel.
>

Sure we can do it that way, this just followed the usual vhost + IOMMU
way of working.

Since in this case we are using vIOMMU, the code should also take care
of guest's updates. However, I agree it's better to leave this use
case for a future patch, and start just taking into account vhost-vdpa
map/unmap.

>
> >
> >>> To know if this call is needed, vhost_sw_live_migration_start_vq and
> >>> vhost_sw_live_migration_stop copy the test performed in
> >>> vhost_dev_start. Testing for the actual backend type could be cleaner,
> >>> or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
> >>> another vhostOp. We could extract this test in its own function too,
> >>> so its name could give a better hint. Just copy the vhost_dev_start
> >>> check at the moment.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
> >>>    hw/virtio/vhost.c                  | 134 ++++++++++++++++++-
> >>>    2 files changed, 325 insertions(+), 14 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index ff50f12410..6d767fe248 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -9,6 +9,7 @@
> >>>
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost.h"
> >>> +#include "hw/virtio/virtio-access.h"
> >>>
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>> @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
> >>>
> >>>        /* Virtio device */
> >>>        VirtIODevice *vdev;
> >>> +
> >>> +    /* Map for returning guest's descriptors */
> >>> +    VirtQueueElement **ring_id_maps;
> >>> +
> >>> +    /* Next head to expose to device */
> >>> +    uint16_t avail_idx_shadow;
> >>> +
> >>> +    /* Next free descriptor */
> >>> +    uint16_t free_head;
> >>> +
> >>> +    /* Last seen used idx */
> >>> +    uint16_t shadow_used_idx;
> >>> +
> >>> +    /* Next head to consume from device */
> >>> +    uint16_t used_idx;
> >>>    } VhostShadowVirtqueue;
> >>>
> >>> -/* Forward guest notifications */
> >>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>> +                                    const struct iovec *iovec,
> >>> +                                    size_t num, bool more_descs, bool write)
> >>> +{
> >>> +    uint16_t i = svq->free_head, last = svq->free_head;
> >>> +    unsigned n;
> >>> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +
> >>> +    if (num == 0) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    for (n = 0; n < num; n++) {
> >>> +        if (more_descs || (n + 1 < num)) {
> >>> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> >>> +        } else {
> >>> +            descs[i].flags = flags;
> >>> +        }
> >>> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> >>> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> >>> +
> >>> +        last = i;
> >>> +        i = cpu_to_le16(descs[i].next);
> >>> +    }
> >>> +
> >>> +    svq->free_head = le16_to_cpu(descs[last].next);
> >>> +}
> >>> +
> >>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> >>> +                                          VirtQueueElement *elem)
> >>> +{
> >>> +    int head;
> >>> +    unsigned avail_idx;
> >>> +    vring_avail_t *avail = svq->vring.avail;
> >>> +
> >>> +    head = svq->free_head;
> >>> +
> >>> +    /* We need some descriptors here */
> >>> +    assert(elem->out_num || elem->in_num);
> >>> +
> >>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>> +                            elem->in_num > 0, false);
> >>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>> +
> >>> +    /*
> >>> +     * Put entry in available array (but don't update avail->idx until they
> >>> +     * do sync).
> >>> +     */
> >>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> >>> +    avail->ring[avail_idx] = cpu_to_le16(head);
> >>> +    svq->avail_idx_shadow++;
> >>> +
> >>> +    /* Expose descriptors to device */
> >>
> >> It's better to describe the detail order.
> >>
> >> E.g "Update avail index after the descriptor is wrote"
> >>
> > Agree, I will replace it with your wording.
> >
> >>> +    smp_wmb();
> >>> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> >>> +
> >>> +    return head;
> >>> +
> >>> +}
> >>> +
> >>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> >>> +                                VirtQueueElement *elem)
> >>> +{
> >>> +    unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> >>> +
> >>> +    svq->ring_id_maps[qemu_head] = elem;
> >>> +}
> >>> +
> >>> +/* Handle guest->device notifications */
> >>>    static void vhost_handle_guest_kick(EventNotifier *n)
> >>>    {
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>            return;
> >>>        }
> >>>
> >>> -    event_notifier_set(&svq->kick_notifier);
> >>> +    /* Make available as many buffers as possible */
> >>> +    do {
> >>> +        if (virtio_queue_get_notification(svq->vq)) {
> >>> +            /* No more notifications until process all available */
> >>> +            virtio_queue_set_notification(svq->vq, false);
> >>> +        }
> >>> +
> >>> +        while (true) {
> >>> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            vhost_shadow_vq_add(svq, elem);
> >>> +            event_notifier_set(&svq->kick_notifier);
> >>> +        }
> >>> +
> >>> +        virtio_queue_set_notification(svq->vq, true);
> >>> +    } while (!virtio_queue_empty(svq->vq));
> >>> +}
> >>> +
> >>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    if (svq->used_idx != svq->shadow_used_idx) {
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    /* Get used idx must not be reordered */
> >>> +    smp_rmb();
> >>> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> >>> +
> >>> +    return svq->used_idx != svq->shadow_used_idx;
> >>> +}
> >>> +
> >>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +    const vring_used_t *used = svq->vring.used;
> >>> +    vring_used_elem_t used_elem;
> >>> +    uint16_t last_used;
> >>> +
> >>> +    if (!vhost_shadow_vq_more_used(svq)) {
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    last_used = svq->used_idx & (svq->vring.num - 1);
> >>> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> >>> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> >>> +
> >>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> >>> +        error_report("Device %s says index %u is available", svq->vdev->name,
> >>> +                     used_elem.id);
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    descs[used_elem.id].next = svq->free_head;
> >>> +    svq->free_head = used_elem.id;
> >>> +
> >>> +    svq->used_idx++;
> >>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> >>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >>>    }
> >>>
> >>>    /* Forward vhost notifications */
> >>> @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>                                                 call_notifier);
> >>>        EventNotifier *masked_notifier;
> >>> +    VirtQueue *vq = svq->vq;
> >>>
> >>>        masked_notifier = svq->masked_notifier.n;
> >>>
> >>> -    if (!masked_notifier) {
> >>> -        unsigned n = virtio_get_queue_index(svq->vq);
> >>> -        virtio_queue_invalidate_signalled_used(svq->vdev, n);
> >>> -        virtio_notify_irqfd(svq->vdev, svq->vq);
> >>> -    } else if (!svq->masked_notifier.signaled) {
> >>> -        svq->masked_notifier.signaled = true;
> >>> -        event_notifier_set(svq->masked_notifier.n);
> >>> -    }
> >>> +    /* Make as many buffers as possible used. */
> >>> +    do {
> >>> +        unsigned i = 0;
> >>> +
> >>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> >>> +        while (true) {
> >>> +            g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            assert(i < svq->vring.num);
> >>> +            virtqueue_fill(vq, elem, elem->len, i++);
> >>> +        }
> >>> +
> >>> +        virtqueue_flush(vq, i);
> >>> +        if (!masked_notifier) {
> >>> +            virtio_notify_irqfd(svq->vdev, svq->vq);
> >>
> >> Any reason that you don't use virtio_notify() here?
> >>
> > No reason but to make sure guest_notifier is used. I'm not sure if
> > this is an implementation detail though.
>
>
> The difference is that virtio_notify() will go through the memory API
> which will finally go to irqfd in this case.
>
>
> >
> > I can test to switch to virtio_notify, what would be the advantage here?
>
>
> Probably no.
>
>
> >
> >>> +        } else if (!svq->masked_notifier.signaled) {
> >>> +            svq->masked_notifier.signaled = true;
> >>> +            event_notifier_set(svq->masked_notifier.n);
> >>> +        }
> >>
> >> This is an example of the extra complexity if you do shadow virtqueue at
> > .> virtio level.
> >> If you do everything at e.g vhost-vdpa, you don't need to care about
> >> masked_notifer at all.
> >>
> > Correct me if I'm wrong, what you mean is to use the backend
> > vhost_set_vring_call to set the guest notifier (from SVQ point of
> > view), and then set it unconditionally. The function
> > vhost_virtqueue_mask would set the masked one by itself, no
> > modification is needed here.
>
>
> Something like this, from the point of vhost, it doesn't even need to
> know whether or not the notifier is masked or not. All it needs is to
> write to the eventfd set via vq call.
>
>
> >
> > Backend would still need a conditional checking if SVQ is enabled, so
> > it either sends call_fd to backend or to SVQ.
>
>
> Yes.
>
>
> > The call to
> > virtqueue_fill, would still be needed if we don't want to duplicate
> > all the device virtio's logic in the vhost-vdpa backend.
>
>
> Yes, you can make the buffer forwarding a common library then it could
> be used other vhost backend in the future.
>
> The point is to avoid touching vhost protocols to reduce the changeset
> and have someting minimal for our requirements (vhost-vDPA mainly).
>
>
> >
> > Another possibility would be to just store guest_notifier in SVQ, and
> > replace it with masked notifier and back. I think this is more aligned
> > with what you have in mind, but it still needs changes to
> > vhost_virtqueue_mask. Note that the boolean store
> > masked_notifier.signaled is just a (maybe premature) optimization to
> > skip the unneeded write syscall, but it could be omitted for brevity.
> > Or maybe a cleaner solution is to use io_uring for this write? :).
>
>
> Looks like not what I meant :)
>
> To clarify, it works like:
>
> 1) record the vq call fd1 during vhost_vdpa_set_vring_call
> 2) when svq is not enabled, set this fd1 to vhost-VDPA via
> VHOST_SET_VRING_CALL
> 3) when svq is enabled, initialize and set fd2 to vhost-vDPA, poll and
> handle guest kick via fd1 and rely fd1 to fd2
>
> So we don't need to care much about the masking, in the svq codes, we
> just stick to use the fd that is set via recent vhost_vdpa_set_vring_call().
>
> That means, if virtqueue is masked, we're using mased_notifier actually,
> but it's totally transparent to us.
>
> So the idea is behave like a normal vhost-vDPA backend, and hide the
> shadowing from the virtio codes.
>
> Thanks
>

I'm fine with that approach. It could write many times to
masked_notifier if the guest does mask the device and active poll the
ring, but
1) I'm not sure if any driver relies on it actually, and they should
also indicate that they don't want to be notified through the avail
ring flags.
2) Actual devices cannot do these optimizations, it would write
repeatedly to masked_notifier.

So even if a synthetic test proves that it is beneficial, it probably
is of no use in real use cases.

>
> >
> > Thanks!
> >
> >> Thanks
> >>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
  2021-06-03  3:39       ` Jason Wang
@ 2021-06-04  9:07         ` Eugenio Perez Martin
  0 siblings, 0 replies; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-04  9:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Michael Lilja,
	Stefano Garzarella

On Thu, Jun 3, 2021 at 5:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/3 上午1:51, Eugenio Perez Martin 写道:
> > On Wed, Jun 2, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> >>> Use translations added in IOVAReverseMaps in SVQ if the vhost device
> >>> does not support the mapping of the full qemu's virtual address space.
> >>> In other cases, Shadow Virtqueue still uses the qemu's virtual address
> >>> of the buffer pointed by the descriptor, which has been translated
> >>> already by qemu's VirtQueue machinery.
> >>
> >> I'd say let stick to a single kind of translation (iova allocator) that
> >> works for all the cases first and add optimizations on top.
> >>
> > Ok, I will start from here for the next revision.
> >
> >>> Now every element needs to store the previous address also, so VirtQueue
> >>> can consume the elements properly. This adds a little overhead per VQ
> >>> element, having to allocate more memory to stash them. As a possible
> >>> optimization, this allocation could be avoided if the descriptor is not
> >>> a chain but a single one, but this is left undone.
> >>>
> >>> Checking also for vhost_set_iotlb_callback to send used ring remapping.
> >>> This is only needed for kernel, and would print an error in case of
> >>> vhost devices with its own mapping (vdpa).
> >>>
> >>> This could change for other callback, like checking for
> >>> vhost_force_iommu, enable_custom_iommu, or another. Another option could
> >>> be to, at least, extract the check of "is map(used, writable) needed?"
> >>> in another function. But at the moment just copy the check used in
> >>> vhost_dev_start here.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
> >>>    hw/virtio/vhost.c                  |  29 +++++--
> >>>    2 files changed, 145 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 934d3bb27b..a92da979d1 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -10,12 +10,19 @@
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost.h"
> >>>    #include "hw/virtio/virtio-access.h"
> >>> +#include "hw/virtio/vhost-iova-tree.h"
> >>>
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>>    #include "qemu/error-report.h"
> >>>    #include "qemu/main-loop.h"
> >>>
> >>> +typedef struct SVQElement {
> >>> +    VirtQueueElement elem;
> >>> +    void **in_sg_stash;
> >>> +    void **out_sg_stash;
> >>
> >> Any reason for the trick like this?
> >>
> >> Can we simply use iovec and iov_copy() here?
> >>
> > At the moment the device writes the buffer directly to the guest's
> > memory, and SVQ only translates the descriptor. In this scenario,
> > there would be no need for iov_copy, isn't it?
>
>
> It depends on which kinds of translation you used.
>
> If I read the code correctly, stash is used for storing HVAs after the
> HVA->IOVA translation.
>
> This looks exactly the work of iov (and do we guarantee the there will
> be a 1:1 translation?)
>
> And if the mapping is 1:1 you can simply use iov_copy().
>
> But this wont' be a option if we will always use iova allocator.
>

Right, the stash is only used in case of iova allocator. In case of
1:1 translation, svq->iova_map is never !NULL and _stash/_unstash
functions are never called.

And yes, I could have used iov_copy [1], but the check of overlapping
would have been unnecessary. It was like using memmove vs memset in my
head.

Thanks!

[1] I thought you meant iov_to_buf in your last mail, so please omit
the part of the buffer copy in my answer :).

>
> >
> > The reason for stash and unstash them was to allow the 1:1 mapping
> > with qemu memory and IOMMU and iova allocator to work with less
> > changes, In particular, the reason for unstash is that virtqueue_fill,
> > expects qemu pointers to set the guest memory page as dirty in
> > virtqueue_unmap_sg->dma_memory_unmap.
> >
> > Now I think that just storing the iova address from the allocator in a
> > separated field and using a wrapper to get the IOVA addresses in SVQ
> > would be a better idea, so I would change to this if everyone agrees.
>
>
> I agree.
>
> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-05-24  7:13     ` Eugenio Perez Martin
@ 2021-06-08 14:23       ` Markus Armbruster
  2021-06-08 15:26         ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Armbruster @ 2021-06-08 14:23 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Michael Lilja, Stefano Garzarella

Eugenio Perez Martin <eperezma@redhat.com> writes:

> On Fri, May 21, 2021 at 9:05 AM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> Eugenio Pérez <eperezma@redhat.com> writes:
>>
>> > Command to enable shadow virtqueue looks like:
>> >
>> > { "execute": "x-vhost-enable-shadow-vq",
>> >   "arguments": { "name": "dev0", "enable": true } }
>> >
>> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> > ---
>> >  qapi/net.json     | 22 ++++++++++++++++++++++
>> >  hw/virtio/vhost.c |  6 ++++++
>> >  2 files changed, 28 insertions(+)
>> >
>> > diff --git a/qapi/net.json b/qapi/net.json
>> > index c31748c87f..660feafdd2 100644
>> > --- a/qapi/net.json
>> > +++ b/qapi/net.json
>> > @@ -77,6 +77,28 @@
>> >  ##
>> >  { 'command': 'netdev_del', 'data': {'id': 'str'} }
>> >
>> > +##
>> > +# @x-vhost-enable-shadow-vq:
>> > +#
>> > +# Use vhost shadow virtqueue.
>> > +#
>> > +# @name: the device name of the VirtIO device
>> > +#
>> > +# @enable: true to use he alternate shadow VQ notification path

[...]

>> > +#
>> > +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
>>
>> This is confusing.  What do you mean by "Not found"?
>>
>> If you mean DeviceNotFound:
>>
>> 1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
>> GenericError.  Perhaps later patches will change that.

[...]

>> 2. Do you really need to distinguish "vhost is not enabled" from other
>> errors?
>>
>
> SVQ cannot work if the device backend is not vhost, like qemu VirtIO
> dev. What I meant is that "qemu will only look for its name in the set
> of vhost devices, so you will have a device not found if the device is
> not a vhost one", which may not be 100% clear at first glance. Maybe
> this wording is better?

We might be talking past each other.  Let me try again :)

The following question is *not* about the doc comment, it's about the
*code*: what practical problem is solved by using DeviceNotFound instead
of GenericError for some errors?

[...]



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-06-08 14:23       ` Markus Armbruster
@ 2021-06-08 15:26         ` Eugenio Perez Martin
  2021-06-09 11:46           ` Markus Armbruster
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-08 15:26 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Michael Lilja, Stefano Garzarella

On Tue, Jun 8, 2021 at 4:23 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Perez Martin <eperezma@redhat.com> writes:
>
> > On Fri, May 21, 2021 at 9:05 AM Markus Armbruster <armbru@redhat.com> wrote:
> >>
> >> Eugenio Pérez <eperezma@redhat.com> writes:
> >>
> >> > Command to enable shadow virtqueue looks like:
> >> >
> >> > { "execute": "x-vhost-enable-shadow-vq",
> >> >   "arguments": { "name": "dev0", "enable": true } }
> >> >
> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> > ---
> >> >  qapi/net.json     | 22 ++++++++++++++++++++++
> >> >  hw/virtio/vhost.c |  6 ++++++
> >> >  2 files changed, 28 insertions(+)
> >> >
> >> > diff --git a/qapi/net.json b/qapi/net.json
> >> > index c31748c87f..660feafdd2 100644
> >> > --- a/qapi/net.json
> >> > +++ b/qapi/net.json
> >> > @@ -77,6 +77,28 @@
> >> >  ##
> >> >  { 'command': 'netdev_del', 'data': {'id': 'str'} }
> >> >
> >> > +##
> >> > +# @x-vhost-enable-shadow-vq:
> >> > +#
> >> > +# Use vhost shadow virtqueue.
> >> > +#
> >> > +# @name: the device name of the VirtIO device
> >> > +#
> >> > +# @enable: true to use he alternate shadow VQ notification path
>
> [...]
>
> >> > +#
> >> > +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
> >>
> >> This is confusing.  What do you mean by "Not found"?
> >>
> >> If you mean DeviceNotFound:
> >>
> >> 1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
> >> GenericError.  Perhaps later patches will change that.
>
> [...]
>
> >> 2. Do you really need to distinguish "vhost is not enabled" from other
> >> errors?
> >>
> >
> > SVQ cannot work if the device backend is not vhost, like qemu VirtIO
> > dev. What I meant is that "qemu will only look for its name in the set
> > of vhost devices, so you will have a device not found if the device is
> > not a vhost one", which may not be 100% clear at first glance. Maybe
> > this wording is better?
>
> We might be talking past each other.  Let me try again :)
>
> The following question is *not* about the doc comment, it's about the
> *code*: what practical problem is solved by using DeviceNotFound instead
> of GenericError for some errors?
>

Sorry, I'm not sure if I follow you :). At risk of being circular in
this topic, the only use case I can think is to actually tell the
difference between "the device does not exists, or is not a vhost
device" and "the device does not support SVQ because X", where X can
be "it uses a packed ring", "it uses event idx", ...

I can only think of one practical use case, "if you see this error,
you probably forgot to set vhost=on in the command line, or something
is forbidding this device to be a vhost one". Having said that, I'm
totally fine with using GenericError always, but I see the more
fine-grained the error the better. What would be the advantage of also
using GenericError here?

Just to be sure that we are on the same page, I think this is better
seen from PATCH 07/39: vhost: Route guest->host notification through
shadow virtqueue.

> [...]
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-06-08 15:26         ` Eugenio Perez Martin
@ 2021-06-09 11:46           ` Markus Armbruster
  2021-06-09 14:06             ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Armbruster @ 2021-06-09 11:46 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Michael Lilja, Stefano Garzarella

Eugenio Perez Martin <eperezma@redhat.com> writes:

> On Tue, Jun 8, 2021 at 4:23 PM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> Eugenio Perez Martin <eperezma@redhat.com> writes:
>>
>> > On Fri, May 21, 2021 at 9:05 AM Markus Armbruster <armbru@redhat.com> wrote:
>> >>
>> >> Eugenio Pérez <eperezma@redhat.com> writes:
>> >>
>> >> > Command to enable shadow virtqueue looks like:
>> >> >
>> >> > { "execute": "x-vhost-enable-shadow-vq",
>> >> >   "arguments": { "name": "dev0", "enable": true } }
>> >> >
>> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> >> > ---
>> >> >  qapi/net.json     | 22 ++++++++++++++++++++++
>> >> >  hw/virtio/vhost.c |  6 ++++++
>> >> >  2 files changed, 28 insertions(+)
>> >> >
>> >> > diff --git a/qapi/net.json b/qapi/net.json
>> >> > index c31748c87f..660feafdd2 100644
>> >> > --- a/qapi/net.json
>> >> > +++ b/qapi/net.json
>> >> > @@ -77,6 +77,28 @@
>> >> >  ##
>> >> >  { 'command': 'netdev_del', 'data': {'id': 'str'} }
>> >> >
>> >> > +##
>> >> > +# @x-vhost-enable-shadow-vq:
>> >> > +#
>> >> > +# Use vhost shadow virtqueue.
>> >> > +#
>> >> > +# @name: the device name of the VirtIO device
>> >> > +#
>> >> > +# @enable: true to use he alternate shadow VQ notification path
>>
>> [...]
>>
>> >> > +#
>> >> > +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
>> >>
>> >> This is confusing.  What do you mean by "Not found"?
>> >>
>> >> If you mean DeviceNotFound:
>> >>
>> >> 1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
>> >> GenericError.  Perhaps later patches will change that.
>>
>> [...]
>>
>> >> 2. Do you really need to distinguish "vhost is not enabled" from other
>> >> errors?
>> >>
>> >
>> > SVQ cannot work if the device backend is not vhost, like qemu VirtIO
>> > dev. What I meant is that "qemu will only look for its name in the set
>> > of vhost devices, so you will have a device not found if the device is
>> > not a vhost one", which may not be 100% clear at first glance. Maybe
>> > this wording is better?
>>
>> We might be talking past each other.  Let me try again :)
>>
>> The following question is *not* about the doc comment, it's about the
>> *code*: what practical problem is solved by using DeviceNotFound instead
>> of GenericError for some errors?
>>
>
> Sorry, I'm not sure if I follow you :). At risk of being circular in
> this topic, the only use case I can think is to actually tell the
> difference between "the device does not exists, or is not a vhost
> device" and "the device does not support SVQ because X", where X can
> be "it uses a packed ring", "it uses event idx", ...
>
> I can only think of one practical use case, "if you see this error,
> you probably forgot to set vhost=on in the command line, or something
> is forbidding this device to be a vhost one". Having said that, I'm
> totally fine with using GenericError always, but I see the more
> fine-grained the error the better. What would be the advantage of also
> using GenericError here?

In the initial design of the Error API, every error had its own distinct
class.  This provided for fine-grained errors.

Adding a new error was bothersome: you had to define a new class, in
qerror.h.  Recompile the world.  Conflict magnet.  Constant temptation
to reuse an existing error even when its error message is suboptimal,
and the reuse of the class for another error conflates errors.

After a bit under three years, we had 70 classes, used in almost 400
places.  Management applications actually cared for just six classes.

Bad error messages and development friction had turned out to be a real
problem.  Conflating errors pretty much not.

We concluded that providing for fine-grained errors when next to nothing
uses them was not worth the pain.  So we ditched them:

    https://lists.nongnu.org/archive/html/qemu-devel/2012-08/msg01958.html
    Commit ac839ccd8c3..adb2072ed0f

Since them, we recommend to use GenericError unless there is a
compelling reason not to.  "Something might care someday" doesn't
qualify.

Learning by doing the wrong thing is painful and expensive, but at least
the lessons tends to stick ;)

Today, we have more than 4000 callers of error_setg(), and less than 40
of error_set().

> Just to be sure that we are on the same page, I think this is better
> seen from PATCH 07/39: vhost: Route guest->host notification through
> shadow virtqueue.
>
>> [...]
>>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-06-09 11:46           ` Markus Armbruster
@ 2021-06-09 14:06             ` Eugenio Perez Martin
  0 siblings, 0 replies; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-06-09 14:06 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Stefano Garzarella

On Wed, Jun 9, 2021 at 1:46 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Perez Martin <eperezma@redhat.com> writes:
>
> > On Tue, Jun 8, 2021 at 4:23 PM Markus Armbruster <armbru@redhat.com> wrote:
> >>
> >> Eugenio Perez Martin <eperezma@redhat.com> writes:
> >>
> >> > On Fri, May 21, 2021 at 9:05 AM Markus Armbruster <armbru@redhat.com> wrote:
> >> >>
> >> >> Eugenio Pérez <eperezma@redhat.com> writes:
> >> >>
> >> >> > Command to enable shadow virtqueue looks like:
> >> >> >
> >> >> > { "execute": "x-vhost-enable-shadow-vq",
> >> >> >   "arguments": { "name": "dev0", "enable": true } }
> >> >> >
> >> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> >> > ---
> >> >> >  qapi/net.json     | 22 ++++++++++++++++++++++
> >> >> >  hw/virtio/vhost.c |  6 ++++++
> >> >> >  2 files changed, 28 insertions(+)
> >> >> >
> >> >> > diff --git a/qapi/net.json b/qapi/net.json
> >> >> > index c31748c87f..660feafdd2 100644
> >> >> > --- a/qapi/net.json
> >> >> > +++ b/qapi/net.json
> >> >> > @@ -77,6 +77,28 @@
> >> >> >  ##
> >> >> >  { 'command': 'netdev_del', 'data': {'id': 'str'} }
> >> >> >
> >> >> > +##
> >> >> > +# @x-vhost-enable-shadow-vq:
> >> >> > +#
> >> >> > +# Use vhost shadow virtqueue.
> >> >> > +#
> >> >> > +# @name: the device name of the VirtIO device
> >> >> > +#
> >> >> > +# @enable: true to use he alternate shadow VQ notification path
> >>
> >> [...]
> >>
> >> >> > +#
> >> >> > +# Returns: Error if failure, or 'no error' for success. Not found if vhost is not enabled.
> >> >>
> >> >> This is confusing.  What do you mean by "Not found"?
> >> >>
> >> >> If you mean DeviceNotFound:
> >> >>
> >> >> 1. Not actually true: qmp_x_vhost_enable_shadow_vq() always fails with
> >> >> GenericError.  Perhaps later patches will change that.
> >>
> >> [...]
> >>
> >> >> 2. Do you really need to distinguish "vhost is not enabled" from other
> >> >> errors?
> >> >>
> >> >
> >> > SVQ cannot work if the device backend is not vhost, like qemu VirtIO
> >> > dev. What I meant is that "qemu will only look for its name in the set
> >> > of vhost devices, so you will have a device not found if the device is
> >> > not a vhost one", which may not be 100% clear at first glance. Maybe
> >> > this wording is better?
> >>
> >> We might be talking past each other.  Let me try again :)
> >>
> >> The following question is *not* about the doc comment, it's about the
> >> *code*: what practical problem is solved by using DeviceNotFound instead
> >> of GenericError for some errors?
> >>
> >
> > Sorry, I'm not sure if I follow you :). At risk of being circular in
> > this topic, the only use case I can think is to actually tell the
> > difference between "the device does not exists, or is not a vhost
> > device" and "the device does not support SVQ because X", where X can
> > be "it uses a packed ring", "it uses event idx", ...
> >
> > I can only think of one practical use case, "if you see this error,
> > you probably forgot to set vhost=on in the command line, or something
> > is forbidding this device to be a vhost one". Having said that, I'm
> > totally fine with using GenericError always, but I see the more
> > fine-grained the error the better. What would be the advantage of also
> > using GenericError here?
>
> In the initial design of the Error API, every error had its own distinct
> class.  This provided for fine-grained errors.
>
> Adding a new error was bothersome: you had to define a new class, in
> qerror.h.  Recompile the world.  Conflict magnet.  Constant temptation
> to reuse an existing error even when its error message is suboptimal,
> and the reuse of the class for another error conflates errors.
>
> After a bit under three years, we had 70 classes, used in almost 400
> places.  Management applications actually cared for just six classes.
>
> Bad error messages and development friction had turned out to be a real
> problem.  Conflating errors pretty much not.
>
> We concluded that providing for fine-grained errors when next to nothing
> uses them was not worth the pain.  So we ditched them:
>
>     https://lists.nongnu.org/archive/html/qemu-devel/2012-08/msg01958.html
>     Commit ac839ccd8c3..adb2072ed0f
>
> Since them, we recommend to use GenericError unless there is a
> compelling reason not to.  "Something might care someday" doesn't
> qualify.
>
> Learning by doing the wrong thing is painful and expensive, but at least
> the lessons tends to stick ;)
>
> Today, we have more than 4000 callers of error_setg(), and less than 40
> of error_set().
>

So let's do it with GenericError then :). Thanks for pointing it out,
it will be fixed in the next revision!

> > Just to be sure that we are on the same page, I think this is better
> > seen from PATCH 07/39: vhost: Route guest->host notification through
> > shadow virtqueue.
> >
> >> [...]
> >>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-06-01  8:15     ` Eugenio Perez Martin
@ 2021-07-14  3:04       ` Jason Wang
  2021-07-14  6:54         ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Wang @ 2021-07-14  3:04 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella


在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
> On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> This tree is able to look for a translated address from a IOVA address.
>>>
>>> At first glance is similar to util/iova-tree. However, SVQ working on
>>> devices with limited IOVA space need more capabilities, like allocating
>>> IOVA chunks or perform reverse translations (qemu addresses to iova).
>>>
>>> Starting a sepparated implementation. Knowing than insertions/deletions
>>> will not be as frequent as searches,
>>
>> This might not be true if vIOMMU is enabled.
>>
> Right.
>
>>> it uses an ordered array at
>>> implementation.
>>
>> I wonder how much overhead could g_array be if it needs to grow.
>>
> I didn't do any tests, actually. But I see this VhostIOVATree as a
> replaceable tool, just to get the buffer translations to work. So I'm
> both ok with change it now and ok to delay it, since they should not
> be hard to do.
>
>>>    A different name could be used, but ordered
>>> searchable array is a little bit long though.
>>
>> Note that we had a very good example for this. That is the kernel iova
>> allocator which is implemented via rbtree.
>>
>> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
>> g_tree first (based on util/iova-tree) and borrow the well design kernel
>> iova allocator API to have a generic IOVA one instead of coupling it
>> with vhost. It could be used by other userspace driver in the future:
>>
>> init_iova_domain()/put_iova_domain();
>>
>> alloc_iova()/free_iova();
>>
>> find_iova();
>>
> We could go that way, but then the iova-tree should be extended to
> support both translations (iova->translated_addr is now implemented in
> iova-tree, the reverse is not). If I understood you correctly,
> borrowing the kernel iova allocator would give us both, right?


No the reverse lookup is done via a specific IOMMU driver if I 
understand it correctly.

And if the mapping is 1:1 we can just use two iova-tree I guess.


>
> Note that it is not coupled to vhost at all except in the name: all
> the implementations only work with hwaddr and void pointers memory.
> Just to illustrate the point, I think it could be a drop-in
> replacement for iova-tree at this moment (with all the
> drawbacks/advantages of an array vs tree).


Ok.

Thanks


>
>> Another reference is the iova allocator that is implemented in VFIO.
> I will check this too.
>
> Thanks!
>
>
>> Thanks
>>
>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-iova-tree.h |  50 ++++++++++
>>>    hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
>>>    hw/virtio/meson.build       |   2 +-
>>>    3 files changed, 239 insertions(+), 1 deletion(-)
>>>    create mode 100644 hw/virtio/vhost-iova-tree.h
>>>    create mode 100644 hw/virtio/vhost-iova-tree.c
>>>
>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>> new file mode 100644
>>> index 0000000000..2a44af8b3a
>>> --- /dev/null
>>> +++ b/hw/virtio/vhost-iova-tree.h
>>> @@ -0,0 +1,50 @@
>>> +/*
>>> + * vhost software live migration ring
>>> + *
>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
>>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
>>> +
>>> +#include <gmodule.h>
>>> +
>>> +#include "exec/memory.h"
>>> +
>>> +typedef struct VhostDMAMap {
>>> +    void *translated_addr;
>>> +    hwaddr iova;
>>> +    hwaddr size;                /* Inclusive */
>>> +    IOMMUAccessFlags perm;
>>> +} VhostDMAMap;
>>> +
>>> +typedef enum VhostDMAMapNewRC {
>>> +    VHOST_DMA_MAP_OVERLAP = -2,
>>> +    VHOST_DMA_MAP_INVALID = -1,
>>> +    VHOST_DMA_MAP_OK = 0,
>>> +} VhostDMAMapNewRC;
>>> +
>>> +/**
>>> + * VhostIOVATree
>>> + *
>>> + * Store and search IOVA -> Translated mappings.
>>> + *
>>> + * Note that it cannot remove nodes.
>>> + */
>>> +typedef struct VhostIOVATree {
>>> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
>>> +    GArray *iova_taddr_map;
>>> +} VhostIOVATree;
>>> +
>>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
>>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
>>> +
>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
>>> +                                              const VhostDMAMap *map);
>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
>>> +                                        VhostDMAMap *map);
>>> +
>>> +#endif
>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>> new file mode 100644
>>> index 0000000000..dfd7e448b5
>>> --- /dev/null
>>> +++ b/hw/virtio/vhost-iova-tree.c
>>> @@ -0,0 +1,188 @@
>>> +/*
>>> + * vhost software live migration ring
>>> + *
>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "vhost-iova-tree.h"
>>> +
>>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
>>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
>>> +
>>> +/**
>>> + * Inserts an element after an existing one in garray.
>>> + *
>>> + * @array      The array
>>> + * @prev_elem  The previous element of array of NULL if prepending
>>> + * @map        The DMA map
>>> + *
>>> + * It provides the aditional advantage of being type safe over
>>> + * g_array_insert_val, which accepts a reference pointer instead of a value
>>> + * with no complains.
>>> + */
>>> +static void vhost_iova_tree_insert_after(GArray *array,
>>> +                                         const VhostDMAMap *prev_elem,
>>> +                                         const VhostDMAMap *map)
>>> +{
>>> +    size_t pos;
>>> +
>>> +    if (!prev_elem) {
>>> +        pos = 0;
>>> +    } else {
>>> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
>>> +    }
>>> +
>>> +    g_array_insert_val(array, pos, *map);
>>> +}
>>> +
>>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
>>> +{
>>> +    const VhostDMAMap *m1 = a, *m2 = b;
>>> +
>>> +    if (m1->iova > m2->iova + m2->size) {
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (m1->iova + m1->size < m2->iova) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    /* Overlapped */
>>> +    return 0;
>>> +}
>>> +
>>> +/**
>>> + * Find the previous node to a given iova
>>> + *
>>> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
>>> + * @map    The map to insert
>>> + * @prev   Returned location of the previous map
>>> + *
>>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
>>> + * it already exists. It is ok to use this function to check if a given range
>>> + * exists, but it will use a linear search.
>>> + *
>>> + * TODO: We can use bsearch to locate the entry if we save the state in the
>>> + * needle, knowing that the needle is always the first argument to
>>> + * compare_func.
>>> + */
>>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
>>> +                                                  GCompareFunc compare_func,
>>> +                                                  const VhostDMAMap *map,
>>> +                                                  const VhostDMAMap **prev)
>>> +{
>>> +    size_t i;
>>> +    int r;
>>> +
>>> +    *prev = NULL;
>>> +    for (i = 0; i < array->len; ++i) {
>>> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
>>> +        if (r == 0) {
>>> +            return VHOST_DMA_MAP_OVERLAP;
>>> +        }
>>> +        if (r < 0) {
>>> +            return VHOST_DMA_MAP_OK;
>>> +        }
>>> +
>>> +        *prev = &g_array_index(array, typeof(**prev), i);
>>> +    }
>>> +
>>> +    return VHOST_DMA_MAP_OK;
>>> +}
>>> +
>>> +/**
>>> + * Create a new IOVA tree
>>> + *
>>> + * @tree  The IOVA tree
>>> + */
>>> +void vhost_iova_tree_new(VhostIOVATree *tree)
>>> +{
>>> +    assert(tree);
>>> +
>>> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
>>> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
>>> +                                       sizeof(VhostDMAMap));
>>> +}
>>> +
>>> +/**
>>> + * Destroy an IOVA tree
>>> + *
>>> + * @tree  The iova tree
>>> + */
>>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
>>> +{
>>> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
>>> +}
>>> +
>>> +/**
>>> + * Perform a search on a GArray.
>>> + *
>>> + * @array Glib array
>>> + * @map Map to look up
>>> + * @compare_func Compare function to use
>>> + *
>>> + * Return The found element or NULL if not found.
>>> + *
>>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
>>> + * is common enough.
>>> + */
>>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
>>> +                                                  const VhostDMAMap *map,
>>> +                                                  GCompareFunc compare_func)
>>> +{
>>> +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
>>> +}
>>> +
>>> +/**
>>> + * Find the translated address stored from a IOVA address
>>> + *
>>> + * @tree  The iova tree
>>> + * @map   The map with the memory address
>>> + *
>>> + * Return the stored mapping, or NULL if not found.
>>> + */
>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
>>> +                                              const VhostDMAMap *map)
>>> +{
>>> +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
>>> +                                  vhost_iova_tree_cmp_iova);
>>> +}
>>> +
>>> +/**
>>> + * Insert a new map
>>> + *
>>> + * @tree  The iova tree
>>> + * @map   The iova map
>>> + *
>>> + * Returns:
>>> + * - VHOST_DMA_MAP_OK if the map fits in the container
>>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
>>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
>>> + * Can query the assignated iova in map.
>>> + */
>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
>>> +                                        VhostDMAMap *map)
>>> +{
>>> +    const VhostDMAMap *prev;
>>> +    int find_prev_rc;
>>> +
>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
>>> +        return VHOST_DMA_MAP_INVALID;
>>> +    }
>>> +
>>> +    /* Check for duplicates, and save position for insertion */
>>> +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
>>> +                                             vhost_iova_tree_cmp_iova, map,
>>> +                                             &prev);
>>> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
>>> +        return VHOST_DMA_MAP_OVERLAP;
>>> +    }
>>> +
>>> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
>>> +    return VHOST_DMA_MAP_OK;
>>> +}
>>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
>>> index 8b5a0225fe..cb306b83c6 100644
>>> --- a/hw/virtio/meson.build
>>> +++ b/hw/virtio/meson.build
>>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>>>
>>>    virtio_ss = ss.source_set()
>>>    virtio_ss.add(files('virtio.c'))
>>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
>>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>>>    virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>>>    virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>>>    virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-07-14  3:04       ` Jason Wang
@ 2021-07-14  6:54         ` Eugenio Perez Martin
  2021-07-14  9:14           ` Eugenio Perez Martin
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-07-14  6:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella

On Wed, Jul 14, 2021 at 5:04 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
> > On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> >>> This tree is able to look for a translated address from a IOVA address.
> >>>
> >>> At first glance is similar to util/iova-tree. However, SVQ working on
> >>> devices with limited IOVA space need more capabilities, like allocating
> >>> IOVA chunks or perform reverse translations (qemu addresses to iova).
> >>>
> >>> Starting a sepparated implementation. Knowing than insertions/deletions
> >>> will not be as frequent as searches,
> >>
> >> This might not be true if vIOMMU is enabled.
> >>
> > Right.
> >
> >>> it uses an ordered array at
> >>> implementation.
> >>
> >> I wonder how much overhead could g_array be if it needs to grow.
> >>
> > I didn't do any tests, actually. But I see this VhostIOVATree as a
> > replaceable tool, just to get the buffer translations to work. So I'm
> > both ok with change it now and ok to delay it, since they should not
> > be hard to do.
> >
> >>>    A different name could be used, but ordered
> >>> searchable array is a little bit long though.
> >>
> >> Note that we had a very good example for this. That is the kernel iova
> >> allocator which is implemented via rbtree.
> >>
> >> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
> >> g_tree first (based on util/iova-tree) and borrow the well design kernel
> >> iova allocator API to have a generic IOVA one instead of coupling it
> >> with vhost. It could be used by other userspace driver in the future:
> >>
> >> init_iova_domain()/put_iova_domain();
> >>
> >> alloc_iova()/free_iova();
> >>
> >> find_iova();
> >>
> > We could go that way, but then the iova-tree should be extended to
> > support both translations (iova->translated_addr is now implemented in
> > iova-tree, the reverse is not). If I understood you correctly,
> > borrowing the kernel iova allocator would give us both, right?
>
>
> No the reverse lookup is done via a specific IOMMU driver if I
> understand it correctly.
>
> And if the mapping is 1:1 we can just use two iova-tree I guess.
>

I did try with two IOVATree, but the usage of the reversed one is
confusing at best. To reuse most of the code, .iova needs to be
.translated_addr, and the opposite. Maybe I can try again using a
wrapper structure that reverses them on each operation (insert,
search, ...).

Thanks!

>
> >
> > Note that it is not coupled to vhost at all except in the name: all
> > the implementations only work with hwaddr and void pointers memory.
> > Just to illustrate the point, I think it could be a drop-in
> > replacement for iova-tree at this moment (with all the
> > drawbacks/advantages of an array vs tree).
>
>
> Ok.
>
> Thanks
>
>
> >
> >> Another reference is the iova allocator that is implemented in VFIO.
> > I will check this too.
> >
> > Thanks!
> >
> >
> >> Thanks
> >>
> >>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-iova-tree.h |  50 ++++++++++
> >>>    hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
> >>>    hw/virtio/meson.build       |   2 +-
> >>>    3 files changed, 239 insertions(+), 1 deletion(-)
> >>>    create mode 100644 hw/virtio/vhost-iova-tree.h
> >>>    create mode 100644 hw/virtio/vhost-iova-tree.c
> >>>
> >>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >>> new file mode 100644
> >>> index 0000000000..2a44af8b3a
> >>> --- /dev/null
> >>> +++ b/hw/virtio/vhost-iova-tree.h
> >>> @@ -0,0 +1,50 @@
> >>> +/*
> >>> + * vhost software live migration ring
> >>> + *
> >>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> >>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> >>> + *
> >>> + * SPDX-License-Identifier: GPL-2.0-or-later
> >>> + */
> >>> +
> >>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> >>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> >>> +
> >>> +#include <gmodule.h>
> >>> +
> >>> +#include "exec/memory.h"
> >>> +
> >>> +typedef struct VhostDMAMap {
> >>> +    void *translated_addr;
> >>> +    hwaddr iova;
> >>> +    hwaddr size;                /* Inclusive */
> >>> +    IOMMUAccessFlags perm;
> >>> +} VhostDMAMap;
> >>> +
> >>> +typedef enum VhostDMAMapNewRC {
> >>> +    VHOST_DMA_MAP_OVERLAP = -2,
> >>> +    VHOST_DMA_MAP_INVALID = -1,
> >>> +    VHOST_DMA_MAP_OK = 0,
> >>> +} VhostDMAMapNewRC;
> >>> +
> >>> +/**
> >>> + * VhostIOVATree
> >>> + *
> >>> + * Store and search IOVA -> Translated mappings.
> >>> + *
> >>> + * Note that it cannot remove nodes.
> >>> + */
> >>> +typedef struct VhostIOVATree {
> >>> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> >>> +    GArray *iova_taddr_map;
> >>> +} VhostIOVATree;
> >>> +
> >>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
> >>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
> >>> +
> >>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
> >>> +                                              const VhostDMAMap *map);
> >>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
> >>> +                                        VhostDMAMap *map);
> >>> +
> >>> +#endif
> >>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >>> new file mode 100644
> >>> index 0000000000..dfd7e448b5
> >>> --- /dev/null
> >>> +++ b/hw/virtio/vhost-iova-tree.c
> >>> @@ -0,0 +1,188 @@
> >>> +/*
> >>> + * vhost software live migration ring
> >>> + *
> >>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> >>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> >>> + *
> >>> + * SPDX-License-Identifier: GPL-2.0-or-later
> >>> + */
> >>> +
> >>> +#include "qemu/osdep.h"
> >>> +#include "vhost-iova-tree.h"
> >>> +
> >>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
> >>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> >>> +
> >>> +/**
> >>> + * Inserts an element after an existing one in garray.
> >>> + *
> >>> + * @array      The array
> >>> + * @prev_elem  The previous element of array of NULL if prepending
> >>> + * @map        The DMA map
> >>> + *
> >>> + * It provides the aditional advantage of being type safe over
> >>> + * g_array_insert_val, which accepts a reference pointer instead of a value
> >>> + * with no complains.
> >>> + */
> >>> +static void vhost_iova_tree_insert_after(GArray *array,
> >>> +                                         const VhostDMAMap *prev_elem,
> >>> +                                         const VhostDMAMap *map)
> >>> +{
> >>> +    size_t pos;
> >>> +
> >>> +    if (!prev_elem) {
> >>> +        pos = 0;
> >>> +    } else {
> >>> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> >>> +    }
> >>> +
> >>> +    g_array_insert_val(array, pos, *map);
> >>> +}
> >>> +
> >>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
> >>> +{
> >>> +    const VhostDMAMap *m1 = a, *m2 = b;
> >>> +
> >>> +    if (m1->iova > m2->iova + m2->size) {
> >>> +        return 1;
> >>> +    }
> >>> +
> >>> +    if (m1->iova + m1->size < m2->iova) {
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    /* Overlapped */
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +/**
> >>> + * Find the previous node to a given iova
> >>> + *
> >>> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> >>> + * @map    The map to insert
> >>> + * @prev   Returned location of the previous map
> >>> + *
> >>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> >>> + * it already exists. It is ok to use this function to check if a given range
> >>> + * exists, but it will use a linear search.
> >>> + *
> >>> + * TODO: We can use bsearch to locate the entry if we save the state in the
> >>> + * needle, knowing that the needle is always the first argument to
> >>> + * compare_func.
> >>> + */
> >>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> >>> +                                                  GCompareFunc compare_func,
> >>> +                                                  const VhostDMAMap *map,
> >>> +                                                  const VhostDMAMap **prev)
> >>> +{
> >>> +    size_t i;
> >>> +    int r;
> >>> +
> >>> +    *prev = NULL;
> >>> +    for (i = 0; i < array->len; ++i) {
> >>> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> >>> +        if (r == 0) {
> >>> +            return VHOST_DMA_MAP_OVERLAP;
> >>> +        }
> >>> +        if (r < 0) {
> >>> +            return VHOST_DMA_MAP_OK;
> >>> +        }
> >>> +
> >>> +        *prev = &g_array_index(array, typeof(**prev), i);
> >>> +    }
> >>> +
> >>> +    return VHOST_DMA_MAP_OK;
> >>> +}
> >>> +
> >>> +/**
> >>> + * Create a new IOVA tree
> >>> + *
> >>> + * @tree  The IOVA tree
> >>> + */
> >>> +void vhost_iova_tree_new(VhostIOVATree *tree)
> >>> +{
> >>> +    assert(tree);
> >>> +
> >>> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> >>> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> >>> +                                       sizeof(VhostDMAMap));
> >>> +}
> >>> +
> >>> +/**
> >>> + * Destroy an IOVA tree
> >>> + *
> >>> + * @tree  The iova tree
> >>> + */
> >>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
> >>> +{
> >>> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> >>> +}
> >>> +
> >>> +/**
> >>> + * Perform a search on a GArray.
> >>> + *
> >>> + * @array Glib array
> >>> + * @map Map to look up
> >>> + * @compare_func Compare function to use
> >>> + *
> >>> + * Return The found element or NULL if not found.
> >>> + *
> >>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
> >>> + * is common enough.
> >>> + */
> >>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
> >>> +                                                  const VhostDMAMap *map,
> >>> +                                                  GCompareFunc compare_func)
> >>> +{
> >>> +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Find the translated address stored from a IOVA address
> >>> + *
> >>> + * @tree  The iova tree
> >>> + * @map   The map with the memory address
> >>> + *
> >>> + * Return the stored mapping, or NULL if not found.
> >>> + */
> >>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
> >>> +                                              const VhostDMAMap *map)
> >>> +{
> >>> +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
> >>> +                                  vhost_iova_tree_cmp_iova);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Insert a new map
> >>> + *
> >>> + * @tree  The iova tree
> >>> + * @map   The iova map
> >>> + *
> >>> + * Returns:
> >>> + * - VHOST_DMA_MAP_OK if the map fits in the container
> >>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> >>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> >>> + * Can query the assignated iova in map.
> >>> + */
> >>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
> >>> +                                        VhostDMAMap *map)
> >>> +{
> >>> +    const VhostDMAMap *prev;
> >>> +    int find_prev_rc;
> >>> +
> >>> +    if (map->translated_addr + map->size < map->translated_addr ||
> >>> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> >>> +        return VHOST_DMA_MAP_INVALID;
> >>> +    }
> >>> +
> >>> +    /* Check for duplicates, and save position for insertion */
> >>> +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
> >>> +                                             vhost_iova_tree_cmp_iova, map,
> >>> +                                             &prev);
> >>> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> >>> +        return VHOST_DMA_MAP_OVERLAP;
> >>> +    }
> >>> +
> >>> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
> >>> +    return VHOST_DMA_MAP_OK;
> >>> +}
> >>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> >>> index 8b5a0225fe..cb306b83c6 100644
> >>> --- a/hw/virtio/meson.build
> >>> +++ b/hw/virtio/meson.build
> >>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> >>>
> >>>    virtio_ss = ss.source_set()
> >>>    virtio_ss.add(files('virtio.c'))
> >>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> >>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> >>>    virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> >>>    virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> >>>    virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-07-14  6:54         ` Eugenio Perez Martin
@ 2021-07-14  9:14           ` Eugenio Perez Martin
  2021-07-14  9:33             ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Eugenio Perez Martin @ 2021-07-14  9:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella

On Wed, Jul 14, 2021 at 8:54 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Jul 14, 2021 at 5:04 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
> > > On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> > >>> This tree is able to look for a translated address from a IOVA address.
> > >>>
> > >>> At first glance is similar to util/iova-tree. However, SVQ working on
> > >>> devices with limited IOVA space need more capabilities, like allocating
> > >>> IOVA chunks or perform reverse translations (qemu addresses to iova).
> > >>>
> > >>> Starting a sepparated implementation. Knowing than insertions/deletions
> > >>> will not be as frequent as searches,
> > >>
> > >> This might not be true if vIOMMU is enabled.
> > >>
> > > Right.
> > >
> > >>> it uses an ordered array at
> > >>> implementation.
> > >>
> > >> I wonder how much overhead could g_array be if it needs to grow.
> > >>
> > > I didn't do any tests, actually. But I see this VhostIOVATree as a
> > > replaceable tool, just to get the buffer translations to work. So I'm
> > > both ok with change it now and ok to delay it, since they should not
> > > be hard to do.
> > >
> > >>>    A different name could be used, but ordered
> > >>> searchable array is a little bit long though.
> > >>
> > >> Note that we had a very good example for this. That is the kernel iova
> > >> allocator which is implemented via rbtree.
> > >>
> > >> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
> > >> g_tree first (based on util/iova-tree) and borrow the well design kernel
> > >> iova allocator API to have a generic IOVA one instead of coupling it
> > >> with vhost. It could be used by other userspace driver in the future:
> > >>
> > >> init_iova_domain()/put_iova_domain();
> > >>
> > >> alloc_iova()/free_iova();
> > >>
> > >> find_iova();
> > >>
> > > We could go that way, but then the iova-tree should be extended to
> > > support both translations (iova->translated_addr is now implemented in
> > > iova-tree, the reverse is not). If I understood you correctly,
> > > borrowing the kernel iova allocator would give us both, right?
> >
> >
> > No the reverse lookup is done via a specific IOMMU driver if I
> > understand it correctly.
> >
> > And if the mapping is 1:1 we can just use two iova-tree I guess.
> >
>
> I did try with two IOVATree, but the usage of the reversed one is
> confusing at best. To reuse most of the code, .iova needs to be
> .translated_addr, and the opposite. Maybe I can try again using a
> wrapper structure that reverses them on each operation (insert,
> search, ...).
>
> Thanks!
>

Another feature is also needed that IOVATree does not support:
Allocation, as the searching of a free hole in iova range [1]. Linux's
alloc_iova is actually the equivalent of vhost_iova_tree_insert in
this commit, that expects iova addresses to be specified, not
"allocated".

My first attempt to solve that was to add a second (third?) tree with
the free space. But this again complicates the code a lot, and misses
a few optimization opportunities (like the need of searching in many
trees instead of just one, and then reuse the result [2]. Maybe
iova_tree_foreach can be modified to achieve this, but then more code
needs to be modified.

[1] Range themselves are not supported in IOVATree natively, but I
think they are more or less straightforward to implement in
vhost-vdpa.

> >
> > >
> > > Note that it is not coupled to vhost at all except in the name: all
> > > the implementations only work with hwaddr and void pointers memory.
> > > Just to illustrate the point, I think it could be a drop-in
> > > replacement for iova-tree at this moment (with all the
> > > drawbacks/advantages of an array vs tree).
> >
> >
> > Ok.
> >
> > Thanks
> >
> >
> > >
> > >> Another reference is the iova allocator that is implemented in VFIO.
> > > I will check this too.
> > >
> > > Thanks!
> > >
> > >
> > >> Thanks
> > >>
> > >>
> > >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >>> ---
> > >>>    hw/virtio/vhost-iova-tree.h |  50 ++++++++++
> > >>>    hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
> > >>>    hw/virtio/meson.build       |   2 +-
> > >>>    3 files changed, 239 insertions(+), 1 deletion(-)
> > >>>    create mode 100644 hw/virtio/vhost-iova-tree.h
> > >>>    create mode 100644 hw/virtio/vhost-iova-tree.c
> > >>>
> > >>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > >>> new file mode 100644
> > >>> index 0000000000..2a44af8b3a
> > >>> --- /dev/null
> > >>> +++ b/hw/virtio/vhost-iova-tree.h
> > >>> @@ -0,0 +1,50 @@
> > >>> +/*
> > >>> + * vhost software live migration ring
> > >>> + *
> > >>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > >>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > >>> + *
> > >>> + * SPDX-License-Identifier: GPL-2.0-or-later
> > >>> + */
> > >>> +
> > >>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > >>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > >>> +
> > >>> +#include <gmodule.h>
> > >>> +
> > >>> +#include "exec/memory.h"
> > >>> +
> > >>> +typedef struct VhostDMAMap {
> > >>> +    void *translated_addr;
> > >>> +    hwaddr iova;
> > >>> +    hwaddr size;                /* Inclusive */
> > >>> +    IOMMUAccessFlags perm;
> > >>> +} VhostDMAMap;
> > >>> +
> > >>> +typedef enum VhostDMAMapNewRC {
> > >>> +    VHOST_DMA_MAP_OVERLAP = -2,
> > >>> +    VHOST_DMA_MAP_INVALID = -1,
> > >>> +    VHOST_DMA_MAP_OK = 0,
> > >>> +} VhostDMAMapNewRC;
> > >>> +
> > >>> +/**
> > >>> + * VhostIOVATree
> > >>> + *
> > >>> + * Store and search IOVA -> Translated mappings.
> > >>> + *
> > >>> + * Note that it cannot remove nodes.
> > >>> + */
> > >>> +typedef struct VhostIOVATree {
> > >>> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> > >>> +    GArray *iova_taddr_map;
> > >>> +} VhostIOVATree;
> > >>> +
> > >>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
> > >>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
> > >>> +
> > >>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
> > >>> +                                              const VhostDMAMap *map);
> > >>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
> > >>> +                                        VhostDMAMap *map);
> > >>> +
> > >>> +#endif
> > >>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > >>> new file mode 100644
> > >>> index 0000000000..dfd7e448b5
> > >>> --- /dev/null
> > >>> +++ b/hw/virtio/vhost-iova-tree.c
> > >>> @@ -0,0 +1,188 @@
> > >>> +/*
> > >>> + * vhost software live migration ring
> > >>> + *
> > >>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > >>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > >>> + *
> > >>> + * SPDX-License-Identifier: GPL-2.0-or-later
> > >>> + */
> > >>> +
> > >>> +#include "qemu/osdep.h"
> > >>> +#include "vhost-iova-tree.h"
> > >>> +
> > >>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
> > >>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> > >>> +
> > >>> +/**
> > >>> + * Inserts an element after an existing one in garray.
> > >>> + *
> > >>> + * @array      The array
> > >>> + * @prev_elem  The previous element of array of NULL if prepending

[2] If I remember correctly.

> > >>> + * @map        The DMA map
> > >>> + *
> > >>> + * It provides the aditional advantage of being type safe over
> > >>> + * g_array_insert_val, which accepts a reference pointer instead of a value
> > >>> + * with no complains.
> > >>> + */
> > >>> +static void vhost_iova_tree_insert_after(GArray *array,
> > >>> +                                         const VhostDMAMap *prev_elem,
> > >>> +                                         const VhostDMAMap *map)
> > >>> +{
> > >>> +    size_t pos;
> > >>> +
> > >>> +    if (!prev_elem) {
> > >>> +        pos = 0;
> > >>> +    } else {
> > >>> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> > >>> +    }
> > >>> +
> > >>> +    g_array_insert_val(array, pos, *map);
> > >>> +}
> > >>> +
> > >>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
> > >>> +{
> > >>> +    const VhostDMAMap *m1 = a, *m2 = b;
> > >>> +
> > >>> +    if (m1->iova > m2->iova + m2->size) {
> > >>> +        return 1;
> > >>> +    }
> > >>> +
> > >>> +    if (m1->iova + m1->size < m2->iova) {
> > >>> +        return -1;
> > >>> +    }
> > >>> +
> > >>> +    /* Overlapped */
> > >>> +    return 0;
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Find the previous node to a given iova
> > >>> + *
> > >>> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> > >>> + * @map    The map to insert
> > >>> + * @prev   Returned location of the previous map
> > >>> + *
> > >>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> > >>> + * it already exists. It is ok to use this function to check if a given range
> > >>> + * exists, but it will use a linear search.
> > >>> + *
> > >>> + * TODO: We can use bsearch to locate the entry if we save the state in the
> > >>> + * needle, knowing that the needle is always the first argument to
> > >>> + * compare_func.
> > >>> + */
> > >>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> > >>> +                                                  GCompareFunc compare_func,
> > >>> +                                                  const VhostDMAMap *map,
> > >>> +                                                  const VhostDMAMap **prev)
> > >>> +{
> > >>> +    size_t i;
> > >>> +    int r;
> > >>> +
> > >>> +    *prev = NULL;
> > >>> +    for (i = 0; i < array->len; ++i) {
> > >>> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> > >>> +        if (r == 0) {
> > >>> +            return VHOST_DMA_MAP_OVERLAP;
> > >>> +        }
> > >>> +        if (r < 0) {
> > >>> +            return VHOST_DMA_MAP_OK;
> > >>> +        }
> > >>> +
> > >>> +        *prev = &g_array_index(array, typeof(**prev), i);
> > >>> +    }
> > >>> +
> > >>> +    return VHOST_DMA_MAP_OK;
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Create a new IOVA tree
> > >>> + *
> > >>> + * @tree  The IOVA tree
> > >>> + */
> > >>> +void vhost_iova_tree_new(VhostIOVATree *tree)
> > >>> +{
> > >>> +    assert(tree);
> > >>> +
> > >>> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> > >>> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> > >>> +                                       sizeof(VhostDMAMap));
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Destroy an IOVA tree
> > >>> + *
> > >>> + * @tree  The iova tree
> > >>> + */
> > >>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
> > >>> +{
> > >>> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Perform a search on a GArray.
> > >>> + *
> > >>> + * @array Glib array
> > >>> + * @map Map to look up
> > >>> + * @compare_func Compare function to use
> > >>> + *
> > >>> + * Return The found element or NULL if not found.
> > >>> + *
> > >>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
> > >>> + * is common enough.
> > >>> + */
> > >>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
> > >>> +                                                  const VhostDMAMap *map,
> > >>> +                                                  GCompareFunc compare_func)
> > >>> +{
> > >>> +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Find the translated address stored from a IOVA address
> > >>> + *
> > >>> + * @tree  The iova tree
> > >>> + * @map   The map with the memory address
> > >>> + *
> > >>> + * Return the stored mapping, or NULL if not found.
> > >>> + */
> > >>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
> > >>> +                                              const VhostDMAMap *map)
> > >>> +{
> > >>> +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
> > >>> +                                  vhost_iova_tree_cmp_iova);
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Insert a new map
> > >>> + *
> > >>> + * @tree  The iova tree
> > >>> + * @map   The iova map
> > >>> + *
> > >>> + * Returns:
> > >>> + * - VHOST_DMA_MAP_OK if the map fits in the container
> > >>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> > >>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> > >>> + * Can query the assignated iova in map.
> > >>> + */
> > >>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
> > >>> +                                        VhostDMAMap *map)
> > >>> +{
> > >>> +    const VhostDMAMap *prev;
> > >>> +    int find_prev_rc;
> > >>> +
> > >>> +    if (map->translated_addr + map->size < map->translated_addr ||
> > >>> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> > >>> +        return VHOST_DMA_MAP_INVALID;
> > >>> +    }
> > >>> +
> > >>> +    /* Check for duplicates, and save position for insertion */
> > >>> +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
> > >>> +                                             vhost_iova_tree_cmp_iova, map,
> > >>> +                                             &prev);
> > >>> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> > >>> +        return VHOST_DMA_MAP_OVERLAP;
> > >>> +    }
> > >>> +
> > >>> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
> > >>> +    return VHOST_DMA_MAP_OK;
> > >>> +}
> > >>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > >>> index 8b5a0225fe..cb306b83c6 100644
> > >>> --- a/hw/virtio/meson.build
> > >>> +++ b/hw/virtio/meson.build
> > >>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > >>>
> > >>>    virtio_ss = ss.source_set()
> > >>>    virtio_ss.add(files('virtio.c'))
> > >>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > >>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > >>>    virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > >>>    virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > >>>    virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> >



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
  2021-07-14  9:14           ` Eugenio Perez Martin
@ 2021-07-14  9:33             ` Jason Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Wang @ 2021-07-14  9:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Michael S. Tsirkin, Juan Quintela,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella


在 2021/7/14 下午5:14, Eugenio Perez Martin 写道:
> On Wed, Jul 14, 2021 at 8:54 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
>> On Wed, Jul 14, 2021 at 5:04 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> 在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
>>>> On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>>>> This tree is able to look for a translated address from a IOVA address.
>>>>>>
>>>>>> At first glance is similar to util/iova-tree. However, SVQ working on
>>>>>> devices with limited IOVA space need more capabilities, like allocating
>>>>>> IOVA chunks or perform reverse translations (qemu addresses to iova).
>>>>>>
>>>>>> Starting a sepparated implementation. Knowing than insertions/deletions
>>>>>> will not be as frequent as searches,
>>>>> This might not be true if vIOMMU is enabled.
>>>>>
>>>> Right.
>>>>
>>>>>> it uses an ordered array at
>>>>>> implementation.
>>>>> I wonder how much overhead could g_array be if it needs to grow.
>>>>>
>>>> I didn't do any tests, actually. But I see this VhostIOVATree as a
>>>> replaceable tool, just to get the buffer translations to work. So I'm
>>>> both ok with change it now and ok to delay it, since they should not
>>>> be hard to do.
>>>>
>>>>>>     A different name could be used, but ordered
>>>>>> searchable array is a little bit long though.
>>>>> Note that we had a very good example for this. That is the kernel iova
>>>>> allocator which is implemented via rbtree.
>>>>>
>>>>> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
>>>>> g_tree first (based on util/iova-tree) and borrow the well design kernel
>>>>> iova allocator API to have a generic IOVA one instead of coupling it
>>>>> with vhost. It could be used by other userspace driver in the future:
>>>>>
>>>>> init_iova_domain()/put_iova_domain();
>>>>>
>>>>> alloc_iova()/free_iova();
>>>>>
>>>>> find_iova();
>>>>>
>>>> We could go that way, but then the iova-tree should be extended to
>>>> support both translations (iova->translated_addr is now implemented in
>>>> iova-tree, the reverse is not). If I understood you correctly,
>>>> borrowing the kernel iova allocator would give us both, right?
>>>
>>> No the reverse lookup is done via a specific IOMMU driver if I
>>> understand it correctly.
>>>
>>> And if the mapping is 1:1 we can just use two iova-tree I guess.
>>>
>> I did try with two IOVATree, but the usage of the reversed one is
>> confusing at best. To reuse most of the code, .iova needs to be
>> .translated_addr, and the opposite. Maybe I can try again using a
>> wrapper structure that reverses them on each operation (insert,
>> search, ...).
>>
>> Thanks!
>>
> Another feature is also needed that IOVATree does not support:
> Allocation, as the searching of a free hole in iova range [1]. Linux's
> alloc_iova is actually the equivalent of vhost_iova_tree_insert in
> this commit, that expects iova addresses to be specified, not
> "allocated".


Yes, that's why need a general iova allocator. (But use it for shadow 
virtqueue first for speeding up the merge).


>
> My first attempt to solve that was to add a second (third?) tree with
> the free space. But this again complicates the code a lot, and misses
> a few optimization opportunities (like the need of searching in many
> trees instead of just one, and then reuse the result [2]. Maybe
> iova_tree_foreach can be modified to achieve this, but then more code
> needs to be modified.


Ok, I think you can propose what you think better and let's see the code 
then.


>
> [1] Range themselves are not supported in IOVATree natively, but I
> think they are more or less straightforward to implement in
> vhost-vdpa.


You mean in the kernel? We only have the abstraction of entries which is 
kind of but more than just a simple range.

Thanks


>
>>>> Note that it is not coupled to vhost at all except in the name: all
>>>> the implementations only work with hwaddr and void pointers memory.
>>>> Just to illustrate the point, I think it could be a drop-in
>>>> replacement for iova-tree at this moment (with all the
>>>> drawbacks/advantages of an array vs tree).
>>>
>>> Ok.
>>>
>>> Thanks
>>>
>>>
>>>>> Another reference is the iova allocator that is implemented in VFIO.
>>>> I will check this too.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>> ---
>>>>>>     hw/virtio/vhost-iova-tree.h |  50 ++++++++++
>>>>>>     hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
>>>>>>     hw/virtio/meson.build       |   2 +-
>>>>>>     3 files changed, 239 insertions(+), 1 deletion(-)
>>>>>>     create mode 100644 hw/virtio/vhost-iova-tree.h
>>>>>>     create mode 100644 hw/virtio/vhost-iova-tree.c
>>>>>>
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>>>> new file mode 100644
>>>>>> index 0000000000..2a44af8b3a
>>>>>> --- /dev/null
>>>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>>>> @@ -0,0 +1,50 @@
>>>>>> +/*
>>>>>> + * vhost software live migration ring
>>>>>> + *
>>>>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>>>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>>>>> + *
>>>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>>>> + */
>>>>>> +
>>>>>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
>>>>>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
>>>>>> +
>>>>>> +#include <gmodule.h>
>>>>>> +
>>>>>> +#include "exec/memory.h"
>>>>>> +
>>>>>> +typedef struct VhostDMAMap {
>>>>>> +    void *translated_addr;
>>>>>> +    hwaddr iova;
>>>>>> +    hwaddr size;                /* Inclusive */
>>>>>> +    IOMMUAccessFlags perm;
>>>>>> +} VhostDMAMap;
>>>>>> +
>>>>>> +typedef enum VhostDMAMapNewRC {
>>>>>> +    VHOST_DMA_MAP_OVERLAP = -2,
>>>>>> +    VHOST_DMA_MAP_INVALID = -1,
>>>>>> +    VHOST_DMA_MAP_OK = 0,
>>>>>> +} VhostDMAMapNewRC;
>>>>>> +
>>>>>> +/**
>>>>>> + * VhostIOVATree
>>>>>> + *
>>>>>> + * Store and search IOVA -> Translated mappings.
>>>>>> + *
>>>>>> + * Note that it cannot remove nodes.
>>>>>> + */
>>>>>> +typedef struct VhostIOVATree {
>>>>>> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
>>>>>> +    GArray *iova_taddr_map;
>>>>>> +} VhostIOVATree;
>>>>>> +
>>>>>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
>>>>>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
>>>>>> +
>>>>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
>>>>>> +                                              const VhostDMAMap *map);
>>>>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
>>>>>> +                                        VhostDMAMap *map);
>>>>>> +
>>>>>> +#endif
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>>>> new file mode 100644
>>>>>> index 0000000000..dfd7e448b5
>>>>>> --- /dev/null
>>>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>>>> @@ -0,0 +1,188 @@
>>>>>> +/*
>>>>>> + * vhost software live migration ring
>>>>>> + *
>>>>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>>>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>>>>> + *
>>>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "vhost-iova-tree.h"
>>>>>> +
>>>>>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
>>>>>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
>>>>>> +
>>>>>> +/**
>>>>>> + * Inserts an element after an existing one in garray.
>>>>>> + *
>>>>>> + * @array      The array
>>>>>> + * @prev_elem  The previous element of array of NULL if prepending
> [2] If I remember correctly.
>
>>>>>> + * @map        The DMA map
>>>>>> + *
>>>>>> + * It provides the aditional advantage of being type safe over
>>>>>> + * g_array_insert_val, which accepts a reference pointer instead of a value
>>>>>> + * with no complains.
>>>>>> + */
>>>>>> +static void vhost_iova_tree_insert_after(GArray *array,
>>>>>> +                                         const VhostDMAMap *prev_elem,
>>>>>> +                                         const VhostDMAMap *map)
>>>>>> +{
>>>>>> +    size_t pos;
>>>>>> +
>>>>>> +    if (!prev_elem) {
>>>>>> +        pos = 0;
>>>>>> +    } else {
>>>>>> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
>>>>>> +    }
>>>>>> +
>>>>>> +    g_array_insert_val(array, pos, *map);
>>>>>> +}
>>>>>> +
>>>>>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
>>>>>> +{
>>>>>> +    const VhostDMAMap *m1 = a, *m2 = b;
>>>>>> +
>>>>>> +    if (m1->iova > m2->iova + m2->size) {
>>>>>> +        return 1;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (m1->iova + m1->size < m2->iova) {
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Overlapped */
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Find the previous node to a given iova
>>>>>> + *
>>>>>> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
>>>>>> + * @map    The map to insert
>>>>>> + * @prev   Returned location of the previous map
>>>>>> + *
>>>>>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
>>>>>> + * it already exists. It is ok to use this function to check if a given range
>>>>>> + * exists, but it will use a linear search.
>>>>>> + *
>>>>>> + * TODO: We can use bsearch to locate the entry if we save the state in the
>>>>>> + * needle, knowing that the needle is always the first argument to
>>>>>> + * compare_func.
>>>>>> + */
>>>>>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
>>>>>> +                                                  GCompareFunc compare_func,
>>>>>> +                                                  const VhostDMAMap *map,
>>>>>> +                                                  const VhostDMAMap **prev)
>>>>>> +{
>>>>>> +    size_t i;
>>>>>> +    int r;
>>>>>> +
>>>>>> +    *prev = NULL;
>>>>>> +    for (i = 0; i < array->len; ++i) {
>>>>>> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
>>>>>> +        if (r == 0) {
>>>>>> +            return VHOST_DMA_MAP_OVERLAP;
>>>>>> +        }
>>>>>> +        if (r < 0) {
>>>>>> +            return VHOST_DMA_MAP_OK;
>>>>>> +        }
>>>>>> +
>>>>>> +        *prev = &g_array_index(array, typeof(**prev), i);
>>>>>> +    }
>>>>>> +
>>>>>> +    return VHOST_DMA_MAP_OK;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Create a new IOVA tree
>>>>>> + *
>>>>>> + * @tree  The IOVA tree
>>>>>> + */
>>>>>> +void vhost_iova_tree_new(VhostIOVATree *tree)
>>>>>> +{
>>>>>> +    assert(tree);
>>>>>> +
>>>>>> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
>>>>>> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
>>>>>> +                                       sizeof(VhostDMAMap));
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Destroy an IOVA tree
>>>>>> + *
>>>>>> + * @tree  The iova tree
>>>>>> + */
>>>>>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
>>>>>> +{
>>>>>> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Perform a search on a GArray.
>>>>>> + *
>>>>>> + * @array Glib array
>>>>>> + * @map Map to look up
>>>>>> + * @compare_func Compare function to use
>>>>>> + *
>>>>>> + * Return The found element or NULL if not found.
>>>>>> + *
>>>>>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
>>>>>> + * is common enough.
>>>>>> + */
>>>>>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
>>>>>> +                                                  const VhostDMAMap *map,
>>>>>> +                                                  GCompareFunc compare_func)
>>>>>> +{
>>>>>> +    return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Find the translated address stored from a IOVA address
>>>>>> + *
>>>>>> + * @tree  The iova tree
>>>>>> + * @map   The map with the memory address
>>>>>> + *
>>>>>> + * Return the stored mapping, or NULL if not found.
>>>>>> + */
>>>>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
>>>>>> +                                              const VhostDMAMap *map)
>>>>>> +{
>>>>>> +    return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
>>>>>> +                                  vhost_iova_tree_cmp_iova);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Insert a new map
>>>>>> + *
>>>>>> + * @tree  The iova tree
>>>>>> + * @map   The iova map
>>>>>> + *
>>>>>> + * Returns:
>>>>>> + * - VHOST_DMA_MAP_OK if the map fits in the container
>>>>>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
>>>>>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
>>>>>> + * Can query the assignated iova in map.
>>>>>> + */
>>>>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
>>>>>> +                                        VhostDMAMap *map)
>>>>>> +{
>>>>>> +    const VhostDMAMap *prev;
>>>>>> +    int find_prev_rc;
>>>>>> +
>>>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>>>> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
>>>>>> +        return VHOST_DMA_MAP_INVALID;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Check for duplicates, and save position for insertion */
>>>>>> +    find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
>>>>>> +                                             vhost_iova_tree_cmp_iova, map,
>>>>>> +                                             &prev);
>>>>>> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
>>>>>> +        return VHOST_DMA_MAP_OVERLAP;
>>>>>> +    }
>>>>>> +
>>>>>> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
>>>>>> +    return VHOST_DMA_MAP_OK;
>>>>>> +}
>>>>>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
>>>>>> index 8b5a0225fe..cb306b83c6 100644
>>>>>> --- a/hw/virtio/meson.build
>>>>>> +++ b/hw/virtio/meson.build
>>>>>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>>>>>>
>>>>>>     virtio_ss = ss.source_set()
>>>>>>     virtio_ss.add(files('virtio.c'))
>>>>>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
>>>>>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>>>>>>     virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>>>>>>     virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>>>>>>     virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC v3 00/29] vDPA software assisted live migration
  2021-05-24 11:29     ` Michael S. Tsirkin
@ 2021-07-19 14:13       ` Stefan Hajnoczi
  0 siblings, 0 replies; 67+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, qemu-level,
	Markus Armbruster, Eugenio Perez Martin, Harpreet Singh Anand,
	Xiao W Wang, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella

[-- Attachment #1: Type: text/plain, Size: 3018 bytes --]

On Mon, May 24, 2021 at 07:29:06AM -0400, Michael S. Tsirkin wrote:
> On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > > preparations again, and laters are for enabling some obvious
> > > > optimizations. However, it needs the vdpa device to be able to map
> > > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > > of this is added in previous commits.
> > >
> > > That might become a significant limitation. And it worries me that
> > > this is such a big patchset which might yet take a while to get
> > > finalized.
> > >
> > 
> > Sorry, maybe I've been unclear here: Latter commits in this series
> > address this limitation. Still not perfect: for example, it does not
> > support adding or removing guest's memory at the moment, but this
> > should be easy to implement on top.
> > 
> > The main issue I'm observing is from the kernel if I'm not wrong: If I
> > unmap every address, I cannot re-map them again. But code in this
> > patchset is mostly final, except for the comments it may arise in the
> > mail list of course.
> > 
> > > I have an idea: how about as a first step we implement a transparent
> > > switch from vdpa to a software virtio in QEMU or a software vhost in
> > > kernel?
> > >
> > > This will give us live migration quickly with performance comparable
> > > to failover but without dependance on guest cooperation.
> > >
> > 
> > I think it should be doable. I'm not sure about the effort that needs
> > to be done in qemu to hide these "hypervisor-failover devices" from
> > guest's view but it should be comparable to failover, as you say.
> > 
> > Networking should be ok by its nature, although it could require care
> > on the host hardware setup. But I'm not sure how other types of
> > vhost/vdpa devices may work that way. How would a disk/scsi device
> > switch modes? Can the kernel take control of the vdpa device through
> > vhost, and just start reporting with a dirty bitmap?
> > 
> > Thanks!
> 
> It depends of course, e.g. blk is mostly reads/writes so
> not a lot of state. just don't reorder or drop requests.

QEMU's virtio-blk does not attempt to change states (e.g. quiesce the
device or switch between vhost kernel/QEMU, etc) while there are
in-flight requests. Instead all currently active requests must complete
(in some cases they can be cancelled to stop them early). Note that
failed requests can be kept in a list across the switch and then
resubmitted later.

The underlying storage never has requests in flight while the device is
switched. The reason QEMU does this is because there's no way to hand
over an in-flight preadv(2), Linux AIO, or other host kernel block layer
request to another process.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2021-07-19 14:15 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 16:28 [RFC v3 00/29] vDPA software assisted live migration Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 01/29] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 02/29] vhost: Save masked_notifier state Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 03/29] vhost: Add VhostShadowVirtqueue Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 04/29] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
2021-05-21  7:05   ` Markus Armbruster
2021-05-24  7:13     ` Eugenio Perez Martin
2021-06-08 14:23       ` Markus Armbruster
2021-06-08 15:26         ` Eugenio Perez Martin
2021-06-09 11:46           ` Markus Armbruster
2021-06-09 14:06             ` Eugenio Perez Martin
2021-05-19 16:28 ` [RFC v3 05/29] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
2021-05-26  1:06   ` Jason Wang
2021-05-26  1:10     ` Jason Wang
2021-06-01  7:13       ` Eugenio Perez Martin
2021-06-03  3:12         ` Jason Wang
2021-05-19 16:28 ` [RFC v3 07/29] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 08/29] vhost: Route host->guest " Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 09/29] vhost: Avoid re-set masked notifier in shadow vq Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 10/29] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 11/29] vhost: Add vhost_vring_pause operation Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 12/29] vhost: add vhost_kernel_vring_pause Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 13/29] vhost: Add vhost_get_iova_range operation Eugenio Pérez
2021-05-26  1:14   ` Jason Wang
2021-05-26 17:49     ` Eugenio Perez Martin
2021-05-27  4:51       ` Jason Wang
2021-06-01  7:17         ` Eugenio Perez Martin
2021-06-03  3:13           ` Jason Wang
2021-05-19 16:28 ` [RFC v3 14/29] vhost: add vhost_has_limited_iova_range Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps Eugenio Pérez
2021-05-31  9:01   ` Jason Wang
2021-06-01  7:49     ` Eugenio Perez Martin
2021-05-19 16:28 ` [RFC v3 16/29] vhost-vdpa: Add vhost_vdpa_enable_custom_iommu Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
2021-06-02  9:50   ` Jason Wang
2021-06-02 17:18     ` Eugenio Perez Martin
2021-06-03  3:34       ` Jason Wang
2021-06-04  8:37         ` Eugenio Perez Martin
2021-05-19 16:28 ` [RFC v3 18/29] vhost: Use vhost_enable_custom_iommu to unmap everything if available Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 19/29] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 20/29] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 21/29] vhost: Add VhostIOVATree Eugenio Pérez
2021-05-31  9:40   ` Jason Wang
2021-06-01  8:15     ` Eugenio Perez Martin
2021-07-14  3:04       ` Jason Wang
2021-07-14  6:54         ` Eugenio Perez Martin
2021-07-14  9:14           ` Eugenio Perez Martin
2021-07-14  9:33             ` Jason Wang
2021-05-19 16:28 ` [RFC v3 22/29] vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 23/29] vhost: Use a tree to store memory mappings Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 24/29] vhost: Add iova_rev_maps_alloc Eugenio Pérez
2021-05-19 16:28 ` [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ Eugenio Pérez
2021-06-02  9:51   ` Jason Wang
2021-06-02 17:51     ` Eugenio Perez Martin
2021-06-03  3:39       ` Jason Wang
2021-06-04  9:07         ` Eugenio Perez Martin
2021-05-19 16:29 ` [RFC v3 26/29] vhost: Map in vdpa-dev Eugenio Pérez
2021-05-19 16:29 ` [RFC v3 27/29] vhost-vdpa: Implement vhost_vdpa_vring_pause operation Eugenio Pérez
2021-05-19 16:29 ` [RFC v3 28/29] vhost-vdpa: never map with vDPA listener Eugenio Pérez
2021-05-19 16:29 ` [RFC v3 29/29] vhost: Start vhost-vdpa SVQ directly Eugenio Pérez
2021-05-24  9:38 ` [RFC v3 00/29] vDPA software assisted live migration Michael S. Tsirkin
2021-05-24 10:37   ` Eugenio Perez Martin
2021-05-24 11:29     ` Michael S. Tsirkin
2021-07-19 14:13       ` Stefan Hajnoczi
2021-05-25  0:09     ` Jason Wang
2021-06-02  9:59 ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).