All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/14] vDPA shadow virtqueue
@ 2022-02-27 13:40 Eugenio Pérez
  2022-02-27 13:40 ` [PATCH v2 01/14] vhost: Add VhostShadowVirtqueue Eugenio Pérez
                   ` (15 more replies)
  0 siblings, 16 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
is intended as a new method of tracking the memory the devices touch
during a migration process: Instead of relay on vhost device's dirty
logging capability, SVQ intercepts the VQ dataplane forwarding the
descriptors between VM and device. This way qemu is the effective
writer of guests memory, like in qemu's virtio device operation.

When SVQ is enabled qemu offers a new virtual address space to the
device to read and write into, and it maps new vrings and the guest
memory in it. SVQ also intercepts kicks and calls between the device
and the guest. Used buffers relay would cause dirty memory being
tracked.

This effectively means that vDPA device passthrough is intercepted by
qemu. While SVQ should only be enabled at migration time, the switching
from regular mode to SVQ mode is left for a future series.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

For qemu to use shadow virtqueues the guest virtio driver must not use
features like event_idx, indirect descriptors, packed and in_order.
These features are easy to implement on top of this base, but is left
for a future series for simplicity.

SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:

-netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off

The first three patches enables notifications forwarding with
assistance of qemu. It's easy to enable only this if the relevant
cmdline part of the last patch is applied on top of these.

Next four patches implement the actual buffer forwarding. However,
address are not translated from HVA so they will need a host device with
an iommu allowing them to access all of the HVA range.

The last part of the series uses properly the host iommu, so qemu
creates a new iova address space in the device's range and translates
the buffers in it. Finally, it adds the cmdline parameter.

Some simple performance tests with netperf were done. They used a nested
guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
baseline average of ~9980.13Mbps:
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    30.01    9910.61
131072  16384  16384    30.00    10030.94
131072  16384  16384    30.01    9998.84

To enable the notifications interception reduced performance to an
average of ~9577.73Mbit/s:
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    30.00    9563.03
131072  16384  16384    30.01    9626.65
131072  16384  16384    30.01    9543.51

Finally, to enable buffers forwarding reduced the throughput again to
~8902.92Mbit/s:
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    30.01    8643.19
131072  16384  16384    30.01    9033.56
131072  16384  16384    30.01    9032.02

However, many performance improvements were left out of this series for
simplicity, so difference if performance should shrink in the future.

Comments are welcome.

TODO in future series:
* Event, indirect, packed, and others features of virtio.
* To support different set of features between the device<->SVQ and the
  SVQ<->guest communication.
* Support of device host notifier memory regions.
* To sepparate buffers forwarding in its own AIO context, so we can
  throw more threads to that task and we don't need to stop the main
  event loop.
* Support multiqueue virtio-net vdpa.
* Proper documentation.

Changes from v1:
* Feature set at device->SVQ is now the same as SVQ->guest.
* Size of SVQ is not max available device size anymore, but guest's
  negotiated.
* Add VHOST_FILE_UNBIND kick and call fd treatment.
* Make SVQ a public struct
* Come back to previous approach to iova-tree
* Some assertions are now fail paths. Some errors are now log_guest.
* Only mask _F_LOG feature at vdpa_set_features svq enable path.
* Refactor some errors and messages. Add missing error unwindings.
* Add memory barrier at _F_NO_NOTIFY set.
* Stop checking for features flags out of transport range.
v1 link:
https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/

Changes from v4 RFC:
* Support of allocating / freeing iova ranges in IOVA tree. Extending
  already present iova-tree for that.
* Proper validation of guest features. Now SVQ can negotiate a
  different set of features with the device when enabled.
* Support of host notifiers memory regions
* Handling of SVQ full queue in case guest's descriptors span to
  different memory regions (qemu's VA chunks).
* Flush pending used buffers at end of SVQ operation.
* QMP command now looks by NetClientState name. Other devices will need
  to implement it's way to enable vdpa.
* Rename QMP command to set, so it looks more like a way of working
* Better use of qemu error system
* Make a few assertions proper error-handling paths.
* Add more documentation
* Less coupling of virtio / vhost, that could cause friction on changes
* Addressed many other small comments and small fixes.

Changes from v3 RFC:
  * Move everything to vhost-vdpa backend. A big change, this allowed
    some cleanup but more code has been added in other places.
  * More use of glib utilities, especially to manage memory.
v3 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html

Changes from v2 RFC:
  * Adding vhost-vdpa devices support
  * Fixed some memory leaks pointed by different comments
v2 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html

Changes from v1 RFC:
  * Use QMP instead of migration to start SVQ mode.
  * Only accepting IOMMU devices, closer behavior with target devices
    (vDPA)
  * Fix invalid masking/unmasking of vhost call fd.
  * Use of proper methods for synchronization.
  * No need to modify VirtIO device code, all of the changes are
    contained in vhost code.
  * Delete superfluous code.
  * An intermediate RFC was sent with only the notifications forwarding
    changes. It can be seen in
    https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
v1 link:
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html

Eugenio Pérez (20):
      virtio: Add VIRTIO_F_QUEUE_STATE
      virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
      virtio: Add virtio_queue_is_host_notifier_enabled
      vhost: Make vhost_virtqueue_{start,stop} public
      vhost: Add x-vhost-enable-shadow-vq qmp
      vhost: Add VhostShadowVirtqueue
      vdpa: Register vdpa devices in a list
      vhost: Route guest->host notification through shadow virtqueue
      Add vhost_svq_get_svq_call_notifier
      Add vhost_svq_set_guest_call_notifier
      vdpa: Save call_fd in vhost-vdpa
      vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
      vhost: Route host->guest notification through shadow virtqueue
      virtio: Add vhost_shadow_vq_get_vring_addr
      vdpa: Save host and guest features
      vhost: Add vhost_svq_valid_device_features to shadow vq
      vhost: Shadow virtqueue buffers forwarding
      vhost: Add VhostIOVATree
      vhost: Use a tree to store memory mappings
      vdpa: Add custom IOTLB translations to SVQ

Eugenio Pérez (14):
  vhost: Add VhostShadowVirtqueue
  vhost: Add Shadow VirtQueue kick forwarding capabilities
  vhost: Add Shadow VirtQueue call forwarding capabilities
  vhost: Add vhost_svq_valid_features to shadow vq
  virtio: Add vhost_shadow_vq_get_vring_addr
  vdpa: adapt vhost_ops callbacks to svq
  vhost: Shadow virtqueue buffers forwarding
  util: Add iova_tree_alloc
  vhost: Add VhostIOVATree
  vdpa: Add custom IOTLB translations to SVQ
  vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
  vdpa: Never set log_base addr if SVQ is enabled
  vdpa: Expose VHOST_F_LOG_ALL on SVQ
  vdpa: Add x-svq to NetdevVhostVDPAOptions

 qapi/net.json                      |   5 +-
 hw/virtio/vhost-iova-tree.h        |  27 ++
 hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
 include/hw/virtio/vhost-vdpa.h     |   8 +
 include/qemu/iova-tree.h           |  18 +
 hw/virtio/vhost-iova-tree.c        | 155 +++++++
 hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
 hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
 net/vhost-vdpa.c                   |  48 ++-
 util/iova-tree.c                   | 133 ++++++
 hw/virtio/meson.build              |   2 +-
 11 files changed, 1644 insertions(+), 25 deletions(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-iova-tree.c
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

-- 
2.27.0




^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 01/14] vhost: Add VhostShadowVirtqueue
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
@ 2022-02-27 13:40 ` Eugenio Pérez
  2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

Vhost shadow virtqueue (SVQ) is an intermediate jump for virtqueue
notifications and buffers, allowing qemu to track them. While qemu is
forwarding the buffers and virtqueue changes, it is able to commit the
memory it's being dirtied, the same way regular qemu's VirtIO devices
do.

This commit only exposes basic SVQ allocation and free. Next patches of
the series add functionality like notifications and buffers forwarding.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h | 28 ++++++++++++++
 hw/virtio/vhost-shadow-virtqueue.c | 62 ++++++++++++++++++++++++++++++
 hw/virtio/meson.build              |  2 +-
 3 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
new file mode 100644
index 0000000000..f1519e3c7b
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -0,0 +1,28 @@
+/*
+ * vhost shadow virtqueue
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef VHOST_SHADOW_VIRTQUEUE_H
+#define VHOST_SHADOW_VIRTQUEUE_H
+
+#include "qemu/event_notifier.h"
+
+/* Shadow virtqueue to relay notifications */
+typedef struct VhostShadowVirtqueue {
+    /* Shadow kick notifier, sent to vhost */
+    EventNotifier hdev_kick;
+    /* Shadow call notifier, sent to vhost */
+    EventNotifier hdev_call;
+} VhostShadowVirtqueue;
+
+VhostShadowVirtqueue *vhost_svq_new(void);
+
+void vhost_svq_free(gpointer vq);
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
+
+#endif
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
new file mode 100644
index 0000000000..019cf1950f
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -0,0 +1,62 @@
+/*
+ * vhost shadow virtqueue
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
+
+#include "qemu/error-report.h"
+
+/**
+ * Creates vhost shadow virtqueue, and instructs the vhost device to use the
+ * shadow methods and file descriptors.
+ *
+ * Returns the new virtqueue or NULL.
+ *
+ * In case of error, reason is reported through error_report.
+ */
+VhostShadowVirtqueue *vhost_svq_new(void)
+{
+    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
+    int r;
+
+    r = event_notifier_init(&svq->hdev_kick, 0);
+    if (r != 0) {
+        error_report("Couldn't create kick event notifier: %s (%d)",
+                     g_strerror(errno), errno);
+        goto err_init_hdev_kick;
+    }
+
+    r = event_notifier_init(&svq->hdev_call, 0);
+    if (r != 0) {
+        error_report("Couldn't create call event notifier: %s (%d)",
+                     g_strerror(errno), errno);
+        goto err_init_hdev_call;
+    }
+
+    return g_steal_pointer(&svq);
+
+err_init_hdev_call:
+    event_notifier_cleanup(&svq->hdev_kick);
+
+err_init_hdev_kick:
+    return NULL;
+}
+
+/**
+ * Free the resources of the shadow virtqueue.
+ *
+ * @pvq  gpointer to SVQ so it can be used by autofree functions.
+ */
+void vhost_svq_free(gpointer pvq)
+{
+    VhostShadowVirtqueue *vq = pvq;
+    event_notifier_cleanup(&vq->hdev_kick);
+    event_notifier_cleanup(&vq->hdev_call);
+    g_free(vq);
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index 521f7d64a8..2dc87613bc 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
  2022-02-27 13:40 ` [PATCH v2 01/14] vhost: Add VhostShadowVirtqueue Eugenio Pérez
@ 2022-02-27 13:40 ` Eugenio Pérez
  2022-02-28  2:57     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 03/14] vhost: Add Shadow VirtQueue call " Eugenio Pérez
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

At this mode no buffer forwarding will be performed in SVQ mode: Qemu
will just forward the guest's kicks to the device.

Host memory notifiers regions are left out for simplicity, and they will
not be addressed in this series.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  14 +++
 include/hw/virtio/vhost-vdpa.h     |   4 +
 hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
 hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
 4 files changed, 213 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index f1519e3c7b..1cbc87d5d8 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
     EventNotifier hdev_kick;
     /* Shadow call notifier, sent to vhost */
     EventNotifier hdev_call;
+
+    /*
+     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
+     * notifier allows to recover the VhostShadowVirtqueue from the event loop
+     * easily. If we use the VirtQueue's one, we don't have an easy way to
+     * retrieve VhostShadowVirtqueue.
+     *
+     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
+     */
+    EventNotifier svq_kick;
 } VhostShadowVirtqueue;
 
+void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
+
+void vhost_svq_stop(VhostShadowVirtqueue *svq);
+
 VhostShadowVirtqueue *vhost_svq_new(void);
 
 void vhost_svq_free(gpointer vq);
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 3ce79a646d..009a9f3b6b 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -12,6 +12,8 @@
 #ifndef HW_VIRTIO_VHOST_VDPA_H
 #define HW_VIRTIO_VHOST_VDPA_H
 
+#include <gmodule.h>
+
 #include "hw/virtio/virtio.h"
 #include "standard-headers/linux/vhost_types.h"
 
@@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
     bool iotlb_batch_begin_sent;
     MemoryListener listener;
     struct vhost_vdpa_iova_range iova_range;
+    bool shadow_vqs_enabled;
+    GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 019cf1950f..a5d0659f86 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -11,6 +11,56 @@
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "linux-headers/linux/vhost.h"
+
+/** Forward guest notifications */
+static void vhost_handle_guest_kick(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             svq_kick);
+    event_notifier_test_and_clear(n);
+    event_notifier_set(&svq->hdev_kick);
+}
+
+/**
+ * Set a new file descriptor for the guest to kick the SVQ and notify for avail
+ *
+ * @svq          The svq
+ * @svq_kick_fd  The svq kick fd
+ *
+ * Note that the SVQ will never close the old file descriptor.
+ */
+void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
+{
+    EventNotifier *svq_kick = &svq->svq_kick;
+    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
+    bool poll_start = svq_kick_fd != VHOST_FILE_UNBIND;
+
+    if (poll_stop) {
+        event_notifier_set_handler(svq_kick, NULL);
+    }
+
+    /*
+     * event_notifier_set_handler already checks for guest's notifications if
+     * they arrive at the new file descriptor in the switch, so there is no
+     * need to explicitly check for them.
+     */
+    if (poll_start) {
+        event_notifier_init_fd(svq_kick, svq_kick_fd);
+        event_notifier_set(svq_kick);
+        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
+    }
+}
+
+/**
+ * Stop the shadow virtqueue operation.
+ * @svq Shadow Virtqueue
+ */
+void vhost_svq_stop(VhostShadowVirtqueue *svq)
+{
+    event_notifier_set_handler(&svq->svq_kick, NULL);
+}
 
 /**
  * Creates vhost shadow virtqueue, and instructs the vhost device to use the
@@ -39,6 +89,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
         goto err_init_hdev_call;
     }
 
+    event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
     return g_steal_pointer(&svq);
 
 err_init_hdev_call:
@@ -56,6 +107,7 @@ err_init_hdev_kick:
 void vhost_svq_free(gpointer pvq)
 {
     VhostShadowVirtqueue *vq = pvq;
+    vhost_svq_stop(vq);
     event_notifier_cleanup(&vq->hdev_kick);
     event_notifier_cleanup(&vq->hdev_call);
     g_free(vq);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 04ea43704f..454bf50735 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -17,12 +17,14 @@
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-backend.h"
 #include "hw/virtio/virtio-net.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost-vdpa.h"
 #include "exec/address-spaces.h"
 #include "qemu/main-loop.h"
 #include "cpu.h"
 #include "trace.h"
 #include "qemu-common.h"
+#include "qapi/error.h"
 
 /*
  * Return one past the end of the end of section. Be careful with uint64_t
@@ -342,6 +344,30 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
     return v->index != 0;
 }
 
+static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
+                               Error **errp)
+{
+    g_autoptr(GPtrArray) shadow_vqs = NULL;
+
+    if (!v->shadow_vqs_enabled) {
+        return 0;
+    }
+
+    shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
+    for (unsigned n = 0; n < hdev->nvqs; ++n) {
+        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
+
+        if (unlikely(!svq)) {
+            error_setg(errp, "Cannot create svq %u", n);
+            return -1;
+        }
+        g_ptr_array_add(shadow_vqs, g_steal_pointer(&svq));
+    }
+
+    v->shadow_vqs = g_steal_pointer(&shadow_vqs);
+    return 0;
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
     struct vhost_vdpa *v;
@@ -364,6 +390,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     dev->opaque =  opaque ;
     v->listener = vhost_vdpa_memory_listener;
     v->msg_type = VHOST_IOTLB_MSG_V2;
+    ret = vhost_vdpa_init_svq(dev, v, errp);
+    if (ret) {
+        goto err;
+    }
 
     vhost_vdpa_get_iova_range(v);
 
@@ -375,6 +405,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
                                VIRTIO_CONFIG_S_DRIVER);
 
     return 0;
+
+err:
+    ram_block_discard_disable(false);
+    return ret;
 }
 
 static void vhost_vdpa_host_notifier_uninit(struct vhost_dev *dev,
@@ -444,8 +478,14 @@ err:
 
 static void vhost_vdpa_host_notifiers_init(struct vhost_dev *dev)
 {
+    struct vhost_vdpa *v = dev->opaque;
     int i;
 
+    if (v->shadow_vqs_enabled) {
+        /* FIXME SVQ is not compatible with host notifiers mr */
+        return;
+    }
+
     for (i = dev->vq_index; i < dev->vq_index + dev->nvqs; i++) {
         if (vhost_vdpa_host_notifier_init(dev, i)) {
             goto err;
@@ -459,6 +499,21 @@ err:
     return;
 }
 
+static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    size_t idx;
+
+    if (!v->shadow_vqs) {
+        return;
+    }
+
+    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
+        vhost_svq_stop(g_ptr_array_index(v->shadow_vqs, idx));
+    }
+    g_ptr_array_free(v->shadow_vqs, true);
+}
+
 static int vhost_vdpa_cleanup(struct vhost_dev *dev)
 {
     struct vhost_vdpa *v;
@@ -467,6 +522,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
     trace_vhost_vdpa_cleanup(dev, v);
     vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
     memory_listener_unregister(&v->listener);
+    vhost_vdpa_svq_cleanup(dev);
 
     dev->opaque = NULL;
     ram_block_discard_disable(false);
@@ -558,11 +614,26 @@ static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
     return ret;
 }
 
+static void vhost_vdpa_reset_svq(struct vhost_vdpa *v)
+{
+    if (!v->shadow_vqs_enabled) {
+        return;
+    }
+
+    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
+        vhost_svq_stop(svq);
+    }
+}
+
 static int vhost_vdpa_reset_device(struct vhost_dev *dev)
 {
+    struct vhost_vdpa *v = dev->opaque;
     int ret;
     uint8_t status = 0;
 
+    vhost_vdpa_reset_svq(v);
+
     ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
     trace_vhost_vdpa_reset_device(dev, status);
     return ret;
@@ -646,13 +717,75 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
     return ret;
  }
 
+static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
+                                         struct vhost_vring_file *file)
+{
+    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
+    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
+}
+
+/**
+ * Set the shadow virtqueue descriptors to the device
+ *
+ * @dev   The vhost device model
+ * @svq   The shadow virtqueue
+ * @idx   The index of the virtqueue in the vhost device
+ * @errp  Error
+ */
+static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
+                                 VhostShadowVirtqueue *svq,
+                                 unsigned idx,
+                                 Error **errp)
+{
+    struct vhost_vring_file file = {
+        .index = dev->vq_index + idx,
+    };
+    const EventNotifier *event_notifier = &svq->hdev_kick;
+    int r;
+
+    file.fd = event_notifier_get_fd(event_notifier);
+    r = vhost_vdpa_set_vring_dev_kick(dev, &file);
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Can't set device kick fd");
+    }
+
+    return r == 0;
+}
+
+static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    Error *err = NULL;
+    unsigned i;
+
+    if (!v->shadow_vqs) {
+        return true;
+    }
+
+    for (i = 0; i < v->shadow_vqs->len; ++i) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
+        bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
+        if (unlikely(!ok)) {
+            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
+            return false;
+        }
+    }
+
+    return true;
+}
+
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
     struct vhost_vdpa *v = dev->opaque;
+    bool ok;
     trace_vhost_vdpa_dev_start(dev, started);
 
     if (started) {
         vhost_vdpa_host_notifiers_init(dev);
+        ok = vhost_vdpa_svqs_start(dev);
+        if (unlikely(!ok)) {
+            return -1;
+        }
         vhost_vdpa_set_vring_ready(dev);
     } else {
         vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
@@ -724,8 +857,16 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
 static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
                                        struct vhost_vring_file *file)
 {
-    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
-    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
+    struct vhost_vdpa *v = dev->opaque;
+    int vdpa_idx = file->index - dev->vq_index;
+
+    if (v->shadow_vqs_enabled) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
+        vhost_svq_set_svq_kick_fd(svq, file->fd);
+        return 0;
+    } else {
+        return vhost_vdpa_set_vring_dev_kick(dev, file);
+    }
 }
 
 static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 03/14] vhost: Add Shadow VirtQueue call forwarding capabilities
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
  2022-02-27 13:40 ` [PATCH v2 01/14] vhost: Add VhostShadowVirtqueue Eugenio Pérez
  2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  3:18     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq Eugenio Pérez
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This will make qemu aware of the device used buffers, allowing it to
write the guest memory with its contents if needed.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  4 ++++
 hw/virtio/vhost-shadow-virtqueue.c | 34 ++++++++++++++++++++++++++++++
 hw/virtio/vhost-vdpa.c             | 31 +++++++++++++++++++++++++--
 3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 1cbc87d5d8..1d4c160d0a 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -28,9 +28,13 @@ typedef struct VhostShadowVirtqueue {
      * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
      */
     EventNotifier svq_kick;
+
+    /* Guest's call notifier, where the SVQ calls guest. */
+    EventNotifier svq_call;
 } VhostShadowVirtqueue;
 
 void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
+void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
 
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index a5d0659f86..54c701a196 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -23,6 +23,38 @@ static void vhost_handle_guest_kick(EventNotifier *n)
     event_notifier_set(&svq->hdev_kick);
 }
 
+/* Forward vhost notifications */
+static void vhost_svq_handle_call(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             hdev_call);
+    event_notifier_test_and_clear(n);
+    event_notifier_set(&svq->svq_call);
+}
+
+/**
+ * Set the call notifier for the SVQ to call the guest
+ *
+ * @svq Shadow virtqueue
+ * @call_fd call notifier
+ *
+ * Called on BQL context.
+ */
+void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
+{
+    if (call_fd == VHOST_FILE_UNBIND) {
+        /*
+         * Fail event_notifier_set if called handling device call.
+         *
+         * SVQ still needs device notifications, since it needs to keep
+         * forwarding used buffers even with the unbind.
+         */
+        memset(&svq->svq_call, 0, sizeof(svq->svq_call));
+    } else {
+        event_notifier_init_fd(&svq->svq_call, call_fd);
+    }
+}
+
 /**
  * Set a new file descriptor for the guest to kick the SVQ and notify for avail
  *
@@ -90,6 +122,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
     }
 
     event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
+    event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
     return g_steal_pointer(&svq);
 
 err_init_hdev_call:
@@ -109,6 +142,7 @@ void vhost_svq_free(gpointer pvq)
     VhostShadowVirtqueue *vq = pvq;
     vhost_svq_stop(vq);
     event_notifier_cleanup(&vq->hdev_kick);
+    event_notifier_set_handler(&vq->hdev_call, NULL);
     event_notifier_cleanup(&vq->hdev_call);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 454bf50735..c73215751d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -724,6 +724,13 @@ static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
     return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
 }
 
+static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
+                                         struct vhost_vring_file *file)
+{
+    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
+    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+}
+
 /**
  * Set the shadow virtqueue descriptors to the device
  *
@@ -731,6 +738,9 @@ static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
  * @svq   The shadow virtqueue
  * @idx   The index of the virtqueue in the vhost device
  * @errp  Error
+ *
+ * Note that this function does not rewind kick file descriptor if cannot set
+ * call one.
  */
 static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
                                  VhostShadowVirtqueue *svq,
@@ -747,6 +757,14 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
     r = vhost_vdpa_set_vring_dev_kick(dev, &file);
     if (unlikely(r != 0)) {
         error_setg_errno(errp, -r, "Can't set device kick fd");
+        return false;
+    }
+
+    event_notifier = &svq->hdev_call;
+    file.fd = event_notifier_get_fd(event_notifier);
+    r = vhost_vdpa_set_vring_dev_call(dev, &file);
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Can't set device call fd");
     }
 
     return r == 0;
@@ -872,8 +890,17 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
 static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
                                        struct vhost_vring_file *file)
 {
-    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
-    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+    struct vhost_vdpa *v = dev->opaque;
+
+    if (v->shadow_vqs_enabled) {
+        int vdpa_idx = file->index - dev->vq_index;
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
+
+        vhost_svq_set_guest_call_notifier(svq, file->fd);
+        return 0;
+    } else {
+        return vhost_vdpa_set_vring_dev_call(dev, file);
+    }
 }
 
 static int vhost_vdpa_get_features(struct vhost_dev *dev,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (2 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 03/14] vhost: Add Shadow VirtQueue call " Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  3:25     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 05/14] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This allows SVQ to negotiate features with the guest and the device. For
the device, SVQ is a driver. While this function bypasses all
non-transport features, it needs to disable the features that SVQ does
not support when forwarding buffers. This includes packed vq layout,
indirect descriptors or event idx.

Future changes can add support to offer more features to the guest,
since the use of VirtQueue gives this for free. This is left out at the
moment for simplicity.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  2 ++
 hw/virtio/vhost-shadow-virtqueue.c | 39 ++++++++++++++++++++++++++++++
 hw/virtio/vhost-vdpa.c             | 18 ++++++++++++++
 3 files changed, 59 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 1d4c160d0a..84747655ad 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -33,6 +33,8 @@ typedef struct VhostShadowVirtqueue {
     EventNotifier svq_call;
 } VhostShadowVirtqueue;
 
+bool vhost_svq_valid_features(uint64_t *features);
+
 void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
 void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
 
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 54c701a196..34354aea2c 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -14,6 +14,45 @@
 #include "qemu/main-loop.h"
 #include "linux-headers/linux/vhost.h"
 
+/**
+ * Validate the transport device features that both guests can use with the SVQ
+ * and SVQs can use with the device.
+ *
+ * @dev_features  The features. If success, the acknowledged features. If
+ *                failure, the minimal set from it.
+ *
+ * Returns true if SVQ can go with a subset of these, false otherwise.
+ */
+bool vhost_svq_valid_features(uint64_t *features)
+{
+    bool r = true;
+
+    for (uint64_t b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END;
+         ++b) {
+        switch (b) {
+        case VIRTIO_F_ANY_LAYOUT:
+            continue;
+
+        case VIRTIO_F_ACCESS_PLATFORM:
+            /* SVQ trust in the host's IOMMU to translate addresses */
+        case VIRTIO_F_VERSION_1:
+            /* SVQ trust that the guest vring is little endian */
+            if (!(*features & BIT_ULL(b))) {
+                set_bit(b, features);
+                r = false;
+            }
+            continue;
+
+        default:
+            if (*features & BIT_ULL(b)) {
+                clear_bit(b, features);
+            }
+        }
+    }
+
+    return r;
+}
+
 /** Forward guest notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index c73215751d..d614c435f3 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -348,11 +348,29 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
                                Error **errp)
 {
     g_autoptr(GPtrArray) shadow_vqs = NULL;
+    uint64_t dev_features, svq_features;
+    int r;
+    bool ok;
 
     if (!v->shadow_vqs_enabled) {
         return 0;
     }
 
+    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
+    if (r != 0) {
+        error_setg_errno(errp, -r, "Can't get vdpa device features");
+        return r;
+    }
+
+    svq_features = dev_features;
+    ok = vhost_svq_valid_features(&svq_features);
+    if (unlikely(!ok)) {
+        error_setg(errp,
+            "SVQ Invalid device feature flags, offer: 0x%"PRIx64", ok: 0x%"PRIx64,
+            dev_features, svq_features);
+        return -1;
+    }
+
     shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
     for (unsigned n = 0; n < hdev->nvqs; ++n) {
         g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 05/14] virtio: Add vhost_shadow_vq_get_vring_addr
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (3 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-27 13:41 ` [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq Eugenio Pérez
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

It reports the shadow virtqueue address from qemu virtual address space.

Since this will be different from the guest's vaddr, but the device can
access it, SVQ takes special care about its alignment & lack of garbage
data. It assumes that IOMMU will work in host_page_size ranges for that.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  9 +++++++++
 hw/virtio/vhost-shadow-virtqueue.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 84747655ad..3bbea77082 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -11,9 +11,14 @@
 #define VHOST_SHADOW_VIRTQUEUE_H
 
 #include "qemu/event_notifier.h"
+#include "hw/virtio/virtio.h"
+#include "standard-headers/linux/vhost_types.h"
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
+    /* Shadow vring */
+    struct vring vring;
+
     /* Shadow kick notifier, sent to vhost */
     EventNotifier hdev_kick;
     /* Shadow call notifier, sent to vhost */
@@ -37,6 +42,10 @@ bool vhost_svq_valid_features(uint64_t *features);
 
 void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
 void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                              struct vhost_vring_addr *addr);
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
 
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 34354aea2c..2150e2b071 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -94,6 +94,35 @@ void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
     }
 }
 
+/*
+ * Get the shadow vq vring address.
+ * @svq Shadow virtqueue
+ * @addr Destination to store address
+ */
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                              struct vhost_vring_addr *addr)
+{
+    addr->desc_user_addr = (uint64_t)svq->vring.desc;
+    addr->avail_user_addr = (uint64_t)svq->vring.avail;
+    addr->used_user_addr = (uint64_t)svq->vring.used;
+}
+
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
+{
+    size_t desc_size = sizeof(vring_desc_t) * svq->vring.num;
+    size_t avail_size = offsetof(vring_avail_t, ring) +
+                                             sizeof(uint16_t) * svq->vring.num;
+
+    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);
+}
+
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
+{
+    size_t used_size = offsetof(vring_used_t, ring) +
+                                    sizeof(vring_used_elem_t) * svq->vring.num;
+    return ROUND_UP(used_size, qemu_real_host_page_size);
+}
+
 /**
  * Set a new file descriptor for the guest to kick the SVQ and notify for avail
  *
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (4 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 05/14] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  3:59     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

First half of the buffers forwarding part, preparing vhost-vdpa
callbacks to SVQ to offer it. QEMU cannot enable it at this moment, so
this is effectively dead code at the moment, but it helps to reduce
patch size.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index d614c435f3..b2c4e92fcf 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -344,6 +344,16 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
     return v->index != 0;
 }
 
+static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
+                                       uint64_t *features)
+{
+    int ret;
+
+    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
+    trace_vhost_vdpa_get_features(dev, *features);
+    return ret;
+}
+
 static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
                                Error **errp)
 {
@@ -356,7 +366,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
         return 0;
     }
 
-    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
+    r = vhost_vdpa_get_dev_features(hdev, &dev_features);
     if (r != 0) {
         error_setg_errno(errp, -r, "Can't get vdpa device features");
         return r;
@@ -583,12 +593,26 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
 static int vhost_vdpa_set_features(struct vhost_dev *dev,
                                    uint64_t features)
 {
+    struct vhost_vdpa *v = dev->opaque;
     int ret;
 
     if (vhost_vdpa_one_time_request(dev)) {
         return 0;
     }
 
+    if (v->shadow_vqs_enabled) {
+        uint64_t features_ok = features;
+        bool ok;
+
+        ok = vhost_svq_valid_features(&features_ok);
+        if (unlikely(!ok)) {
+            error_report(
+                "Invalid guest acked feature flag, acked: 0x%"
+                PRIx64", ok: 0x%"PRIx64, features, features_ok);
+            return -EINVAL;
+        }
+    }
+
     trace_vhost_vdpa_set_features(dev, features);
     ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
     if (ret) {
@@ -735,6 +759,13 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
     return ret;
  }
 
+static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
+                                         struct vhost_vring_state *ring)
+{
+    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
+    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
+}
+
 static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
                                          struct vhost_vring_file *file)
 {
@@ -749,6 +780,18 @@ static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
     return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
 }
 
+static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
+                                         struct vhost_vring_addr *addr)
+{
+    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
+                                addr->desc_user_addr, addr->used_user_addr,
+                                addr->avail_user_addr,
+                                addr->log_guest_addr);
+
+    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
+
+}
+
 /**
  * Set the shadow virtqueue descriptors to the device
  *
@@ -859,11 +902,17 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
 static int vhost_vdpa_set_vring_addr(struct vhost_dev *dev,
                                        struct vhost_vring_addr *addr)
 {
-    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
-                                    addr->desc_user_addr, addr->used_user_addr,
-                                    addr->avail_user_addr,
-                                    addr->log_guest_addr);
-    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
+    struct vhost_vdpa *v = dev->opaque;
+
+    if (v->shadow_vqs_enabled) {
+        /*
+         * Device vring addr was set at device start. SVQ base is handled by
+         * VirtQueue code.
+         */
+        return 0;
+    }
+
+    return vhost_vdpa_set_vring_dev_addr(dev, addr);
 }
 
 static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
@@ -876,8 +925,17 @@ static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
 static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
                                        struct vhost_vring_state *ring)
 {
-    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
-    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
+    struct vhost_vdpa *v = dev->opaque;
+
+    if (v->shadow_vqs_enabled) {
+        /*
+         * Device vring base was set at device start. SVQ base is handled by
+         * VirtQueue code.
+         */
+        return 0;
+    }
+
+    return vhost_vdpa_set_dev_vring_base(dev, ring);
 }
 
 static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
@@ -924,10 +982,14 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
 static int vhost_vdpa_get_features(struct vhost_dev *dev,
                                      uint64_t *features)
 {
-    int ret;
+    struct vhost_vdpa *v = dev->opaque;
+    int ret = vhost_vdpa_get_dev_features(dev, features);
+
+    if (ret == 0 && v->shadow_vqs_enabled) {
+        /* Filter only features that SVQ can offer to guest */
+        vhost_svq_valid_features(features);
+    }
 
-    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
-    trace_vhost_vdpa_get_features(dev, *features);
     return ret;
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (5 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  5:39     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 08/14] util: Add iova_tree_alloc Eugenio Pérez
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

Initial version of shadow virtqueue that actually forward buffers. There
is no iommu support at the moment, and that will be addressed in future
patches of this series. Since all vhost-vdpa devices use forced IOMMU,
this means that SVQ is not usable at this point of the series on any
device.

For simplicity it only supports modern devices, that expects vring
in little endian, with split ring and no event idx or indirect
descriptors. Support for them will not be added in this series.

It reuses the VirtQueue code for the device part. The driver part is
based on Linux's virtio_ring driver, but with stripped functionality
and optimizations so it's easier to review.

However, forwarding buffers have some particular pieces: One of the most
unexpected ones is that a guest's buffer can expand through more than
one descriptor in SVQ. While this is handled gracefully by qemu's
emulated virtio devices, it may cause unexpected SVQ queue full. This
patch also solves it by checking for this condition at both guest's
kicks and device's calls. The code may be more elegant in the future if
SVQ code runs in its own iocontext.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  29 +++
 hw/virtio/vhost-shadow-virtqueue.c | 360 ++++++++++++++++++++++++++++-
 hw/virtio/vhost-vdpa.c             | 165 ++++++++++++-
 3 files changed, 542 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 3bbea77082..04c67685fd 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -36,6 +36,33 @@ typedef struct VhostShadowVirtqueue {
 
     /* Guest's call notifier, where the SVQ calls guest. */
     EventNotifier svq_call;
+
+    /* Virtio queue shadowing */
+    VirtQueue *vq;
+
+    /* Virtio device */
+    VirtIODevice *vdev;
+
+    /* Map for use the guest's descriptors */
+    VirtQueueElement **ring_id_maps;
+
+    /* Next VirtQueue element that guest made available */
+    VirtQueueElement *next_guest_avail_elem;
+
+    /* Next head to expose to the device */
+    uint16_t avail_idx_shadow;
+
+    /* Next free descriptor */
+    uint16_t free_head;
+
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from the device */
+    uint16_t last_used_idx;
+
+    /* Cache for the exposed notification flag */
+    bool notification;
 } VhostShadowVirtqueue;
 
 bool vhost_svq_valid_features(uint64_t *features);
@@ -47,6 +74,8 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
 size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
 size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
 
+void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
+                     VirtQueue *vq);
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
 VhostShadowVirtqueue *vhost_svq_new(void);
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 2150e2b071..a38d313755 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -12,6 +12,7 @@
 
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/log.h"
 #include "linux-headers/linux/vhost.h"
 
 /**
@@ -53,22 +54,309 @@ bool vhost_svq_valid_features(uint64_t *features)
     return r;
 }
 
-/** Forward guest notifications */
-static void vhost_handle_guest_kick(EventNotifier *n)
+/**
+ * Number of descriptors that the SVQ can make available from the guest.
+ *
+ * @svq   The svq
+ */
+static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
+{
+    return svq->vring.num - (svq->avail_idx_shadow - svq->shadow_used_idx);
+}
+
+static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
+{
+    uint16_t notification_flag;
+
+    if (svq->notification == enable) {
+        return;
+    }
+
+    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+
+    svq->notification = enable;
+    if (enable) {
+        svq->vring.avail->flags &= ~notification_flag;
+    } else {
+        svq->vring.avail->flags |= notification_flag;
+        /* Make sure the flag is written before the read of used_idx */
+        smp_mb();
+    }
+}
+
+static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+                                    const struct iovec *iovec,
+                                    size_t num, bool more_descs, bool write)
+{
+    uint16_t i = svq->free_head, last = svq->free_head;
+    unsigned n;
+    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
+    vring_desc_t *descs = svq->vring.desc;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (n = 0; n < num; n++) {
+        if (more_descs || (n + 1 < num)) {
+            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
+        } else {
+            descs[i].flags = flags;
+        }
+        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
+        descs[i].len = cpu_to_le32(iovec[n].iov_len);
+
+        last = i;
+        i = cpu_to_le16(descs[i].next);
+    }
+
+    svq->free_head = le16_to_cpu(descs[last].next);
+}
+
+static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
+                                VirtQueueElement *elem,
+                                unsigned *head)
+{
+    unsigned avail_idx;
+    vring_avail_t *avail = svq->vring.avail;
+
+    *head = svq->free_head;
+
+    /* We need some descriptors here */
+    if (unlikely(!elem->out_num && !elem->in_num)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+            "Guest provided element with no descriptors");
+        return false;
+    }
+
+    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
+                            elem->in_num > 0, false);
+    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
+
+    /*
+     * Put the entry in the available array (but don't update avail->idx until
+     * they do sync).
+     */
+    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
+    avail->ring[avail_idx] = cpu_to_le16(*head);
+    svq->avail_idx_shadow++;
+
+    /* Update the avail index after write the descriptor */
+    smp_wmb();
+    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
+
+    return true;
+}
+
+static bool vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
+{
+    unsigned qemu_head;
+    bool ok = vhost_svq_add_split(svq, elem, &qemu_head);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    svq->ring_id_maps[qemu_head] = elem;
+    return true;
+}
+
+static void vhost_svq_kick(VhostShadowVirtqueue *svq)
+{
+    /*
+     * We need to expose the available array entries before checking the used
+     * flags
+     */
+    smp_mb();
+    if (svq->vring.used->flags & VRING_USED_F_NO_NOTIFY) {
+        return;
+    }
+
+    event_notifier_set(&svq->hdev_kick);
+}
+
+/**
+ * Forward available buffers.
+ *
+ * @svq Shadow VirtQueue
+ *
+ * Note that this function does not guarantee that all guest's available
+ * buffers are available to the device in SVQ avail ring. The guest may have
+ * exposed a GPA / GIOVA contiguous buffer, but it may not be contiguous in
+ * qemu vaddr.
+ *
+ * If that happens, guest's kick notifications will be disabled until the
+ * device uses some buffers.
+ */
+static void vhost_handle_guest_kick(VhostShadowVirtqueue *svq)
+{
+    /* Clear event notifier */
+    event_notifier_test_and_clear(&svq->svq_kick);
+
+    /* Forward to the device as many available buffers as possible */
+    do {
+        virtio_queue_set_notification(svq->vq, false);
+
+        while (true) {
+            VirtQueueElement *elem;
+            bool ok;
+
+            if (svq->next_guest_avail_elem) {
+                elem = g_steal_pointer(&svq->next_guest_avail_elem);
+            } else {
+                elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            }
+
+            if (!elem) {
+                break;
+            }
+
+            if (elem->out_num + elem->in_num >
+                vhost_svq_available_slots(svq)) {
+                /*
+                 * This condition is possible since a contiguous buffer in GPA
+                 * does not imply a contiguous buffer in qemu's VA
+                 * scatter-gather segments. If that happens, the buffer exposed
+                 * to the device needs to be a chain of descriptors at this
+                 * moment.
+                 *
+                 * SVQ cannot hold more available buffers if we are here:
+                 * queue the current guest descriptor and ignore further kicks
+                 * until some elements are used.
+                 */
+                svq->next_guest_avail_elem = elem;
+                return;
+            }
+
+            ok = vhost_svq_add(svq, elem);
+            if (unlikely(!ok)) {
+                /* VQ is broken, just return and ignore any other kicks */
+                return;
+            }
+            vhost_svq_kick(svq);
+        }
+
+        virtio_queue_set_notification(svq->vq, true);
+    } while (!virtio_queue_empty(svq->vq));
+}
+
+/**
+ * Handle guest's kick.
+ *
+ * @n guest kick event notifier, the one that guest set to notify svq.
+ */
+static void vhost_handle_guest_kick_notifier(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
                                              svq_kick);
     event_notifier_test_and_clear(n);
-    event_notifier_set(&svq->hdev_kick);
+    vhost_handle_guest_kick(svq);
+}
+
+static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
+{
+    if (svq->last_used_idx != svq->shadow_used_idx) {
+        return true;
+    }
+
+    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
+
+    return svq->last_used_idx != svq->shadow_used_idx;
 }
 
-/* Forward vhost notifications */
+static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
+{
+    vring_desc_t *descs = svq->vring.desc;
+    const vring_used_t *used = svq->vring.used;
+    vring_used_elem_t used_elem;
+    uint16_t last_used;
+
+    if (!vhost_svq_more_used(svq)) {
+        return NULL;
+    }
+
+    /* Only get used array entries after they have been exposed by dev */
+    smp_rmb();
+    last_used = svq->last_used_idx & (svq->vring.num - 1);
+    used_elem.id = le32_to_cpu(used->ring[last_used].id);
+    used_elem.len = le32_to_cpu(used->ring[last_used].len);
+
+    svq->last_used_idx++;
+    if (unlikely(used_elem.id >= svq->vring.num)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Device %s says index %u is used",
+                      svq->vdev->name, used_elem.id);
+        return NULL;
+    }
+
+    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+            "Device %s says index %u is used, but it was not available",
+            svq->vdev->name, used_elem.id);
+        return NULL;
+    }
+
+    descs[used_elem.id].next = svq->free_head;
+    svq->free_head = used_elem.id;
+
+    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
+}
+
+static void vhost_svq_flush(VhostShadowVirtqueue *svq,
+                            bool check_for_avail_queue)
+{
+    VirtQueue *vq = svq->vq;
+
+    /* Forward as many used buffers as possible. */
+    do {
+        unsigned i = 0;
+
+        vhost_svq_set_notification(svq, false);
+        while (true) {
+            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
+            if (!elem) {
+                break;
+            }
+
+            if (unlikely(i >= svq->vring.num)) {
+                qemu_log_mask(LOG_GUEST_ERROR,
+                         "More than %u used buffers obtained in a %u size SVQ",
+                         i, svq->vring.num);
+                virtqueue_fill(vq, elem, elem->len, i);
+                virtqueue_flush(vq, i);
+                return;
+            }
+            virtqueue_fill(vq, elem, elem->len, i++);
+        }
+
+        virtqueue_flush(vq, i);
+        event_notifier_set(&svq->svq_call);
+
+        if (check_for_avail_queue && svq->next_guest_avail_elem) {
+            /*
+             * Avail ring was full when vhost_svq_flush was called, so it's a
+             * good moment to make more descriptors available if possible.
+             */
+            vhost_handle_guest_kick(svq);
+        }
+
+        vhost_svq_set_notification(svq, true);
+    } while (vhost_svq_more_used(svq));
+}
+
+/**
+ * Forward used buffers.
+ *
+ * @n hdev call event notifier, the one that device set to notify svq.
+ *
+ * Note that we are not making any buffers available in the loop, there is no
+ * way that it runs more than virtqueue size times.
+ */
 static void vhost_svq_handle_call(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
                                              hdev_call);
     event_notifier_test_and_clear(n);
-    event_notifier_set(&svq->svq_call);
+    vhost_svq_flush(svq, true);
 }
 
 /**
@@ -149,7 +437,41 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
     if (poll_start) {
         event_notifier_init_fd(svq_kick, svq_kick_fd);
         event_notifier_set(svq_kick);
-        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
+        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick_notifier);
+    }
+}
+
+/**
+ * Start the shadow virtqueue operation.
+ *
+ * @svq Shadow Virtqueue
+ * @vdev        VirtIO device
+ * @vq          Virtqueue to shadow
+ */
+void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
+                     VirtQueue *vq)
+{
+    size_t desc_size, driver_size, device_size;
+
+    svq->next_guest_avail_elem = NULL;
+    svq->avail_idx_shadow = 0;
+    svq->shadow_used_idx = 0;
+    svq->last_used_idx = 0;
+    svq->vdev = vdev;
+    svq->vq = vq;
+
+    svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
+    driver_size = vhost_svq_driver_area_size(svq);
+    device_size = vhost_svq_device_area_size(svq);
+    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
+    desc_size = sizeof(vring_desc_t) * svq->vring.num;
+    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
+    memset(svq->vring.desc, 0, driver_size);
+    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
+    memset(svq->vring.used, 0, device_size);
+    svq->ring_id_maps = g_new0(VirtQueueElement *, svq->vring.num);
+    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
+        svq->vring.desc[i].next = cpu_to_le16(i + 1);
     }
 }
 
@@ -160,6 +482,32 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
 void vhost_svq_stop(VhostShadowVirtqueue *svq)
 {
     event_notifier_set_handler(&svq->svq_kick, NULL);
+    g_autofree VirtQueueElement *next_avail_elem = NULL;
+
+    if (!svq->vq) {
+        return;
+    }
+
+    /* Send all pending used descriptors to guest */
+    vhost_svq_flush(svq, false);
+
+    for (unsigned i = 0; i < svq->vring.num; ++i) {
+        g_autofree VirtQueueElement *elem = NULL;
+        elem = g_steal_pointer(&svq->ring_id_maps[i]);
+        if (elem) {
+            virtqueue_detach_element(svq->vq, elem, elem->len);
+        }
+    }
+
+    next_avail_elem = g_steal_pointer(&svq->next_guest_avail_elem);
+    if (next_avail_elem) {
+        virtqueue_detach_element(svq->vq, next_avail_elem,
+                                 next_avail_elem->len);
+    }
+    svq->vq = NULL;
+    g_free(svq->ring_id_maps);
+    qemu_vfree(svq->vring.desc);
+    qemu_vfree(svq->vring.used);
 }
 
 /**
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index b2c4e92fcf..435b9c2e9e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -803,10 +803,10 @@ static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
  * Note that this function does not rewind kick file descriptor if cannot set
  * call one.
  */
-static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
-                                 VhostShadowVirtqueue *svq,
-                                 unsigned idx,
-                                 Error **errp)
+static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
+                                  VhostShadowVirtqueue *svq,
+                                  unsigned idx,
+                                  Error **errp)
 {
     struct vhost_vring_file file = {
         .index = dev->vq_index + idx,
@@ -818,7 +818,7 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
     r = vhost_vdpa_set_vring_dev_kick(dev, &file);
     if (unlikely(r != 0)) {
         error_setg_errno(errp, -r, "Can't set device kick fd");
-        return false;
+        return r;
     }
 
     event_notifier = &svq->hdev_call;
@@ -828,6 +828,119 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
         error_setg_errno(errp, -r, "Can't set device call fd");
     }
 
+    return r;
+}
+
+/**
+ * Unmap a SVQ area in the device
+ */
+static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
+                                      hwaddr size)
+{
+    int r;
+
+    size = ROUND_UP(size, qemu_real_host_page_size);
+    r = vhost_vdpa_dma_unmap(v, iova, size);
+    return r == 0;
+}
+
+static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
+                                       const VhostShadowVirtqueue *svq)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    struct vhost_vring_addr svq_addr;
+    size_t device_size = vhost_svq_device_area_size(svq);
+    size_t driver_size = vhost_svq_driver_area_size(svq);
+    bool ok;
+
+    vhost_svq_get_vring_addr(svq, &svq_addr);
+
+    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
+}
+
+/**
+ * Map shadow virtqueue rings in device
+ *
+ * @dev   The vhost device
+ * @svq   The shadow virtqueue
+ * @addr  Assigned IOVA addresses
+ * @errp  Error pointer
+ */
+static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
+                                     const VhostShadowVirtqueue *svq,
+                                     struct vhost_vring_addr *addr,
+                                     Error **errp)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    size_t device_size = vhost_svq_device_area_size(svq);
+    size_t driver_size = vhost_svq_driver_area_size(svq);
+    int r;
+
+    ERRP_GUARD();
+    vhost_svq_get_vring_addr(svq, addr);
+
+    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
+                           (void *)addr->desc_user_addr, true);
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
+        return false;
+    }
+
+    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
+                           (void *)addr->used_user_addr, false);
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Cannot create vq device region: ");
+    }
+
+    return r == 0;
+}
+
+static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
+                                 VhostShadowVirtqueue *svq,
+                                 unsigned idx,
+                                 Error **errp)
+{
+    uint16_t vq_index = dev->vq_index + idx;
+    struct vhost_vring_state s = {
+        .index = vq_index,
+    };
+    int r;
+
+    r = vhost_vdpa_set_dev_vring_base(dev, &s);
+    if (unlikely(r)) {
+        error_setg_errno(errp, -r, "Cannot set vring base");
+        return false;
+    }
+
+    r = vhost_vdpa_svq_set_fds(dev, svq, idx, errp);
+    return r == 0;
+}
+
+static bool vhost_vdpa_svq_set_addr(struct vhost_dev *dev, unsigned idx,
+                                    VhostShadowVirtqueue *svq,
+                                    Error **errp)
+{
+    struct vhost_vring_addr addr = {
+        .index = idx,
+    };
+    int r;
+
+    bool ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, errp);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    /* Override vring GPA set by vhost subsystem */
+    r = vhost_vdpa_set_vring_dev_addr(dev, &addr);
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Cannot set device address");
+    }
+
     return r == 0;
 }
 
@@ -842,10 +955,46 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
     }
 
     for (i = 0; i < v->shadow_vqs->len; ++i) {
+        VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
         VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
         bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
         if (unlikely(!ok)) {
-            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
+            goto err;
+        }
+        vhost_svq_start(svq, dev->vdev, vq);
+        ok = vhost_vdpa_svq_set_addr(dev, i, svq, &err);
+        if (unlikely(!ok)) {
+            vhost_svq_stop(svq);
+            goto err;
+        }
+    }
+
+    return true;
+
+err:
+    error_reportf_err(err, "Cannot setup SVQ %u: ", i);
+    for (unsigned j = 0; j < i; ++j) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, j);
+        vhost_vdpa_svq_unmap_rings(dev, svq);
+        vhost_svq_stop(svq);
+    }
+
+    return false;
+}
+
+static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+
+    if (!v->shadow_vqs) {
+        return true;
+    }
+
+    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
+                                                      i);
+        bool ok = vhost_vdpa_svq_unmap_rings(dev, svq);
+        if (unlikely(!ok)) {
             return false;
         }
     }
@@ -867,6 +1016,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         }
         vhost_vdpa_set_vring_ready(dev);
     } else {
+        ok = vhost_vdpa_svqs_stop(dev);
+        if (unlikely(!ok)) {
+            return -1;
+        }
         vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
     }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 08/14] util: Add iova_tree_alloc
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (6 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  6:39     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 09/14] vhost: Add VhostIOVATree Eugenio Pérez
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This iova tree function allows it to look for a hole in allocated
regions and return a totally new translation for a given translated
address.

It's usage is mainly to allow devices to access qemu address space,
remapping guest's one into a new iova space where qemu can add chunks of
addresses.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/iova-tree.h |  18 ++++++
 util/iova-tree.c         | 133 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 151 insertions(+)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 8249edd764..a623136cd8 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -29,6 +29,7 @@
 #define  IOVA_OK           (0)
 #define  IOVA_ERR_INVALID  (-1) /* Invalid parameters */
 #define  IOVA_ERR_OVERLAP  (-2) /* IOVA range overlapped */
+#define  IOVA_ERR_NOMEM    (-3) /* Cannot allocate */
 
 typedef struct IOVATree IOVATree;
 typedef struct DMAMap {
@@ -119,6 +120,23 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
  */
 void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
 
+/**
+ * iova_tree_alloc:
+ *
+ * @tree: the iova tree to allocate from
+ * @map: the new map (as translated addr & size) to allocate in the iova region
+ * @iova_begin: the minimum address of the allocation
+ * @iova_end: the maximum addressable direction of the allocation
+ *
+ * Allocates a new region of a given size, between iova_min and iova_max.
+ *
+ * Return: Same as iova_tree_insert, but cannot overlap and can return error if
+ * iova tree is out of free contiguous range. The caller gets the assigned iova
+ * in map->iova.
+ */
+int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
+                        hwaddr iova_end);
+
 /**
  * iova_tree_destroy:
  *
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 23ea35b7a4..302b01f1cc 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -16,6 +16,39 @@ struct IOVATree {
     GTree *tree;
 };
 
+/* Args to pass to iova_tree_alloc foreach function. */
+struct IOVATreeAllocArgs {
+    /* Size of the desired allocation */
+    size_t new_size;
+
+    /* The minimum address allowed in the allocation */
+    hwaddr iova_begin;
+
+    /* Map at the left of the hole, can be NULL if "this" is first one */
+    const DMAMap *prev;
+
+    /* Map at the right of the hole, can be NULL if "prev" is the last one */
+    const DMAMap *this;
+
+    /* If found, we fill in the IOVA here */
+    hwaddr iova_result;
+
+    /* Whether have we found a valid IOVA */
+    bool iova_found;
+};
+
+/**
+ * Iterate args to the next hole
+ *
+ * @args  The alloc arguments
+ * @next  The next mapping in the tree. Can be NULL to signal the last one
+ */
+static void iova_tree_alloc_args_iterate(struct IOVATreeAllocArgs *args,
+                                         const DMAMap *next) {
+    args->prev = args->this;
+    args->this = next;
+}
+
 static int iova_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
 {
     const DMAMap *m1 = a, *m2 = b;
@@ -107,6 +140,106 @@ int iova_tree_remove(IOVATree *tree, const DMAMap *map)
     return IOVA_OK;
 }
 
+/**
+ * Try to find an unallocated IOVA range between prev and this elements.
+ *
+ * @args Arguments to allocation
+ *
+ * Cases:
+ *
+ * (1) !prev, !this: No entries allocated, always succeed
+ *
+ * (2) !prev, this: We're iterating at the 1st element.
+ *
+ * (3) prev, !this: We're iterating at the last element.
+ *
+ * (4) prev, this: this is the most common case, we'll try to find a hole
+ * between "prev" and "this" mapping.
+ *
+ * Note that this function assumes the last valid iova is HWADDR_MAX, but it
+ * searches linearly so it's easy to discard the result if it's not the case.
+ */
+static void iova_tree_alloc_map_in_hole(struct IOVATreeAllocArgs *args)
+{
+    const DMAMap *prev = args->prev, *this = args->this;
+    uint64_t hole_start, hole_last;
+
+    if (this && this->iova + this->size < args->iova_begin) {
+        return;
+    }
+
+    hole_start = MAX(prev ? prev->iova + prev->size + 1 : 0, args->iova_begin);
+    hole_last = this ? this->iova : HWADDR_MAX;
+
+    if (hole_last - hole_start > args->new_size) {
+        args->iova_result = hole_start;
+        args->iova_found = true;
+    }
+}
+
+/**
+ * Foreach dma node in the tree, compare if there is a hole with its previous
+ * node (or minimum iova address allowed) and the node.
+ *
+ * @key   Node iterating
+ * @value Node iterating
+ * @pargs Struct to communicate with the outside world
+ *
+ * Return: false to keep iterating, true if needs break.
+ */
+static gboolean iova_tree_alloc_traverse(gpointer key, gpointer value,
+                                         gpointer pargs)
+{
+    struct IOVATreeAllocArgs *args = pargs;
+    DMAMap *node = value;
+
+    assert(key == value);
+
+    iova_tree_alloc_args_iterate(args, node);
+    iova_tree_alloc_map_in_hole(args);
+    return args->iova_found;
+}
+
+int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
+                        hwaddr iova_last)
+{
+    struct IOVATreeAllocArgs args = {
+        .new_size = map->size,
+        .iova_begin = iova_begin,
+    };
+
+    assert(iova_begin < iova_last);
+
+    /*
+     * Find a valid hole for the mapping
+     *
+     * Assuming low iova_begin, so no need to do a binary search to
+     * locate the first node.
+     *
+     * TODO: Replace all this with g_tree_node_first/next/last when available
+     * (from glib since 2.68). To do it with g_tree_foreach complicates the
+     * code a lot.
+     *
+     */
+    g_tree_foreach(tree->tree, iova_tree_alloc_traverse, &args);
+    if (!args.iova_found) {
+        /*
+         * Either tree is empty or the last hole is still not checked.
+         * g_tree_foreach does not compare (last, iova_end] range, so we check
+         * it here.
+         */
+        iova_tree_alloc_args_iterate(&args, NULL);
+        iova_tree_alloc_map_in_hole(&args);
+    }
+
+    if (!args.iova_found || args.iova_result + map->size > iova_last) {
+        return IOVA_ERR_NOMEM;
+    }
+
+    map->iova = args.iova_result;
+    return iova_tree_insert(tree, map);
+}
+
 void iova_tree_destroy(IOVATree *tree)
 {
     g_tree_destroy(tree->tree);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (7 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 08/14] util: Add iova_tree_alloc Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  7:06     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This tree is able to look for a translated address from an IOVA address.

At first glance it is similar to util/iova-tree. However, SVQ working on
devices with limited IOVA space need more capabilities, like allocating
IOVA chunks or performing reverse translations (qemu addresses to iova).

The allocation capability, as "assign a free IOVA address to this chunk
of memory in qemu's address space" allows shadow virtqueue to create a
new address space that is not restricted by guest's addressable one, so
we can allocate shadow vqs vrings outside of it.

It duplicates the tree so it can search efficiently in both directions,
and it will signal overlap if iova or the translated address is present
in any tree.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-iova-tree.h |  27 +++++++
 hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
 hw/virtio/meson.build       |   2 +-
 3 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-iova-tree.c

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
new file mode 100644
index 0000000000..6a4f24e0f9
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.h
@@ -0,0 +1,27 @@
+/*
+ * vhost software live migration iova tree
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
+#define HW_VIRTIO_VHOST_IOVA_TREE_H
+
+#include "qemu/iova-tree.h"
+#include "exec/memory.h"
+
+typedef struct VhostIOVATree VhostIOVATree;
+
+VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
+void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
+
+const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
+                                        const DMAMap *map);
+int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
+void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
+
+#endif
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
new file mode 100644
index 0000000000..03496ac075
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.c
@@ -0,0 +1,155 @@
+/*
+ * vhost software live migration iova tree
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/iova-tree.h"
+#include "vhost-iova-tree.h"
+
+#define iova_min_addr qemu_real_host_page_size
+
+/**
+ * VhostIOVATree, able to:
+ * - Translate iova address
+ * - Reverse translate iova address (from translated to iova)
+ * - Allocate IOVA regions for translated range (linear operation)
+ */
+struct VhostIOVATree {
+    /* First addressable iova address in the device */
+    uint64_t iova_first;
+
+    /* Last addressable iova address in the device */
+    uint64_t iova_last;
+
+    /* IOVA address to qemu memory maps. */
+    IOVATree *iova_taddr_map;
+
+    /* QEMU virtual memory address to iova maps */
+    GTree *taddr_iova_map;
+};
+
+static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
+                                      gpointer data)
+{
+    const DMAMap *m1 = a, *m2 = b;
+
+    if (m1->translated_addr > m2->translated_addr + m2->size) {
+        return 1;
+    }
+
+    if (m1->translated_addr + m1->size < m2->translated_addr) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+/**
+ * Create a new IOVA tree
+ *
+ * Returns the new IOVA tree
+ */
+VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
+{
+    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
+
+    /* Some devices do not like 0 addresses */
+    tree->iova_first = MAX(iova_first, iova_min_addr);
+    tree->iova_last = iova_last;
+
+    tree->iova_taddr_map = iova_tree_new();
+    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
+                                           NULL, g_free);
+    return tree;
+}
+
+/**
+ * Delete an iova tree
+ */
+void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
+{
+    iova_tree_destroy(iova_tree->iova_taddr_map);
+    g_tree_unref(iova_tree->taddr_iova_map);
+    g_free(iova_tree);
+}
+
+/**
+ * Find the IOVA address stored from a memory address
+ *
+ * @tree     The iova tree
+ * @map      The map with the memory address
+ *
+ * Return the stored mapping, or NULL if not found.
+ */
+const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
+                                        const DMAMap *map)
+{
+    return g_tree_lookup(tree->taddr_iova_map, map);
+}
+
+/**
+ * Allocate a new mapping
+ *
+ * @tree  The iova tree
+ * @map   The iova map
+ *
+ * Returns:
+ * - IOVA_OK if the map fits in the container
+ * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
+ * - IOVA_ERR_OVERLAP if the tree already contains that map
+ * - IOVA_ERR_NOMEM if tree cannot allocate more space.
+ *
+ * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
+ */
+int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
+{
+    /* Some vhost devices do not like addr 0. Skip first page */
+    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
+    DMAMap *new;
+    int r;
+
+    if (map->translated_addr + map->size < map->translated_addr ||
+        map->perm == IOMMU_NONE) {
+        return IOVA_ERR_INVALID;
+    }
+
+    /* Check for collisions in translated addresses */
+    if (vhost_iova_tree_find_iova(tree, map)) {
+        return IOVA_ERR_OVERLAP;
+    }
+
+    /* Allocate a node in IOVA address */
+    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
+                            tree->iova_last);
+    if (r != IOVA_OK) {
+        return r;
+    }
+
+    /* Allocate node in qemu -> iova translations */
+    new = g_malloc(sizeof(*new));
+    memcpy(new, map, sizeof(*new));
+    g_tree_insert(tree->taddr_iova_map, new, new);
+    return IOVA_OK;
+}
+
+/**
+ * Remove existing mappings from iova tree
+ *
+ * @param  iova_tree  The vhost iova tree
+ * @param  map        The map to remove
+ */
+void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
+{
+    const DMAMap *overlap;
+
+    iova_tree_remove(iova_tree->iova_taddr_map, map);
+    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
+        g_tree_remove(iova_tree->taddr_iova_map, overlap);
+    }
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index 2dc87613bc..6047670804 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (8 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 09/14] vhost: Add VhostIOVATree Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  7:36     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base " Eugenio Pérez
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

Use translations added in VhostIOVATree in SVQ.

Only introduce usage here, not allocation and deallocation. As with
previous patches, we use the dead code paths of shadow_vqs_enabled to
avoid commiting too many changes at once. These are impossible to take
at the moment.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |   6 +-
 include/hw/virtio/vhost-vdpa.h     |   3 +
 hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
 hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
 4 files changed, 187 insertions(+), 26 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 04c67685fd..b2f722d101 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -13,6 +13,7 @@
 #include "qemu/event_notifier.h"
 #include "hw/virtio/virtio.h"
 #include "standard-headers/linux/vhost_types.h"
+#include "hw/virtio/vhost-iova-tree.h"
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
@@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
     /* Virtio device */
     VirtIODevice *vdev;
 
+    /* IOVA mapping */
+    VhostIOVATree *iova_tree;
+
     /* Map for use the guest's descriptors */
     VirtQueueElement **ring_id_maps;
 
@@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
                      VirtQueue *vq);
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
-VhostShadowVirtqueue *vhost_svq_new(void);
+VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
 
 void vhost_svq_free(gpointer vq);
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 009a9f3b6b..ee8e939ad0 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -14,6 +14,7 @@
 
 #include <gmodule.h>
 
+#include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/virtio.h"
 #include "standard-headers/linux/vhost_types.h"
 
@@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
     MemoryListener listener;
     struct vhost_vdpa_iova_range iova_range;
     bool shadow_vqs_enabled;
+    /* IOVA mapping used by the Shadow Virtqueue */
+    VhostIOVATree *iova_tree;
     GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index a38d313755..7e073773d1 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -11,6 +11,7 @@
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 
 #include "qemu/error-report.h"
+#include "qemu/log.h"
 #include "qemu/main-loop.h"
 #include "qemu/log.h"
 #include "linux-headers/linux/vhost.h"
@@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
     }
 }
 
+/**
+ * Translate addresses between the qemu's virtual address and the SVQ IOVA
+ *
+ * @svq    Shadow VirtQueue
+ * @vaddr  Translated IOVA addresses
+ * @iovec  Source qemu's VA addresses
+ * @num    Length of iovec and minimum length of vaddr
+ */
+static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
+                                     void **addrs, const struct iovec *iovec,
+                                     size_t num)
+{
+    if (num == 0) {
+        return true;
+    }
+
+    for (size_t i = 0; i < num; ++i) {
+        DMAMap needle = {
+            .translated_addr = (hwaddr)iovec[i].iov_base,
+            .size = iovec[i].iov_len,
+        };
+        size_t off;
+
+        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
+        /*
+         * Map cannot be NULL since iova map contains all guest space and
+         * qemu already has a physical address mapped
+         */
+        if (unlikely(!map)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
+                          needle.translated_addr);
+            return false;
+        }
+
+        off = needle.translated_addr - map->translated_addr;
+        addrs[i] = (void *)(map->iova + off);
+
+        if (unlikely(int128_gt(int128_add(needle.translated_addr,
+                                          iovec[i].iov_len),
+                               map->translated_addr + map->size))) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "Guest buffer expands over iova range");
+            return false;
+        }
+    }
+
+    return true;
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+                                    void * const *vaddr_sg,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
 {
@@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
         } else {
             descs[i].flags = flags;
         }
-        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
+        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
         descs[i].len = cpu_to_le32(iovec[n].iov_len);
 
         last = i;
@@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
 {
     unsigned avail_idx;
     vring_avail_t *avail = svq->vring.avail;
+    bool ok;
+    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
 
     *head = svq->free_head;
 
@@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
         return false;
     }
 
-    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
+    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
+    if (unlikely(!ok)) {
+        return false;
+    }
+    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
                             elem->in_num > 0, false);
-    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
+
+
+    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
+    if (unlikely(!ok)) {
+        return false;
+    }
+
+    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
 
     /*
      * Put the entry in the available array (but don't update avail->idx until
@@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
  * Creates vhost shadow virtqueue, and instructs the vhost device to use the
  * shadow methods and file descriptors.
  *
+ * @iova_tree Tree to perform descriptors translations
+ *
  * Returns the new virtqueue or NULL.
  *
  * In case of error, reason is reported through error_report.
  */
-VhostShadowVirtqueue *vhost_svq_new(void)
+VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
 {
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
@@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
 
     event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
     event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
+    svq->iova_tree = iova_tree;
     return g_steal_pointer(&svq);
 
 err_init_hdev_call:
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 435b9c2e9e..56f9f125cd 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
                                          vaddr, section->readonly);
 
     llsize = int128_sub(llend, int128_make64(iova));
+    if (v->shadow_vqs_enabled) {
+        DMAMap mem_region = {
+            .translated_addr = (hwaddr)vaddr,
+            .size = int128_get64(llsize) - 1,
+            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
+        };
+
+        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
+        if (unlikely(r != IOVA_OK)) {
+            error_report("Can't allocate a mapping (%d)", r);
+            goto fail;
+        }
+
+        iova = mem_region.iova;
+    }
 
     vhost_vdpa_iotlb_batch_begin_once(v);
     ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
@@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     llsize = int128_sub(llend, int128_make64(iova));
 
+    if (v->shadow_vqs_enabled) {
+        const DMAMap *result;
+        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+        DMAMap mem_region = {
+            .translated_addr = (hwaddr)vaddr,
+            .size = int128_get64(llsize) - 1,
+        };
+
+        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
+        iova = result->iova;
+        vhost_iova_tree_remove(v->iova_tree, &mem_region);
+    }
     vhost_vdpa_iotlb_batch_begin_once(v);
     ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
     if (ret) {
@@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
 
     shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
     for (unsigned n = 0; n < hdev->nvqs; ++n) {
-        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
+        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
 
         if (unlikely(!svq)) {
             error_setg(errp, "Cannot create svq %u", n);
@@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
 /**
  * Unmap a SVQ area in the device
  */
-static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
-                                      hwaddr size)
+static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
+                                      const DMAMap *needle)
 {
+    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
+    hwaddr size;
     int r;
 
-    size = ROUND_UP(size, qemu_real_host_page_size);
-    r = vhost_vdpa_dma_unmap(v, iova, size);
+    if (unlikely(!result)) {
+        error_report("Unable to find SVQ address to unmap");
+        return false;
+    }
+
+    size = ROUND_UP(result->size, qemu_real_host_page_size);
+    r = vhost_vdpa_dma_unmap(v, result->iova, size);
     return r == 0;
 }
 
 static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
                                        const VhostShadowVirtqueue *svq)
 {
+    DMAMap needle;
     struct vhost_vdpa *v = dev->opaque;
     struct vhost_vring_addr svq_addr;
-    size_t device_size = vhost_svq_device_area_size(svq);
-    size_t driver_size = vhost_svq_driver_area_size(svq);
     bool ok;
 
     vhost_svq_get_vring_addr(svq, &svq_addr);
 
-    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
+    needle = (DMAMap) {
+        .translated_addr = svq_addr.desc_user_addr,
+    };
+    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
     if (unlikely(!ok)) {
         return false;
     }
 
-    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
+    needle = (DMAMap) {
+        .translated_addr = svq_addr.used_user_addr,
+    };
+    return vhost_vdpa_svq_unmap_ring(v, &needle);
+}
+
+/**
+ * Map the SVQ area in the device
+ *
+ * @v          Vhost-vdpa device
+ * @needle     The area to search iova
+ * @errorp     Error pointer
+ */
+static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
+                                    Error **errp)
+{
+    int r;
+
+    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
+    if (unlikely(r != IOVA_OK)) {
+        error_setg(errp, "Cannot allocate iova (%d)", r);
+        return false;
+    }
+
+    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
+                           (void *)needle->translated_addr,
+                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
+    if (unlikely(r != 0)) {
+        error_setg_errno(errp, -r, "Cannot map region to device");
+        vhost_iova_tree_remove(v->iova_tree, needle);
+    }
+
+    return r == 0;
 }
 
 /**
- * Map shadow virtqueue rings in device
+ * Map the shadow virtqueue rings in the device
  *
  * @dev   The vhost device
  * @svq   The shadow virtqueue
@@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
                                      struct vhost_vring_addr *addr,
                                      Error **errp)
 {
+    DMAMap device_region, driver_region;
+    struct vhost_vring_addr svq_addr;
     struct vhost_vdpa *v = dev->opaque;
     size_t device_size = vhost_svq_device_area_size(svq);
     size_t driver_size = vhost_svq_driver_area_size(svq);
-    int r;
+    size_t avail_offset;
+    bool ok;
 
     ERRP_GUARD();
-    vhost_svq_get_vring_addr(svq, addr);
+    vhost_svq_get_vring_addr(svq, &svq_addr);
 
-    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
-                           (void *)addr->desc_user_addr, true);
-    if (unlikely(r != 0)) {
-        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
+    driver_region = (DMAMap) {
+        .translated_addr = svq_addr.desc_user_addr,
+        .size = driver_size - 1,
+        .perm = IOMMU_RO,
+    };
+    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
+    if (unlikely(!ok)) {
+        error_prepend(errp, "Cannot create vq driver region: ");
         return false;
     }
+    addr->desc_user_addr = driver_region.iova;
+    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
+    addr->avail_user_addr = driver_region.iova + avail_offset;
 
-    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
-                           (void *)addr->used_user_addr, false);
-    if (unlikely(r != 0)) {
-        error_setg_errno(errp, -r, "Cannot create vq device region: ");
+    device_region = (DMAMap) {
+        .translated_addr = svq_addr.used_user_addr,
+        .size = device_size - 1,
+        .perm = IOMMU_RW,
+    };
+    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
+    if (unlikely(!ok)) {
+        error_prepend(errp, "Cannot create vq device region: ");
+        vhost_vdpa_svq_unmap_ring(v, &driver_region);
     }
+    addr->used_user_addr = device_region.iova;
 
-    return r == 0;
+    return ok;
 }
 
 static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (9 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  7:38     ` Jason Wang
  2022-02-27 13:41 ` [PATCH v2 12/14] vdpa: Never set log_base addr if SVQ is enabled Eugenio Pérez
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

This is needed to achieve migration, so the destination can restore its
index.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 56f9f125cd..accc4024c2 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1180,8 +1180,25 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
 static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
                                        struct vhost_vring_state *ring)
 {
+    struct vhost_vdpa *v = dev->opaque;
     int ret;
 
+    if (v->shadow_vqs_enabled) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
+                                                      ring->index);
+
+        /*
+         * Setting base as last used idx, so destination will see as available
+         * all the entries that the device did not use, including the in-flight
+         * processing ones.
+         *
+         * TODO: This is ok for networking, but other kinds of devices might
+         * have problems with these retransmissions.
+         */
+        ring->num = svq->last_used_idx;
+        return 0;
+    }
+
     ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
     trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
     return ret;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 12/14] vdpa: Never set log_base addr if SVQ is enabled
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (10 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base " Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-27 13:41 ` [PATCH v2 13/14] vdpa: Expose VHOST_F_LOG_ALL on SVQ Eugenio Pérez
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

Setting the log address would make the device start reporting invalid
dirty memory because the SVQ vrings are located in qemu's memory.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index accc4024c2..f7ac62d0d6 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1129,7 +1129,8 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
                                      struct vhost_log *log)
 {
-    if (vhost_vdpa_one_time_request(dev)) {
+    struct vhost_vdpa *v = dev->opaque;
+    if (v->shadow_vqs_enabled || vhost_vdpa_one_time_request(dev)) {
         return 0;
     }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 13/14] vdpa: Expose VHOST_F_LOG_ALL on SVQ
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (11 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 12/14] vdpa: Never set log_base addr if SVQ is enabled Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-27 13:41 ` [PATCH v2 14/14] vdpa: Add x-svq to NetdevVhostVDPAOptions Eugenio Pérez
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

SVQ is able to log the dirty bits by itself, so let's use it to not
block migration.

Also, ignore set and clear of VHOST_F_LOG_ALL on set_features if SVQ is
enabled. Even if the device supports it, the reports would be nonsense
because SVQ memory is in the qemu region.

The log region is still allocated. Future changes might skip that, but
this series is already long enough.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  1 +
 hw/virtio/vhost-vdpa.c         | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index ee8e939ad0..a29dbb3f53 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,7 @@ typedef struct vhost_vdpa {
     bool iotlb_batch_begin_sent;
     MemoryListener listener;
     struct vhost_vdpa_iova_range iova_range;
+    uint64_t acked_features;
     bool shadow_vqs_enabled;
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index f7ac62d0d6..58007255fd 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -630,6 +630,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
     }
 
     if (v->shadow_vqs_enabled) {
+        /* We must not ack _F_LOG if SVQ is enabled */
         uint64_t features_ok = features;
         bool ok;
 
@@ -640,6 +641,18 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
                 PRIx64", ok: 0x%"PRIx64, features, features_ok);
             return -EINVAL;
         }
+
+        if ((v->acked_features ^ features) == BIT_ULL(VHOST_F_LOG_ALL)) {
+            /*
+             * QEMU is just trying to enable or disable logging. SVQ handles
+             * this sepparately, so no need to forward this.
+             */
+            v->acked_features = features;
+            return 0;
+        }
+
+        v->acked_features = features;
+        features &= ~BIT_ULL(VHOST_F_LOG_ALL);
     }
 
     trace_vhost_vdpa_set_features(dev, features);
@@ -1245,6 +1258,9 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
     if (ret == 0 && v->shadow_vqs_enabled) {
         /* Filter only features that SVQ can offer to guest */
         vhost_svq_valid_features(features);
+
+        /* Add SVQ logging capabilities */
+        *features |= BIT_ULL(VHOST_F_LOG_ALL);
     }
 
     return ret;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 14/14] vdpa: Add x-svq to NetdevVhostVDPAOptions
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
                   ` (12 preceding siblings ...)
  2022-02-27 13:41 ` [PATCH v2 13/14] vdpa: Expose VHOST_F_LOG_ALL on SVQ Eugenio Pérez
@ 2022-02-27 13:41 ` Eugenio Pérez
  2022-02-28  2:32   ` Jason Wang
  2022-02-28  7:41   ` Jason Wang
  15 siblings, 0 replies; 69+ messages in thread
From: Eugenio Pérez @ 2022-02-27 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Jason Wang, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

Finally offering the possibility to enable SVQ from the command line.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json    |  5 ++++-
 net/vhost-vdpa.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/qapi/net.json b/qapi/net.json
index 7fab2e7cd8..d243701527 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -445,12 +445,15 @@
 # @queues: number of queues to be created for multiqueue vhost-vdpa
 #          (default: 1)
 #
+# @x-svq: Start device with (experimental) shadow virtqueue. (Since 7.0)
+#
 # Since: 5.1
 ##
 { 'struct': 'NetdevVhostVDPAOptions',
   'data': {
     '*vhostdev':     'str',
-    '*queues':       'int' } }
+    '*queues':       'int',
+    '*x-svq':        'bool' } }
 
 ##
 # @NetClientDriver:
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1e9fe47c03..def738998b 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -127,7 +127,11 @@ err_init:
 static void vhost_vdpa_cleanup(NetClientState *nc)
 {
     VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    struct vhost_dev *dev = s->vhost_vdpa.dev;
 
+    if (dev && dev->vq_index + dev->nvqs == dev->vq_index_end) {
+        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+    }
     if (s->vhost_net) {
         vhost_net_cleanup(s->vhost_net);
         g_free(s->vhost_net);
@@ -187,13 +191,23 @@ static NetClientInfo net_vhost_vdpa_info = {
         .check_peer_type = vhost_vdpa_check_peer_type,
 };
 
+static int vhost_vdpa_get_iova_range(int fd,
+                                     struct vhost_vdpa_iova_range *iova_range)
+{
+    int ret = ioctl(fd, VHOST_VDPA_GET_IOVA_RANGE, iova_range);
+
+    return ret < 0 ? -errno : 0;
+}
+
 static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
-                                           const char *device,
-                                           const char *name,
-                                           int vdpa_device_fd,
-                                           int queue_pair_index,
-                                           int nvqs,
-                                           bool is_datapath)
+                                       const char *device,
+                                       const char *name,
+                                       int vdpa_device_fd,
+                                       int queue_pair_index,
+                                       int nvqs,
+                                       bool is_datapath,
+                                       bool svq,
+                                       VhostIOVATree *iova_tree)
 {
     NetClientState *nc = NULL;
     VhostVDPAState *s;
@@ -211,6 +225,8 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
 
     s->vhost_vdpa.device_fd = vdpa_device_fd;
     s->vhost_vdpa.index = queue_pair_index;
+    s->vhost_vdpa.shadow_vqs_enabled = svq;
+    s->vhost_vdpa.iova_tree = iova_tree;
     ret = vhost_vdpa_add(nc, (void *)&s->vhost_vdpa, queue_pair_index, nvqs);
     if (ret) {
         qemu_del_net_client(nc);
@@ -266,6 +282,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     g_autofree NetClientState **ncs = NULL;
     NetClientState *nc;
     int queue_pairs, i, has_cvq = 0;
+    g_autoptr(VhostIOVATree) iova_tree = NULL;
 
     assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
     opts = &netdev->u.vhost_vdpa;
@@ -285,29 +302,44 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
         qemu_close(vdpa_device_fd);
         return queue_pairs;
     }
+    if (opts->x_svq) {
+        struct vhost_vdpa_iova_range iova_range;
+
+        if (has_cvq) {
+            error_setg(errp, "vdpa svq does not work with cvq");
+            goto err_svq;
+        }
+        vhost_vdpa_get_iova_range(vdpa_device_fd, &iova_range);
+        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
+    }
 
     ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
 
     for (i = 0; i < queue_pairs; i++) {
         ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
-                                     vdpa_device_fd, i, 2, true);
+                                     vdpa_device_fd, i, 2, true, opts->x_svq,
+                                     iova_tree);
         if (!ncs[i])
             goto err;
     }
 
     if (has_cvq) {
         nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
-                                 vdpa_device_fd, i, 1, false);
+                                 vdpa_device_fd, i, 1, false, opts->x_svq,
+                                 iova_tree);
         if (!nc)
             goto err;
     }
 
+    iova_tree = NULL;
     return 0;
 
 err:
     if (i) {
         qemu_del_net_client(ncs[0]);
     }
+
+err_svq:
     qemu_close(vdpa_device_fd);
 
     return -1;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
@ 2022-02-28  2:32   ` Jason Wang
  2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  2:32 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Michael S. Tsirkin, qemu-devel, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Sun, Feb 27, 2022 at 9:42 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> is intended as a new method of tracking the memory the devices touch
> during a migration process: Instead of relay on vhost device's dirty
> logging capability, SVQ intercepts the VQ dataplane forwarding the
> descriptors between VM and device. This way qemu is the effective
> writer of guests memory, like in qemu's virtio device operation.
>
> When SVQ is enabled qemu offers a new virtual address space to the
> device to read and write into, and it maps new vrings and the guest
> memory in it. SVQ also intercepts kicks and calls between the device
> and the guest. Used buffers relay would cause dirty memory being
> tracked.
>
> This effectively means that vDPA device passthrough is intercepted by
> qemu. While SVQ should only be enabled at migration time, the switching
> from regular mode to SVQ mode is left for a future series.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> For qemu to use shadow virtqueues the guest virtio driver must not use
> features like event_idx, indirect descriptors, packed and in_order.
> These features are easy to implement on top of this base, but is left
> for a future series for simplicity.
>
> SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
>
> -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
>
> The first three patches enables notifications forwarding with
> assistance of qemu. It's easy to enable only this if the relevant
> cmdline part of the last patch is applied on top of these.
>
> Next four patches implement the actual buffer forwarding. However,
> address are not translated from HVA so they will need a host device with
> an iommu allowing them to access all of the HVA range.
>
> The last part of the series uses properly the host iommu, so qemu
> creates a new iova address space in the device's range and translates
> the buffers in it. Finally, it adds the cmdline parameter.
>
> Some simple performance tests with netperf were done. They used a nested
> guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> baseline average of ~9980.13Mbps:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    9910.61
> 131072  16384  16384    30.00    10030.94
> 131072  16384  16384    30.01    9998.84
>
> To enable the notifications interception reduced performance to an
> average of ~9577.73Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.00    9563.03
> 131072  16384  16384    30.01    9626.65
> 131072  16384  16384    30.01    9543.51
>
> Finally, to enable buffers forwarding reduced the throughput again to
> ~8902.92Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    8643.19
> 131072  16384  16384    30.01    9033.56
> 131072  16384  16384    30.01    9032.02
>
> However, many performance improvements were left out of this series for
> simplicity, so difference if performance should shrink in the future.

I think the performance should be acceptable as a start.

>
> Comments are welcome.
>
> TODO in future series:
> * Event, indirect, packed, and others features of virtio.
> * To support different set of features between the device<->SVQ and the
>   SVQ<->guest communication.
> * Support of device host notifier memory regions.
> * To sepparate buffers forwarding in its own AIO context, so we can
>   throw more threads to that task and we don't need to stop the main
>   event loop.
> * Support multiqueue virtio-net vdpa.
> * Proper documentation.
>
> Changes from v1:
> * Feature set at device->SVQ is now the same as SVQ->guest.
> * Size of SVQ is not max available device size anymore, but guest's
>   negotiated.
> * Add VHOST_FILE_UNBIND kick and call fd treatment.
> * Make SVQ a public struct
> * Come back to previous approach to iova-tree
> * Some assertions are now fail paths. Some errors are now log_guest.
> * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> * Refactor some errors and messages. Add missing error unwindings.
> * Add memory barrier at _F_NO_NOTIFY set.
> * Stop checking for features flags out of transport range.
> v1 link:
> https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
>
> Changes from v4 RFC:
> * Support of allocating / freeing iova ranges in IOVA tree. Extending
>   already present iova-tree for that.
> * Proper validation of guest features. Now SVQ can negotiate a
>   different set of features with the device when enabled.
> * Support of host notifiers memory regions
> * Handling of SVQ full queue in case guest's descriptors span to
>   different memory regions (qemu's VA chunks).
> * Flush pending used buffers at end of SVQ operation.
> * QMP command now looks by NetClientState name. Other devices will need
>   to implement it's way to enable vdpa.
> * Rename QMP command to set, so it looks more like a way of working
> * Better use of qemu error system
> * Make a few assertions proper error-handling paths.
> * Add more documentation
> * Less coupling of virtio / vhost, that could cause friction on changes
> * Addressed many other small comments and small fixes.
>
> Changes from v3 RFC:
>   * Move everything to vhost-vdpa backend. A big change, this allowed
>     some cleanup but more code has been added in other places.
>   * More use of glib utilities, especially to manage memory.
> v3 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
>
> Changes from v2 RFC:
>   * Adding vhost-vdpa devices support
>   * Fixed some memory leaks pointed by different comments
> v2 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
>
> Changes from v1 RFC:
>   * Use QMP instead of migration to start SVQ mode.
>   * Only accepting IOMMU devices, closer behavior with target devices
>     (vDPA)
>   * Fix invalid masking/unmasking of vhost call fd.
>   * Use of proper methods for synchronization.
>   * No need to modify VirtIO device code, all of the changes are
>     contained in vhost code.
>   * Delete superfluous code.
>   * An intermediate RFC was sent with only the notifications forwarding
>     changes. It can be seen in
>     https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> v1 link:
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (20):
>       virtio: Add VIRTIO_F_QUEUE_STATE
>       virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>       virtio: Add virtio_queue_is_host_notifier_enabled
>       vhost: Make vhost_virtqueue_{start,stop} public
>       vhost: Add x-vhost-enable-shadow-vq qmp
>       vhost: Add VhostShadowVirtqueue
>       vdpa: Register vdpa devices in a list
>       vhost: Route guest->host notification through shadow virtqueue
>       Add vhost_svq_get_svq_call_notifier
>       Add vhost_svq_set_guest_call_notifier
>       vdpa: Save call_fd in vhost-vdpa
>       vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>       vhost: Route host->guest notification through shadow virtqueue
>       virtio: Add vhost_shadow_vq_get_vring_addr
>       vdpa: Save host and guest features
>       vhost: Add vhost_svq_valid_device_features to shadow vq
>       vhost: Shadow virtqueue buffers forwarding
>       vhost: Add VhostIOVATree
>       vhost: Use a tree to store memory mappings
>       vdpa: Add custom IOTLB translations to SVQ

This list seems wrong btw :)

Thanks

>
> Eugenio Pérez (14):
>   vhost: Add VhostShadowVirtqueue
>   vhost: Add Shadow VirtQueue kick forwarding capabilities
>   vhost: Add Shadow VirtQueue call forwarding capabilities
>   vhost: Add vhost_svq_valid_features to shadow vq
>   virtio: Add vhost_shadow_vq_get_vring_addr
>   vdpa: adapt vhost_ops callbacks to svq
>   vhost: Shadow virtqueue buffers forwarding
>   util: Add iova_tree_alloc
>   vhost: Add VhostIOVATree
>   vdpa: Add custom IOTLB translations to SVQ
>   vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
>   vdpa: Never set log_base addr if SVQ is enabled
>   vdpa: Expose VHOST_F_LOG_ALL on SVQ
>   vdpa: Add x-svq to NetdevVhostVDPAOptions
>
>  qapi/net.json                      |   5 +-
>  hw/virtio/vhost-iova-tree.h        |  27 ++
>  hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
>  include/hw/virtio/vhost-vdpa.h     |   8 +
>  include/qemu/iova-tree.h           |  18 +
>  hw/virtio/vhost-iova-tree.c        | 155 +++++++
>  hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
>  hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
>  net/vhost-vdpa.c                   |  48 ++-
>  util/iova-tree.c                   | 133 ++++++
>  hw/virtio/meson.build              |   2 +-
>  11 files changed, 1644 insertions(+), 25 deletions(-)
>  create mode 100644 hw/virtio/vhost-iova-tree.h
>  create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>  create mode 100644 hw/virtio/vhost-iova-tree.c
>  create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>
> --
> 2.27.0
>
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
@ 2022-02-28  2:32   ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  2:32 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Sun, Feb 27, 2022 at 9:42 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> is intended as a new method of tracking the memory the devices touch
> during a migration process: Instead of relay on vhost device's dirty
> logging capability, SVQ intercepts the VQ dataplane forwarding the
> descriptors between VM and device. This way qemu is the effective
> writer of guests memory, like in qemu's virtio device operation.
>
> When SVQ is enabled qemu offers a new virtual address space to the
> device to read and write into, and it maps new vrings and the guest
> memory in it. SVQ also intercepts kicks and calls between the device
> and the guest. Used buffers relay would cause dirty memory being
> tracked.
>
> This effectively means that vDPA device passthrough is intercepted by
> qemu. While SVQ should only be enabled at migration time, the switching
> from regular mode to SVQ mode is left for a future series.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> For qemu to use shadow virtqueues the guest virtio driver must not use
> features like event_idx, indirect descriptors, packed and in_order.
> These features are easy to implement on top of this base, but is left
> for a future series for simplicity.
>
> SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
>
> -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
>
> The first three patches enables notifications forwarding with
> assistance of qemu. It's easy to enable only this if the relevant
> cmdline part of the last patch is applied on top of these.
>
> Next four patches implement the actual buffer forwarding. However,
> address are not translated from HVA so they will need a host device with
> an iommu allowing them to access all of the HVA range.
>
> The last part of the series uses properly the host iommu, so qemu
> creates a new iova address space in the device's range and translates
> the buffers in it. Finally, it adds the cmdline parameter.
>
> Some simple performance tests with netperf were done. They used a nested
> guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> baseline average of ~9980.13Mbps:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    9910.61
> 131072  16384  16384    30.00    10030.94
> 131072  16384  16384    30.01    9998.84
>
> To enable the notifications interception reduced performance to an
> average of ~9577.73Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.00    9563.03
> 131072  16384  16384    30.01    9626.65
> 131072  16384  16384    30.01    9543.51
>
> Finally, to enable buffers forwarding reduced the throughput again to
> ~8902.92Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    8643.19
> 131072  16384  16384    30.01    9033.56
> 131072  16384  16384    30.01    9032.02
>
> However, many performance improvements were left out of this series for
> simplicity, so difference if performance should shrink in the future.

I think the performance should be acceptable as a start.

>
> Comments are welcome.
>
> TODO in future series:
> * Event, indirect, packed, and others features of virtio.
> * To support different set of features between the device<->SVQ and the
>   SVQ<->guest communication.
> * Support of device host notifier memory regions.
> * To sepparate buffers forwarding in its own AIO context, so we can
>   throw more threads to that task and we don't need to stop the main
>   event loop.
> * Support multiqueue virtio-net vdpa.
> * Proper documentation.
>
> Changes from v1:
> * Feature set at device->SVQ is now the same as SVQ->guest.
> * Size of SVQ is not max available device size anymore, but guest's
>   negotiated.
> * Add VHOST_FILE_UNBIND kick and call fd treatment.
> * Make SVQ a public struct
> * Come back to previous approach to iova-tree
> * Some assertions are now fail paths. Some errors are now log_guest.
> * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> * Refactor some errors and messages. Add missing error unwindings.
> * Add memory barrier at _F_NO_NOTIFY set.
> * Stop checking for features flags out of transport range.
> v1 link:
> https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
>
> Changes from v4 RFC:
> * Support of allocating / freeing iova ranges in IOVA tree. Extending
>   already present iova-tree for that.
> * Proper validation of guest features. Now SVQ can negotiate a
>   different set of features with the device when enabled.
> * Support of host notifiers memory regions
> * Handling of SVQ full queue in case guest's descriptors span to
>   different memory regions (qemu's VA chunks).
> * Flush pending used buffers at end of SVQ operation.
> * QMP command now looks by NetClientState name. Other devices will need
>   to implement it's way to enable vdpa.
> * Rename QMP command to set, so it looks more like a way of working
> * Better use of qemu error system
> * Make a few assertions proper error-handling paths.
> * Add more documentation
> * Less coupling of virtio / vhost, that could cause friction on changes
> * Addressed many other small comments and small fixes.
>
> Changes from v3 RFC:
>   * Move everything to vhost-vdpa backend. A big change, this allowed
>     some cleanup but more code has been added in other places.
>   * More use of glib utilities, especially to manage memory.
> v3 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
>
> Changes from v2 RFC:
>   * Adding vhost-vdpa devices support
>   * Fixed some memory leaks pointed by different comments
> v2 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
>
> Changes from v1 RFC:
>   * Use QMP instead of migration to start SVQ mode.
>   * Only accepting IOMMU devices, closer behavior with target devices
>     (vDPA)
>   * Fix invalid masking/unmasking of vhost call fd.
>   * Use of proper methods for synchronization.
>   * No need to modify VirtIO device code, all of the changes are
>     contained in vhost code.
>   * Delete superfluous code.
>   * An intermediate RFC was sent with only the notifications forwarding
>     changes. It can be seen in
>     https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> v1 link:
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (20):
>       virtio: Add VIRTIO_F_QUEUE_STATE
>       virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>       virtio: Add virtio_queue_is_host_notifier_enabled
>       vhost: Make vhost_virtqueue_{start,stop} public
>       vhost: Add x-vhost-enable-shadow-vq qmp
>       vhost: Add VhostShadowVirtqueue
>       vdpa: Register vdpa devices in a list
>       vhost: Route guest->host notification through shadow virtqueue
>       Add vhost_svq_get_svq_call_notifier
>       Add vhost_svq_set_guest_call_notifier
>       vdpa: Save call_fd in vhost-vdpa
>       vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>       vhost: Route host->guest notification through shadow virtqueue
>       virtio: Add vhost_shadow_vq_get_vring_addr
>       vdpa: Save host and guest features
>       vhost: Add vhost_svq_valid_device_features to shadow vq
>       vhost: Shadow virtqueue buffers forwarding
>       vhost: Add VhostIOVATree
>       vhost: Use a tree to store memory mappings
>       vdpa: Add custom IOTLB translations to SVQ

This list seems wrong btw :)

Thanks

>
> Eugenio Pérez (14):
>   vhost: Add VhostShadowVirtqueue
>   vhost: Add Shadow VirtQueue kick forwarding capabilities
>   vhost: Add Shadow VirtQueue call forwarding capabilities
>   vhost: Add vhost_svq_valid_features to shadow vq
>   virtio: Add vhost_shadow_vq_get_vring_addr
>   vdpa: adapt vhost_ops callbacks to svq
>   vhost: Shadow virtqueue buffers forwarding
>   util: Add iova_tree_alloc
>   vhost: Add VhostIOVATree
>   vdpa: Add custom IOTLB translations to SVQ
>   vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
>   vdpa: Never set log_base addr if SVQ is enabled
>   vdpa: Expose VHOST_F_LOG_ALL on SVQ
>   vdpa: Add x-svq to NetdevVhostVDPAOptions
>
>  qapi/net.json                      |   5 +-
>  hw/virtio/vhost-iova-tree.h        |  27 ++
>  hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
>  include/hw/virtio/vhost-vdpa.h     |   8 +
>  include/qemu/iova-tree.h           |  18 +
>  hw/virtio/vhost-iova-tree.c        | 155 +++++++
>  hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
>  hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
>  net/vhost-vdpa.c                   |  48 ++-
>  util/iova-tree.c                   | 133 ++++++
>  hw/virtio/meson.build              |   2 +-
>  11 files changed, 1644 insertions(+), 25 deletions(-)
>  create mode 100644 hw/virtio/vhost-iova-tree.h
>  create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>  create mode 100644 hw/virtio/vhost-iova-tree.c
>  create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>
> --
> 2.27.0
>
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
@ 2022-02-28  2:57     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  2:57 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> will just forward the guest's kicks to the device.
>
> Host memory notifiers regions are left out for simplicity, and they will
> not be addressed in this series.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  14 +++
>   include/hw/virtio/vhost-vdpa.h     |   4 +
>   hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
>   hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
>   4 files changed, 213 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index f1519e3c7b..1cbc87d5d8 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier hdev_kick;
>       /* Shadow call notifier, sent to vhost */
>       EventNotifier hdev_call;
> +
> +    /*
> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> +     * retrieve VhostShadowVirtqueue.
> +     *
> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> +     */
> +    EventNotifier svq_kick;
>   } VhostShadowVirtqueue;
>   
> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> +
> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> +
>   VhostShadowVirtqueue *vhost_svq_new(void);
>   
>   void vhost_svq_free(gpointer vq);
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 3ce79a646d..009a9f3b6b 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -12,6 +12,8 @@
>   #ifndef HW_VIRTIO_VHOST_VDPA_H
>   #define HW_VIRTIO_VHOST_VDPA_H
>   
> +#include <gmodule.h>
> +
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
>       bool iotlb_batch_begin_sent;
>       MemoryListener listener;
>       struct vhost_vdpa_iova_range iova_range;
> +    bool shadow_vqs_enabled;
> +    GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>   } VhostVDPA;
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 019cf1950f..a5d0659f86 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -11,6 +11,56 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   
>   #include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
> +#include "linux-headers/linux/vhost.h"
> +
> +/** Forward guest notifications */
> +static void vhost_handle_guest_kick(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             svq_kick);
> +    event_notifier_test_and_clear(n);
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
> +/**
> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> + *
> + * @svq          The svq
> + * @svq_kick_fd  The svq kick fd
> + *
> + * Note that the SVQ will never close the old file descriptor.
> + */
> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> +{
> +    EventNotifier *svq_kick = &svq->svq_kick;
> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);


I wonder if this is robust. E.g is there any chance that may end up with 
both poll_stop and poll_start are false?

If not, can we simple detect poll_stop as below and treat !poll_start 
and poll_stop?

Other looks good.

Thanks


> +    bool poll_start = svq_kick_fd != VHOST_FILE_UNBIND;
> +
> +    if (poll_stop) {
> +        event_notifier_set_handler(svq_kick, NULL);
> +    }
> +
> +    /*
> +     * event_notifier_set_handler already checks for guest's notifications if
> +     * they arrive at the new file descriptor in the switch, so there is no
> +     * need to explicitly check for them.
> +     */
> +    if (poll_start) {
> +        event_notifier_init_fd(svq_kick, svq_kick_fd);
> +        event_notifier_set(svq_kick);
> +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> +    }
> +}
> +
> +/**
> + * Stop the shadow virtqueue operation.
> + * @svq Shadow Virtqueue
> + */
> +void vhost_svq_stop(VhostShadowVirtqueue *svq)
> +{
> +    event_notifier_set_handler(&svq->svq_kick, NULL);
> +}
>   
>   /**
>    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> @@ -39,6 +89,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>           goto err_init_hdev_call;
>       }
>   
> +    event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>       return g_steal_pointer(&svq);
>   
>   err_init_hdev_call:
> @@ -56,6 +107,7 @@ err_init_hdev_kick:
>   void vhost_svq_free(gpointer pvq)
>   {
>       VhostShadowVirtqueue *vq = pvq;
> +    vhost_svq_stop(vq);
>       event_notifier_cleanup(&vq->hdev_kick);
>       event_notifier_cleanup(&vq->hdev_call);
>       g_free(vq);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 04ea43704f..454bf50735 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -17,12 +17,14 @@
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/vhost-backend.h"
>   #include "hw/virtio/virtio-net.h"
> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost-vdpa.h"
>   #include "exec/address-spaces.h"
>   #include "qemu/main-loop.h"
>   #include "cpu.h"
>   #include "trace.h"
>   #include "qemu-common.h"
> +#include "qapi/error.h"
>   
>   /*
>    * Return one past the end of the end of section. Be careful with uint64_t
> @@ -342,6 +344,30 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
>       return v->index != 0;
>   }
>   
> +static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> +                               Error **errp)
> +{
> +    g_autoptr(GPtrArray) shadow_vqs = NULL;
> +
> +    if (!v->shadow_vqs_enabled) {
> +        return 0;
> +    }
> +
> +    shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> +    for (unsigned n = 0; n < hdev->nvqs; ++n) {
> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> +
> +        if (unlikely(!svq)) {
> +            error_setg(errp, "Cannot create svq %u", n);
> +            return -1;
> +        }
> +        g_ptr_array_add(shadow_vqs, g_steal_pointer(&svq));
> +    }
> +
> +    v->shadow_vqs = g_steal_pointer(&shadow_vqs);
> +    return 0;
> +}
> +
>   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>   {
>       struct vhost_vdpa *v;
> @@ -364,6 +390,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>       dev->opaque =  opaque ;
>       v->listener = vhost_vdpa_memory_listener;
>       v->msg_type = VHOST_IOTLB_MSG_V2;
> +    ret = vhost_vdpa_init_svq(dev, v, errp);
> +    if (ret) {
> +        goto err;
> +    }
>   
>       vhost_vdpa_get_iova_range(v);
>   
> @@ -375,6 +405,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>                                  VIRTIO_CONFIG_S_DRIVER);
>   
>       return 0;
> +
> +err:
> +    ram_block_discard_disable(false);
> +    return ret;
>   }
>   
>   static void vhost_vdpa_host_notifier_uninit(struct vhost_dev *dev,
> @@ -444,8 +478,14 @@ err:
>   
>   static void vhost_vdpa_host_notifiers_init(struct vhost_dev *dev)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int i;
>   
> +    if (v->shadow_vqs_enabled) {
> +        /* FIXME SVQ is not compatible with host notifiers mr */
> +        return;
> +    }
> +
>       for (i = dev->vq_index; i < dev->vq_index + dev->nvqs; i++) {
>           if (vhost_vdpa_host_notifier_init(dev, i)) {
>               goto err;
> @@ -459,6 +499,21 @@ err:
>       return;
>   }
>   
> +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    size_t idx;
> +
> +    if (!v->shadow_vqs) {
> +        return;
> +    }
> +
> +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> +        vhost_svq_stop(g_ptr_array_index(v->shadow_vqs, idx));
> +    }
> +    g_ptr_array_free(v->shadow_vqs, true);
> +}
> +
>   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>   {
>       struct vhost_vdpa *v;
> @@ -467,6 +522,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>       trace_vhost_vdpa_cleanup(dev, v);
>       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>       memory_listener_unregister(&v->listener);
> +    vhost_vdpa_svq_cleanup(dev);
>   
>       dev->opaque = NULL;
>       ram_block_discard_disable(false);
> @@ -558,11 +614,26 @@ static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
>       return ret;
>   }
>   
> +static void vhost_vdpa_reset_svq(struct vhost_vdpa *v)
> +{
> +    if (!v->shadow_vqs_enabled) {
> +        return;
> +    }
> +
> +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> +        vhost_svq_stop(svq);
> +    }
> +}
> +
>   static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>       uint8_t status = 0;
>   
> +    vhost_vdpa_reset_svq(v);
> +
>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>       trace_vhost_vdpa_reset_device(dev, status);
>       return ret;
> @@ -646,13 +717,75 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
>       return ret;
>    }
>   
> +static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
> +                                         struct vhost_vring_file *file)
> +{
> +    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> +}
> +
> +/**
> + * Set the shadow virtqueue descriptors to the device
> + *
> + * @dev   The vhost device model
> + * @svq   The shadow virtqueue
> + * @idx   The index of the virtqueue in the vhost device
> + * @errp  Error
> + */
> +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> +                                 VhostShadowVirtqueue *svq,
> +                                 unsigned idx,
> +                                 Error **errp)
> +{
> +    struct vhost_vring_file file = {
> +        .index = dev->vq_index + idx,
> +    };
> +    const EventNotifier *event_notifier = &svq->hdev_kick;
> +    int r;
> +
> +    file.fd = event_notifier_get_fd(event_notifier);
> +    r = vhost_vdpa_set_vring_dev_kick(dev, &file);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Can't set device kick fd");
> +    }
> +
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    Error *err = NULL;
> +    unsigned i;
> +
> +    if (!v->shadow_vqs) {
> +        return true;
> +    }
> +
> +    for (i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> +        bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
> +        if (unlikely(!ok)) {
> +            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
>   static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>   {
>       struct vhost_vdpa *v = dev->opaque;
> +    bool ok;
>       trace_vhost_vdpa_dev_start(dev, started);
>   
>       if (started) {
>           vhost_vdpa_host_notifiers_init(dev);
> +        ok = vhost_vdpa_svqs_start(dev);
> +        if (unlikely(!ok)) {
> +            return -1;
> +        }
>           vhost_vdpa_set_vring_ready(dev);
>       } else {
>           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> @@ -724,8 +857,16 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>   static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
>                                          struct vhost_vring_file *file)
>   {
> -    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> +    struct vhost_vdpa *v = dev->opaque;
> +    int vdpa_idx = file->index - dev->vq_index;
> +
> +    if (v->shadow_vqs_enabled) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
> +        vhost_svq_set_svq_kick_fd(svq, file->fd);
> +        return 0;
> +    } else {
> +        return vhost_vdpa_set_vring_dev_kick(dev, file);
> +    }
>   }
>   
>   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
@ 2022-02-28  2:57     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  2:57 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> will just forward the guest's kicks to the device.
>
> Host memory notifiers regions are left out for simplicity, and they will
> not be addressed in this series.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  14 +++
>   include/hw/virtio/vhost-vdpa.h     |   4 +
>   hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
>   hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
>   4 files changed, 213 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index f1519e3c7b..1cbc87d5d8 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier hdev_kick;
>       /* Shadow call notifier, sent to vhost */
>       EventNotifier hdev_call;
> +
> +    /*
> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> +     * retrieve VhostShadowVirtqueue.
> +     *
> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> +     */
> +    EventNotifier svq_kick;
>   } VhostShadowVirtqueue;
>   
> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> +
> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> +
>   VhostShadowVirtqueue *vhost_svq_new(void);
>   
>   void vhost_svq_free(gpointer vq);
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 3ce79a646d..009a9f3b6b 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -12,6 +12,8 @@
>   #ifndef HW_VIRTIO_VHOST_VDPA_H
>   #define HW_VIRTIO_VHOST_VDPA_H
>   
> +#include <gmodule.h>
> +
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
>       bool iotlb_batch_begin_sent;
>       MemoryListener listener;
>       struct vhost_vdpa_iova_range iova_range;
> +    bool shadow_vqs_enabled;
> +    GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>   } VhostVDPA;
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 019cf1950f..a5d0659f86 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -11,6 +11,56 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   
>   #include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
> +#include "linux-headers/linux/vhost.h"
> +
> +/** Forward guest notifications */
> +static void vhost_handle_guest_kick(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             svq_kick);
> +    event_notifier_test_and_clear(n);
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
> +/**
> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> + *
> + * @svq          The svq
> + * @svq_kick_fd  The svq kick fd
> + *
> + * Note that the SVQ will never close the old file descriptor.
> + */
> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> +{
> +    EventNotifier *svq_kick = &svq->svq_kick;
> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);


I wonder if this is robust. E.g is there any chance that may end up with 
both poll_stop and poll_start are false?

If not, can we simple detect poll_stop as below and treat !poll_start 
and poll_stop?

Other looks good.

Thanks


> +    bool poll_start = svq_kick_fd != VHOST_FILE_UNBIND;
> +
> +    if (poll_stop) {
> +        event_notifier_set_handler(svq_kick, NULL);
> +    }
> +
> +    /*
> +     * event_notifier_set_handler already checks for guest's notifications if
> +     * they arrive at the new file descriptor in the switch, so there is no
> +     * need to explicitly check for them.
> +     */
> +    if (poll_start) {
> +        event_notifier_init_fd(svq_kick, svq_kick_fd);
> +        event_notifier_set(svq_kick);
> +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> +    }
> +}
> +
> +/**
> + * Stop the shadow virtqueue operation.
> + * @svq Shadow Virtqueue
> + */
> +void vhost_svq_stop(VhostShadowVirtqueue *svq)
> +{
> +    event_notifier_set_handler(&svq->svq_kick, NULL);
> +}
>   
>   /**
>    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> @@ -39,6 +89,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>           goto err_init_hdev_call;
>       }
>   
> +    event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>       return g_steal_pointer(&svq);
>   
>   err_init_hdev_call:
> @@ -56,6 +107,7 @@ err_init_hdev_kick:
>   void vhost_svq_free(gpointer pvq)
>   {
>       VhostShadowVirtqueue *vq = pvq;
> +    vhost_svq_stop(vq);
>       event_notifier_cleanup(&vq->hdev_kick);
>       event_notifier_cleanup(&vq->hdev_call);
>       g_free(vq);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 04ea43704f..454bf50735 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -17,12 +17,14 @@
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/vhost-backend.h"
>   #include "hw/virtio/virtio-net.h"
> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost-vdpa.h"
>   #include "exec/address-spaces.h"
>   #include "qemu/main-loop.h"
>   #include "cpu.h"
>   #include "trace.h"
>   #include "qemu-common.h"
> +#include "qapi/error.h"
>   
>   /*
>    * Return one past the end of the end of section. Be careful with uint64_t
> @@ -342,6 +344,30 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
>       return v->index != 0;
>   }
>   
> +static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> +                               Error **errp)
> +{
> +    g_autoptr(GPtrArray) shadow_vqs = NULL;
> +
> +    if (!v->shadow_vqs_enabled) {
> +        return 0;
> +    }
> +
> +    shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> +    for (unsigned n = 0; n < hdev->nvqs; ++n) {
> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> +
> +        if (unlikely(!svq)) {
> +            error_setg(errp, "Cannot create svq %u", n);
> +            return -1;
> +        }
> +        g_ptr_array_add(shadow_vqs, g_steal_pointer(&svq));
> +    }
> +
> +    v->shadow_vqs = g_steal_pointer(&shadow_vqs);
> +    return 0;
> +}
> +
>   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>   {
>       struct vhost_vdpa *v;
> @@ -364,6 +390,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>       dev->opaque =  opaque ;
>       v->listener = vhost_vdpa_memory_listener;
>       v->msg_type = VHOST_IOTLB_MSG_V2;
> +    ret = vhost_vdpa_init_svq(dev, v, errp);
> +    if (ret) {
> +        goto err;
> +    }
>   
>       vhost_vdpa_get_iova_range(v);
>   
> @@ -375,6 +405,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>                                  VIRTIO_CONFIG_S_DRIVER);
>   
>       return 0;
> +
> +err:
> +    ram_block_discard_disable(false);
> +    return ret;
>   }
>   
>   static void vhost_vdpa_host_notifier_uninit(struct vhost_dev *dev,
> @@ -444,8 +478,14 @@ err:
>   
>   static void vhost_vdpa_host_notifiers_init(struct vhost_dev *dev)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int i;
>   
> +    if (v->shadow_vqs_enabled) {
> +        /* FIXME SVQ is not compatible with host notifiers mr */
> +        return;
> +    }
> +
>       for (i = dev->vq_index; i < dev->vq_index + dev->nvqs; i++) {
>           if (vhost_vdpa_host_notifier_init(dev, i)) {
>               goto err;
> @@ -459,6 +499,21 @@ err:
>       return;
>   }
>   
> +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    size_t idx;
> +
> +    if (!v->shadow_vqs) {
> +        return;
> +    }
> +
> +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> +        vhost_svq_stop(g_ptr_array_index(v->shadow_vqs, idx));
> +    }
> +    g_ptr_array_free(v->shadow_vqs, true);
> +}
> +
>   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>   {
>       struct vhost_vdpa *v;
> @@ -467,6 +522,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>       trace_vhost_vdpa_cleanup(dev, v);
>       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>       memory_listener_unregister(&v->listener);
> +    vhost_vdpa_svq_cleanup(dev);
>   
>       dev->opaque = NULL;
>       ram_block_discard_disable(false);
> @@ -558,11 +614,26 @@ static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
>       return ret;
>   }
>   
> +static void vhost_vdpa_reset_svq(struct vhost_vdpa *v)
> +{
> +    if (!v->shadow_vqs_enabled) {
> +        return;
> +    }
> +
> +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> +        vhost_svq_stop(svq);
> +    }
> +}
> +
>   static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>       uint8_t status = 0;
>   
> +    vhost_vdpa_reset_svq(v);
> +
>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>       trace_vhost_vdpa_reset_device(dev, status);
>       return ret;
> @@ -646,13 +717,75 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
>       return ret;
>    }
>   
> +static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
> +                                         struct vhost_vring_file *file)
> +{
> +    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> +}
> +
> +/**
> + * Set the shadow virtqueue descriptors to the device
> + *
> + * @dev   The vhost device model
> + * @svq   The shadow virtqueue
> + * @idx   The index of the virtqueue in the vhost device
> + * @errp  Error
> + */
> +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> +                                 VhostShadowVirtqueue *svq,
> +                                 unsigned idx,
> +                                 Error **errp)
> +{
> +    struct vhost_vring_file file = {
> +        .index = dev->vq_index + idx,
> +    };
> +    const EventNotifier *event_notifier = &svq->hdev_kick;
> +    int r;
> +
> +    file.fd = event_notifier_get_fd(event_notifier);
> +    r = vhost_vdpa_set_vring_dev_kick(dev, &file);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Can't set device kick fd");
> +    }
> +
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    Error *err = NULL;
> +    unsigned i;
> +
> +    if (!v->shadow_vqs) {
> +        return true;
> +    }
> +
> +    for (i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> +        bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
> +        if (unlikely(!ok)) {
> +            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
>   static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>   {
>       struct vhost_vdpa *v = dev->opaque;
> +    bool ok;
>       trace_vhost_vdpa_dev_start(dev, started);
>   
>       if (started) {
>           vhost_vdpa_host_notifiers_init(dev);
> +        ok = vhost_vdpa_svqs_start(dev);
> +        if (unlikely(!ok)) {
> +            return -1;
> +        }
>           vhost_vdpa_set_vring_ready(dev);
>       } else {
>           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> @@ -724,8 +857,16 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>   static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
>                                          struct vhost_vring_file *file)
>   {
> -    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> +    struct vhost_vdpa *v = dev->opaque;
> +    int vdpa_idx = file->index - dev->vq_index;
> +
> +    if (v->shadow_vqs_enabled) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
> +        vhost_svq_set_svq_kick_fd(svq, file->fd);
> +        return 0;
> +    } else {
> +        return vhost_vdpa_set_vring_dev_kick(dev, file);
> +    }
>   }
>   
>   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/14] vhost: Add Shadow VirtQueue call forwarding capabilities
  2022-02-27 13:41 ` [PATCH v2 03/14] vhost: Add Shadow VirtQueue call " Eugenio Pérez
@ 2022-02-28  3:18     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:18 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This will make qemu aware of the device used buffers, allowing it to
> write the guest memory with its contents if needed.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  4 ++++
>   hw/virtio/vhost-shadow-virtqueue.c | 34 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 31 +++++++++++++++++++++++++--
>   3 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 1cbc87d5d8..1d4c160d0a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -28,9 +28,13 @@ typedef struct VhostShadowVirtqueue {
>        * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>        */
>       EventNotifier svq_kick;
> +
> +    /* Guest's call notifier, where the SVQ calls guest. */
> +    EventNotifier svq_call;
>   } VhostShadowVirtqueue;
>   
>   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
>   
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index a5d0659f86..54c701a196 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -23,6 +23,38 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>       event_notifier_set(&svq->hdev_kick);
>   }
>   
> +/* Forward vhost notifications */
> +static void vhost_svq_handle_call(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             hdev_call);
> +    event_notifier_test_and_clear(n);
> +    event_notifier_set(&svq->svq_call);
> +}
> +
> +/**
> + * Set the call notifier for the SVQ to call the guest
> + *
> + * @svq Shadow virtqueue
> + * @call_fd call notifier
> + *
> + * Called on BQL context.
> + */
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)


I think we need to have consistent naming for both kick and call. Note 
that in patch 2 we had

vhost_svq_set_svq_kick_fd

Maybe it's better to use vhost_svq_set_guest_call_fd() here.


> +{
> +    if (call_fd == VHOST_FILE_UNBIND) {
> +        /*
> +         * Fail event_notifier_set if called handling device call.
> +         *
> +         * SVQ still needs device notifications, since it needs to keep
> +         * forwarding used buffers even with the unbind.
> +         */
> +        memset(&svq->svq_call, 0, sizeof(svq->svq_call));


I may miss something but shouldn't we stop polling svq_call here like

event_notifier_set_handle(&svq->svq_call, false);

?

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/14] vhost: Add Shadow VirtQueue call forwarding capabilities
@ 2022-02-28  3:18     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:18 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This will make qemu aware of the device used buffers, allowing it to
> write the guest memory with its contents if needed.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  4 ++++
>   hw/virtio/vhost-shadow-virtqueue.c | 34 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 31 +++++++++++++++++++++++++--
>   3 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 1cbc87d5d8..1d4c160d0a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -28,9 +28,13 @@ typedef struct VhostShadowVirtqueue {
>        * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>        */
>       EventNotifier svq_kick;
> +
> +    /* Guest's call notifier, where the SVQ calls guest. */
> +    EventNotifier svq_call;
>   } VhostShadowVirtqueue;
>   
>   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
>   
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index a5d0659f86..54c701a196 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -23,6 +23,38 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>       event_notifier_set(&svq->hdev_kick);
>   }
>   
> +/* Forward vhost notifications */
> +static void vhost_svq_handle_call(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             hdev_call);
> +    event_notifier_test_and_clear(n);
> +    event_notifier_set(&svq->svq_call);
> +}
> +
> +/**
> + * Set the call notifier for the SVQ to call the guest
> + *
> + * @svq Shadow virtqueue
> + * @call_fd call notifier
> + *
> + * Called on BQL context.
> + */
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)


I think we need to have consistent naming for both kick and call. Note 
that in patch 2 we had

vhost_svq_set_svq_kick_fd

Maybe it's better to use vhost_svq_set_guest_call_fd() here.


> +{
> +    if (call_fd == VHOST_FILE_UNBIND) {
> +        /*
> +         * Fail event_notifier_set if called handling device call.
> +         *
> +         * SVQ still needs device notifications, since it needs to keep
> +         * forwarding used buffers even with the unbind.
> +         */
> +        memset(&svq->svq_call, 0, sizeof(svq->svq_call));


I may miss something but shouldn't we stop polling svq_call here like

event_notifier_set_handle(&svq->svq_call, false);

?

Thanks




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq
  2022-02-27 13:41 ` [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq Eugenio Pérez
@ 2022-02-28  3:25     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:25 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This allows SVQ to negotiate features with the guest and the device. For
> the device, SVQ is a driver. While this function bypasses all
> non-transport features, it needs to disable the features that SVQ does
> not support when forwarding buffers. This includes packed vq layout,
> indirect descriptors or event idx.
>
> Future changes can add support to offer more features to the guest,
> since the use of VirtQueue gives this for free. This is left out at the
> moment for simplicity.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
>   hw/virtio/vhost-shadow-virtqueue.c | 39 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 18 ++++++++++++++
>   3 files changed, 59 insertions(+)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 1d4c160d0a..84747655ad 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -33,6 +33,8 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier svq_call;
>   } VhostShadowVirtqueue;
>   
> +bool vhost_svq_valid_features(uint64_t *features);
> +
>   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>   void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
>   
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 54c701a196..34354aea2c 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -14,6 +14,45 @@
>   #include "qemu/main-loop.h"
>   #include "linux-headers/linux/vhost.h"
>   
> +/**
> + * Validate the transport device features that both guests can use with the SVQ
> + * and SVQs can use with the device.
> + *
> + * @dev_features  The features. If success, the acknowledged features. If
> + *                failure, the minimal set from it.
> + *
> + * Returns true if SVQ can go with a subset of these, false otherwise.
> + */
> +bool vhost_svq_valid_features(uint64_t *features)
> +{
> +    bool r = true;
> +
> +    for (uint64_t b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END;
> +         ++b) {
> +        switch (b) {
> +        case VIRTIO_F_ANY_LAYOUT:
> +            continue;
> +
> +        case VIRTIO_F_ACCESS_PLATFORM:
> +            /* SVQ trust in the host's IOMMU to translate addresses */
> +        case VIRTIO_F_VERSION_1:
> +            /* SVQ trust that the guest vring is little endian */
> +            if (!(*features & BIT_ULL(b))) {
> +                set_bit(b, features);
> +                r = false;
> +            }
> +            continue;


It looks to me the *features is only used for logging errors, if this is 
the truth. I suggest to do error_setg in this function instead of 
letting the caller to pass a pointer.


> +
> +        default:
> +            if (*features & BIT_ULL(b)) {
> +                clear_bit(b, features);
> +            }


Do we need to check indirect and event idx here?

Thanks


> +        }
> +    }
> +
> +    return r;
> +}
> +
>   /** Forward guest notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index c73215751d..d614c435f3 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -348,11 +348,29 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>                                  Error **errp)
>   {
>       g_autoptr(GPtrArray) shadow_vqs = NULL;
> +    uint64_t dev_features, svq_features;
> +    int r;
> +    bool ok;
>   
>       if (!v->shadow_vqs_enabled) {
>           return 0;
>       }
>   
> +    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> +    if (r != 0) {
> +        error_setg_errno(errp, -r, "Can't get vdpa device features");
> +        return r;
> +    }
> +
> +    svq_features = dev_features;
> +    ok = vhost_svq_valid_features(&svq_features);
> +    if (unlikely(!ok)) {
> +        error_setg(errp,
> +            "SVQ Invalid device feature flags, offer: 0x%"PRIx64", ok: 0x%"PRIx64,
> +            dev_features, svq_features);
> +        return -1;
> +    }
> +
>       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>       for (unsigned n = 0; n < hdev->nvqs; ++n) {
>           g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq
@ 2022-02-28  3:25     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:25 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This allows SVQ to negotiate features with the guest and the device. For
> the device, SVQ is a driver. While this function bypasses all
> non-transport features, it needs to disable the features that SVQ does
> not support when forwarding buffers. This includes packed vq layout,
> indirect descriptors or event idx.
>
> Future changes can add support to offer more features to the guest,
> since the use of VirtQueue gives this for free. This is left out at the
> moment for simplicity.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
>   hw/virtio/vhost-shadow-virtqueue.c | 39 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 18 ++++++++++++++
>   3 files changed, 59 insertions(+)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 1d4c160d0a..84747655ad 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -33,6 +33,8 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier svq_call;
>   } VhostShadowVirtqueue;
>   
> +bool vhost_svq_valid_features(uint64_t *features);
> +
>   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>   void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
>   
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 54c701a196..34354aea2c 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -14,6 +14,45 @@
>   #include "qemu/main-loop.h"
>   #include "linux-headers/linux/vhost.h"
>   
> +/**
> + * Validate the transport device features that both guests can use with the SVQ
> + * and SVQs can use with the device.
> + *
> + * @dev_features  The features. If success, the acknowledged features. If
> + *                failure, the minimal set from it.
> + *
> + * Returns true if SVQ can go with a subset of these, false otherwise.
> + */
> +bool vhost_svq_valid_features(uint64_t *features)
> +{
> +    bool r = true;
> +
> +    for (uint64_t b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END;
> +         ++b) {
> +        switch (b) {
> +        case VIRTIO_F_ANY_LAYOUT:
> +            continue;
> +
> +        case VIRTIO_F_ACCESS_PLATFORM:
> +            /* SVQ trust in the host's IOMMU to translate addresses */
> +        case VIRTIO_F_VERSION_1:
> +            /* SVQ trust that the guest vring is little endian */
> +            if (!(*features & BIT_ULL(b))) {
> +                set_bit(b, features);
> +                r = false;
> +            }
> +            continue;


It looks to me the *features is only used for logging errors, if this is 
the truth. I suggest to do error_setg in this function instead of 
letting the caller to pass a pointer.


> +
> +        default:
> +            if (*features & BIT_ULL(b)) {
> +                clear_bit(b, features);
> +            }


Do we need to check indirect and event idx here?

Thanks


> +        }
> +    }
> +
> +    return r;
> +}
> +
>   /** Forward guest notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index c73215751d..d614c435f3 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -348,11 +348,29 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>                                  Error **errp)
>   {
>       g_autoptr(GPtrArray) shadow_vqs = NULL;
> +    uint64_t dev_features, svq_features;
> +    int r;
> +    bool ok;
>   
>       if (!v->shadow_vqs_enabled) {
>           return 0;
>       }
>   
> +    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> +    if (r != 0) {
> +        error_setg_errno(errp, -r, "Can't get vdpa device features");
> +        return r;
> +    }
> +
> +    svq_features = dev_features;
> +    ok = vhost_svq_valid_features(&svq_features);
> +    if (unlikely(!ok)) {
> +        error_setg(errp,
> +            "SVQ Invalid device feature flags, offer: 0x%"PRIx64", ok: 0x%"PRIx64,
> +            dev_features, svq_features);
> +        return -1;
> +    }
> +
>       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>       for (unsigned n = 0; n < hdev->nvqs; ++n) {
>           g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq
  2022-02-27 13:41 ` [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq Eugenio Pérez
@ 2022-02-28  3:59     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:59 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> First half of the buffers forwarding part, preparing vhost-vdpa
> callbacks to SVQ to offer it. QEMU cannot enable it at this moment, so
> this is effectively dead code at the moment, but it helps to reduce
> patch size.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++------
>   1 file changed, 73 insertions(+), 11 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index d614c435f3..b2c4e92fcf 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -344,6 +344,16 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
>       return v->index != 0;
>   }
>   
> +static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
> +                                       uint64_t *features)
> +{
> +    int ret;
> +
> +    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> +    trace_vhost_vdpa_get_features(dev, *features);
> +    return ret;
> +}
> +
>   static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>                                  Error **errp)
>   {
> @@ -356,7 +366,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>           return 0;
>       }
>   
> -    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> +    r = vhost_vdpa_get_dev_features(hdev, &dev_features);
>       if (r != 0) {
>           error_setg_errno(errp, -r, "Can't get vdpa device features");
>           return r;
> @@ -583,12 +593,26 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
>   static int vhost_vdpa_set_features(struct vhost_dev *dev,
>                                      uint64_t features)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>   
>       if (vhost_vdpa_one_time_request(dev)) {
>           return 0;
>       }
>   
> +    if (v->shadow_vqs_enabled) {
> +        uint64_t features_ok = features;
> +        bool ok;
> +
> +        ok = vhost_svq_valid_features(&features_ok);
> +        if (unlikely(!ok)) {
> +            error_report(
> +                "Invalid guest acked feature flag, acked: 0x%"
> +                PRIx64", ok: 0x%"PRIx64, features, features_ok);
> +            return -EINVAL;
> +        }
> +    }
> +
>       trace_vhost_vdpa_set_features(dev, features);
>       ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
>       if (ret) {
> @@ -735,6 +759,13 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
>       return ret;
>    }
>   
> +static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
> +                                         struct vhost_vring_state *ring)
> +{
> +    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> +}
> +
>   static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
>                                            struct vhost_vring_file *file)
>   {
> @@ -749,6 +780,18 @@ static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
>       return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
>   }
>   
> +static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
> +                                         struct vhost_vring_addr *addr)
> +{
> +    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> +                                addr->desc_user_addr, addr->used_user_addr,
> +                                addr->avail_user_addr,
> +                                addr->log_guest_addr);
> +
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> +
> +}
> +
>   /**
>    * Set the shadow virtqueue descriptors to the device
>    *
> @@ -859,11 +902,17 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
>   static int vhost_vdpa_set_vring_addr(struct vhost_dev *dev,
>                                          struct vhost_vring_addr *addr)
>   {
> -    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> -                                    addr->desc_user_addr, addr->used_user_addr,
> -                                    addr->avail_user_addr,
> -                                    addr->log_guest_addr);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (v->shadow_vqs_enabled) {
> +        /*
> +         * Device vring addr was set at device start. SVQ base is handled by
> +         * VirtQueue code.
> +         */
> +        return 0;
> +    }
> +
> +    return vhost_vdpa_set_vring_dev_addr(dev, addr);
>   }
>   
>   static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
> @@ -876,8 +925,17 @@ static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
>   static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>                                          struct vhost_vring_state *ring)
>   {
> -    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (v->shadow_vqs_enabled) {
> +        /*
> +         * Device vring base was set at device start. SVQ base is handled by
> +         * VirtQueue code.
> +         */
> +        return 0;
> +    }
> +
> +    return vhost_vdpa_set_dev_vring_base(dev, ring);
>   }
>   
>   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> @@ -924,10 +982,14 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>   static int vhost_vdpa_get_features(struct vhost_dev *dev,
>                                        uint64_t *features)
>   {
> -    int ret;
> +    struct vhost_vdpa *v = dev->opaque;
> +    int ret = vhost_vdpa_get_dev_features(dev, features);
> +
> +    if (ret == 0 && v->shadow_vqs_enabled) {
> +        /* Filter only features that SVQ can offer to guest */
> +        vhost_svq_valid_features(features);


I think it's better not silently clear features here (e.g the feature 
could be explicitly enabled via cli). This may give more troubles in the 
future cross vendor/backend live migration.

We can simple during vhost_vdpa init.

Thanks


> +    }
>   
> -    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> -    trace_vhost_vdpa_get_features(dev, *features);
>       return ret;
>   }
>   

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq
@ 2022-02-28  3:59     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  3:59 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> First half of the buffers forwarding part, preparing vhost-vdpa
> callbacks to SVQ to offer it. QEMU cannot enable it at this moment, so
> this is effectively dead code at the moment, but it helps to reduce
> patch size.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++------
>   1 file changed, 73 insertions(+), 11 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index d614c435f3..b2c4e92fcf 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -344,6 +344,16 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
>       return v->index != 0;
>   }
>   
> +static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
> +                                       uint64_t *features)
> +{
> +    int ret;
> +
> +    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> +    trace_vhost_vdpa_get_features(dev, *features);
> +    return ret;
> +}
> +
>   static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>                                  Error **errp)
>   {
> @@ -356,7 +366,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>           return 0;
>       }
>   
> -    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> +    r = vhost_vdpa_get_dev_features(hdev, &dev_features);
>       if (r != 0) {
>           error_setg_errno(errp, -r, "Can't get vdpa device features");
>           return r;
> @@ -583,12 +593,26 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
>   static int vhost_vdpa_set_features(struct vhost_dev *dev,
>                                      uint64_t features)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>   
>       if (vhost_vdpa_one_time_request(dev)) {
>           return 0;
>       }
>   
> +    if (v->shadow_vqs_enabled) {
> +        uint64_t features_ok = features;
> +        bool ok;
> +
> +        ok = vhost_svq_valid_features(&features_ok);
> +        if (unlikely(!ok)) {
> +            error_report(
> +                "Invalid guest acked feature flag, acked: 0x%"
> +                PRIx64", ok: 0x%"PRIx64, features, features_ok);
> +            return -EINVAL;
> +        }
> +    }
> +
>       trace_vhost_vdpa_set_features(dev, features);
>       ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
>       if (ret) {
> @@ -735,6 +759,13 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
>       return ret;
>    }
>   
> +static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
> +                                         struct vhost_vring_state *ring)
> +{
> +    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> +}
> +
>   static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
>                                            struct vhost_vring_file *file)
>   {
> @@ -749,6 +780,18 @@ static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
>       return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
>   }
>   
> +static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
> +                                         struct vhost_vring_addr *addr)
> +{
> +    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> +                                addr->desc_user_addr, addr->used_user_addr,
> +                                addr->avail_user_addr,
> +                                addr->log_guest_addr);
> +
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> +
> +}
> +
>   /**
>    * Set the shadow virtqueue descriptors to the device
>    *
> @@ -859,11 +902,17 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
>   static int vhost_vdpa_set_vring_addr(struct vhost_dev *dev,
>                                          struct vhost_vring_addr *addr)
>   {
> -    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> -                                    addr->desc_user_addr, addr->used_user_addr,
> -                                    addr->avail_user_addr,
> -                                    addr->log_guest_addr);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (v->shadow_vqs_enabled) {
> +        /*
> +         * Device vring addr was set at device start. SVQ base is handled by
> +         * VirtQueue code.
> +         */
> +        return 0;
> +    }
> +
> +    return vhost_vdpa_set_vring_dev_addr(dev, addr);
>   }
>   
>   static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
> @@ -876,8 +925,17 @@ static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
>   static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>                                          struct vhost_vring_state *ring)
>   {
> -    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (v->shadow_vqs_enabled) {
> +        /*
> +         * Device vring base was set at device start. SVQ base is handled by
> +         * VirtQueue code.
> +         */
> +        return 0;
> +    }
> +
> +    return vhost_vdpa_set_dev_vring_base(dev, ring);
>   }
>   
>   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> @@ -924,10 +982,14 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>   static int vhost_vdpa_get_features(struct vhost_dev *dev,
>                                        uint64_t *features)
>   {
> -    int ret;
> +    struct vhost_vdpa *v = dev->opaque;
> +    int ret = vhost_vdpa_get_dev_features(dev, features);
> +
> +    if (ret == 0 && v->shadow_vqs_enabled) {
> +        /* Filter only features that SVQ can offer to guest */
> +        vhost_svq_valid_features(features);


I think it's better not silently clear features here (e.g the feature 
could be explicitly enabled via cli). This may give more troubles in the 
future cross vendor/backend live migration.

We can simple during vhost_vdpa init.

Thanks


> +    }
>   
> -    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> -    trace_vhost_vdpa_get_features(dev, *features);
>       return ret;
>   }
>   



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
  2022-02-27 13:41 ` [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2022-02-28  5:39     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  5:39 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers. There
> is no iommu support at the moment, and that will be addressed in future
> patches of this series. Since all vhost-vdpa devices use forced IOMMU,
> this means that SVQ is not usable at this point of the series on any
> device.
>
> For simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors. Support for them will not be added in this series.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review.
>
> However, forwarding buffers have some particular pieces: One of the most
> unexpected ones is that a guest's buffer can expand through more than
> one descriptor in SVQ. While this is handled gracefully by qemu's
> emulated virtio devices, it may cause unexpected SVQ queue full. This
> patch also solves it by checking for this condition at both guest's
> kicks and device's calls. The code may be more elegant in the future if
> SVQ code runs in its own iocontext.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  29 +++
>   hw/virtio/vhost-shadow-virtqueue.c | 360 ++++++++++++++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 165 ++++++++++++-
>   3 files changed, 542 insertions(+), 12 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 3bbea77082..04c67685fd 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -36,6 +36,33 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Guest's call notifier, where the SVQ calls guest. */
>       EventNotifier svq_call;
> +
> +    /* Virtio queue shadowing */
> +    VirtQueue *vq;
> +
> +    /* Virtio device */
> +    VirtIODevice *vdev;
> +
> +    /* Map for use the guest's descriptors */
> +    VirtQueueElement **ring_id_maps;
> +
> +    /* Next VirtQueue element that guest made available */
> +    VirtQueueElement *next_guest_avail_elem;
> +
> +    /* Next head to expose to the device */
> +    uint16_t avail_idx_shadow;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;


I suggest to have a consistent name for avail and used instead of using 
one "shadow" as prefix and other as suffix


> +
> +    /* Next head to consume from the device */
> +    uint16_t last_used_idx;
> +
> +    /* Cache for the exposed notification flag */
> +    bool notification;
>   } VhostShadowVirtqueue;
>   
>   bool vhost_svq_valid_features(uint64_t *features);
> @@ -47,6 +74,8 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
>   size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
>   size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
>   
> +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> +                     VirtQueue *vq);
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
>   VhostShadowVirtqueue *vhost_svq_new(void);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 2150e2b071..a38d313755 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -12,6 +12,7 @@
>   
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/log.h"
>   #include "linux-headers/linux/vhost.h"
>   
>   /**
> @@ -53,22 +54,309 @@ bool vhost_svq_valid_features(uint64_t *features)
>       return r;
>   }
>   
> -/** Forward guest notifications */
> -static void vhost_handle_guest_kick(EventNotifier *n)
> +/**
> + * Number of descriptors that the SVQ can make available from the guest.
> + *
> + * @svq   The svq


Btw, I notice most of function there will be a colon in the middle of 
the parameter and it's documentation.  Maybe we should follow that.


> + */
> +static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
> +{
> +    return svq->vring.num - (svq->avail_idx_shadow - svq->shadow_used_idx);
> +}
> +
> +static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> +{
> +    uint16_t notification_flag;
> +
> +    if (svq->notification == enable) {
> +        return;
> +    }
> +
> +    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
> +
> +    svq->notification = enable;
> +    if (enable) {
> +        svq->vring.avail->flags &= ~notification_flag;
> +    } else {
> +        svq->vring.avail->flags |= notification_flag;
> +        /* Make sure the flag is written before the read of used_idx */
> +        smp_mb();


So the comment assumes that a reading of used_idx will come. This makes 
me wonder if we can simply split this function as:

vhost_svq_disable_notification() and vhost_svq_enable_notification()

and in the vhost_svq_enable_notification, we simply return 
vhost_svq_more_used() after smp_mb().

(Not a must but just feel it might be more clear)


> +    }
> +}
> +
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    const struct iovec *iovec,
> +                                    size_t num, bool more_descs, bool write)
> +{
> +    uint16_t i = svq->free_head, last = svq->free_head;
> +    unsigned n;
> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +    vring_desc_t *descs = svq->vring.desc;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (n = 0; n < num; n++) {
> +        if (more_descs || (n + 1 < num)) {
> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> +
> +        last = i;
> +        i = cpu_to_le16(descs[i].next);
> +    }
> +
> +    svq->free_head = le16_to_cpu(descs[last].next);
> +}
> +
> +static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> +                                VirtQueueElement *elem,
> +                                unsigned *head)
> +{
> +    unsigned avail_idx;
> +    vring_avail_t *avail = svq->vring.avail;
> +
> +    *head = svq->free_head;
> +
> +    /* We need some descriptors here */
> +    if (unlikely(!elem->out_num && !elem->in_num)) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +            "Guest provided element with no descriptors");
> +        return false;
> +    }
> +
> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +                            elem->in_num > 0, false);
> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);


I wonder instead of passing in/out separately and using the hint like 
more_descs, is it better to simply pass the elem to 
vhost_vrign_write_descs() then we know which one is the last that 
doesn't depend on more_descs.


> +
> +    /*
> +     * Put the entry in the available array (but don't update avail->idx until
> +     * they do sync).
> +     */
> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> +    avail->ring[avail_idx] = cpu_to_le16(*head);
> +    svq->avail_idx_shadow++;
> +
> +    /* Update the avail index after write the descriptor */
> +    smp_wmb();
> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> +
> +    return true;
> +}
> +
> +static bool vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> +{
> +    unsigned qemu_head;
> +    bool ok = vhost_svq_add_split(svq, elem, &qemu_head);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    svq->ring_id_maps[qemu_head] = elem;
> +    return true;
> +}
> +
> +static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> +{
> +    /*
> +     * We need to expose the available array entries before checking the used
> +     * flags
> +     */
> +    smp_mb();
> +    if (svq->vring.used->flags & VRING_USED_F_NO_NOTIFY) {
> +        return;
> +    }
> +
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
> +/**
> + * Forward available buffers.
> + *
> + * @svq Shadow VirtQueue
> + *
> + * Note that this function does not guarantee that all guest's available
> + * buffers are available to the device in SVQ avail ring. The guest may have
> + * exposed a GPA / GIOVA contiguous buffer, but it may not be contiguous in
> + * qemu vaddr.
> + *
> + * If that happens, guest's kick notifications will be disabled until the
> + * device uses some buffers.
> + */
> +static void vhost_handle_guest_kick(VhostShadowVirtqueue *svq)
> +{
> +    /* Clear event notifier */
> +    event_notifier_test_and_clear(&svq->svq_kick);
> +
> +    /* Forward to the device as many available buffers as possible */
> +    do {
> +        virtio_queue_set_notification(svq->vq, false);
> +
> +        while (true) {
> +            VirtQueueElement *elem;
> +            bool ok;
> +
> +            if (svq->next_guest_avail_elem) {
> +                elem = g_steal_pointer(&svq->next_guest_avail_elem);
> +            } else {
> +                elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            }
> +
> +            if (!elem) {
> +                break;
> +            }
> +
> +            if (elem->out_num + elem->in_num >
> +                vhost_svq_available_slots(svq)) {
> +                /*
> +                 * This condition is possible since a contiguous buffer in GPA
> +                 * does not imply a contiguous buffer in qemu's VA
> +                 * scatter-gather segments. If that happens, the buffer exposed
> +                 * to the device needs to be a chain of descriptors at this
> +                 * moment.
> +                 *
> +                 * SVQ cannot hold more available buffers if we are here:
> +                 * queue the current guest descriptor and ignore further kicks
> +                 * until some elements are used.
> +                 */
> +                svq->next_guest_avail_elem = elem;
> +                return;
> +            }
> +
> +            ok = vhost_svq_add(svq, elem);
> +            if (unlikely(!ok)) {
> +                /* VQ is broken, just return and ignore any other kicks */
> +                return;
> +            }
> +            vhost_svq_kick(svq);
> +        }
> +
> +        virtio_queue_set_notification(svq->vq, true);
> +    } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +/**
> + * Handle guest's kick.
> + *
> + * @n guest kick event notifier, the one that guest set to notify svq.
> + */
> +static void vhost_handle_guest_kick_notifier(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                svq_kick);
>       event_notifier_test_and_clear(n);
> -    event_notifier_set(&svq->hdev_kick);
> +    vhost_handle_guest_kick(svq);
> +}
> +
> +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> +{
> +    if (svq->last_used_idx != svq->shadow_used_idx) {
> +        return true;
> +    }
> +
> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> +
> +    return svq->last_used_idx != svq->shadow_used_idx;
>   }
>   
> -/* Forward vhost notifications */
> +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> +{
> +    vring_desc_t *descs = svq->vring.desc;
> +    const vring_used_t *used = svq->vring.used;
> +    vring_used_elem_t used_elem;
> +    uint16_t last_used;
> +
> +    if (!vhost_svq_more_used(svq)) {
> +        return NULL;
> +    }
> +
> +    /* Only get used array entries after they have been exposed by dev */
> +    smp_rmb();
> +    last_used = svq->last_used_idx & (svq->vring.num - 1);
> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> +
> +    svq->last_used_idx++;
> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "Device %s says index %u is used",
> +                      svq->vdev->name, used_elem.id);
> +        return NULL;
> +    }
> +
> +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +            "Device %s says index %u is used, but it was not available",
> +            svq->vdev->name, used_elem.id);
> +        return NULL;
> +    }
> +
> +    descs[used_elem.id].next = svq->free_head;
> +    svq->free_head = used_elem.id;
> +
> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;


It looks to me we'd better not modify the elem here, otherwise we may 
leak mapping during virtqueue_unmap_sg()?


> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> +}
> +
> +static void vhost_svq_flush(VhostShadowVirtqueue *svq,
> +                            bool check_for_avail_queue)
> +{
> +    VirtQueue *vq = svq->vq;
> +
> +    /* Forward as many used buffers as possible. */
> +    do {
> +        unsigned i = 0;
> +
> +        vhost_svq_set_notification(svq, false);
> +        while (true) {
> +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> +            if (!elem) {
> +                break;
> +            }
> +
> +            if (unlikely(i >= svq->vring.num)) {
> +                qemu_log_mask(LOG_GUEST_ERROR,
> +                         "More than %u used buffers obtained in a %u size SVQ",
> +                         i, svq->vring.num);
> +                virtqueue_fill(vq, elem, elem->len, i);
> +                virtqueue_flush(vq, i);
> +                return;
> +            }
> +            virtqueue_fill(vq, elem, elem->len, i++);
> +        }
> +
> +        virtqueue_flush(vq, i);
> +        event_notifier_set(&svq->svq_call);
> +
> +        if (check_for_avail_queue && svq->next_guest_avail_elem) {
> +            /*
> +             * Avail ring was full when vhost_svq_flush was called, so it's a
> +             * good moment to make more descriptors available if possible.
> +             */
> +            vhost_handle_guest_kick(svq);
> +        }
> +
> +        vhost_svq_set_notification(svq, true);
> +    } while (vhost_svq_more_used(svq));
> +}
> +
> +/**
> + * Forward used buffers.
> + *
> + * @n hdev call event notifier, the one that device set to notify svq.
> + *
> + * Note that we are not making any buffers available in the loop, there is no
> + * way that it runs more than virtqueue size times.
> + */
>   static void vhost_svq_handle_call(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                hdev_call);
>       event_notifier_test_and_clear(n);
> -    event_notifier_set(&svq->svq_call);
> +    vhost_svq_flush(svq, true);
>   }
>   
>   /**
> @@ -149,7 +437,41 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>       if (poll_start) {
>           event_notifier_init_fd(svq_kick, svq_kick_fd);
>           event_notifier_set(svq_kick);
> -        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick_notifier);
> +    }
> +}
> +
> +/**
> + * Start the shadow virtqueue operation.
> + *
> + * @svq Shadow Virtqueue
> + * @vdev        VirtIO device
> + * @vq          Virtqueue to shadow
> + */
> +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> +                     VirtQueue *vq)
> +{
> +    size_t desc_size, driver_size, device_size;
> +
> +    svq->next_guest_avail_elem = NULL;
> +    svq->avail_idx_shadow = 0;
> +    svq->shadow_used_idx = 0;
> +    svq->last_used_idx = 0;
> +    svq->vdev = vdev;
> +    svq->vq = vq;
> +
> +    svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> +    driver_size = vhost_svq_driver_area_size(svq);
> +    device_size = vhost_svq_device_area_size(svq);
> +    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> +    desc_size = sizeof(vring_desc_t) * svq->vring.num;
> +    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> +    memset(svq->vring.desc, 0, driver_size);
> +    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> +    memset(svq->vring.used, 0, device_size);
> +    svq->ring_id_maps = g_new0(VirtQueueElement *, svq->vring.num);
> +    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
>       }
>   }
>   
> @@ -160,6 +482,32 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>   void vhost_svq_stop(VhostShadowVirtqueue *svq)
>   {
>       event_notifier_set_handler(&svq->svq_kick, NULL);
> +    g_autofree VirtQueueElement *next_avail_elem = NULL;
> +
> +    if (!svq->vq) {
> +        return;
> +    }
> +
> +    /* Send all pending used descriptors to guest */
> +    vhost_svq_flush(svq, false);
> +
> +    for (unsigned i = 0; i < svq->vring.num; ++i) {
> +        g_autofree VirtQueueElement *elem = NULL;
> +        elem = g_steal_pointer(&svq->ring_id_maps[i]);
> +        if (elem) {
> +            virtqueue_detach_element(svq->vq, elem, elem->len);
> +        }
> +    }
> +
> +    next_avail_elem = g_steal_pointer(&svq->next_guest_avail_elem);
> +    if (next_avail_elem) {
> +        virtqueue_detach_element(svq->vq, next_avail_elem,
> +                                 next_avail_elem->len);
> +    }
> +    svq->vq = NULL;
> +    g_free(svq->ring_id_maps);
> +    qemu_vfree(svq->vring.desc);
> +    qemu_vfree(svq->vring.used);
>   }
>   
>   /**
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index b2c4e92fcf..435b9c2e9e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -803,10 +803,10 @@ static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
>    * Note that this function does not rewind kick file descriptor if cannot set
>    * call one.
>    */
> -static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> -                                 VhostShadowVirtqueue *svq,
> -                                 unsigned idx,
> -                                 Error **errp)
> +static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> +                                  VhostShadowVirtqueue *svq,
> +                                  unsigned idx,
> +                                  Error **errp)
>   {
>       struct vhost_vring_file file = {
>           .index = dev->vq_index + idx,
> @@ -818,7 +818,7 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>       r = vhost_vdpa_set_vring_dev_kick(dev, &file);
>       if (unlikely(r != 0)) {
>           error_setg_errno(errp, -r, "Can't set device kick fd");
> -        return false;
> +        return r;
>       }
>   
>       event_notifier = &svq->hdev_call;
> @@ -828,6 +828,119 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>           error_setg_errno(errp, -r, "Can't set device call fd");
>       }
>   
> +    return r;
> +}
> +
> +/**
> + * Unmap a SVQ area in the device
> + */
> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> +                                      hwaddr size)
> +{
> +    int r;
> +
> +    size = ROUND_UP(size, qemu_real_host_page_size);
> +    r = vhost_vdpa_dma_unmap(v, iova, size);
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> +                                       const VhostShadowVirtqueue *svq)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    struct vhost_vring_addr svq_addr;
> +    size_t device_size = vhost_svq_device_area_size(svq);
> +    size_t driver_size = vhost_svq_driver_area_size(svq);
> +    bool ok;
> +
> +    vhost_svq_get_vring_addr(svq, &svq_addr);
> +
> +    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> +}
> +
> +/**
> + * Map shadow virtqueue rings in device
> + *
> + * @dev   The vhost device
> + * @svq   The shadow virtqueue
> + * @addr  Assigned IOVA addresses
> + * @errp  Error pointer
> + */
> +static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> +                                     const VhostShadowVirtqueue *svq,
> +                                     struct vhost_vring_addr *addr,
> +                                     Error **errp)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    size_t device_size = vhost_svq_device_area_size(svq);
> +    size_t driver_size = vhost_svq_driver_area_size(svq);
> +    int r;
> +
> +    ERRP_GUARD();
> +    vhost_svq_get_vring_addr(svq, addr);
> +
> +    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> +                           (void *)addr->desc_user_addr, true);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> +                           (void *)addr->used_user_addr, false);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> +    }
> +
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> +                                 VhostShadowVirtqueue *svq,
> +                                 unsigned idx,
> +                                 Error **errp)
> +{
> +    uint16_t vq_index = dev->vq_index + idx;
> +    struct vhost_vring_state s = {
> +        .index = vq_index,
> +    };
> +    int r;
> +
> +    r = vhost_vdpa_set_dev_vring_base(dev, &s);
> +    if (unlikely(r)) {
> +        error_setg_errno(errp, -r, "Cannot set vring base");
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_svq_set_fds(dev, svq, idx, errp);
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_set_addr(struct vhost_dev *dev, unsigned idx,
> +                                    VhostShadowVirtqueue *svq,
> +                                    Error **errp)
> +{
> +    struct vhost_vring_addr addr = {
> +        .index = idx,
> +    };
> +    int r;
> +
> +    bool ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, errp);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    /* Override vring GPA set by vhost subsystem */
> +    r = vhost_vdpa_set_vring_dev_addr(dev, &addr);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot set device address");
> +    }
> +
>       return r == 0;
>   }
>   
> @@ -842,10 +955,46 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
>       }
>   
>       for (i = 0; i < v->shadow_vqs->len; ++i) {
> +        VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>           VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
>           bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
>           if (unlikely(!ok)) {
> -            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +            goto err;
> +        }
> +        vhost_svq_start(svq, dev->vdev, vq);
> +        ok = vhost_vdpa_svq_set_addr(dev, i, svq, &err);
> +        if (unlikely(!ok)) {
> +            vhost_svq_stop(svq);
> +            goto err;


Nit: let's introduce a new error label for this?


> +        }
> +    }
> +
> +    return true;
> +
> +err:
> +    error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +    for (unsigned j = 0; j < i; ++j) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, j);
> +        vhost_vdpa_svq_unmap_rings(dev, svq);


I wonder if it's better to move the vhost_vdpa_svq_map_rings() out of 
vhost_vdpa_svq_set_addr(). (This function seems to be the only user for 
that). This will makes the error handling logic more clear.

Thanks


> +        vhost_svq_stop(svq);
> +    }
> +
> +    return false;
> +}
> +
> +static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (!v->shadow_vqs) {
> +        return true;
> +    }
> +
> +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> +                                                      i);
> +        bool ok = vhost_vdpa_svq_unmap_rings(dev, svq);
> +        if (unlikely(!ok)) {
>               return false;
>           }
>       }
> @@ -867,6 +1016,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>           }
>           vhost_vdpa_set_vring_ready(dev);
>       } else {
> +        ok = vhost_vdpa_svqs_stop(dev);
> +        if (unlikely(!ok)) {
> +            return -1;
> +        }
>           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>       }
>   

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
@ 2022-02-28  5:39     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  5:39 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers. There
> is no iommu support at the moment, and that will be addressed in future
> patches of this series. Since all vhost-vdpa devices use forced IOMMU,
> this means that SVQ is not usable at this point of the series on any
> device.
>
> For simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors. Support for them will not be added in this series.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review.
>
> However, forwarding buffers have some particular pieces: One of the most
> unexpected ones is that a guest's buffer can expand through more than
> one descriptor in SVQ. While this is handled gracefully by qemu's
> emulated virtio devices, it may cause unexpected SVQ queue full. This
> patch also solves it by checking for this condition at both guest's
> kicks and device's calls. The code may be more elegant in the future if
> SVQ code runs in its own iocontext.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  29 +++
>   hw/virtio/vhost-shadow-virtqueue.c | 360 ++++++++++++++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 165 ++++++++++++-
>   3 files changed, 542 insertions(+), 12 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 3bbea77082..04c67685fd 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -36,6 +36,33 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Guest's call notifier, where the SVQ calls guest. */
>       EventNotifier svq_call;
> +
> +    /* Virtio queue shadowing */
> +    VirtQueue *vq;
> +
> +    /* Virtio device */
> +    VirtIODevice *vdev;
> +
> +    /* Map for use the guest's descriptors */
> +    VirtQueueElement **ring_id_maps;
> +
> +    /* Next VirtQueue element that guest made available */
> +    VirtQueueElement *next_guest_avail_elem;
> +
> +    /* Next head to expose to the device */
> +    uint16_t avail_idx_shadow;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;


I suggest to have a consistent name for avail and used instead of using 
one "shadow" as prefix and other as suffix


> +
> +    /* Next head to consume from the device */
> +    uint16_t last_used_idx;
> +
> +    /* Cache for the exposed notification flag */
> +    bool notification;
>   } VhostShadowVirtqueue;
>   
>   bool vhost_svq_valid_features(uint64_t *features);
> @@ -47,6 +74,8 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
>   size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
>   size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
>   
> +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> +                     VirtQueue *vq);
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
>   VhostShadowVirtqueue *vhost_svq_new(void);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 2150e2b071..a38d313755 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -12,6 +12,7 @@
>   
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/log.h"
>   #include "linux-headers/linux/vhost.h"
>   
>   /**
> @@ -53,22 +54,309 @@ bool vhost_svq_valid_features(uint64_t *features)
>       return r;
>   }
>   
> -/** Forward guest notifications */
> -static void vhost_handle_guest_kick(EventNotifier *n)
> +/**
> + * Number of descriptors that the SVQ can make available from the guest.
> + *
> + * @svq   The svq


Btw, I notice most of function there will be a colon in the middle of 
the parameter and it's documentation.  Maybe we should follow that.


> + */
> +static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
> +{
> +    return svq->vring.num - (svq->avail_idx_shadow - svq->shadow_used_idx);
> +}
> +
> +static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> +{
> +    uint16_t notification_flag;
> +
> +    if (svq->notification == enable) {
> +        return;
> +    }
> +
> +    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
> +
> +    svq->notification = enable;
> +    if (enable) {
> +        svq->vring.avail->flags &= ~notification_flag;
> +    } else {
> +        svq->vring.avail->flags |= notification_flag;
> +        /* Make sure the flag is written before the read of used_idx */
> +        smp_mb();


So the comment assumes that a reading of used_idx will come. This makes 
me wonder if we can simply split this function as:

vhost_svq_disable_notification() and vhost_svq_enable_notification()

and in the vhost_svq_enable_notification, we simply return 
vhost_svq_more_used() after smp_mb().

(Not a must but just feel it might be more clear)


> +    }
> +}
> +
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    const struct iovec *iovec,
> +                                    size_t num, bool more_descs, bool write)
> +{
> +    uint16_t i = svq->free_head, last = svq->free_head;
> +    unsigned n;
> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +    vring_desc_t *descs = svq->vring.desc;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (n = 0; n < num; n++) {
> +        if (more_descs || (n + 1 < num)) {
> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> +
> +        last = i;
> +        i = cpu_to_le16(descs[i].next);
> +    }
> +
> +    svq->free_head = le16_to_cpu(descs[last].next);
> +}
> +
> +static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> +                                VirtQueueElement *elem,
> +                                unsigned *head)
> +{
> +    unsigned avail_idx;
> +    vring_avail_t *avail = svq->vring.avail;
> +
> +    *head = svq->free_head;
> +
> +    /* We need some descriptors here */
> +    if (unlikely(!elem->out_num && !elem->in_num)) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +            "Guest provided element with no descriptors");
> +        return false;
> +    }
> +
> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +                            elem->in_num > 0, false);
> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);


I wonder instead of passing in/out separately and using the hint like 
more_descs, is it better to simply pass the elem to 
vhost_vrign_write_descs() then we know which one is the last that 
doesn't depend on more_descs.


> +
> +    /*
> +     * Put the entry in the available array (but don't update avail->idx until
> +     * they do sync).
> +     */
> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> +    avail->ring[avail_idx] = cpu_to_le16(*head);
> +    svq->avail_idx_shadow++;
> +
> +    /* Update the avail index after write the descriptor */
> +    smp_wmb();
> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> +
> +    return true;
> +}
> +
> +static bool vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> +{
> +    unsigned qemu_head;
> +    bool ok = vhost_svq_add_split(svq, elem, &qemu_head);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    svq->ring_id_maps[qemu_head] = elem;
> +    return true;
> +}
> +
> +static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> +{
> +    /*
> +     * We need to expose the available array entries before checking the used
> +     * flags
> +     */
> +    smp_mb();
> +    if (svq->vring.used->flags & VRING_USED_F_NO_NOTIFY) {
> +        return;
> +    }
> +
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
> +/**
> + * Forward available buffers.
> + *
> + * @svq Shadow VirtQueue
> + *
> + * Note that this function does not guarantee that all guest's available
> + * buffers are available to the device in SVQ avail ring. The guest may have
> + * exposed a GPA / GIOVA contiguous buffer, but it may not be contiguous in
> + * qemu vaddr.
> + *
> + * If that happens, guest's kick notifications will be disabled until the
> + * device uses some buffers.
> + */
> +static void vhost_handle_guest_kick(VhostShadowVirtqueue *svq)
> +{
> +    /* Clear event notifier */
> +    event_notifier_test_and_clear(&svq->svq_kick);
> +
> +    /* Forward to the device as many available buffers as possible */
> +    do {
> +        virtio_queue_set_notification(svq->vq, false);
> +
> +        while (true) {
> +            VirtQueueElement *elem;
> +            bool ok;
> +
> +            if (svq->next_guest_avail_elem) {
> +                elem = g_steal_pointer(&svq->next_guest_avail_elem);
> +            } else {
> +                elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            }
> +
> +            if (!elem) {
> +                break;
> +            }
> +
> +            if (elem->out_num + elem->in_num >
> +                vhost_svq_available_slots(svq)) {
> +                /*
> +                 * This condition is possible since a contiguous buffer in GPA
> +                 * does not imply a contiguous buffer in qemu's VA
> +                 * scatter-gather segments. If that happens, the buffer exposed
> +                 * to the device needs to be a chain of descriptors at this
> +                 * moment.
> +                 *
> +                 * SVQ cannot hold more available buffers if we are here:
> +                 * queue the current guest descriptor and ignore further kicks
> +                 * until some elements are used.
> +                 */
> +                svq->next_guest_avail_elem = elem;
> +                return;
> +            }
> +
> +            ok = vhost_svq_add(svq, elem);
> +            if (unlikely(!ok)) {
> +                /* VQ is broken, just return and ignore any other kicks */
> +                return;
> +            }
> +            vhost_svq_kick(svq);
> +        }
> +
> +        virtio_queue_set_notification(svq->vq, true);
> +    } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +/**
> + * Handle guest's kick.
> + *
> + * @n guest kick event notifier, the one that guest set to notify svq.
> + */
> +static void vhost_handle_guest_kick_notifier(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                svq_kick);
>       event_notifier_test_and_clear(n);
> -    event_notifier_set(&svq->hdev_kick);
> +    vhost_handle_guest_kick(svq);
> +}
> +
> +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> +{
> +    if (svq->last_used_idx != svq->shadow_used_idx) {
> +        return true;
> +    }
> +
> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> +
> +    return svq->last_used_idx != svq->shadow_used_idx;
>   }
>   
> -/* Forward vhost notifications */
> +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> +{
> +    vring_desc_t *descs = svq->vring.desc;
> +    const vring_used_t *used = svq->vring.used;
> +    vring_used_elem_t used_elem;
> +    uint16_t last_used;
> +
> +    if (!vhost_svq_more_used(svq)) {
> +        return NULL;
> +    }
> +
> +    /* Only get used array entries after they have been exposed by dev */
> +    smp_rmb();
> +    last_used = svq->last_used_idx & (svq->vring.num - 1);
> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> +
> +    svq->last_used_idx++;
> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "Device %s says index %u is used",
> +                      svq->vdev->name, used_elem.id);
> +        return NULL;
> +    }
> +
> +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +            "Device %s says index %u is used, but it was not available",
> +            svq->vdev->name, used_elem.id);
> +        return NULL;
> +    }
> +
> +    descs[used_elem.id].next = svq->free_head;
> +    svq->free_head = used_elem.id;
> +
> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;


It looks to me we'd better not modify the elem here, otherwise we may 
leak mapping during virtqueue_unmap_sg()?


> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> +}
> +
> +static void vhost_svq_flush(VhostShadowVirtqueue *svq,
> +                            bool check_for_avail_queue)
> +{
> +    VirtQueue *vq = svq->vq;
> +
> +    /* Forward as many used buffers as possible. */
> +    do {
> +        unsigned i = 0;
> +
> +        vhost_svq_set_notification(svq, false);
> +        while (true) {
> +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> +            if (!elem) {
> +                break;
> +            }
> +
> +            if (unlikely(i >= svq->vring.num)) {
> +                qemu_log_mask(LOG_GUEST_ERROR,
> +                         "More than %u used buffers obtained in a %u size SVQ",
> +                         i, svq->vring.num);
> +                virtqueue_fill(vq, elem, elem->len, i);
> +                virtqueue_flush(vq, i);
> +                return;
> +            }
> +            virtqueue_fill(vq, elem, elem->len, i++);
> +        }
> +
> +        virtqueue_flush(vq, i);
> +        event_notifier_set(&svq->svq_call);
> +
> +        if (check_for_avail_queue && svq->next_guest_avail_elem) {
> +            /*
> +             * Avail ring was full when vhost_svq_flush was called, so it's a
> +             * good moment to make more descriptors available if possible.
> +             */
> +            vhost_handle_guest_kick(svq);
> +        }
> +
> +        vhost_svq_set_notification(svq, true);
> +    } while (vhost_svq_more_used(svq));
> +}
> +
> +/**
> + * Forward used buffers.
> + *
> + * @n hdev call event notifier, the one that device set to notify svq.
> + *
> + * Note that we are not making any buffers available in the loop, there is no
> + * way that it runs more than virtqueue size times.
> + */
>   static void vhost_svq_handle_call(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                hdev_call);
>       event_notifier_test_and_clear(n);
> -    event_notifier_set(&svq->svq_call);
> +    vhost_svq_flush(svq, true);
>   }
>   
>   /**
> @@ -149,7 +437,41 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>       if (poll_start) {
>           event_notifier_init_fd(svq_kick, svq_kick_fd);
>           event_notifier_set(svq_kick);
> -        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick_notifier);
> +    }
> +}
> +
> +/**
> + * Start the shadow virtqueue operation.
> + *
> + * @svq Shadow Virtqueue
> + * @vdev        VirtIO device
> + * @vq          Virtqueue to shadow
> + */
> +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> +                     VirtQueue *vq)
> +{
> +    size_t desc_size, driver_size, device_size;
> +
> +    svq->next_guest_avail_elem = NULL;
> +    svq->avail_idx_shadow = 0;
> +    svq->shadow_used_idx = 0;
> +    svq->last_used_idx = 0;
> +    svq->vdev = vdev;
> +    svq->vq = vq;
> +
> +    svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> +    driver_size = vhost_svq_driver_area_size(svq);
> +    device_size = vhost_svq_device_area_size(svq);
> +    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> +    desc_size = sizeof(vring_desc_t) * svq->vring.num;
> +    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> +    memset(svq->vring.desc, 0, driver_size);
> +    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> +    memset(svq->vring.used, 0, device_size);
> +    svq->ring_id_maps = g_new0(VirtQueueElement *, svq->vring.num);
> +    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
>       }
>   }
>   
> @@ -160,6 +482,32 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>   void vhost_svq_stop(VhostShadowVirtqueue *svq)
>   {
>       event_notifier_set_handler(&svq->svq_kick, NULL);
> +    g_autofree VirtQueueElement *next_avail_elem = NULL;
> +
> +    if (!svq->vq) {
> +        return;
> +    }
> +
> +    /* Send all pending used descriptors to guest */
> +    vhost_svq_flush(svq, false);
> +
> +    for (unsigned i = 0; i < svq->vring.num; ++i) {
> +        g_autofree VirtQueueElement *elem = NULL;
> +        elem = g_steal_pointer(&svq->ring_id_maps[i]);
> +        if (elem) {
> +            virtqueue_detach_element(svq->vq, elem, elem->len);
> +        }
> +    }
> +
> +    next_avail_elem = g_steal_pointer(&svq->next_guest_avail_elem);
> +    if (next_avail_elem) {
> +        virtqueue_detach_element(svq->vq, next_avail_elem,
> +                                 next_avail_elem->len);
> +    }
> +    svq->vq = NULL;
> +    g_free(svq->ring_id_maps);
> +    qemu_vfree(svq->vring.desc);
> +    qemu_vfree(svq->vring.used);
>   }
>   
>   /**
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index b2c4e92fcf..435b9c2e9e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -803,10 +803,10 @@ static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
>    * Note that this function does not rewind kick file descriptor if cannot set
>    * call one.
>    */
> -static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> -                                 VhostShadowVirtqueue *svq,
> -                                 unsigned idx,
> -                                 Error **errp)
> +static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> +                                  VhostShadowVirtqueue *svq,
> +                                  unsigned idx,
> +                                  Error **errp)
>   {
>       struct vhost_vring_file file = {
>           .index = dev->vq_index + idx,
> @@ -818,7 +818,7 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>       r = vhost_vdpa_set_vring_dev_kick(dev, &file);
>       if (unlikely(r != 0)) {
>           error_setg_errno(errp, -r, "Can't set device kick fd");
> -        return false;
> +        return r;
>       }
>   
>       event_notifier = &svq->hdev_call;
> @@ -828,6 +828,119 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>           error_setg_errno(errp, -r, "Can't set device call fd");
>       }
>   
> +    return r;
> +}
> +
> +/**
> + * Unmap a SVQ area in the device
> + */
> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> +                                      hwaddr size)
> +{
> +    int r;
> +
> +    size = ROUND_UP(size, qemu_real_host_page_size);
> +    r = vhost_vdpa_dma_unmap(v, iova, size);
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> +                                       const VhostShadowVirtqueue *svq)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    struct vhost_vring_addr svq_addr;
> +    size_t device_size = vhost_svq_device_area_size(svq);
> +    size_t driver_size = vhost_svq_driver_area_size(svq);
> +    bool ok;
> +
> +    vhost_svq_get_vring_addr(svq, &svq_addr);
> +
> +    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> +}
> +
> +/**
> + * Map shadow virtqueue rings in device
> + *
> + * @dev   The vhost device
> + * @svq   The shadow virtqueue
> + * @addr  Assigned IOVA addresses
> + * @errp  Error pointer
> + */
> +static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> +                                     const VhostShadowVirtqueue *svq,
> +                                     struct vhost_vring_addr *addr,
> +                                     Error **errp)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    size_t device_size = vhost_svq_device_area_size(svq);
> +    size_t driver_size = vhost_svq_driver_area_size(svq);
> +    int r;
> +
> +    ERRP_GUARD();
> +    vhost_svq_get_vring_addr(svq, addr);
> +
> +    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> +                           (void *)addr->desc_user_addr, true);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> +                           (void *)addr->used_user_addr, false);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> +    }
> +
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> +                                 VhostShadowVirtqueue *svq,
> +                                 unsigned idx,
> +                                 Error **errp)
> +{
> +    uint16_t vq_index = dev->vq_index + idx;
> +    struct vhost_vring_state s = {
> +        .index = vq_index,
> +    };
> +    int r;
> +
> +    r = vhost_vdpa_set_dev_vring_base(dev, &s);
> +    if (unlikely(r)) {
> +        error_setg_errno(errp, -r, "Cannot set vring base");
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_svq_set_fds(dev, svq, idx, errp);
> +    return r == 0;
> +}
> +
> +static bool vhost_vdpa_svq_set_addr(struct vhost_dev *dev, unsigned idx,
> +                                    VhostShadowVirtqueue *svq,
> +                                    Error **errp)
> +{
> +    struct vhost_vring_addr addr = {
> +        .index = idx,
> +    };
> +    int r;
> +
> +    bool ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, errp);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    /* Override vring GPA set by vhost subsystem */
> +    r = vhost_vdpa_set_vring_dev_addr(dev, &addr);
> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot set device address");
> +    }
> +
>       return r == 0;
>   }
>   
> @@ -842,10 +955,46 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
>       }
>   
>       for (i = 0; i < v->shadow_vqs->len; ++i) {
> +        VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>           VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
>           bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
>           if (unlikely(!ok)) {
> -            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +            goto err;
> +        }
> +        vhost_svq_start(svq, dev->vdev, vq);
> +        ok = vhost_vdpa_svq_set_addr(dev, i, svq, &err);
> +        if (unlikely(!ok)) {
> +            vhost_svq_stop(svq);
> +            goto err;


Nit: let's introduce a new error label for this?


> +        }
> +    }
> +
> +    return true;
> +
> +err:
> +    error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> +    for (unsigned j = 0; j < i; ++j) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, j);
> +        vhost_vdpa_svq_unmap_rings(dev, svq);


I wonder if it's better to move the vhost_vdpa_svq_map_rings() out of 
vhost_vdpa_svq_set_addr(). (This function seems to be the only user for 
that). This will makes the error handling logic more clear.

Thanks


> +        vhost_svq_stop(svq);
> +    }
> +
> +    return false;
> +}
> +
> +static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (!v->shadow_vqs) {
> +        return true;
> +    }
> +
> +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> +                                                      i);
> +        bool ok = vhost_vdpa_svq_unmap_rings(dev, svq);
> +        if (unlikely(!ok)) {
>               return false;
>           }
>       }
> @@ -867,6 +1016,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>           }
>           vhost_vdpa_set_vring_ready(dev);
>       } else {
> +        ok = vhost_vdpa_svqs_stop(dev);
> +        if (unlikely(!ok)) {
> +            return -1;
> +        }
>           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>       }
>   



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/14] util: Add iova_tree_alloc
  2022-02-27 13:41 ` [PATCH v2 08/14] util: Add iova_tree_alloc Eugenio Pérez
@ 2022-02-28  6:39     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  6:39 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This iova tree function allows it to look for a hole in allocated
> regions and return a totally new translation for a given translated
> address.
>
> It's usage is mainly to allow devices to access qemu address space,
> remapping guest's one into a new iova space where qemu can add chunks of
> addresses.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
>   include/qemu/iova-tree.h |  18 ++++++
>   util/iova-tree.c         | 133 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 151 insertions(+)
>
> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> index 8249edd764..a623136cd8 100644
> --- a/include/qemu/iova-tree.h
> +++ b/include/qemu/iova-tree.h
> @@ -29,6 +29,7 @@
>   #define  IOVA_OK           (0)
>   #define  IOVA_ERR_INVALID  (-1) /* Invalid parameters */
>   #define  IOVA_ERR_OVERLAP  (-2) /* IOVA range overlapped */
> +#define  IOVA_ERR_NOMEM    (-3) /* Cannot allocate */
>   
>   typedef struct IOVATree IOVATree;
>   typedef struct DMAMap {
> @@ -119,6 +120,23 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
>    */
>   void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
>   
> +/**
> + * iova_tree_alloc:


Should be iova_tree_alloc_map.


> + *
> + * @tree: the iova tree to allocate from
> + * @map: the new map (as translated addr & size) to allocate in the iova region
> + * @iova_begin: the minimum address of the allocation
> + * @iova_end: the maximum addressable direction of the allocation
> + *
> + * Allocates a new region of a given size, between iova_min and iova_max.
> + *
> + * Return: Same as iova_tree_insert, but cannot overlap and can return error if
> + * iova tree is out of free contiguous range. The caller gets the assigned iova
> + * in map->iova.
> + */
> +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> +                        hwaddr iova_end);
> +
>   /**
>    * iova_tree_destroy:
>    *
> diff --git a/util/iova-tree.c b/util/iova-tree.c
> index 23ea35b7a4..302b01f1cc 100644
> --- a/util/iova-tree.c
> +++ b/util/iova-tree.c
> @@ -16,6 +16,39 @@ struct IOVATree {
>       GTree *tree;
>   };
>   
> +/* Args to pass to iova_tree_alloc foreach function. */
> +struct IOVATreeAllocArgs {
> +    /* Size of the desired allocation */
> +    size_t new_size;
> +
> +    /* The minimum address allowed in the allocation */
> +    hwaddr iova_begin;
> +
> +    /* Map at the left of the hole, can be NULL if "this" is first one */
> +    const DMAMap *prev;
> +
> +    /* Map at the right of the hole, can be NULL if "prev" is the last one */
> +    const DMAMap *this;
> +
> +    /* If found, we fill in the IOVA here */
> +    hwaddr iova_result;
> +
> +    /* Whether have we found a valid IOVA */
> +    bool iova_found;
> +};
> +
> +/**
> + * Iterate args to the next hole
> + *
> + * @args  The alloc arguments
> + * @next  The next mapping in the tree. Can be NULL to signal the last one
> + */
> +static void iova_tree_alloc_args_iterate(struct IOVATreeAllocArgs *args,
> +                                         const DMAMap *next) {
> +    args->prev = args->this;
> +    args->this = next;
> +}
> +
>   static int iova_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
>   {
>       const DMAMap *m1 = a, *m2 = b;
> @@ -107,6 +140,106 @@ int iova_tree_remove(IOVATree *tree, const DMAMap *map)
>       return IOVA_OK;
>   }
>   
> +/**
> + * Try to find an unallocated IOVA range between prev and this elements.
> + *
> + * @args Arguments to allocation
> + *
> + * Cases:
> + *
> + * (1) !prev, !this: No entries allocated, always succeed
> + *
> + * (2) !prev, this: We're iterating at the 1st element.
> + *
> + * (3) prev, !this: We're iterating at the last element.
> + *
> + * (4) prev, this: this is the most common case, we'll try to find a hole
> + * between "prev" and "this" mapping.
> + *
> + * Note that this function assumes the last valid iova is HWADDR_MAX, but it
> + * searches linearly so it's easy to discard the result if it's not the case.
> + */
> +static void iova_tree_alloc_map_in_hole(struct IOVATreeAllocArgs *args)
> +{
> +    const DMAMap *prev = args->prev, *this = args->this;
> +    uint64_t hole_start, hole_last;
> +
> +    if (this && this->iova + this->size < args->iova_begin) {
> +        return;
> +    }
> +
> +    hole_start = MAX(prev ? prev->iova + prev->size + 1 : 0, args->iova_begin);
> +    hole_last = this ? this->iova : HWADDR_MAX;


Do we need to use iova_last instead of HWADDR_MAX?


> +
> +    if (hole_last - hole_start > args->new_size) {
> +        args->iova_result = hole_start;
> +        args->iova_found = true;
> +    }
> +}
> +
> +/**
> + * Foreach dma node in the tree, compare if there is a hole with its previous
> + * node (or minimum iova address allowed) and the node.
> + *
> + * @key   Node iterating
> + * @value Node iterating
> + * @pargs Struct to communicate with the outside world
> + *
> + * Return: false to keep iterating, true if needs break.
> + */
> +static gboolean iova_tree_alloc_traverse(gpointer key, gpointer value,
> +                                         gpointer pargs)
> +{
> +    struct IOVATreeAllocArgs *args = pargs;
> +    DMAMap *node = value;
> +
> +    assert(key == value);
> +
> +    iova_tree_alloc_args_iterate(args, node);
> +    iova_tree_alloc_map_in_hole(args);
> +    return args->iova_found;
> +}
> +
> +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> +                        hwaddr iova_last)
> +{
> +    struct IOVATreeAllocArgs args = {
> +        .new_size = map->size,
> +        .iova_begin = iova_begin,
> +    };
> +
> +    assert(iova_begin < iova_last);


Should we use "<=" here, otherwise we disallow allocate the size of 1.

And maybe we should return error instead of assert.


> +
> +    /*
> +     * Find a valid hole for the mapping
> +     *
> +     * Assuming low iova_begin, so no need to do a binary search to
> +     * locate the first node.
> +     *
> +     * TODO: Replace all this with g_tree_node_first/next/last when available
> +     * (from glib since 2.68). To do it with g_tree_foreach complicates the
> +     * code a lot.
> +     *


One more question

The current code looks work but still a little bit complicated to be 
reviewed. Looking at the missing helpers above, if the add and remove 
are seldom. I wonder if we can simply do

g_tree_foreach() during each add/del to build a sorted list then we can 
emulate g_tree_node_first/next/last easily?


> +     */
> +    g_tree_foreach(tree->tree, iova_tree_alloc_traverse, &args);
> +    if (!args.iova_found) {
> +        /*
> +         * Either tree is empty or the last hole is still not checked.
> +         * g_tree_foreach does not compare (last, iova_end] range, so we check


"(last, iova_last]" ?

Thanks


> +         * it here.
> +         */
> +        iova_tree_alloc_args_iterate(&args, NULL);
> +        iova_tree_alloc_map_in_hole(&args);
> +    }
> +
> +    if (!args.iova_found || args.iova_result + map->size > iova_last) {
> +        return IOVA_ERR_NOMEM;
> +    }
> +
> +    map->iova = args.iova_result;
> +    return iova_tree_insert(tree, map);
> +}
> +
>   void iova_tree_destroy(IOVATree *tree)
>   {
>       g_tree_destroy(tree->tree);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/14] util: Add iova_tree_alloc
@ 2022-02-28  6:39     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  6:39 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This iova tree function allows it to look for a hole in allocated
> regions and return a totally new translation for a given translated
> address.
>
> It's usage is mainly to allow devices to access qemu address space,
> remapping guest's one into a new iova space where qemu can add chunks of
> addresses.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
>   include/qemu/iova-tree.h |  18 ++++++
>   util/iova-tree.c         | 133 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 151 insertions(+)
>
> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> index 8249edd764..a623136cd8 100644
> --- a/include/qemu/iova-tree.h
> +++ b/include/qemu/iova-tree.h
> @@ -29,6 +29,7 @@
>   #define  IOVA_OK           (0)
>   #define  IOVA_ERR_INVALID  (-1) /* Invalid parameters */
>   #define  IOVA_ERR_OVERLAP  (-2) /* IOVA range overlapped */
> +#define  IOVA_ERR_NOMEM    (-3) /* Cannot allocate */
>   
>   typedef struct IOVATree IOVATree;
>   typedef struct DMAMap {
> @@ -119,6 +120,23 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
>    */
>   void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
>   
> +/**
> + * iova_tree_alloc:


Should be iova_tree_alloc_map.


> + *
> + * @tree: the iova tree to allocate from
> + * @map: the new map (as translated addr & size) to allocate in the iova region
> + * @iova_begin: the minimum address of the allocation
> + * @iova_end: the maximum addressable direction of the allocation
> + *
> + * Allocates a new region of a given size, between iova_min and iova_max.
> + *
> + * Return: Same as iova_tree_insert, but cannot overlap and can return error if
> + * iova tree is out of free contiguous range. The caller gets the assigned iova
> + * in map->iova.
> + */
> +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> +                        hwaddr iova_end);
> +
>   /**
>    * iova_tree_destroy:
>    *
> diff --git a/util/iova-tree.c b/util/iova-tree.c
> index 23ea35b7a4..302b01f1cc 100644
> --- a/util/iova-tree.c
> +++ b/util/iova-tree.c
> @@ -16,6 +16,39 @@ struct IOVATree {
>       GTree *tree;
>   };
>   
> +/* Args to pass to iova_tree_alloc foreach function. */
> +struct IOVATreeAllocArgs {
> +    /* Size of the desired allocation */
> +    size_t new_size;
> +
> +    /* The minimum address allowed in the allocation */
> +    hwaddr iova_begin;
> +
> +    /* Map at the left of the hole, can be NULL if "this" is first one */
> +    const DMAMap *prev;
> +
> +    /* Map at the right of the hole, can be NULL if "prev" is the last one */
> +    const DMAMap *this;
> +
> +    /* If found, we fill in the IOVA here */
> +    hwaddr iova_result;
> +
> +    /* Whether have we found a valid IOVA */
> +    bool iova_found;
> +};
> +
> +/**
> + * Iterate args to the next hole
> + *
> + * @args  The alloc arguments
> + * @next  The next mapping in the tree. Can be NULL to signal the last one
> + */
> +static void iova_tree_alloc_args_iterate(struct IOVATreeAllocArgs *args,
> +                                         const DMAMap *next) {
> +    args->prev = args->this;
> +    args->this = next;
> +}
> +
>   static int iova_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
>   {
>       const DMAMap *m1 = a, *m2 = b;
> @@ -107,6 +140,106 @@ int iova_tree_remove(IOVATree *tree, const DMAMap *map)
>       return IOVA_OK;
>   }
>   
> +/**
> + * Try to find an unallocated IOVA range between prev and this elements.
> + *
> + * @args Arguments to allocation
> + *
> + * Cases:
> + *
> + * (1) !prev, !this: No entries allocated, always succeed
> + *
> + * (2) !prev, this: We're iterating at the 1st element.
> + *
> + * (3) prev, !this: We're iterating at the last element.
> + *
> + * (4) prev, this: this is the most common case, we'll try to find a hole
> + * between "prev" and "this" mapping.
> + *
> + * Note that this function assumes the last valid iova is HWADDR_MAX, but it
> + * searches linearly so it's easy to discard the result if it's not the case.
> + */
> +static void iova_tree_alloc_map_in_hole(struct IOVATreeAllocArgs *args)
> +{
> +    const DMAMap *prev = args->prev, *this = args->this;
> +    uint64_t hole_start, hole_last;
> +
> +    if (this && this->iova + this->size < args->iova_begin) {
> +        return;
> +    }
> +
> +    hole_start = MAX(prev ? prev->iova + prev->size + 1 : 0, args->iova_begin);
> +    hole_last = this ? this->iova : HWADDR_MAX;


Do we need to use iova_last instead of HWADDR_MAX?


> +
> +    if (hole_last - hole_start > args->new_size) {
> +        args->iova_result = hole_start;
> +        args->iova_found = true;
> +    }
> +}
> +
> +/**
> + * Foreach dma node in the tree, compare if there is a hole with its previous
> + * node (or minimum iova address allowed) and the node.
> + *
> + * @key   Node iterating
> + * @value Node iterating
> + * @pargs Struct to communicate with the outside world
> + *
> + * Return: false to keep iterating, true if needs break.
> + */
> +static gboolean iova_tree_alloc_traverse(gpointer key, gpointer value,
> +                                         gpointer pargs)
> +{
> +    struct IOVATreeAllocArgs *args = pargs;
> +    DMAMap *node = value;
> +
> +    assert(key == value);
> +
> +    iova_tree_alloc_args_iterate(args, node);
> +    iova_tree_alloc_map_in_hole(args);
> +    return args->iova_found;
> +}
> +
> +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> +                        hwaddr iova_last)
> +{
> +    struct IOVATreeAllocArgs args = {
> +        .new_size = map->size,
> +        .iova_begin = iova_begin,
> +    };
> +
> +    assert(iova_begin < iova_last);


Should we use "<=" here, otherwise we disallow allocate the size of 1.

And maybe we should return error instead of assert.


> +
> +    /*
> +     * Find a valid hole for the mapping
> +     *
> +     * Assuming low iova_begin, so no need to do a binary search to
> +     * locate the first node.
> +     *
> +     * TODO: Replace all this with g_tree_node_first/next/last when available
> +     * (from glib since 2.68). To do it with g_tree_foreach complicates the
> +     * code a lot.
> +     *


One more question

The current code looks work but still a little bit complicated to be 
reviewed. Looking at the missing helpers above, if the add and remove 
are seldom. I wonder if we can simply do

g_tree_foreach() during each add/del to build a sorted list then we can 
emulate g_tree_node_first/next/last easily?


> +     */
> +    g_tree_foreach(tree->tree, iova_tree_alloc_traverse, &args);
> +    if (!args.iova_found) {
> +        /*
> +         * Either tree is empty or the last hole is still not checked.
> +         * g_tree_foreach does not compare (last, iova_end] range, so we check


"(last, iova_last]" ?

Thanks


> +         * it here.
> +         */
> +        iova_tree_alloc_args_iterate(&args, NULL);
> +        iova_tree_alloc_map_in_hole(&args);
> +    }
> +
> +    if (!args.iova_found || args.iova_result + map->size > iova_last) {
> +        return IOVA_ERR_NOMEM;
> +    }
> +
> +    map->iova = args.iova_result;
> +    return iova_tree_insert(tree, map);
> +}
> +
>   void iova_tree_destroy(IOVATree *tree)
>   {
>       g_tree_destroy(tree->tree);



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-02-27 13:41 ` [PATCH v2 09/14] vhost: Add VhostIOVATree Eugenio Pérez
@ 2022-02-28  7:06     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:06 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This tree is able to look for a translated address from an IOVA address.
>
> At first glance it is similar to util/iova-tree. However, SVQ working on
> devices with limited IOVA space need more capabilities, like allocating
> IOVA chunks or performing reverse translations (qemu addresses to iova).
>
> The allocation capability, as "assign a free IOVA address to this chunk
> of memory in qemu's address space" allows shadow virtqueue to create a
> new address space that is not restricted by guest's addressable one, so
> we can allocate shadow vqs vrings outside of it.
>
> It duplicates the tree so it can search efficiently in both directions,
> and it will signal overlap if iova or the translated address is present
> in any tree.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-iova-tree.h |  27 +++++++
>   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
>   hw/virtio/meson.build       |   2 +-
>   3 files changed, 183 insertions(+), 1 deletion(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> new file mode 100644
> index 0000000000..6a4f24e0f9
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -0,0 +1,27 @@
> +/*
> + * vhost software live migration iova tree
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> +
> +#include "qemu/iova-tree.h"
> +#include "exec/memory.h"
> +
> +typedef struct VhostIOVATree VhostIOVATree;
> +
> +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> +
> +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> +                                        const DMAMap *map);
> +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> +
> +#endif
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> new file mode 100644
> index 0000000000..03496ac075
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -0,0 +1,155 @@
> +/*
> + * vhost software live migration iova tree
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/iova-tree.h"
> +#include "vhost-iova-tree.h"
> +
> +#define iova_min_addr qemu_real_host_page_size
> +
> +/**
> + * VhostIOVATree, able to:
> + * - Translate iova address
> + * - Reverse translate iova address (from translated to iova)
> + * - Allocate IOVA regions for translated range (linear operation)
> + */
> +struct VhostIOVATree {
> +    /* First addressable iova address in the device */
> +    uint64_t iova_first;
> +
> +    /* Last addressable iova address in the device */
> +    uint64_t iova_last;
> +
> +    /* IOVA address to qemu memory maps. */
> +    IOVATree *iova_taddr_map;
> +
> +    /* QEMU virtual memory address to iova maps */
> +    GTree *taddr_iova_map;
> +};
> +
> +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> +                                      gpointer data)
> +{
> +    const DMAMap *m1 = a, *m2 = b;
> +
> +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> +        return 1;
> +    }
> +
> +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> +        return -1;
> +    }
> +
> +    /* Overlapped */
> +    return 0;
> +}
> +
> +/**
> + * Create a new IOVA tree
> + *
> + * Returns the new IOVA tree
> + */
> +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> +{
> +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> +
> +    /* Some devices do not like 0 addresses */
> +    tree->iova_first = MAX(iova_first, iova_min_addr);
> +    tree->iova_last = iova_last;
> +
> +    tree->iova_taddr_map = iova_tree_new();
> +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> +                                           NULL, g_free);
> +    return tree;
> +}
> +
> +/**
> + * Delete an iova tree
> + */
> +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> +{
> +    iova_tree_destroy(iova_tree->iova_taddr_map);
> +    g_tree_unref(iova_tree->taddr_iova_map);
> +    g_free(iova_tree);
> +}
> +
> +/**
> + * Find the IOVA address stored from a memory address
> + *
> + * @tree     The iova tree
> + * @map      The map with the memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> +                                        const DMAMap *map)
> +{
> +    return g_tree_lookup(tree->taddr_iova_map, map);
> +}
> +
> +/**
> + * Allocate a new mapping
> + *
> + * @tree  The iova tree
> + * @map   The iova map
> + *
> + * Returns:
> + * - IOVA_OK if the map fits in the container
> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> + * - IOVA_ERR_OVERLAP if the tree already contains that map
> + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> + *
> + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> + */
> +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> +{
> +    /* Some vhost devices do not like addr 0. Skip first page */
> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> +    DMAMap *new;
> +    int r;
> +
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->perm == IOMMU_NONE) {
> +        return IOVA_ERR_INVALID;
> +    }
> +
> +    /* Check for collisions in translated addresses */
> +    if (vhost_iova_tree_find_iova(tree, map)) {
> +        return IOVA_ERR_OVERLAP;
> +    }
> +
> +    /* Allocate a node in IOVA address */
> +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> +                            tree->iova_last);
> +    if (r != IOVA_OK) {
> +        return r;
> +    }
> +
> +    /* Allocate node in qemu -> iova translations */
> +    new = g_malloc(sizeof(*new));
> +    memcpy(new, map, sizeof(*new));
> +    g_tree_insert(tree->taddr_iova_map, new, new);


Can the caller map two IOVA ranges to the same e.g GPA range?

Thanks


> +    return IOVA_OK;
> +}
> +
> +/**
> + * Remove existing mappings from iova tree
> + *
> + * @param  iova_tree  The vhost iova tree
> + * @param  map        The map to remove
> + */
> +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> +{
> +    const DMAMap *overlap;
> +
> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> +    }
> +}
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index 2dc87613bc..6047670804 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>   
>   virtio_ss = ss.source_set()
>   virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
@ 2022-02-28  7:06     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:06 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This tree is able to look for a translated address from an IOVA address.
>
> At first glance it is similar to util/iova-tree. However, SVQ working on
> devices with limited IOVA space need more capabilities, like allocating
> IOVA chunks or performing reverse translations (qemu addresses to iova).
>
> The allocation capability, as "assign a free IOVA address to this chunk
> of memory in qemu's address space" allows shadow virtqueue to create a
> new address space that is not restricted by guest's addressable one, so
> we can allocate shadow vqs vrings outside of it.
>
> It duplicates the tree so it can search efficiently in both directions,
> and it will signal overlap if iova or the translated address is present
> in any tree.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-iova-tree.h |  27 +++++++
>   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
>   hw/virtio/meson.build       |   2 +-
>   3 files changed, 183 insertions(+), 1 deletion(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> new file mode 100644
> index 0000000000..6a4f24e0f9
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -0,0 +1,27 @@
> +/*
> + * vhost software live migration iova tree
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> +
> +#include "qemu/iova-tree.h"
> +#include "exec/memory.h"
> +
> +typedef struct VhostIOVATree VhostIOVATree;
> +
> +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> +
> +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> +                                        const DMAMap *map);
> +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> +
> +#endif
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> new file mode 100644
> index 0000000000..03496ac075
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -0,0 +1,155 @@
> +/*
> + * vhost software live migration iova tree
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/iova-tree.h"
> +#include "vhost-iova-tree.h"
> +
> +#define iova_min_addr qemu_real_host_page_size
> +
> +/**
> + * VhostIOVATree, able to:
> + * - Translate iova address
> + * - Reverse translate iova address (from translated to iova)
> + * - Allocate IOVA regions for translated range (linear operation)
> + */
> +struct VhostIOVATree {
> +    /* First addressable iova address in the device */
> +    uint64_t iova_first;
> +
> +    /* Last addressable iova address in the device */
> +    uint64_t iova_last;
> +
> +    /* IOVA address to qemu memory maps. */
> +    IOVATree *iova_taddr_map;
> +
> +    /* QEMU virtual memory address to iova maps */
> +    GTree *taddr_iova_map;
> +};
> +
> +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> +                                      gpointer data)
> +{
> +    const DMAMap *m1 = a, *m2 = b;
> +
> +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> +        return 1;
> +    }
> +
> +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> +        return -1;
> +    }
> +
> +    /* Overlapped */
> +    return 0;
> +}
> +
> +/**
> + * Create a new IOVA tree
> + *
> + * Returns the new IOVA tree
> + */
> +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> +{
> +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> +
> +    /* Some devices do not like 0 addresses */
> +    tree->iova_first = MAX(iova_first, iova_min_addr);
> +    tree->iova_last = iova_last;
> +
> +    tree->iova_taddr_map = iova_tree_new();
> +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> +                                           NULL, g_free);
> +    return tree;
> +}
> +
> +/**
> + * Delete an iova tree
> + */
> +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> +{
> +    iova_tree_destroy(iova_tree->iova_taddr_map);
> +    g_tree_unref(iova_tree->taddr_iova_map);
> +    g_free(iova_tree);
> +}
> +
> +/**
> + * Find the IOVA address stored from a memory address
> + *
> + * @tree     The iova tree
> + * @map      The map with the memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> +                                        const DMAMap *map)
> +{
> +    return g_tree_lookup(tree->taddr_iova_map, map);
> +}
> +
> +/**
> + * Allocate a new mapping
> + *
> + * @tree  The iova tree
> + * @map   The iova map
> + *
> + * Returns:
> + * - IOVA_OK if the map fits in the container
> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> + * - IOVA_ERR_OVERLAP if the tree already contains that map
> + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> + *
> + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> + */
> +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> +{
> +    /* Some vhost devices do not like addr 0. Skip first page */
> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> +    DMAMap *new;
> +    int r;
> +
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->perm == IOMMU_NONE) {
> +        return IOVA_ERR_INVALID;
> +    }
> +
> +    /* Check for collisions in translated addresses */
> +    if (vhost_iova_tree_find_iova(tree, map)) {
> +        return IOVA_ERR_OVERLAP;
> +    }
> +
> +    /* Allocate a node in IOVA address */
> +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> +                            tree->iova_last);
> +    if (r != IOVA_OK) {
> +        return r;
> +    }
> +
> +    /* Allocate node in qemu -> iova translations */
> +    new = g_malloc(sizeof(*new));
> +    memcpy(new, map, sizeof(*new));
> +    g_tree_insert(tree->taddr_iova_map, new, new);


Can the caller map two IOVA ranges to the same e.g GPA range?

Thanks


> +    return IOVA_OK;
> +}
> +
> +/**
> + * Remove existing mappings from iova tree
> + *
> + * @param  iova_tree  The vhost iova tree
> + * @param  map        The map to remove
> + */
> +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> +{
> +    const DMAMap *overlap;
> +
> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> +    }
> +}
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index 2dc87613bc..6047670804 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>   
>   virtio_ss = ss.source_set()
>   virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-02-27 13:41 ` [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2022-02-28  7:36     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:36 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> Use translations added in VhostIOVATree in SVQ.
>
> Only introduce usage here, not allocation and deallocation. As with
> previous patches, we use the dead code paths of shadow_vqs_enabled to
> avoid commiting too many changes at once. These are impossible to take
> at the moment.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>   include/hw/virtio/vhost-vdpa.h     |   3 +
>   hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>   4 files changed, 187 insertions(+), 26 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 04c67685fd..b2f722d101 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -13,6 +13,7 @@
>   #include "qemu/event_notifier.h"
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>       /* Virtio device */
>       VirtIODevice *vdev;
>   
> +    /* IOVA mapping */
> +    VhostIOVATree *iova_tree;
> +
>       /* Map for use the guest's descriptors */
>       VirtQueueElement **ring_id_maps;
>   
> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>                        VirtQueue *vq);
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
> -VhostShadowVirtqueue *vhost_svq_new(void);
> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>   
>   void vhost_svq_free(gpointer vq);
>   G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 009a9f3b6b..ee8e939ad0 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -14,6 +14,7 @@
>   
>   #include <gmodule.h>
>   
> +#include "hw/virtio/vhost-iova-tree.h"
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>       MemoryListener listener;
>       struct vhost_vdpa_iova_range iova_range;
>       bool shadow_vqs_enabled;
> +    /* IOVA mapping used by the Shadow Virtqueue */
> +    VhostIOVATree *iova_tree;
>       GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index a38d313755..7e073773d1 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -11,6 +11,7 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   
>   #include "qemu/error-report.h"
> +#include "qemu/log.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/log.h"
>   #include "linux-headers/linux/vhost.h"
> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>       }
>   }
>   
> +/**
> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
> + *
> + * @svq    Shadow VirtQueue
> + * @vaddr  Translated IOVA addresses
> + * @iovec  Source qemu's VA addresses
> + * @num    Length of iovec and minimum length of vaddr
> + */
> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> +                                     void **addrs, const struct iovec *iovec,
> +                                     size_t num)
> +{
> +    if (num == 0) {
> +        return true;
> +    }
> +
> +    for (size_t i = 0; i < num; ++i) {
> +        DMAMap needle = {
> +            .translated_addr = (hwaddr)iovec[i].iov_base,
> +            .size = iovec[i].iov_len,
> +        };
> +        size_t off;
> +
> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
> +        /*
> +         * Map cannot be NULL since iova map contains all guest space and
> +         * qemu already has a physical address mapped
> +         */
> +        if (unlikely(!map)) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
> +                          needle.translated_addr);
> +            return false;
> +        }
> +
> +        off = needle.translated_addr - map->translated_addr;
> +        addrs[i] = (void *)(map->iova + off);
> +
> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
> +                                          iovec[i].iov_len),
> +                               map->translated_addr + map->size))) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "Guest buffer expands over iova range");
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
>   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    void * const *vaddr_sg,


Nit: it looks to me we are not passing vaddr but iova here, so it might 
be better to use "sg"?


>                                       const struct iovec *iovec,
>                                       size_t num, bool more_descs, bool write)
>   {
> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>           } else {
>               descs[i].flags = flags;
>           }
> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>           descs[i].len = cpu_to_le32(iovec[n].iov_len);
>   
>           last = i;
> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>   {
>       unsigned avail_idx;
>       vring_avail_t *avail = svq->vring.avail;
> +    bool ok;
> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>   
>       *head = svq->free_head;
>   
> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>           return false;
>       }
>   
> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>                               elem->in_num > 0, false);
> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> +
> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>   
>       /*
>        * Put the entry in the available array (but don't update avail->idx until
> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>    * shadow methods and file descriptors.
>    *
> + * @iova_tree Tree to perform descriptors translations
> + *
>    * Returns the new virtqueue or NULL.
>    *
>    * In case of error, reason is reported through error_report.
>    */
> -VhostShadowVirtqueue *vhost_svq_new(void)
> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>   {
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>       int r;
> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>   
>       event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>       event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
> +    svq->iova_tree = iova_tree;
>       return g_steal_pointer(&svq);
>   
>   err_init_hdev_call:
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 435b9c2e9e..56f9f125cd 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>                                            vaddr, section->readonly);
>   
>       llsize = int128_sub(llend, int128_make64(iova));
> +    if (v->shadow_vqs_enabled) {
> +        DMAMap mem_region = {
> +            .translated_addr = (hwaddr)vaddr,
> +            .size = int128_get64(llsize) - 1,
> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> +        };
> +
> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> +        if (unlikely(r != IOVA_OK)) {
> +            error_report("Can't allocate a mapping (%d)", r);
> +            goto fail;
> +        }
> +
> +        iova = mem_region.iova;
> +    }
>   
>       vhost_vdpa_iotlb_batch_begin_once(v);
>       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>   
>       llsize = int128_sub(llend, int128_make64(iova));
>   
> +    if (v->shadow_vqs_enabled) {
> +        const DMAMap *result;
> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> +            section->offset_within_region +
> +            (iova - section->offset_within_address_space);
> +        DMAMap mem_region = {
> +            .translated_addr = (hwaddr)vaddr,
> +            .size = int128_get64(llsize) - 1,
> +        };
> +
> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> +        iova = result->iova;
> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
> +    }
>       vhost_vdpa_iotlb_batch_begin_once(v);
>       ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>       if (ret) {
> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>   
>       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>       for (unsigned n = 0; n < hdev->nvqs; ++n) {
> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>   
>           if (unlikely(!svq)) {
>               error_setg(errp, "Cannot create svq %u", n);
> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>   /**
>    * Unmap a SVQ area in the device
>    */
> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> -                                      hwaddr size)
> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
> +                                      const DMAMap *needle)
>   {
> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
> +    hwaddr size;
>       int r;
>   
> -    size = ROUND_UP(size, qemu_real_host_page_size);
> -    r = vhost_vdpa_dma_unmap(v, iova, size);
> +    if (unlikely(!result)) {
> +        error_report("Unable to find SVQ address to unmap");
> +        return false;
> +    }
> +
> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>       return r == 0;
>   }
>   
>   static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>                                          const VhostShadowVirtqueue *svq)
>   {
> +    DMAMap needle;
>       struct vhost_vdpa *v = dev->opaque;
>       struct vhost_vring_addr svq_addr;
> -    size_t device_size = vhost_svq_device_area_size(svq);
> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>       bool ok;
>   
>       vhost_svq_get_vring_addr(svq, &svq_addr);
>   
> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> +    needle = (DMAMap) {
> +        .translated_addr = svq_addr.desc_user_addr,
> +    };


Let's simply initialize the member to zero during start of this function 
then we can use needle->transalted_addr = XXX here.


> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>       if (unlikely(!ok)) {
>           return false;
>       }
>   
> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> +    needle = (DMAMap) {
> +        .translated_addr = svq_addr.used_user_addr,
> +    };
> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
> +}
> +
> +/**
> + * Map the SVQ area in the device
> + *
> + * @v          Vhost-vdpa device
> + * @needle     The area to search iova
> + * @errorp     Error pointer
> + */
> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> +                                    Error **errp)
> +{
> +    int r;
> +
> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> +    if (unlikely(r != IOVA_OK)) {
> +        error_setg(errp, "Cannot allocate iova (%d)", r);
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> +                           (void *)needle->translated_addr,
> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));


Let's simply use needle->perm == IOMMU_RO here?


> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot map region to device");
> +        vhost_iova_tree_remove(v->iova_tree, needle);
> +    }
> +
> +    return r == 0;
>   }
>   
>   /**
> - * Map shadow virtqueue rings in device
> + * Map the shadow virtqueue rings in the device
>    *
>    * @dev   The vhost device
>    * @svq   The shadow virtqueue
> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>                                        struct vhost_vring_addr *addr,
>                                        Error **errp)
>   {
> +    DMAMap device_region, driver_region;
> +    struct vhost_vring_addr svq_addr;
>       struct vhost_vdpa *v = dev->opaque;
>       size_t device_size = vhost_svq_device_area_size(svq);
>       size_t driver_size = vhost_svq_driver_area_size(svq);
> -    int r;
> +    size_t avail_offset;
> +    bool ok;
>   
>       ERRP_GUARD();
> -    vhost_svq_get_vring_addr(svq, addr);
> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>   
> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> -                           (void *)addr->desc_user_addr, true);
> -    if (unlikely(r != 0)) {
> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> +    driver_region = (DMAMap) {
> +        .translated_addr = svq_addr.desc_user_addr,
> +        .size = driver_size - 1,


Any reason for the "-1" here? I see several places do things like that, 
it's probably hint of wrong API somehwere.

Thanks


> +        .perm = IOMMU_RO,
> +    };
> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> +    if (unlikely(!ok)) {
> +        error_prepend(errp, "Cannot create vq driver region: ");
>           return false;
>       }
> +    addr->desc_user_addr = driver_region.iova;
> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>   
> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> -                           (void *)addr->used_user_addr, false);
> -    if (unlikely(r != 0)) {
> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> +    device_region = (DMAMap) {
> +        .translated_addr = svq_addr.used_user_addr,
> +        .size = device_size - 1,
> +        .perm = IOMMU_RW,
> +    };
> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> +    if (unlikely(!ok)) {
> +        error_prepend(errp, "Cannot create vq device region: ");
> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>       }
> +    addr->used_user_addr = device_region.iova;
>   
> -    return r == 0;
> +    return ok;
>   }
>   
>   static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
@ 2022-02-28  7:36     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:36 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> Use translations added in VhostIOVATree in SVQ.
>
> Only introduce usage here, not allocation and deallocation. As with
> previous patches, we use the dead code paths of shadow_vqs_enabled to
> avoid commiting too many changes at once. These are impossible to take
> at the moment.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>   include/hw/virtio/vhost-vdpa.h     |   3 +
>   hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>   4 files changed, 187 insertions(+), 26 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 04c67685fd..b2f722d101 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -13,6 +13,7 @@
>   #include "qemu/event_notifier.h"
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>       /* Virtio device */
>       VirtIODevice *vdev;
>   
> +    /* IOVA mapping */
> +    VhostIOVATree *iova_tree;
> +
>       /* Map for use the guest's descriptors */
>       VirtQueueElement **ring_id_maps;
>   
> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>                        VirtQueue *vq);
>   void vhost_svq_stop(VhostShadowVirtqueue *svq);
>   
> -VhostShadowVirtqueue *vhost_svq_new(void);
> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>   
>   void vhost_svq_free(gpointer vq);
>   G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 009a9f3b6b..ee8e939ad0 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -14,6 +14,7 @@
>   
>   #include <gmodule.h>
>   
> +#include "hw/virtio/vhost-iova-tree.h"
>   #include "hw/virtio/virtio.h"
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>       MemoryListener listener;
>       struct vhost_vdpa_iova_range iova_range;
>       bool shadow_vqs_enabled;
> +    /* IOVA mapping used by the Shadow Virtqueue */
> +    VhostIOVATree *iova_tree;
>       GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index a38d313755..7e073773d1 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -11,6 +11,7 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   
>   #include "qemu/error-report.h"
> +#include "qemu/log.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/log.h"
>   #include "linux-headers/linux/vhost.h"
> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>       }
>   }
>   
> +/**
> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
> + *
> + * @svq    Shadow VirtQueue
> + * @vaddr  Translated IOVA addresses
> + * @iovec  Source qemu's VA addresses
> + * @num    Length of iovec and minimum length of vaddr
> + */
> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> +                                     void **addrs, const struct iovec *iovec,
> +                                     size_t num)
> +{
> +    if (num == 0) {
> +        return true;
> +    }
> +
> +    for (size_t i = 0; i < num; ++i) {
> +        DMAMap needle = {
> +            .translated_addr = (hwaddr)iovec[i].iov_base,
> +            .size = iovec[i].iov_len,
> +        };
> +        size_t off;
> +
> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
> +        /*
> +         * Map cannot be NULL since iova map contains all guest space and
> +         * qemu already has a physical address mapped
> +         */
> +        if (unlikely(!map)) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
> +                          needle.translated_addr);
> +            return false;
> +        }
> +
> +        off = needle.translated_addr - map->translated_addr;
> +        addrs[i] = (void *)(map->iova + off);
> +
> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
> +                                          iovec[i].iov_len),
> +                               map->translated_addr + map->size))) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "Guest buffer expands over iova range");
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
>   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    void * const *vaddr_sg,


Nit: it looks to me we are not passing vaddr but iova here, so it might 
be better to use "sg"?


>                                       const struct iovec *iovec,
>                                       size_t num, bool more_descs, bool write)
>   {
> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>           } else {
>               descs[i].flags = flags;
>           }
> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>           descs[i].len = cpu_to_le32(iovec[n].iov_len);
>   
>           last = i;
> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>   {
>       unsigned avail_idx;
>       vring_avail_t *avail = svq->vring.avail;
> +    bool ok;
> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>   
>       *head = svq->free_head;
>   
> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>           return false;
>       }
>   
> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>                               elem->in_num > 0, false);
> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> +
> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
> +    if (unlikely(!ok)) {
> +        return false;
> +    }
> +
> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>   
>       /*
>        * Put the entry in the available array (but don't update avail->idx until
> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>    * shadow methods and file descriptors.
>    *
> + * @iova_tree Tree to perform descriptors translations
> + *
>    * Returns the new virtqueue or NULL.
>    *
>    * In case of error, reason is reported through error_report.
>    */
> -VhostShadowVirtqueue *vhost_svq_new(void)
> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>   {
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>       int r;
> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>   
>       event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>       event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
> +    svq->iova_tree = iova_tree;
>       return g_steal_pointer(&svq);
>   
>   err_init_hdev_call:
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 435b9c2e9e..56f9f125cd 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>                                            vaddr, section->readonly);
>   
>       llsize = int128_sub(llend, int128_make64(iova));
> +    if (v->shadow_vqs_enabled) {
> +        DMAMap mem_region = {
> +            .translated_addr = (hwaddr)vaddr,
> +            .size = int128_get64(llsize) - 1,
> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> +        };
> +
> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> +        if (unlikely(r != IOVA_OK)) {
> +            error_report("Can't allocate a mapping (%d)", r);
> +            goto fail;
> +        }
> +
> +        iova = mem_region.iova;
> +    }
>   
>       vhost_vdpa_iotlb_batch_begin_once(v);
>       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>   
>       llsize = int128_sub(llend, int128_make64(iova));
>   
> +    if (v->shadow_vqs_enabled) {
> +        const DMAMap *result;
> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> +            section->offset_within_region +
> +            (iova - section->offset_within_address_space);
> +        DMAMap mem_region = {
> +            .translated_addr = (hwaddr)vaddr,
> +            .size = int128_get64(llsize) - 1,
> +        };
> +
> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> +        iova = result->iova;
> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
> +    }
>       vhost_vdpa_iotlb_batch_begin_once(v);
>       ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>       if (ret) {
> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>   
>       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>       for (unsigned n = 0; n < hdev->nvqs; ++n) {
> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>   
>           if (unlikely(!svq)) {
>               error_setg(errp, "Cannot create svq %u", n);
> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>   /**
>    * Unmap a SVQ area in the device
>    */
> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> -                                      hwaddr size)
> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
> +                                      const DMAMap *needle)
>   {
> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
> +    hwaddr size;
>       int r;
>   
> -    size = ROUND_UP(size, qemu_real_host_page_size);
> -    r = vhost_vdpa_dma_unmap(v, iova, size);
> +    if (unlikely(!result)) {
> +        error_report("Unable to find SVQ address to unmap");
> +        return false;
> +    }
> +
> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>       return r == 0;
>   }
>   
>   static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>                                          const VhostShadowVirtqueue *svq)
>   {
> +    DMAMap needle;
>       struct vhost_vdpa *v = dev->opaque;
>       struct vhost_vring_addr svq_addr;
> -    size_t device_size = vhost_svq_device_area_size(svq);
> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>       bool ok;
>   
>       vhost_svq_get_vring_addr(svq, &svq_addr);
>   
> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> +    needle = (DMAMap) {
> +        .translated_addr = svq_addr.desc_user_addr,
> +    };


Let's simply initialize the member to zero during start of this function 
then we can use needle->transalted_addr = XXX here.


> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>       if (unlikely(!ok)) {
>           return false;
>       }
>   
> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> +    needle = (DMAMap) {
> +        .translated_addr = svq_addr.used_user_addr,
> +    };
> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
> +}
> +
> +/**
> + * Map the SVQ area in the device
> + *
> + * @v          Vhost-vdpa device
> + * @needle     The area to search iova
> + * @errorp     Error pointer
> + */
> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> +                                    Error **errp)
> +{
> +    int r;
> +
> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> +    if (unlikely(r != IOVA_OK)) {
> +        error_setg(errp, "Cannot allocate iova (%d)", r);
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> +                           (void *)needle->translated_addr,
> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));


Let's simply use needle->perm == IOMMU_RO here?


> +    if (unlikely(r != 0)) {
> +        error_setg_errno(errp, -r, "Cannot map region to device");
> +        vhost_iova_tree_remove(v->iova_tree, needle);
> +    }
> +
> +    return r == 0;
>   }
>   
>   /**
> - * Map shadow virtqueue rings in device
> + * Map the shadow virtqueue rings in the device
>    *
>    * @dev   The vhost device
>    * @svq   The shadow virtqueue
> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>                                        struct vhost_vring_addr *addr,
>                                        Error **errp)
>   {
> +    DMAMap device_region, driver_region;
> +    struct vhost_vring_addr svq_addr;
>       struct vhost_vdpa *v = dev->opaque;
>       size_t device_size = vhost_svq_device_area_size(svq);
>       size_t driver_size = vhost_svq_driver_area_size(svq);
> -    int r;
> +    size_t avail_offset;
> +    bool ok;
>   
>       ERRP_GUARD();
> -    vhost_svq_get_vring_addr(svq, addr);
> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>   
> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> -                           (void *)addr->desc_user_addr, true);
> -    if (unlikely(r != 0)) {
> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> +    driver_region = (DMAMap) {
> +        .translated_addr = svq_addr.desc_user_addr,
> +        .size = driver_size - 1,


Any reason for the "-1" here? I see several places do things like that, 
it's probably hint of wrong API somehwere.

Thanks


> +        .perm = IOMMU_RO,
> +    };
> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> +    if (unlikely(!ok)) {
> +        error_prepend(errp, "Cannot create vq driver region: ");
>           return false;
>       }
> +    addr->desc_user_addr = driver_region.iova;
> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>   
> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> -                           (void *)addr->used_user_addr, false);
> -    if (unlikely(r != 0)) {
> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> +    device_region = (DMAMap) {
> +        .translated_addr = svq_addr.used_user_addr,
> +        .size = device_size - 1,
> +        .perm = IOMMU_RW,
> +    };
> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> +    if (unlikely(!ok)) {
> +        error_prepend(errp, "Cannot create vq device region: ");
> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>       }
> +    addr->used_user_addr = device_region.iova;
>   
> -    return r == 0;
> +    return ok;
>   }
>   
>   static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
  2022-02-27 13:41 ` [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base " Eugenio Pérez
@ 2022-02-28  7:38     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:38 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This is needed to achieve migration, so the destination can restore its
> index.


I suggest to duplicate the comment below here.

Thanks


> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-vdpa.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 56f9f125cd..accc4024c2 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1180,8 +1180,25 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>                                          struct vhost_vring_state *ring)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>   
> +    if (v->shadow_vqs_enabled) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> +                                                      ring->index);
> +
> +        /*
> +         * Setting base as last used idx, so destination will see as available
> +         * all the entries that the device did not use, including the in-flight
> +         * processing ones.
> +         *
> +         * TODO: This is ok for networking, but other kinds of devices might
> +         * have problems with these retransmissions.
> +         */
> +        ring->num = svq->last_used_idx;
> +        return 0;
> +    }
> +
>       ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
>       trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
>       return ret;

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
@ 2022-02-28  7:38     ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:38 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> This is needed to achieve migration, so the destination can restore its
> index.


I suggest to duplicate the comment below here.

Thanks


> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-vdpa.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 56f9f125cd..accc4024c2 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1180,8 +1180,25 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>                                          struct vhost_vring_state *ring)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
>       int ret;
>   
> +    if (v->shadow_vqs_enabled) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> +                                                      ring->index);
> +
> +        /*
> +         * Setting base as last used idx, so destination will see as available
> +         * all the entries that the device did not use, including the in-flight
> +         * processing ones.
> +         *
> +         * TODO: This is ok for networking, but other kinds of devices might
> +         * have problems with these retransmissions.
> +         */
> +        ring->num = svq->last_used_idx;
> +        return 0;
> +    }
> +
>       ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
>       trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
>       return ret;



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
  2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
@ 2022-02-28  7:41   ` Jason Wang
  2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:41 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, virtualization, Eli Cohen, Eric Blake,
	Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> is intended as a new method of tracking the memory the devices touch
> during a migration process: Instead of relay on vhost device's dirty
> logging capability, SVQ intercepts the VQ dataplane forwarding the
> descriptors between VM and device. This way qemu is the effective
> writer of guests memory, like in qemu's virtio device operation.
>
> When SVQ is enabled qemu offers a new virtual address space to the
> device to read and write into, and it maps new vrings and the guest
> memory in it. SVQ also intercepts kicks and calls between the device
> and the guest. Used buffers relay would cause dirty memory being
> tracked.
>
> This effectively means that vDPA device passthrough is intercepted by
> qemu. While SVQ should only be enabled at migration time, the switching
> from regular mode to SVQ mode is left for a future series.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> For qemu to use shadow virtqueues the guest virtio driver must not use
> features like event_idx, indirect descriptors, packed and in_order.
> These features are easy to implement on top of this base, but is left
> for a future series for simplicity.
>
> SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
>
> -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
>
> The first three patches enables notifications forwarding with
> assistance of qemu. It's easy to enable only this if the relevant
> cmdline part of the last patch is applied on top of these.
>
> Next four patches implement the actual buffer forwarding. However,
> address are not translated from HVA so they will need a host device with
> an iommu allowing them to access all of the HVA range.
>
> The last part of the series uses properly the host iommu, so qemu
> creates a new iova address space in the device's range and translates
> the buffers in it. Finally, it adds the cmdline parameter.
>
> Some simple performance tests with netperf were done. They used a nested
> guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> baseline average of ~9980.13Mbps:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    9910.61
> 131072  16384  16384    30.00    10030.94
> 131072  16384  16384    30.01    9998.84
>
> To enable the notifications interception reduced performance to an
> average of ~9577.73Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.00    9563.03
> 131072  16384  16384    30.01    9626.65
> 131072  16384  16384    30.01    9543.51
>
> Finally, to enable buffers forwarding reduced the throughput again to
> ~8902.92Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    8643.19
> 131072  16384  16384    30.01    9033.56
> 131072  16384  16384    30.01    9032.02
>
> However, many performance improvements were left out of this series for
> simplicity, so difference if performance should shrink in the future.
>
> Comments are welcome.


The series looks good overall, few comments in the individual patch.

I think if there's no objection, we can try to make it 7.0. (soft-freeze 
is 2022-03-08)

Thanks


>
> TODO in future series:
> * Event, indirect, packed, and others features of virtio.
> * To support different set of features between the device<->SVQ and the
>    SVQ<->guest communication.
> * Support of device host notifier memory regions.
> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * Support multiqueue virtio-net vdpa.
> * Proper documentation.
>
> Changes from v1:
> * Feature set at device->SVQ is now the same as SVQ->guest.
> * Size of SVQ is not max available device size anymore, but guest's
>    negotiated.
> * Add VHOST_FILE_UNBIND kick and call fd treatment.
> * Make SVQ a public struct
> * Come back to previous approach to iova-tree
> * Some assertions are now fail paths. Some errors are now log_guest.
> * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> * Refactor some errors and messages. Add missing error unwindings.
> * Add memory barrier at _F_NO_NOTIFY set.
> * Stop checking for features flags out of transport range.
> v1 link:
> https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
>
> Changes from v4 RFC:
> * Support of allocating / freeing iova ranges in IOVA tree. Extending
>    already present iova-tree for that.
> * Proper validation of guest features. Now SVQ can negotiate a
>    different set of features with the device when enabled.
> * Support of host notifiers memory regions
> * Handling of SVQ full queue in case guest's descriptors span to
>    different memory regions (qemu's VA chunks).
> * Flush pending used buffers at end of SVQ operation.
> * QMP command now looks by NetClientState name. Other devices will need
>    to implement it's way to enable vdpa.
> * Rename QMP command to set, so it looks more like a way of working
> * Better use of qemu error system
> * Make a few assertions proper error-handling paths.
> * Add more documentation
> * Less coupling of virtio / vhost, that could cause friction on changes
> * Addressed many other small comments and small fixes.
>
> Changes from v3 RFC:
>    * Move everything to vhost-vdpa backend. A big change, this allowed
>      some cleanup but more code has been added in other places.
>    * More use of glib utilities, especially to manage memory.
> v3 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
>
> Changes from v2 RFC:
>    * Adding vhost-vdpa devices support
>    * Fixed some memory leaks pointed by different comments
> v2 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> v1 link:
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (20):
>        virtio: Add VIRTIO_F_QUEUE_STATE
>        virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>        virtio: Add virtio_queue_is_host_notifier_enabled
>        vhost: Make vhost_virtqueue_{start,stop} public
>        vhost: Add x-vhost-enable-shadow-vq qmp
>        vhost: Add VhostShadowVirtqueue
>        vdpa: Register vdpa devices in a list
>        vhost: Route guest->host notification through shadow virtqueue
>        Add vhost_svq_get_svq_call_notifier
>        Add vhost_svq_set_guest_call_notifier
>        vdpa: Save call_fd in vhost-vdpa
>        vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>        vhost: Route host->guest notification through shadow virtqueue
>        virtio: Add vhost_shadow_vq_get_vring_addr
>        vdpa: Save host and guest features
>        vhost: Add vhost_svq_valid_device_features to shadow vq
>        vhost: Shadow virtqueue buffers forwarding
>        vhost: Add VhostIOVATree
>        vhost: Use a tree to store memory mappings
>        vdpa: Add custom IOTLB translations to SVQ
>
> Eugenio Pérez (14):
>    vhost: Add VhostShadowVirtqueue
>    vhost: Add Shadow VirtQueue kick forwarding capabilities
>    vhost: Add Shadow VirtQueue call forwarding capabilities
>    vhost: Add vhost_svq_valid_features to shadow vq
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    vdpa: adapt vhost_ops callbacks to svq
>    vhost: Shadow virtqueue buffers forwarding
>    util: Add iova_tree_alloc
>    vhost: Add VhostIOVATree
>    vdpa: Add custom IOTLB translations to SVQ
>    vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
>    vdpa: Never set log_base addr if SVQ is enabled
>    vdpa: Expose VHOST_F_LOG_ALL on SVQ
>    vdpa: Add x-svq to NetdevVhostVDPAOptions
>
>   qapi/net.json                      |   5 +-
>   hw/virtio/vhost-iova-tree.h        |  27 ++
>   hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
>   include/hw/virtio/vhost-vdpa.h     |   8 +
>   include/qemu/iova-tree.h           |  18 +
>   hw/virtio/vhost-iova-tree.c        | 155 +++++++
>   hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
>   net/vhost-vdpa.c                   |  48 ++-
>   util/iova-tree.c                   | 133 ++++++
>   hw/virtio/meson.build              |   2 +-
>   11 files changed, 1644 insertions(+), 25 deletions(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
@ 2022-02-28  7:41   ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-02-28  7:41 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Peter Xu, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> is intended as a new method of tracking the memory the devices touch
> during a migration process: Instead of relay on vhost device's dirty
> logging capability, SVQ intercepts the VQ dataplane forwarding the
> descriptors between VM and device. This way qemu is the effective
> writer of guests memory, like in qemu's virtio device operation.
>
> When SVQ is enabled qemu offers a new virtual address space to the
> device to read and write into, and it maps new vrings and the guest
> memory in it. SVQ also intercepts kicks and calls between the device
> and the guest. Used buffers relay would cause dirty memory being
> tracked.
>
> This effectively means that vDPA device passthrough is intercepted by
> qemu. While SVQ should only be enabled at migration time, the switching
> from regular mode to SVQ mode is left for a future series.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> For qemu to use shadow virtqueues the guest virtio driver must not use
> features like event_idx, indirect descriptors, packed and in_order.
> These features are easy to implement on top of this base, but is left
> for a future series for simplicity.
>
> SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
>
> -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
>
> The first three patches enables notifications forwarding with
> assistance of qemu. It's easy to enable only this if the relevant
> cmdline part of the last patch is applied on top of these.
>
> Next four patches implement the actual buffer forwarding. However,
> address are not translated from HVA so they will need a host device with
> an iommu allowing them to access all of the HVA range.
>
> The last part of the series uses properly the host iommu, so qemu
> creates a new iova address space in the device's range and translates
> the buffers in it. Finally, it adds the cmdline parameter.
>
> Some simple performance tests with netperf were done. They used a nested
> guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> baseline average of ~9980.13Mbps:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    9910.61
> 131072  16384  16384    30.00    10030.94
> 131072  16384  16384    30.01    9998.84
>
> To enable the notifications interception reduced performance to an
> average of ~9577.73Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.00    9563.03
> 131072  16384  16384    30.01    9626.65
> 131072  16384  16384    30.01    9543.51
>
> Finally, to enable buffers forwarding reduced the throughput again to
> ~8902.92Mbit/s:
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
> 131072  16384  16384    30.01    8643.19
> 131072  16384  16384    30.01    9033.56
> 131072  16384  16384    30.01    9032.02
>
> However, many performance improvements were left out of this series for
> simplicity, so difference if performance should shrink in the future.
>
> Comments are welcome.


The series looks good overall, few comments in the individual patch.

I think if there's no objection, we can try to make it 7.0. (soft-freeze 
is 2022-03-08)

Thanks


>
> TODO in future series:
> * Event, indirect, packed, and others features of virtio.
> * To support different set of features between the device<->SVQ and the
>    SVQ<->guest communication.
> * Support of device host notifier memory regions.
> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * Support multiqueue virtio-net vdpa.
> * Proper documentation.
>
> Changes from v1:
> * Feature set at device->SVQ is now the same as SVQ->guest.
> * Size of SVQ is not max available device size anymore, but guest's
>    negotiated.
> * Add VHOST_FILE_UNBIND kick and call fd treatment.
> * Make SVQ a public struct
> * Come back to previous approach to iova-tree
> * Some assertions are now fail paths. Some errors are now log_guest.
> * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> * Refactor some errors and messages. Add missing error unwindings.
> * Add memory barrier at _F_NO_NOTIFY set.
> * Stop checking for features flags out of transport range.
> v1 link:
> https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
>
> Changes from v4 RFC:
> * Support of allocating / freeing iova ranges in IOVA tree. Extending
>    already present iova-tree for that.
> * Proper validation of guest features. Now SVQ can negotiate a
>    different set of features with the device when enabled.
> * Support of host notifiers memory regions
> * Handling of SVQ full queue in case guest's descriptors span to
>    different memory regions (qemu's VA chunks).
> * Flush pending used buffers at end of SVQ operation.
> * QMP command now looks by NetClientState name. Other devices will need
>    to implement it's way to enable vdpa.
> * Rename QMP command to set, so it looks more like a way of working
> * Better use of qemu error system
> * Make a few assertions proper error-handling paths.
> * Add more documentation
> * Less coupling of virtio / vhost, that could cause friction on changes
> * Addressed many other small comments and small fixes.
>
> Changes from v3 RFC:
>    * Move everything to vhost-vdpa backend. A big change, this allowed
>      some cleanup but more code has been added in other places.
>    * More use of glib utilities, especially to manage memory.
> v3 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
>
> Changes from v2 RFC:
>    * Adding vhost-vdpa devices support
>    * Fixed some memory leaks pointed by different comments
> v2 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> v1 link:
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (20):
>        virtio: Add VIRTIO_F_QUEUE_STATE
>        virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>        virtio: Add virtio_queue_is_host_notifier_enabled
>        vhost: Make vhost_virtqueue_{start,stop} public
>        vhost: Add x-vhost-enable-shadow-vq qmp
>        vhost: Add VhostShadowVirtqueue
>        vdpa: Register vdpa devices in a list
>        vhost: Route guest->host notification through shadow virtqueue
>        Add vhost_svq_get_svq_call_notifier
>        Add vhost_svq_set_guest_call_notifier
>        vdpa: Save call_fd in vhost-vdpa
>        vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>        vhost: Route host->guest notification through shadow virtqueue
>        virtio: Add vhost_shadow_vq_get_vring_addr
>        vdpa: Save host and guest features
>        vhost: Add vhost_svq_valid_device_features to shadow vq
>        vhost: Shadow virtqueue buffers forwarding
>        vhost: Add VhostIOVATree
>        vhost: Use a tree to store memory mappings
>        vdpa: Add custom IOTLB translations to SVQ
>
> Eugenio Pérez (14):
>    vhost: Add VhostShadowVirtqueue
>    vhost: Add Shadow VirtQueue kick forwarding capabilities
>    vhost: Add Shadow VirtQueue call forwarding capabilities
>    vhost: Add vhost_svq_valid_features to shadow vq
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    vdpa: adapt vhost_ops callbacks to svq
>    vhost: Shadow virtqueue buffers forwarding
>    util: Add iova_tree_alloc
>    vhost: Add VhostIOVATree
>    vdpa: Add custom IOTLB translations to SVQ
>    vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
>    vdpa: Never set log_base addr if SVQ is enabled
>    vdpa: Expose VHOST_F_LOG_ALL on SVQ
>    vdpa: Add x-svq to NetdevVhostVDPAOptions
>
>   qapi/net.json                      |   5 +-
>   hw/virtio/vhost-iova-tree.h        |  27 ++
>   hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
>   include/hw/virtio/vhost-vdpa.h     |   8 +
>   include/qemu/iova-tree.h           |  18 +
>   hw/virtio/vhost-iova-tree.c        | 155 +++++++
>   hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
>   net/vhost-vdpa.c                   |  48 ++-
>   util/iova-tree.c                   | 133 ++++++
>   hw/virtio/meson.build              |   2 +-
>   11 files changed, 1644 insertions(+), 25 deletions(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
  2022-02-28  7:38     ` Jason Wang
  (?)
@ 2022-03-01  7:51     ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01  7:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 8:38 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > This is needed to achieve migration, so the destination can restore its
> > index.
>
>
> I suggest to duplicate the comment below here.
>

Sure, I'll duplicate here in the commit message.

Thanks!

> Thanks
>
>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-vdpa.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 56f9f125cd..accc4024c2 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -1180,8 +1180,25 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
> >   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> >                                          struct vhost_vring_state *ring)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> >       int ret;
> >
> > +    if (v->shadow_vqs_enabled) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> > +                                                      ring->index);
> > +
> > +        /*
> > +         * Setting base as last used idx, so destination will see as available
> > +         * all the entries that the device did not use, including the in-flight
> > +         * processing ones.
> > +         *
> > +         * TODO: This is ok for networking, but other kinds of devices might
> > +         * have problems with these retransmissions.
> > +         */
> > +        ring->num = svq->last_used_idx;
> > +        return 0;
> > +    }
> > +
> >       ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
> >       trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
> >       return ret;
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-02-28  7:36     ` Jason Wang
  (?)
@ 2022-03-01  8:50     ` Eugenio Perez Martin
  2022-03-03  7:33         ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01  8:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > Use translations added in VhostIOVATree in SVQ.
> >
> > Only introduce usage here, not allocation and deallocation. As with
> > previous patches, we use the dead code paths of shadow_vqs_enabled to
> > avoid commiting too many changes at once. These are impossible to take
> > at the moment.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |   6 +-
> >   include/hw/virtio/vhost-vdpa.h     |   3 +
> >   hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
> >   hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
> >   4 files changed, 187 insertions(+), 26 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 04c67685fd..b2f722d101 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -13,6 +13,7 @@
> >   #include "qemu/event_notifier.h"
> >   #include "hw/virtio/virtio.h"
> >   #include "standard-headers/linux/vhost_types.h"
> > +#include "hw/virtio/vhost-iova-tree.h"
> >
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> > @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> >
> > +    /* IOVA mapping */
> > +    VhostIOVATree *iova_tree;
> > +
> >       /* Map for use the guest's descriptors */
> >       VirtQueueElement **ring_id_maps;
> >
> > @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> >                        VirtQueue *vq);
> >   void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >
> > -VhostShadowVirtqueue *vhost_svq_new(void);
> > +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
> >
> >   void vhost_svq_free(gpointer vq);
> >   G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 009a9f3b6b..ee8e939ad0 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -14,6 +14,7 @@
> >
> >   #include <gmodule.h>
> >
> > +#include "hw/virtio/vhost-iova-tree.h"
> >   #include "hw/virtio/virtio.h"
> >   #include "standard-headers/linux/vhost_types.h"
> >
> > @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
> >       MemoryListener listener;
> >       struct vhost_vdpa_iova_range iova_range;
> >       bool shadow_vqs_enabled;
> > +    /* IOVA mapping used by the Shadow Virtqueue */
> > +    VhostIOVATree *iova_tree;
> >       GPtrArray *shadow_vqs;
> >       struct vhost_dev *dev;
> >       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index a38d313755..7e073773d1 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -11,6 +11,7 @@
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >
> >   #include "qemu/error-report.h"
> > +#include "qemu/log.h"
> >   #include "qemu/main-loop.h"
> >   #include "qemu/log.h"
> >   #include "linux-headers/linux/vhost.h"
> > @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> >       }
> >   }
> >
> > +/**
> > + * Translate addresses between the qemu's virtual address and the SVQ IOVA
> > + *
> > + * @svq    Shadow VirtQueue
> > + * @vaddr  Translated IOVA addresses
> > + * @iovec  Source qemu's VA addresses
> > + * @num    Length of iovec and minimum length of vaddr
> > + */
> > +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > +                                     void **addrs, const struct iovec *iovec,
> > +                                     size_t num)
> > +{
> > +    if (num == 0) {
> > +        return true;
> > +    }
> > +
> > +    for (size_t i = 0; i < num; ++i) {
> > +        DMAMap needle = {
> > +            .translated_addr = (hwaddr)iovec[i].iov_base,
> > +            .size = iovec[i].iov_len,
> > +        };
> > +        size_t off;
> > +
> > +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
> > +        /*
> > +         * Map cannot be NULL since iova map contains all guest space and
> > +         * qemu already has a physical address mapped
> > +         */
> > +        if (unlikely(!map)) {
> > +            qemu_log_mask(LOG_GUEST_ERROR,
> > +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
> > +                          needle.translated_addr);
> > +            return false;
> > +        }
> > +
> > +        off = needle.translated_addr - map->translated_addr;
> > +        addrs[i] = (void *)(map->iova + off);
> > +
> > +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
> > +                                          iovec[i].iov_len),
> > +                               map->translated_addr + map->size))) {
> > +            qemu_log_mask(LOG_GUEST_ERROR,
> > +                          "Guest buffer expands over iova range");
> > +            return false;
> > +        }
> > +    }
> > +
> > +    return true;
> > +}
> > +
> >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > +                                    void * const *vaddr_sg,
>
>
> Nit: it looks to me we are not passing vaddr but iova here, so it might
> be better to use "sg"?
>

Sure, this is a leftover of pre-IOVA translations. I see better to
write as you say.

>
> >                                       const struct iovec *iovec,
> >                                       size_t num, bool more_descs, bool write)
> >   {
> > @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >           } else {
> >               descs[i].flags = flags;
> >           }
> > -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> > +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
> >           descs[i].len = cpu_to_le32(iovec[n].iov_len);
> >
> >           last = i;
> > @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >   {
> >       unsigned avail_idx;
> >       vring_avail_t *avail = svq->vring.avail;
> > +    bool ok;
> > +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
> >
> >       *head = svq->free_head;
> >
> > @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >           return false;
> >       }
> >
> > -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
> >                               elem->in_num > 0, false);
> > -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > +
> > +
> > +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +
> > +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
> >
> >       /*
> >        * Put the entry in the available array (but don't update avail->idx until
> > @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
> >    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> >    * shadow methods and file descriptors.
> >    *
> > + * @iova_tree Tree to perform descriptors translations
> > + *
> >    * Returns the new virtqueue or NULL.
> >    *
> >    * In case of error, reason is reported through error_report.
> >    */
> > -VhostShadowVirtqueue *vhost_svq_new(void)
> > +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
> >   {
> >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >       int r;
> > @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
> >
> >       event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
> >       event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
> > +    svq->iova_tree = iova_tree;
> >       return g_steal_pointer(&svq);
> >
> >   err_init_hdev_call:
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 435b9c2e9e..56f9f125cd 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >                                            vaddr, section->readonly);
> >
> >       llsize = int128_sub(llend, int128_make64(iova));
> > +    if (v->shadow_vqs_enabled) {
> > +        DMAMap mem_region = {
> > +            .translated_addr = (hwaddr)vaddr,
> > +            .size = int128_get64(llsize) - 1,
> > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > +        };
> > +
> > +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> > +        if (unlikely(r != IOVA_OK)) {
> > +            error_report("Can't allocate a mapping (%d)", r);
> > +            goto fail;
> > +        }
> > +
> > +        iova = mem_region.iova;
> > +    }
> >
> >       vhost_vdpa_iotlb_batch_begin_once(v);
> >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >
> >       llsize = int128_sub(llend, int128_make64(iova));
> >
> > +    if (v->shadow_vqs_enabled) {
> > +        const DMAMap *result;
> > +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> > +            section->offset_within_region +
> > +            (iova - section->offset_within_address_space);
> > +        DMAMap mem_region = {
> > +            .translated_addr = (hwaddr)vaddr,
> > +            .size = int128_get64(llsize) - 1,
> > +        };
> > +
> > +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> > +        iova = result->iova;
> > +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
> > +    }
> >       vhost_vdpa_iotlb_batch_begin_once(v);
> >       ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
> >       if (ret) {
> > @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >
> >       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> >       for (unsigned n = 0; n < hdev->nvqs; ++n) {
> > -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> > +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
> >
> >           if (unlikely(!svq)) {
> >               error_setg(errp, "Cannot create svq %u", n);
> > @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> >   /**
> >    * Unmap a SVQ area in the device
> >    */
> > -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> > -                                      hwaddr size)
> > +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
> > +                                      const DMAMap *needle)
> >   {
> > +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
> > +    hwaddr size;
> >       int r;
> >
> > -    size = ROUND_UP(size, qemu_real_host_page_size);
> > -    r = vhost_vdpa_dma_unmap(v, iova, size);
> > +    if (unlikely(!result)) {
> > +        error_report("Unable to find SVQ address to unmap");
> > +        return false;
> > +    }
> > +
> > +    size = ROUND_UP(result->size, qemu_real_host_page_size);
> > +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
> >       return r == 0;
> >   }
> >
> >   static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >                                          const VhostShadowVirtqueue *svq)
> >   {
> > +    DMAMap needle;
> >       struct vhost_vdpa *v = dev->opaque;
> >       struct vhost_vring_addr svq_addr;
> > -    size_t device_size = vhost_svq_device_area_size(svq);
> > -    size_t driver_size = vhost_svq_driver_area_size(svq);
> >       bool ok;
> >
> >       vhost_svq_get_vring_addr(svq, &svq_addr);
> >
> > -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> > +    needle = (DMAMap) {
> > +        .translated_addr = svq_addr.desc_user_addr,
> > +    };
>
>
> Let's simply initialize the member to zero during start of this function
> then we can use needle->transalted_addr = XXX here.
>

Sure

>
> > +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
> >       if (unlikely(!ok)) {
> >           return false;
> >       }
> >
> > -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> > +    needle = (DMAMap) {
> > +        .translated_addr = svq_addr.used_user_addr,
> > +    };
> > +    return vhost_vdpa_svq_unmap_ring(v, &needle);
> > +}
> > +
> > +/**
> > + * Map the SVQ area in the device
> > + *
> > + * @v          Vhost-vdpa device
> > + * @needle     The area to search iova
> > + * @errorp     Error pointer
> > + */
> > +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> > +                                    Error **errp)
> > +{
> > +    int r;
> > +
> > +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> > +    if (unlikely(r != IOVA_OK)) {
> > +        error_setg(errp, "Cannot allocate iova (%d)", r);
> > +        return false;
> > +    }
> > +
> > +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> > +                           (void *)needle->translated_addr,
> > +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
>
>
> Let's simply use needle->perm == IOMMU_RO here?
>

The motivation to use this way is to be more resilient to the future.
For example, if a new flag is added.

But I'm totally ok with comparing with IOMMU_RO, I see that scenario
unlikely at the moment.

>
> > +    if (unlikely(r != 0)) {
> > +        error_setg_errno(errp, -r, "Cannot map region to device");
> > +        vhost_iova_tree_remove(v->iova_tree, needle);
> > +    }
> > +
> > +    return r == 0;
> >   }
> >
> >   /**
> > - * Map shadow virtqueue rings in device
> > + * Map the shadow virtqueue rings in the device
> >    *
> >    * @dev   The vhost device
> >    * @svq   The shadow virtqueue
> > @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >                                        struct vhost_vring_addr *addr,
> >                                        Error **errp)
> >   {
> > +    DMAMap device_region, driver_region;
> > +    struct vhost_vring_addr svq_addr;
> >       struct vhost_vdpa *v = dev->opaque;
> >       size_t device_size = vhost_svq_device_area_size(svq);
> >       size_t driver_size = vhost_svq_driver_area_size(svq);
> > -    int r;
> > +    size_t avail_offset;
> > +    bool ok;
> >
> >       ERRP_GUARD();
> > -    vhost_svq_get_vring_addr(svq, addr);
> > +    vhost_svq_get_vring_addr(svq, &svq_addr);
> >
> > -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> > -                           (void *)addr->desc_user_addr, true);
> > -    if (unlikely(r != 0)) {
> > -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> > +    driver_region = (DMAMap) {
> > +        .translated_addr = svq_addr.desc_user_addr,
> > +        .size = driver_size - 1,
>
>
> Any reason for the "-1" here? I see several places do things like that,
> it's probably hint of wrong API somehwere.
>

The "problem" is the api mismatch between _end and _last, to include
the last member in the size or not.

IOVA tree needs to use _end so we can allocate the last page in case
of available range ending in (uint64_t)-1 [1]. But If we change
vhost_svq_{device,driver}_area_size to make it inclusive, we need to
use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
Probably in more places too

QEMU's emulated Intel iommu code solves it using the address mask as
the size, something that does not fit 100% with vhost devices since
they can allocate an arbitrary address of arbitrary size when using
vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
sure we expose unaligned and whole pages with vrings, but I feel it
would only be to move the problem somewhere else.

Thanks!

[1] There are alternatives: to use Int128_t, etc. But I think it's
better not to change that in this patch series.

> Thanks
>
>
> > +        .perm = IOMMU_RO,
> > +    };
> > +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> > +    if (unlikely(!ok)) {
> > +        error_prepend(errp, "Cannot create vq driver region: ");
> >           return false;
> >       }
> > +    addr->desc_user_addr = driver_region.iova;
> > +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> > +    addr->avail_user_addr = driver_region.iova + avail_offset;
> >
> > -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> > -                           (void *)addr->used_user_addr, false);
> > -    if (unlikely(r != 0)) {
> > -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> > +    device_region = (DMAMap) {
> > +        .translated_addr = svq_addr.used_user_addr,
> > +        .size = device_size - 1,
> > +        .perm = IOMMU_RW,
> > +    };
> > +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> > +    if (unlikely(!ok)) {
> > +        error_prepend(errp, "Cannot create vq device region: ");
> > +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
> >       }
> > +    addr->used_user_addr = device_region.iova;
> >
> > -    return r == 0;
> > +    return ok;
> >   }
> >
> >   static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/14] util: Add iova_tree_alloc
  2022-02-28  6:39     ` Jason Wang
  (?)
@ 2022-03-01 10:06     ` Eugenio Perez Martin
  2022-03-03  7:16         ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 10:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 7:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > This iova tree function allows it to look for a hole in allocated
> > regions and return a totally new translation for a given translated
> > address.
> >
> > It's usage is mainly to allow devices to access qemu address space,
> > remapping guest's one into a new iova space where qemu can add chunks of
> > addresses.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/qemu/iova-tree.h |  18 ++++++
> >   util/iova-tree.c         | 133 +++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 151 insertions(+)
> >
> > diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> > index 8249edd764..a623136cd8 100644
> > --- a/include/qemu/iova-tree.h
> > +++ b/include/qemu/iova-tree.h
> > @@ -29,6 +29,7 @@
> >   #define  IOVA_OK           (0)
> >   #define  IOVA_ERR_INVALID  (-1) /* Invalid parameters */
> >   #define  IOVA_ERR_OVERLAP  (-2) /* IOVA range overlapped */
> > +#define  IOVA_ERR_NOMEM    (-3) /* Cannot allocate */
> >
> >   typedef struct IOVATree IOVATree;
> >   typedef struct DMAMap {
> > @@ -119,6 +120,23 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
> >    */
> >   void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
> >
> > +/**
> > + * iova_tree_alloc:
>
>
> Should be iova_tree_alloc_map.
>

That's right, I'll change. It's also missing from the patch subject.

>
> > + *
> > + * @tree: the iova tree to allocate from
> > + * @map: the new map (as translated addr & size) to allocate in the iova region
> > + * @iova_begin: the minimum address of the allocation
> > + * @iova_end: the maximum addressable direction of the allocation
> > + *
> > + * Allocates a new region of a given size, between iova_min and iova_max.
> > + *
> > + * Return: Same as iova_tree_insert, but cannot overlap and can return error if
> > + * iova tree is out of free contiguous range. The caller gets the assigned iova
> > + * in map->iova.
> > + */
> > +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> > +                        hwaddr iova_end);
> > +
> >   /**
> >    * iova_tree_destroy:
> >    *
> > diff --git a/util/iova-tree.c b/util/iova-tree.c
> > index 23ea35b7a4..302b01f1cc 100644
> > --- a/util/iova-tree.c
> > +++ b/util/iova-tree.c
> > @@ -16,6 +16,39 @@ struct IOVATree {
> >       GTree *tree;
> >   };
> >
> > +/* Args to pass to iova_tree_alloc foreach function. */
> > +struct IOVATreeAllocArgs {
> > +    /* Size of the desired allocation */
> > +    size_t new_size;
> > +
> > +    /* The minimum address allowed in the allocation */
> > +    hwaddr iova_begin;
> > +
> > +    /* Map at the left of the hole, can be NULL if "this" is first one */
> > +    const DMAMap *prev;
> > +
> > +    /* Map at the right of the hole, can be NULL if "prev" is the last one */
> > +    const DMAMap *this;
> > +
> > +    /* If found, we fill in the IOVA here */
> > +    hwaddr iova_result;
> > +
> > +    /* Whether have we found a valid IOVA */
> > +    bool iova_found;
> > +};
> > +
> > +/**
> > + * Iterate args to the next hole
> > + *
> > + * @args  The alloc arguments
> > + * @next  The next mapping in the tree. Can be NULL to signal the last one
> > + */
> > +static void iova_tree_alloc_args_iterate(struct IOVATreeAllocArgs *args,
> > +                                         const DMAMap *next) {
> > +    args->prev = args->this;
> > +    args->this = next;
> > +}
> > +
> >   static int iova_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
> >   {
> >       const DMAMap *m1 = a, *m2 = b;
> > @@ -107,6 +140,106 @@ int iova_tree_remove(IOVATree *tree, const DMAMap *map)
> >       return IOVA_OK;
> >   }
> >
> > +/**
> > + * Try to find an unallocated IOVA range between prev and this elements.
> > + *
> > + * @args Arguments to allocation
> > + *
> > + * Cases:
> > + *
> > + * (1) !prev, !this: No entries allocated, always succeed
> > + *
> > + * (2) !prev, this: We're iterating at the 1st element.
> > + *
> > + * (3) prev, !this: We're iterating at the last element.
> > + *
> > + * (4) prev, this: this is the most common case, we'll try to find a hole
> > + * between "prev" and "this" mapping.
> > + *
> > + * Note that this function assumes the last valid iova is HWADDR_MAX, but it
> > + * searches linearly so it's easy to discard the result if it's not the case.
> > + */
> > +static void iova_tree_alloc_map_in_hole(struct IOVATreeAllocArgs *args)
> > +{
> > +    const DMAMap *prev = args->prev, *this = args->this;
> > +    uint64_t hole_start, hole_last;
> > +
> > +    if (this && this->iova + this->size < args->iova_begin) {
> > +        return;
> > +    }
> > +
> > +    hole_start = MAX(prev ? prev->iova + prev->size + 1 : 0, args->iova_begin);
> > +    hole_last = this ? this->iova : HWADDR_MAX;
>
>
> Do we need to use iova_last instead of HWADDR_MAX?
>

If I re-add iova_last to this function, this first part is the same as
RFC v5. The only difference would be iova_found.

To simplify this function, I extracted the iova_last check to
iova_tree_alloc_map. I thought this was closer to what you proposed.
As a disadvantage, the search could go beyond iova_last, but this
should not be common.

I'm ok with both versions.

>
> > +
> > +    if (hole_last - hole_start > args->new_size) {
> > +        args->iova_result = hole_start;
> > +        args->iova_found = true;
> > +    }
> > +}
> > +
> > +/**
> > + * Foreach dma node in the tree, compare if there is a hole with its previous
> > + * node (or minimum iova address allowed) and the node.
> > + *
> > + * @key   Node iterating
> > + * @value Node iterating
> > + * @pargs Struct to communicate with the outside world
> > + *
> > + * Return: false to keep iterating, true if needs break.
> > + */
> > +static gboolean iova_tree_alloc_traverse(gpointer key, gpointer value,
> > +                                         gpointer pargs)
> > +{
> > +    struct IOVATreeAllocArgs *args = pargs;
> > +    DMAMap *node = value;
> > +
> > +    assert(key == value);
> > +
> > +    iova_tree_alloc_args_iterate(args, node);
> > +    iova_tree_alloc_map_in_hole(args);
> > +    return args->iova_found;
> > +}
> > +
> > +int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> > +                        hwaddr iova_last)
> > +{
> > +    struct IOVATreeAllocArgs args = {
> > +        .new_size = map->size,
> > +        .iova_begin = iova_begin,
> > +    };
> > +
> > +    assert(iova_begin < iova_last);
>
>
> Should we use "<=" here, otherwise we disallow allocate the size of 1.
>
> And maybe we should return error instead of assert.
>

Right, I'll replace both.

>
> > +
> > +    /*
> > +     * Find a valid hole for the mapping
> > +     *
> > +     * Assuming low iova_begin, so no need to do a binary search to
> > +     * locate the first node.
> > +     *
> > +     * TODO: Replace all this with g_tree_node_first/next/last when available
> > +     * (from glib since 2.68). To do it with g_tree_foreach complicates the
> > +     * code a lot.
> > +     *
>
>
> One more question
>
> The current code looks work but still a little bit complicated to be
> reviewed. Looking at the missing helpers above, if the add and remove
> are seldom. I wonder if we can simply do
>
> g_tree_foreach() during each add/del to build a sorted list then we can
> emulate g_tree_node_first/next/last easily?
>

This sounds a lot like the method in v1 [1] :).

But it didn't use the O(N) foreach, since we can locate the new node's
previous element looking for the upper bound of iova-1, maintaining
the insertion's complexity O(log(N)). The function g_tree_upper_bound
is added in Glib version 2.68, so the proposed version will be deleted
sooner or later.

Also the deletion keeps being O(log(N)) since deleting a node in QLIST is O(1).

>
> > +     */
> > +    g_tree_foreach(tree->tree, iova_tree_alloc_traverse, &args);
> > +    if (!args.iova_found) {
> > +        /*
> > +         * Either tree is empty or the last hole is still not checked.
> > +         * g_tree_foreach does not compare (last, iova_end] range, so we check
>
>
> "(last, iova_last]" ?
>

Right, I'll change it too.

Thanks!

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg863699.html
[2] https://docs.gtk.org/glib/method.Tree.upper_bound.html

> Thanks
>
>
> > +         * it here.
> > +         */
> > +        iova_tree_alloc_args_iterate(&args, NULL);
> > +        iova_tree_alloc_map_in_hole(&args);
> > +    }
> > +
> > +    if (!args.iova_found || args.iova_result + map->size > iova_last) {
> > +        return IOVA_ERR_NOMEM;
> > +    }
> > +
> > +    map->iova = args.iova_result;
> > +    return iova_tree_insert(tree, map);
> > +}
> > +
> >   void iova_tree_destroy(IOVATree *tree)
> >   {
> >       g_tree_destroy(tree->tree);
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/14] vhost: Add Shadow VirtQueue call forwarding capabilities
  2022-02-28  3:18     ` Jason Wang
  (?)
@ 2022-03-01 11:18     ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 11:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 4:18 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > This will make qemu aware of the device used buffers, allowing it to
> > write the guest memory with its contents if needed.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  4 ++++
> >   hw/virtio/vhost-shadow-virtqueue.c | 34 ++++++++++++++++++++++++++++++
> >   hw/virtio/vhost-vdpa.c             | 31 +++++++++++++++++++++++++--
> >   3 files changed, 67 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 1cbc87d5d8..1d4c160d0a 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -28,9 +28,13 @@ typedef struct VhostShadowVirtqueue {
> >        * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >        */
> >       EventNotifier svq_kick;
> > +
> > +    /* Guest's call notifier, where the SVQ calls guest. */
> > +    EventNotifier svq_call;
> >   } VhostShadowVirtqueue;
> >
> >   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> >
> >   void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index a5d0659f86..54c701a196 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -23,6 +23,38 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >       event_notifier_set(&svq->hdev_kick);
> >   }
> >
> > +/* Forward vhost notifications */
> > +static void vhost_svq_handle_call(EventNotifier *n)
> > +{
> > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > +                                             hdev_call);
> > +    event_notifier_test_and_clear(n);
> > +    event_notifier_set(&svq->svq_call);
> > +}
> > +
> > +/**
> > + * Set the call notifier for the SVQ to call the guest
> > + *
> > + * @svq Shadow virtqueue
> > + * @call_fd call notifier
> > + *
> > + * Called on BQL context.
> > + */
> > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
>
>
> I think we need to have consistent naming for both kick and call. Note
> that in patch 2 we had
>
> vhost_svq_set_svq_kick_fd
>
> Maybe it's better to use vhost_svq_set_guest_call_fd() here.
>

I think the same, I will replace it for the next version.

>
> > +{
> > +    if (call_fd == VHOST_FILE_UNBIND) {
> > +        /*
> > +         * Fail event_notifier_set if called handling device call.
> > +         *
> > +         * SVQ still needs device notifications, since it needs to keep
> > +         * forwarding used buffers even with the unbind.
> > +         */
> > +        memset(&svq->svq_call, 0, sizeof(svq->svq_call));
>
>
> I may miss something but shouldn't we stop polling svq_call here like
>
> event_notifier_set_handle(&svq->svq_call, false);
>

SVQ never polls that descriptor: It uses that descriptor to call (as
notify) the guest at vhost_svq_flush when SVQ uses descriptors.

svq_kick, svq_call: Descriptors that the guest send to SVQ
hdev_kick, hdev_call: Descriptors that qemu/SVQ send to the device.

I admit it is confusing when reading the code but I cannot come up
with a better naming. Maybe it helps to add a diagram at the top of
the file like:

+-------+-> svq_kick_fd ->+-----+-> hdev_kick ->+-----+
| Guest |                 | SVQ |               | Dev |
+-------+<- svq_call_fd <-+-----+<- hdev_call <-+-----+

Thanks!

> ?
>
> Thanks
>
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
  2022-02-28  2:32   ` Jason Wang
  (?)
@ 2022-03-01 11:36   ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 11:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 3:32 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sun, Feb 27, 2022 at 9:42 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> > is intended as a new method of tracking the memory the devices touch
> > during a migration process: Instead of relay on vhost device's dirty
> > logging capability, SVQ intercepts the VQ dataplane forwarding the
> > descriptors between VM and device. This way qemu is the effective
> > writer of guests memory, like in qemu's virtio device operation.
> >
> > When SVQ is enabled qemu offers a new virtual address space to the
> > device to read and write into, and it maps new vrings and the guest
> > memory in it. SVQ also intercepts kicks and calls between the device
> > and the guest. Used buffers relay would cause dirty memory being
> > tracked.
> >
> > This effectively means that vDPA device passthrough is intercepted by
> > qemu. While SVQ should only be enabled at migration time, the switching
> > from regular mode to SVQ mode is left for a future series.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
> > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > not map the shadow vq in guest's VA, but in qemu's.
> >
> > For qemu to use shadow virtqueues the guest virtio driver must not use
> > features like event_idx, indirect descriptors, packed and in_order.
> > These features are easy to implement on top of this base, but is left
> > for a future series for simplicity.
> >
> > SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
> >
> > -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
> >
> > The first three patches enables notifications forwarding with
> > assistance of qemu. It's easy to enable only this if the relevant
> > cmdline part of the last patch is applied on top of these.
> >
> > Next four patches implement the actual buffer forwarding. However,
> > address are not translated from HVA so they will need a host device with
> > an iommu allowing them to access all of the HVA range.
> >
> > The last part of the series uses properly the host iommu, so qemu
> > creates a new iova address space in the device's range and translates
> > the buffers in it. Finally, it adds the cmdline parameter.
> >
> > Some simple performance tests with netperf were done. They used a nested
> > guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> > baseline average of ~9980.13Mbps:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.01    9910.61
> > 131072  16384  16384    30.00    10030.94
> > 131072  16384  16384    30.01    9998.84
> >
> > To enable the notifications interception reduced performance to an
> > average of ~9577.73Mbit/s:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.00    9563.03
> > 131072  16384  16384    30.01    9626.65
> > 131072  16384  16384    30.01    9543.51
> >
> > Finally, to enable buffers forwarding reduced the throughput again to
> > ~8902.92Mbit/s:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.01    8643.19
> > 131072  16384  16384    30.01    9033.56
> > 131072  16384  16384    30.01    9032.02
> >
> > However, many performance improvements were left out of this series for
> > simplicity, so difference if performance should shrink in the future.
>
> I think the performance should be acceptable as a start.
>
> >
> > Comments are welcome.
> >
> > TODO in future series:
> > * Event, indirect, packed, and others features of virtio.
> > * To support different set of features between the device<->SVQ and the
> >   SVQ<->guest communication.
> > * Support of device host notifier memory regions.
> > * To sepparate buffers forwarding in its own AIO context, so we can
> >   throw more threads to that task and we don't need to stop the main
> >   event loop.
> > * Support multiqueue virtio-net vdpa.
> > * Proper documentation.
> >
> > Changes from v1:
> > * Feature set at device->SVQ is now the same as SVQ->guest.
> > * Size of SVQ is not max available device size anymore, but guest's
> >   negotiated.
> > * Add VHOST_FILE_UNBIND kick and call fd treatment.
> > * Make SVQ a public struct
> > * Come back to previous approach to iova-tree
> > * Some assertions are now fail paths. Some errors are now log_guest.
> > * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> > * Refactor some errors and messages. Add missing error unwindings.
> > * Add memory barrier at _F_NO_NOTIFY set.
> > * Stop checking for features flags out of transport range.
> > v1 link:
> > https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
> >
> > Changes from v4 RFC:
> > * Support of allocating / freeing iova ranges in IOVA tree. Extending
> >   already present iova-tree for that.
> > * Proper validation of guest features. Now SVQ can negotiate a
> >   different set of features with the device when enabled.
> > * Support of host notifiers memory regions
> > * Handling of SVQ full queue in case guest's descriptors span to
> >   different memory regions (qemu's VA chunks).
> > * Flush pending used buffers at end of SVQ operation.
> > * QMP command now looks by NetClientState name. Other devices will need
> >   to implement it's way to enable vdpa.
> > * Rename QMP command to set, so it looks more like a way of working
> > * Better use of qemu error system
> > * Make a few assertions proper error-handling paths.
> > * Add more documentation
> > * Less coupling of virtio / vhost, that could cause friction on changes
> > * Addressed many other small comments and small fixes.
> >
> > Changes from v3 RFC:
> >   * Move everything to vhost-vdpa backend. A big change, this allowed
> >     some cleanup but more code has been added in other places.
> >   * More use of glib utilities, especially to manage memory.
> > v3 link:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
> >
> > Changes from v2 RFC:
> >   * Adding vhost-vdpa devices support
> >   * Fixed some memory leaks pointed by different comments
> > v2 link:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
> >
> > Changes from v1 RFC:
> >   * Use QMP instead of migration to start SVQ mode.
> >   * Only accepting IOMMU devices, closer behavior with target devices
> >     (vDPA)
> >   * Fix invalid masking/unmasking of vhost call fd.
> >   * Use of proper methods for synchronization.
> >   * No need to modify VirtIO device code, all of the changes are
> >     contained in vhost code.
> >   * Delete superfluous code.
> >   * An intermediate RFC was sent with only the notifications forwarding
> >     changes. It can be seen in
> >     https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> > v1 link:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
> >
> > Eugenio Pérez (20):
> >       virtio: Add VIRTIO_F_QUEUE_STATE
> >       virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
> >       virtio: Add virtio_queue_is_host_notifier_enabled
> >       vhost: Make vhost_virtqueue_{start,stop} public
> >       vhost: Add x-vhost-enable-shadow-vq qmp
> >       vhost: Add VhostShadowVirtqueue
> >       vdpa: Register vdpa devices in a list
> >       vhost: Route guest->host notification through shadow virtqueue
> >       Add vhost_svq_get_svq_call_notifier
> >       Add vhost_svq_set_guest_call_notifier
> >       vdpa: Save call_fd in vhost-vdpa
> >       vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
> >       vhost: Route host->guest notification through shadow virtqueue
> >       virtio: Add vhost_shadow_vq_get_vring_addr
> >       vdpa: Save host and guest features
> >       vhost: Add vhost_svq_valid_device_features to shadow vq
> >       vhost: Shadow virtqueue buffers forwarding
> >       vhost: Add VhostIOVATree
> >       vhost: Use a tree to store memory mappings
> >       vdpa: Add custom IOTLB translations to SVQ
>
> This list seems wrong btw :)
>

Yes, I uncommented that part by mistake in git-publish and I'm still
paying the consequences.

Deleting right now so I don't forget anymore.

Thanks!

> Thanks
>
> >
> > Eugenio Pérez (14):
> >   vhost: Add VhostShadowVirtqueue
> >   vhost: Add Shadow VirtQueue kick forwarding capabilities
> >   vhost: Add Shadow VirtQueue call forwarding capabilities
> >   vhost: Add vhost_svq_valid_features to shadow vq
> >   virtio: Add vhost_shadow_vq_get_vring_addr
> >   vdpa: adapt vhost_ops callbacks to svq
> >   vhost: Shadow virtqueue buffers forwarding
> >   util: Add iova_tree_alloc
> >   vhost: Add VhostIOVATree
> >   vdpa: Add custom IOTLB translations to SVQ
> >   vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
> >   vdpa: Never set log_base addr if SVQ is enabled
> >   vdpa: Expose VHOST_F_LOG_ALL on SVQ
> >   vdpa: Add x-svq to NetdevVhostVDPAOptions
> >
> >  qapi/net.json                      |   5 +-
> >  hw/virtio/vhost-iova-tree.h        |  27 ++
> >  hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
> >  include/hw/virtio/vhost-vdpa.h     |   8 +
> >  include/qemu/iova-tree.h           |  18 +
> >  hw/virtio/vhost-iova-tree.c        | 155 +++++++
> >  hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
> >  hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
> >  net/vhost-vdpa.c                   |  48 ++-
> >  util/iova-tree.c                   | 133 ++++++
> >  hw/virtio/meson.build              |   2 +-
> >  11 files changed, 1644 insertions(+), 25 deletions(-)
> >  create mode 100644 hw/virtio/vhost-iova-tree.h
> >  create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
> >  create mode 100644 hw/virtio/vhost-iova-tree.c
> >  create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
> >
> > --
> > 2.27.0
> >
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-02-28  2:57     ` Jason Wang
  (?)
@ 2022-03-01 18:49     ` Eugenio Perez Martin
  2022-03-03  7:12         ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 18:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 3:57 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> > At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> > will just forward the guest's kicks to the device.
> >
> > Host memory notifiers regions are left out for simplicity, and they will
> > not be addressed in this series.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  14 +++
> >   include/hw/virtio/vhost-vdpa.h     |   4 +
> >   hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
> >   hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
> >   4 files changed, 213 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index f1519e3c7b..1cbc87d5d8 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
> >       EventNotifier hdev_kick;
> >       /* Shadow call notifier, sent to vhost */
> >       EventNotifier hdev_call;
> > +
> > +    /*
> > +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> > +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> > +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> > +     * retrieve VhostShadowVirtqueue.
> > +     *
> > +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > +     */
> > +    EventNotifier svq_kick;
> >   } VhostShadowVirtqueue;
> >
> > +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> > +
> > +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> > +
> >   VhostShadowVirtqueue *vhost_svq_new(void);
> >
> >   void vhost_svq_free(gpointer vq);
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 3ce79a646d..009a9f3b6b 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -12,6 +12,8 @@
> >   #ifndef HW_VIRTIO_VHOST_VDPA_H
> >   #define HW_VIRTIO_VHOST_VDPA_H
> >
> > +#include <gmodule.h>
> > +
> >   #include "hw/virtio/virtio.h"
> >   #include "standard-headers/linux/vhost_types.h"
> >
> > @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
> >       bool iotlb_batch_begin_sent;
> >       MemoryListener listener;
> >       struct vhost_vdpa_iova_range iova_range;
> > +    bool shadow_vqs_enabled;
> > +    GPtrArray *shadow_vqs;
> >       struct vhost_dev *dev;
> >       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >   } VhostVDPA;
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 019cf1950f..a5d0659f86 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -11,6 +11,56 @@
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >
> >   #include "qemu/error-report.h"
> > +#include "qemu/main-loop.h"
> > +#include "linux-headers/linux/vhost.h"
> > +
> > +/** Forward guest notifications */
> > +static void vhost_handle_guest_kick(EventNotifier *n)
> > +{
> > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > +                                             svq_kick);
> > +    event_notifier_test_and_clear(n);
> > +    event_notifier_set(&svq->hdev_kick);
> > +}
> > +
> > +/**
> > + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> > + *
> > + * @svq          The svq
> > + * @svq_kick_fd  The svq kick fd
> > + *
> > + * Note that the SVQ will never close the old file descriptor.
> > + */
> > +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> > +{
> > +    EventNotifier *svq_kick = &svq->svq_kick;
> > +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
>
>
> I wonder if this is robust. E.g is there any chance that may end up with
> both poll_stop and poll_start are false?
>

I cannot make that happen in qemu, but the function supports that case
well: It will do nothing. It's more or less the same code as used in
the vhost kernel, and is the expected behaviour if you send two
VHOST_FILE_UNBIND one right after another to me.

> If not, can we simple detect poll_stop as below and treat !poll_start
> and poll_stop?
>

I'm not sure what does it add. Is there an unexpected consequence with
the current do-nothing behavior I've missed?

Thanks!

> Other looks good.
>
> Thanks
>
>
> > +    bool poll_start = svq_kick_fd != VHOST_FILE_UNBIND;
> > +
> > +    if (poll_stop) {
> > +        event_notifier_set_handler(svq_kick, NULL);
> > +    }
> > +
> > +    /*
> > +     * event_notifier_set_handler already checks for guest's notifications if
> > +     * they arrive at the new file descriptor in the switch, so there is no
> > +     * need to explicitly check for them.
> > +     */
> > +    if (poll_start) {
> > +        event_notifier_init_fd(svq_kick, svq_kick_fd);
> > +        event_notifier_set(svq_kick);
> > +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> > +    }
> > +}
> > +
> > +/**
> > + * Stop the shadow virtqueue operation.
> > + * @svq Shadow Virtqueue
> > + */
> > +void vhost_svq_stop(VhostShadowVirtqueue *svq)
> > +{
> > +    event_notifier_set_handler(&svq->svq_kick, NULL);
> > +}
> >
> >   /**
> >    * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> > @@ -39,6 +89,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
> >           goto err_init_hdev_call;
> >       }
> >
> > +    event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
> >       return g_steal_pointer(&svq);
> >
> >   err_init_hdev_call:
> > @@ -56,6 +107,7 @@ err_init_hdev_kick:
> >   void vhost_svq_free(gpointer pvq)
> >   {
> >       VhostShadowVirtqueue *vq = pvq;
> > +    vhost_svq_stop(vq);
> >       event_notifier_cleanup(&vq->hdev_kick);
> >       event_notifier_cleanup(&vq->hdev_call);
> >       g_free(vq);
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 04ea43704f..454bf50735 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -17,12 +17,14 @@
> >   #include "hw/virtio/vhost.h"
> >   #include "hw/virtio/vhost-backend.h"
> >   #include "hw/virtio/virtio-net.h"
> > +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost-vdpa.h"
> >   #include "exec/address-spaces.h"
> >   #include "qemu/main-loop.h"
> >   #include "cpu.h"
> >   #include "trace.h"
> >   #include "qemu-common.h"
> > +#include "qapi/error.h"
> >
> >   /*
> >    * Return one past the end of the end of section. Be careful with uint64_t
> > @@ -342,6 +344,30 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
> >       return v->index != 0;
> >   }
> >
> > +static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> > +                               Error **errp)
> > +{
> > +    g_autoptr(GPtrArray) shadow_vqs = NULL;
> > +
> > +    if (!v->shadow_vqs_enabled) {
> > +        return 0;
> > +    }
> > +
> > +    shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> > +    for (unsigned n = 0; n < hdev->nvqs; ++n) {
> > +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> > +
> > +        if (unlikely(!svq)) {
> > +            error_setg(errp, "Cannot create svq %u", n);
> > +            return -1;
> > +        }
> > +        g_ptr_array_add(shadow_vqs, g_steal_pointer(&svq));
> > +    }
> > +
> > +    v->shadow_vqs = g_steal_pointer(&shadow_vqs);
> > +    return 0;
> > +}
> > +
> >   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >   {
> >       struct vhost_vdpa *v;
> > @@ -364,6 +390,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >       dev->opaque =  opaque ;
> >       v->listener = vhost_vdpa_memory_listener;
> >       v->msg_type = VHOST_IOTLB_MSG_V2;
> > +    ret = vhost_vdpa_init_svq(dev, v, errp);
> > +    if (ret) {
> > +        goto err;
> > +    }
> >
> >       vhost_vdpa_get_iova_range(v);
> >
> > @@ -375,6 +405,10 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >                                  VIRTIO_CONFIG_S_DRIVER);
> >
> >       return 0;
> > +
> > +err:
> > +    ram_block_discard_disable(false);
> > +    return ret;
> >   }
> >
> >   static void vhost_vdpa_host_notifier_uninit(struct vhost_dev *dev,
> > @@ -444,8 +478,14 @@ err:
> >
> >   static void vhost_vdpa_host_notifiers_init(struct vhost_dev *dev)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> >       int i;
> >
> > +    if (v->shadow_vqs_enabled) {
> > +        /* FIXME SVQ is not compatible with host notifiers mr */
> > +        return;
> > +    }
> > +
> >       for (i = dev->vq_index; i < dev->vq_index + dev->nvqs; i++) {
> >           if (vhost_vdpa_host_notifier_init(dev, i)) {
> >               goto err;
> > @@ -459,6 +499,21 @@ err:
> >       return;
> >   }
> >
> > +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    size_t idx;
> > +
> > +    if (!v->shadow_vqs) {
> > +        return;
> > +    }
> > +
> > +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> > +        vhost_svq_stop(g_ptr_array_index(v->shadow_vqs, idx));
> > +    }
> > +    g_ptr_array_free(v->shadow_vqs, true);
> > +}
> > +
> >   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >   {
> >       struct vhost_vdpa *v;
> > @@ -467,6 +522,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >       trace_vhost_vdpa_cleanup(dev, v);
> >       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> >       memory_listener_unregister(&v->listener);
> > +    vhost_vdpa_svq_cleanup(dev);
> >
> >       dev->opaque = NULL;
> >       ram_block_discard_disable(false);
> > @@ -558,11 +614,26 @@ static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
> >       return ret;
> >   }
> >
> > +static void vhost_vdpa_reset_svq(struct vhost_vdpa *v)
> > +{
> > +    if (!v->shadow_vqs_enabled) {
> > +        return;
> > +    }
> > +
> > +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> > +        vhost_svq_stop(svq);
> > +    }
> > +}
> > +
> >   static int vhost_vdpa_reset_device(struct vhost_dev *dev)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> >       int ret;
> >       uint8_t status = 0;
> >
> > +    vhost_vdpa_reset_svq(v);
> > +
> >       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
> >       trace_vhost_vdpa_reset_device(dev, status);
> >       return ret;
> > @@ -646,13 +717,75 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
> >       return ret;
> >    }
> >
> > +static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
> > +                                         struct vhost_vring_file *file)
> > +{
> > +    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> > +    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> > +}
> > +
> > +/**
> > + * Set the shadow virtqueue descriptors to the device
> > + *
> > + * @dev   The vhost device model
> > + * @svq   The shadow virtqueue
> > + * @idx   The index of the virtqueue in the vhost device
> > + * @errp  Error
> > + */
> > +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> > +                                 VhostShadowVirtqueue *svq,
> > +                                 unsigned idx,
> > +                                 Error **errp)
> > +{
> > +    struct vhost_vring_file file = {
> > +        .index = dev->vq_index + idx,
> > +    };
> > +    const EventNotifier *event_notifier = &svq->hdev_kick;
> > +    int r;
> > +
> > +    file.fd = event_notifier_get_fd(event_notifier);
> > +    r = vhost_vdpa_set_vring_dev_kick(dev, &file);
> > +    if (unlikely(r != 0)) {
> > +        error_setg_errno(errp, -r, "Can't set device kick fd");
> > +    }
> > +
> > +    return r == 0;
> > +}
> > +
> > +static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    Error *err = NULL;
> > +    unsigned i;
> > +
> > +    if (!v->shadow_vqs) {
> > +        return true;
> > +    }
> > +
> > +    for (i = 0; i < v->shadow_vqs->len; ++i) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> > +        bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
> > +        if (unlikely(!ok)) {
> > +            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> > +            return false;
> > +        }
> > +    }
> > +
> > +    return true;
> > +}
> > +
> >   static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >   {
> >       struct vhost_vdpa *v = dev->opaque;
> > +    bool ok;
> >       trace_vhost_vdpa_dev_start(dev, started);
> >
> >       if (started) {
> >           vhost_vdpa_host_notifiers_init(dev);
> > +        ok = vhost_vdpa_svqs_start(dev);
> > +        if (unlikely(!ok)) {
> > +            return -1;
> > +        }
> >           vhost_vdpa_set_vring_ready(dev);
> >       } else {
> >           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> > @@ -724,8 +857,16 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> >   static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
> >                                          struct vhost_vring_file *file)
> >   {
> > -    trace_vhost_vdpa_set_vring_kick(dev, file->index, file->fd);
> > -    return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    int vdpa_idx = file->index - dev->vq_index;
> > +
> > +    if (v->shadow_vqs_enabled) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
> > +        vhost_svq_set_svq_kick_fd(svq, file->fd);
> > +        return 0;
> > +    } else {
> > +        return vhost_vdpa_set_vring_dev_kick(dev, file);
> > +    }
> >   }
> >
> >   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq
  2022-02-28  3:25     ` Jason Wang
  (?)
@ 2022-03-01 19:18     ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 19:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 4:25 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > This allows SVQ to negotiate features with the guest and the device. For
> > the device, SVQ is a driver. While this function bypasses all
> > non-transport features, it needs to disable the features that SVQ does
> > not support when forwarding buffers. This includes packed vq layout,
> > indirect descriptors or event idx.
> >
> > Future changes can add support to offer more features to the guest,
> > since the use of VirtQueue gives this for free. This is left out at the
> > moment for simplicity.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  2 ++
> >   hw/virtio/vhost-shadow-virtqueue.c | 39 ++++++++++++++++++++++++++++++
> >   hw/virtio/vhost-vdpa.c             | 18 ++++++++++++++
> >   3 files changed, 59 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 1d4c160d0a..84747655ad 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -33,6 +33,8 @@ typedef struct VhostShadowVirtqueue {
> >       EventNotifier svq_call;
> >   } VhostShadowVirtqueue;
> >
> > +bool vhost_svq_valid_features(uint64_t *features);
> > +
> >   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> >   void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 54c701a196..34354aea2c 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -14,6 +14,45 @@
> >   #include "qemu/main-loop.h"
> >   #include "linux-headers/linux/vhost.h"
> >
> > +/**
> > + * Validate the transport device features that both guests can use with the SVQ
> > + * and SVQs can use with the device.
> > + *
> > + * @dev_features  The features. If success, the acknowledged features. If
> > + *                failure, the minimal set from it.
> > + *
> > + * Returns true if SVQ can go with a subset of these, false otherwise.
> > + */
> > +bool vhost_svq_valid_features(uint64_t *features)
> > +{
> > +    bool r = true;
> > +
> > +    for (uint64_t b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END;
> > +         ++b) {
> > +        switch (b) {
> > +        case VIRTIO_F_ANY_LAYOUT:
> > +            continue;
> > +
> > +        case VIRTIO_F_ACCESS_PLATFORM:
> > +            /* SVQ trust in the host's IOMMU to translate addresses */
> > +        case VIRTIO_F_VERSION_1:
> > +            /* SVQ trust that the guest vring is little endian */
> > +            if (!(*features & BIT_ULL(b))) {
> > +                set_bit(b, features);
> > +                r = false;
> > +            }
> > +            continue;
>
>
> It looks to me the *features is only used for logging errors, if this is
> the truth. I suggest to do error_setg in this function instead of
> letting the caller to pass a pointer.
>

I'm fine with passing the Error pointer, but it's not only used for
logging: if SVQ is enabled, the feature set of the device and the
guest is also checked against this.

For example if the vdpa device supports event_idx, this is the
function that the caller uses to filter out that feature bit. Same
with indirect, packed, and all unknown transport ones. It's different
from device's features like CVQ, where this function does not apply.

Now that this is stated, I think the cover letter should reflect that
better, and that this function should include a verb in its name.
Maybe vhost_svq_filter_valid_transport_features is more appropriate.
Let me know if you see that way too.

>
> > +
> > +        default:
> > +            if (*features & BIT_ULL(b)) {
> > +                clear_bit(b, features);
> > +            }
>
>
> Do we need to check indirect and event idx here?
>

These are simply cleared at this moment.

Thanks!

> Thanks
>
>
> > +        }
> > +    }
> > +
> > +    return r;
> > +}
> > +
> >   /** Forward guest notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index c73215751d..d614c435f3 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -348,11 +348,29 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >                                  Error **errp)
> >   {
> >       g_autoptr(GPtrArray) shadow_vqs = NULL;
> > +    uint64_t dev_features, svq_features;
> > +    int r;
> > +    bool ok;
> >
> >       if (!v->shadow_vqs_enabled) {
> >           return 0;
> >       }
> >
> > +    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> > +    if (r != 0) {
> > +        error_setg_errno(errp, -r, "Can't get vdpa device features");
> > +        return r;
> > +    }
> > +
> > +    svq_features = dev_features;
> > +    ok = vhost_svq_valid_features(&svq_features);
> > +    if (unlikely(!ok)) {
> > +        error_setg(errp,
> > +            "SVQ Invalid device feature flags, offer: 0x%"PRIx64", ok: 0x%"PRIx64,
> > +            dev_features, svq_features);
> > +        return -1;
> > +    }
> > +
> >       shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> >       for (unsigned n = 0; n < hdev->nvqs; ++n) {
> >           g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq
  2022-02-28  3:59     ` Jason Wang
  (?)
@ 2022-03-01 19:31     ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-01 19:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 4:59 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > First half of the buffers forwarding part, preparing vhost-vdpa
> > callbacks to SVQ to offer it. QEMU cannot enable it at this moment, so
> > this is effectively dead code at the moment, but it helps to reduce
> > patch size.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 73 insertions(+), 11 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index d614c435f3..b2c4e92fcf 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -344,6 +344,16 @@ static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
> >       return v->index != 0;
> >   }
> >
> > +static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
> > +                                       uint64_t *features)
> > +{
> > +    int ret;
> > +
> > +    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> > +    trace_vhost_vdpa_get_features(dev, *features);
> > +    return ret;
> > +}
> > +
> >   static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >                                  Error **errp)
> >   {
> > @@ -356,7 +366,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >           return 0;
> >       }
> >
> > -    r = hdev->vhost_ops->vhost_get_features(hdev, &dev_features);
> > +    r = vhost_vdpa_get_dev_features(hdev, &dev_features);
> >       if (r != 0) {
> >           error_setg_errno(errp, -r, "Can't get vdpa device features");
> >           return r;
> > @@ -583,12 +593,26 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
> >   static int vhost_vdpa_set_features(struct vhost_dev *dev,
> >                                      uint64_t features)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> >       int ret;
> >
> >       if (vhost_vdpa_one_time_request(dev)) {
> >           return 0;
> >       }
> >
> > +    if (v->shadow_vqs_enabled) {
> > +        uint64_t features_ok = features;
> > +        bool ok;
> > +
> > +        ok = vhost_svq_valid_features(&features_ok);
> > +        if (unlikely(!ok)) {
> > +            error_report(
> > +                "Invalid guest acked feature flag, acked: 0x%"
> > +                PRIx64", ok: 0x%"PRIx64, features, features_ok);
> > +            return -EINVAL;
> > +        }
> > +    }
> > +
> >       trace_vhost_vdpa_set_features(dev, features);
> >       ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
> >       if (ret) {
> > @@ -735,6 +759,13 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config,
> >       return ret;
> >    }
> >
> > +static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
> > +                                         struct vhost_vring_state *ring)
> > +{
> > +    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> > +    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> > +}
> > +
> >   static int vhost_vdpa_set_vring_dev_kick(struct vhost_dev *dev,
> >                                            struct vhost_vring_file *file)
> >   {
> > @@ -749,6 +780,18 @@ static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
> >       return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> >   }
> >
> > +static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
> > +                                         struct vhost_vring_addr *addr)
> > +{
> > +    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> > +                                addr->desc_user_addr, addr->used_user_addr,
> > +                                addr->avail_user_addr,
> > +                                addr->log_guest_addr);
> > +
> > +    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> > +
> > +}
> > +
> >   /**
> >    * Set the shadow virtqueue descriptors to the device
> >    *
> > @@ -859,11 +902,17 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> >   static int vhost_vdpa_set_vring_addr(struct vhost_dev *dev,
> >                                          struct vhost_vring_addr *addr)
> >   {
> > -    trace_vhost_vdpa_set_vring_addr(dev, addr->index, addr->flags,
> > -                                    addr->desc_user_addr, addr->used_user_addr,
> > -                                    addr->avail_user_addr,
> > -                                    addr->log_guest_addr);
> > -    return vhost_vdpa_call(dev, VHOST_SET_VRING_ADDR, addr);
> > +    struct vhost_vdpa *v = dev->opaque;
> > +
> > +    if (v->shadow_vqs_enabled) {
> > +        /*
> > +         * Device vring addr was set at device start. SVQ base is handled by
> > +         * VirtQueue code.
> > +         */
> > +        return 0;
> > +    }
> > +
> > +    return vhost_vdpa_set_vring_dev_addr(dev, addr);
> >   }
> >
> >   static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
> > @@ -876,8 +925,17 @@ static int vhost_vdpa_set_vring_num(struct vhost_dev *dev,
> >   static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
> >                                          struct vhost_vring_state *ring)
> >   {
> > -    trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
> > -    return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
> > +    struct vhost_vdpa *v = dev->opaque;
> > +
> > +    if (v->shadow_vqs_enabled) {
> > +        /*
> > +         * Device vring base was set at device start. SVQ base is handled by
> > +         * VirtQueue code.
> > +         */
> > +        return 0;
> > +    }
> > +
> > +    return vhost_vdpa_set_dev_vring_base(dev, ring);
> >   }
> >
> >   static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> > @@ -924,10 +982,14 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
> >   static int vhost_vdpa_get_features(struct vhost_dev *dev,
> >                                        uint64_t *features)
> >   {
> > -    int ret;
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    int ret = vhost_vdpa_get_dev_features(dev, features);
> > +
> > +    if (ret == 0 && v->shadow_vqs_enabled) {
> > +        /* Filter only features that SVQ can offer to guest */
> > +        vhost_svq_valid_features(features);
>
>
> I think it's better not silently clear features here (e.g the feature
> could be explicitly enabled via cli). This may give more troubles in the
> future cross vendor/backend live migration.
>
> We can simple during vhost_vdpa init.
>

Do you mean to fail the initialization as long as the emulated device
feature set contains any other feature set than mandatory
VIRTIO_F_ACCESS_PLATFORM and VIRTIO_F_VERSION_1?

I think that it makes sense at this moment, I'll move to that for the
next version.

Thanks!

> Thanks
>
>
> > +    }
> >
> > -    ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> > -    trace_vhost_vdpa_get_features(dev, *features);
> >       return ret;
> >   }
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
  2022-02-28  5:39     ` Jason Wang
  (?)
@ 2022-03-02 18:23     ` Eugenio Perez Martin
  2022-03-03  7:35         ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-02 18:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 6:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > Initial version of shadow virtqueue that actually forward buffers. There
> > is no iommu support at the moment, and that will be addressed in future
> > patches of this series. Since all vhost-vdpa devices use forced IOMMU,
> > this means that SVQ is not usable at this point of the series on any
> > device.
> >
> > For simplicity it only supports modern devices, that expects vring
> > in little endian, with split ring and no event idx or indirect
> > descriptors. Support for them will not be added in this series.
> >
> > It reuses the VirtQueue code for the device part. The driver part is
> > based on Linux's virtio_ring driver, but with stripped functionality
> > and optimizations so it's easier to review.
> >
> > However, forwarding buffers have some particular pieces: One of the most
> > unexpected ones is that a guest's buffer can expand through more than
> > one descriptor in SVQ. While this is handled gracefully by qemu's
> > emulated virtio devices, it may cause unexpected SVQ queue full. This
> > patch also solves it by checking for this condition at both guest's
> > kicks and device's calls. The code may be more elegant in the future if
> > SVQ code runs in its own iocontext.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  29 +++
> >   hw/virtio/vhost-shadow-virtqueue.c | 360 ++++++++++++++++++++++++++++-
> >   hw/virtio/vhost-vdpa.c             | 165 ++++++++++++-
> >   3 files changed, 542 insertions(+), 12 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 3bbea77082..04c67685fd 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -36,6 +36,33 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Guest's call notifier, where the SVQ calls guest. */
> >       EventNotifier svq_call;
> > +
> > +    /* Virtio queue shadowing */
> > +    VirtQueue *vq;
> > +
> > +    /* Virtio device */
> > +    VirtIODevice *vdev;
> > +
> > +    /* Map for use the guest's descriptors */
> > +    VirtQueueElement **ring_id_maps;
> > +
> > +    /* Next VirtQueue element that guest made available */
> > +    VirtQueueElement *next_guest_avail_elem;
> > +
> > +    /* Next head to expose to the device */
> > +    uint16_t avail_idx_shadow;
> > +
> > +    /* Next free descriptor */
> > +    uint16_t free_head;
> > +
> > +    /* Last seen used idx */
> > +    uint16_t shadow_used_idx;
>
>
> I suggest to have a consistent name for avail and used instead of using
> one "shadow" as prefix and other as suffix
>

Right, I'll fix that.

>
> > +
> > +    /* Next head to consume from the device */
> > +    uint16_t last_used_idx;
> > +
> > +    /* Cache for the exposed notification flag */
> > +    bool notification;
> >   } VhostShadowVirtqueue;
> >
> >   bool vhost_svq_valid_features(uint64_t *features);
> > @@ -47,6 +74,8 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> >   size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
> >   size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
> >
> > +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> > +                     VirtQueue *vq);
> >   void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >
> >   VhostShadowVirtqueue *vhost_svq_new(void);
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 2150e2b071..a38d313755 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -12,6 +12,7 @@
> >
> >   #include "qemu/error-report.h"
> >   #include "qemu/main-loop.h"
> > +#include "qemu/log.h"
> >   #include "linux-headers/linux/vhost.h"
> >
> >   /**
> > @@ -53,22 +54,309 @@ bool vhost_svq_valid_features(uint64_t *features)
> >       return r;
> >   }
> >
> > -/** Forward guest notifications */
> > -static void vhost_handle_guest_kick(EventNotifier *n)
> > +/**
> > + * Number of descriptors that the SVQ can make available from the guest.
> > + *
> > + * @svq   The svq
>
>
> Btw, I notice most of function there will be a colon in the middle of
> the parameter and it's documentation.  Maybe we should follow that.
>

Sure, it seems I picked the uncommon template without realizing.


>
> > + */
> > +static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
> > +{
> > +    return svq->vring.num - (svq->avail_idx_shadow - svq->shadow_used_idx);
> > +}
> > +
> > +static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > +{
> > +    uint16_t notification_flag;
> > +
> > +    if (svq->notification == enable) {
> > +        return;
> > +    }
> > +
> > +    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
> > +
> > +    svq->notification = enable;
> > +    if (enable) {
> > +        svq->vring.avail->flags &= ~notification_flag;
> > +    } else {
> > +        svq->vring.avail->flags |= notification_flag;
> > +        /* Make sure the flag is written before the read of used_idx */
> > +        smp_mb();
>
>
> So the comment assumes that a reading of used_idx will come. This makes
> me wonder if we can simply split this function as:
>
> vhost_svq_disable_notification() and vhost_svq_enable_notification()
>
> and in the vhost_svq_enable_notification, we simply return
> vhost_svq_more_used() after smp_mb().
>
> (Not a must but just feel it might be more clear)
>

Ok, like kernel's virtio_ring.c I suppose.

>
> > +    }
> > +}
> > +
> > +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > +                                    const struct iovec *iovec,
> > +                                    size_t num, bool more_descs, bool write)
> > +{
> > +    uint16_t i = svq->free_head, last = svq->free_head;
> > +    unsigned n;
> > +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> > +    vring_desc_t *descs = svq->vring.desc;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    for (n = 0; n < num; n++) {
> > +        if (more_descs || (n + 1 < num)) {
> > +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> > +        } else {
> > +            descs[i].flags = flags;
> > +        }
> > +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> > +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> > +
> > +        last = i;
> > +        i = cpu_to_le16(descs[i].next);
> > +    }
> > +
> > +    svq->free_head = le16_to_cpu(descs[last].next);
> > +}
> > +
> > +static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > +                                VirtQueueElement *elem,
> > +                                unsigned *head)
> > +{
> > +    unsigned avail_idx;
> > +    vring_avail_t *avail = svq->vring.avail;
> > +
> > +    *head = svq->free_head;
> > +
> > +    /* We need some descriptors here */
> > +    if (unlikely(!elem->out_num && !elem->in_num)) {
> > +        qemu_log_mask(LOG_GUEST_ERROR,
> > +            "Guest provided element with no descriptors");
> > +        return false;
> > +    }
> > +
> > +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > +                            elem->in_num > 0, false);
> > +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>
>
> I wonder instead of passing in/out separately and using the hint like
> more_descs, is it better to simply pass the elem to
> vhost_vrign_write_descs() then we know which one is the last that
> doesn't depend on more_descs.
>

I'm not sure I follow this.

The purpose of vhost_vring_write_descs is to abstract the writing of a
batch of descriptors, its chaining, etc. It accepts the write
parameter just for the write flag. If we make elem as a parameter, we
would need to duplicate that for loop for read and for write
descriptors, isn't it?

To duplicate the for loop is the way it is done in the kernel, but I
actually think the kernel could benefit from abstracting both in the
same function too. Please let me know if you think otherwise or I've
missed your point.

>
> > +
> > +    /*
> > +     * Put the entry in the available array (but don't update avail->idx until
> > +     * they do sync).
> > +     */
> > +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> > +    avail->ring[avail_idx] = cpu_to_le16(*head);
> > +    svq->avail_idx_shadow++;
> > +
> > +    /* Update the avail index after write the descriptor */
> > +    smp_wmb();
> > +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> > +
> > +    return true;
> > +}
> > +
> > +static bool vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > +{
> > +    unsigned qemu_head;
> > +    bool ok = vhost_svq_add_split(svq, elem, &qemu_head);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +
> > +    svq->ring_id_maps[qemu_head] = elem;
> > +    return true;
> > +}
> > +
> > +static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> > +{
> > +    /*
> > +     * We need to expose the available array entries before checking the used
> > +     * flags
> > +     */
> > +    smp_mb();
> > +    if (svq->vring.used->flags & VRING_USED_F_NO_NOTIFY) {
> > +        return;
> > +    }
> > +
> > +    event_notifier_set(&svq->hdev_kick);
> > +}
> > +
> > +/**
> > + * Forward available buffers.
> > + *
> > + * @svq Shadow VirtQueue
> > + *
> > + * Note that this function does not guarantee that all guest's available
> > + * buffers are available to the device in SVQ avail ring. The guest may have
> > + * exposed a GPA / GIOVA contiguous buffer, but it may not be contiguous in
> > + * qemu vaddr.
> > + *
> > + * If that happens, guest's kick notifications will be disabled until the
> > + * device uses some buffers.
> > + */
> > +static void vhost_handle_guest_kick(VhostShadowVirtqueue *svq)
> > +{
> > +    /* Clear event notifier */
> > +    event_notifier_test_and_clear(&svq->svq_kick);
> > +
> > +    /* Forward to the device as many available buffers as possible */
> > +    do {
> > +        virtio_queue_set_notification(svq->vq, false);
> > +
> > +        while (true) {
> > +            VirtQueueElement *elem;
> > +            bool ok;
> > +
> > +            if (svq->next_guest_avail_elem) {
> > +                elem = g_steal_pointer(&svq->next_guest_avail_elem);
> > +            } else {
> > +                elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > +            }
> > +
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            if (elem->out_num + elem->in_num >
> > +                vhost_svq_available_slots(svq)) {
> > +                /*
> > +                 * This condition is possible since a contiguous buffer in GPA
> > +                 * does not imply a contiguous buffer in qemu's VA
> > +                 * scatter-gather segments. If that happens, the buffer exposed
> > +                 * to the device needs to be a chain of descriptors at this
> > +                 * moment.
> > +                 *
> > +                 * SVQ cannot hold more available buffers if we are here:
> > +                 * queue the current guest descriptor and ignore further kicks
> > +                 * until some elements are used.
> > +                 */
> > +                svq->next_guest_avail_elem = elem;
> > +                return;
> > +            }
> > +
> > +            ok = vhost_svq_add(svq, elem);
> > +            if (unlikely(!ok)) {
> > +                /* VQ is broken, just return and ignore any other kicks */
> > +                return;
> > +            }
> > +            vhost_svq_kick(svq);
> > +        }
> > +
> > +        virtio_queue_set_notification(svq->vq, true);
> > +    } while (!virtio_queue_empty(svq->vq));
> > +}
> > +
> > +/**
> > + * Handle guest's kick.
> > + *
> > + * @n guest kick event notifier, the one that guest set to notify svq.
> > + */
> > +static void vhost_handle_guest_kick_notifier(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >                                                svq_kick);
> >       event_notifier_test_and_clear(n);
> > -    event_notifier_set(&svq->hdev_kick);
> > +    vhost_handle_guest_kick(svq);
> > +}
> > +
> > +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > +{
> > +    if (svq->last_used_idx != svq->shadow_used_idx) {
> > +        return true;
> > +    }
> > +
> > +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> > +
> > +    return svq->last_used_idx != svq->shadow_used_idx;
> >   }
> >
> > -/* Forward vhost notifications */
> > +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > +{
> > +    vring_desc_t *descs = svq->vring.desc;
> > +    const vring_used_t *used = svq->vring.used;
> > +    vring_used_elem_t used_elem;
> > +    uint16_t last_used;
> > +
> > +    if (!vhost_svq_more_used(svq)) {
> > +        return NULL;
> > +    }
> > +
> > +    /* Only get used array entries after they have been exposed by dev */
> > +    smp_rmb();
> > +    last_used = svq->last_used_idx & (svq->vring.num - 1);
> > +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> > +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> > +
> > +    svq->last_used_idx++;
> > +    if (unlikely(used_elem.id >= svq->vring.num)) {
> > +        qemu_log_mask(LOG_GUEST_ERROR, "Device %s says index %u is used",
> > +                      svq->vdev->name, used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> > +        qemu_log_mask(LOG_GUEST_ERROR,
> > +            "Device %s says index %u is used, but it was not available",
> > +            svq->vdev->name, used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    descs[used_elem.id].next = svq->free_head;
> > +    svq->free_head = used_elem.id;
> > +
> > +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>
>
> It looks to me we'd better not modify the elem here, otherwise we may
> leak mapping during virtqueue_unmap_sg()?
>

Right, it's better to track it separately.

>
> > +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > +}
> > +
> > +static void vhost_svq_flush(VhostShadowVirtqueue *svq,
> > +                            bool check_for_avail_queue)
> > +{
> > +    VirtQueue *vq = svq->vq;
> > +
> > +    /* Forward as many used buffers as possible. */
> > +    do {
> > +        unsigned i = 0;
> > +
> > +        vhost_svq_set_notification(svq, false);
> > +        while (true) {
> > +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            if (unlikely(i >= svq->vring.num)) {
> > +                qemu_log_mask(LOG_GUEST_ERROR,
> > +                         "More than %u used buffers obtained in a %u size SVQ",
> > +                         i, svq->vring.num);
> > +                virtqueue_fill(vq, elem, elem->len, i);
> > +                virtqueue_flush(vq, i);
> > +                return;
> > +            }
> > +            virtqueue_fill(vq, elem, elem->len, i++);
> > +        }
> > +
> > +        virtqueue_flush(vq, i);
> > +        event_notifier_set(&svq->svq_call);
> > +
> > +        if (check_for_avail_queue && svq->next_guest_avail_elem) {
> > +            /*
> > +             * Avail ring was full when vhost_svq_flush was called, so it's a
> > +             * good moment to make more descriptors available if possible.
> > +             */
> > +            vhost_handle_guest_kick(svq);
> > +        }
> > +
> > +        vhost_svq_set_notification(svq, true);
> > +    } while (vhost_svq_more_used(svq));
> > +}
> > +
> > +/**
> > + * Forward used buffers.
> > + *
> > + * @n hdev call event notifier, the one that device set to notify svq.
> > + *
> > + * Note that we are not making any buffers available in the loop, there is no
> > + * way that it runs more than virtqueue size times.
> > + */
> >   static void vhost_svq_handle_call(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >                                                hdev_call);
> >       event_notifier_test_and_clear(n);
> > -    event_notifier_set(&svq->svq_call);
> > +    vhost_svq_flush(svq, true);
> >   }
> >
> >   /**
> > @@ -149,7 +437,41 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> >       if (poll_start) {
> >           event_notifier_init_fd(svq_kick, svq_kick_fd);
> >           event_notifier_set(svq_kick);
> > -        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick);
> > +        event_notifier_set_handler(svq_kick, vhost_handle_guest_kick_notifier);
> > +    }
> > +}
> > +
> > +/**
> > + * Start the shadow virtqueue operation.
> > + *
> > + * @svq Shadow Virtqueue
> > + * @vdev        VirtIO device
> > + * @vq          Virtqueue to shadow
> > + */
> > +void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> > +                     VirtQueue *vq)
> > +{
> > +    size_t desc_size, driver_size, device_size;
> > +
> > +    svq->next_guest_avail_elem = NULL;
> > +    svq->avail_idx_shadow = 0;
> > +    svq->shadow_used_idx = 0;
> > +    svq->last_used_idx = 0;
> > +    svq->vdev = vdev;
> > +    svq->vq = vq;
> > +
> > +    svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> > +    driver_size = vhost_svq_driver_area_size(svq);
> > +    device_size = vhost_svq_device_area_size(svq);
> > +    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> > +    desc_size = sizeof(vring_desc_t) * svq->vring.num;
> > +    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > +    memset(svq->vring.desc, 0, driver_size);
> > +    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > +    memset(svq->vring.used, 0, device_size);
> > +    svq->ring_id_maps = g_new0(VirtQueueElement *, svq->vring.num);
> > +    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> > +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
> >       }
> >   }
> >
> > @@ -160,6 +482,32 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> >   void vhost_svq_stop(VhostShadowVirtqueue *svq)
> >   {
> >       event_notifier_set_handler(&svq->svq_kick, NULL);
> > +    g_autofree VirtQueueElement *next_avail_elem = NULL;
> > +
> > +    if (!svq->vq) {
> > +        return;
> > +    }
> > +
> > +    /* Send all pending used descriptors to guest */
> > +    vhost_svq_flush(svq, false);
> > +
> > +    for (unsigned i = 0; i < svq->vring.num; ++i) {
> > +        g_autofree VirtQueueElement *elem = NULL;
> > +        elem = g_steal_pointer(&svq->ring_id_maps[i]);
> > +        if (elem) {
> > +            virtqueue_detach_element(svq->vq, elem, elem->len);
> > +        }
> > +    }
> > +
> > +    next_avail_elem = g_steal_pointer(&svq->next_guest_avail_elem);
> > +    if (next_avail_elem) {
> > +        virtqueue_detach_element(svq->vq, next_avail_elem,
> > +                                 next_avail_elem->len);
> > +    }
> > +    svq->vq = NULL;
> > +    g_free(svq->ring_id_maps);
> > +    qemu_vfree(svq->vring.desc);
> > +    qemu_vfree(svq->vring.used);
> >   }
> >
> >   /**
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index b2c4e92fcf..435b9c2e9e 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -803,10 +803,10 @@ static int vhost_vdpa_set_vring_dev_addr(struct vhost_dev *dev,
> >    * Note that this function does not rewind kick file descriptor if cannot set
> >    * call one.
> >    */
> > -static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> > -                                 VhostShadowVirtqueue *svq,
> > -                                 unsigned idx,
> > -                                 Error **errp)
> > +static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> > +                                  VhostShadowVirtqueue *svq,
> > +                                  unsigned idx,
> > +                                  Error **errp)
> >   {
> >       struct vhost_vring_file file = {
> >           .index = dev->vq_index + idx,
> > @@ -818,7 +818,7 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> >       r = vhost_vdpa_set_vring_dev_kick(dev, &file);
> >       if (unlikely(r != 0)) {
> >           error_setg_errno(errp, -r, "Can't set device kick fd");
> > -        return false;
> > +        return r;
> >       }
> >
> >       event_notifier = &svq->hdev_call;
> > @@ -828,6 +828,119 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> >           error_setg_errno(errp, -r, "Can't set device call fd");
> >       }
> >
> > +    return r;
> > +}
> > +
> > +/**
> > + * Unmap a SVQ area in the device
> > + */
> > +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> > +                                      hwaddr size)
> > +{
> > +    int r;
> > +
> > +    size = ROUND_UP(size, qemu_real_host_page_size);
> > +    r = vhost_vdpa_dma_unmap(v, iova, size);
> > +    return r == 0;
> > +}
> > +
> > +static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> > +                                       const VhostShadowVirtqueue *svq)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    struct vhost_vring_addr svq_addr;
> > +    size_t device_size = vhost_svq_device_area_size(svq);
> > +    size_t driver_size = vhost_svq_driver_area_size(svq);
> > +    bool ok;
> > +
> > +    vhost_svq_get_vring_addr(svq, &svq_addr);
> > +
> > +    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +
> > +    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> > +}
> > +
> > +/**
> > + * Map shadow virtqueue rings in device
> > + *
> > + * @dev   The vhost device
> > + * @svq   The shadow virtqueue
> > + * @addr  Assigned IOVA addresses
> > + * @errp  Error pointer
> > + */
> > +static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> > +                                     const VhostShadowVirtqueue *svq,
> > +                                     struct vhost_vring_addr *addr,
> > +                                     Error **errp)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    size_t device_size = vhost_svq_device_area_size(svq);
> > +    size_t driver_size = vhost_svq_driver_area_size(svq);
> > +    int r;
> > +
> > +    ERRP_GUARD();
> > +    vhost_svq_get_vring_addr(svq, addr);
> > +
> > +    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> > +                           (void *)addr->desc_user_addr, true);
> > +    if (unlikely(r != 0)) {
> > +        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> > +        return false;
> > +    }
> > +
> > +    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> > +                           (void *)addr->used_user_addr, false);
> > +    if (unlikely(r != 0)) {
> > +        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> > +    }
> > +
> > +    return r == 0;
> > +}
> > +
> > +static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
> > +                                 VhostShadowVirtqueue *svq,
> > +                                 unsigned idx,
> > +                                 Error **errp)
> > +{
> > +    uint16_t vq_index = dev->vq_index + idx;
> > +    struct vhost_vring_state s = {
> > +        .index = vq_index,
> > +    };
> > +    int r;
> > +
> > +    r = vhost_vdpa_set_dev_vring_base(dev, &s);
> > +    if (unlikely(r)) {
> > +        error_setg_errno(errp, -r, "Cannot set vring base");
> > +        return false;
> > +    }
> > +
> > +    r = vhost_vdpa_svq_set_fds(dev, svq, idx, errp);
> > +    return r == 0;
> > +}
> > +
> > +static bool vhost_vdpa_svq_set_addr(struct vhost_dev *dev, unsigned idx,
> > +                                    VhostShadowVirtqueue *svq,
> > +                                    Error **errp)
> > +{
> > +    struct vhost_vring_addr addr = {
> > +        .index = idx,
> > +    };
> > +    int r;
> > +
> > +    bool ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, errp);
> > +    if (unlikely(!ok)) {
> > +        return false;
> > +    }
> > +
> > +    /* Override vring GPA set by vhost subsystem */
> > +    r = vhost_vdpa_set_vring_dev_addr(dev, &addr);
> > +    if (unlikely(r != 0)) {
> > +        error_setg_errno(errp, -r, "Cannot set device address");
> > +    }
> > +
> >       return r == 0;
> >   }
> >
> > @@ -842,10 +955,46 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
> >       }
> >
> >       for (i = 0; i < v->shadow_vqs->len; ++i) {
> > +        VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> >           VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
> >           bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
> >           if (unlikely(!ok)) {
> > -            error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> > +            goto err;
> > +        }
> > +        vhost_svq_start(svq, dev->vdev, vq);
> > +        ok = vhost_vdpa_svq_set_addr(dev, i, svq, &err);
> > +        if (unlikely(!ok)) {
> > +            vhost_svq_stop(svq);
> > +            goto err;
>
>
> Nit: let's introduce a new error label for this?
>

I'm fine with that, but this unwinding index i, calling vhost_svq_stop
on svqs[i]. The code at err label works on j < i.

So there would be something like:

err_mapping:
  /* Unwind current starting SVQ */
  vhost_svq_stop(svqs[i]);

err:
for j in [0, i) {
  vhost_svq_stop(svqs[j]);
  unmap_rings(svqs[j]);
}

Repeating the vhost_svq_stop calling on different indexes.

>
> > +        }
> > +    }
> > +
> > +    return true;
> > +
> > +err:
> > +    error_reportf_err(err, "Cannot setup SVQ %u: ", i);
> > +    for (unsigned j = 0; j < i; ++j) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, j);
> > +        vhost_vdpa_svq_unmap_rings(dev, svq);
>
>
> I wonder if it's better to move the vhost_vdpa_svq_map_rings() out of
> vhost_vdpa_svq_set_addr(). (This function seems to be the only user for
> that). This will makes the error handling logic more clear.
>

Yes, I'll move it out.

Thanks!

> Thanks
>
>
> > +        vhost_svq_stop(svq);
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +
> > +    if (!v->shadow_vqs) {
> > +        return true;
> > +    }
> > +
> > +    for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
> > +                                                      i);
> > +        bool ok = vhost_vdpa_svq_unmap_rings(dev, svq);
> > +        if (unlikely(!ok)) {
> >               return false;
> >           }
> >       }
> > @@ -867,6 +1016,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >           }
> >           vhost_vdpa_set_vring_ready(dev);
> >       } else {
> > +        ok = vhost_vdpa_svqs_stop(dev);
> > +        if (unlikely(!ok)) {
> > +            return -1;
> > +        }
> >           vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> >       }
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/14] vDPA shadow virtqueue
  2022-02-28  7:41   ` Jason Wang
  (?)
@ 2022-03-02 20:30   ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-02 20:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 8:41 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> > is intended as a new method of tracking the memory the devices touch
> > during a migration process: Instead of relay on vhost device's dirty
> > logging capability, SVQ intercepts the VQ dataplane forwarding the
> > descriptors between VM and device. This way qemu is the effective
> > writer of guests memory, like in qemu's virtio device operation.
> >
> > When SVQ is enabled qemu offers a new virtual address space to the
> > device to read and write into, and it maps new vrings and the guest
> > memory in it. SVQ also intercepts kicks and calls between the device
> > and the guest. Used buffers relay would cause dirty memory being
> > tracked.
> >
> > This effectively means that vDPA device passthrough is intercepted by
> > qemu. While SVQ should only be enabled at migration time, the switching
> > from regular mode to SVQ mode is left for a future series.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
> > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > not map the shadow vq in guest's VA, but in qemu's.
> >
> > For qemu to use shadow virtqueues the guest virtio driver must not use
> > features like event_idx, indirect descriptors, packed and in_order.
> > These features are easy to implement on top of this base, but is left
> > for a future series for simplicity.
> >
> > SVQ needs to be enabled at qemu start time with vdpa cmdline parameter:
> >
> > -netdev type=vhost-vdpa,vhostdev=vhost-vdpa-0,id=vhost-vdpa0,x-svq=off
> >
> > The first three patches enables notifications forwarding with
> > assistance of qemu. It's easy to enable only this if the relevant
> > cmdline part of the last patch is applied on top of these.
> >
> > Next four patches implement the actual buffer forwarding. However,
> > address are not translated from HVA so they will need a host device with
> > an iommu allowing them to access all of the HVA range.
> >
> > The last part of the series uses properly the host iommu, so qemu
> > creates a new iova address space in the device's range and translates
> > the buffers in it. Finally, it adds the cmdline parameter.
> >
> > Some simple performance tests with netperf were done. They used a nested
> > guest with vp_vdpa, vhost-kernel at L0 host. Starting with no svq and a
> > baseline average of ~9980.13Mbps:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.01    9910.61
> > 131072  16384  16384    30.00    10030.94
> > 131072  16384  16384    30.01    9998.84
> >
> > To enable the notifications interception reduced performance to an
> > average of ~9577.73Mbit/s:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.00    9563.03
> > 131072  16384  16384    30.01    9626.65
> > 131072  16384  16384    30.01    9543.51
> >
> > Finally, to enable buffers forwarding reduced the throughput again to
> > ~8902.92Mbit/s:
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> > 131072  16384  16384    30.01    8643.19
> > 131072  16384  16384    30.01    9033.56
> > 131072  16384  16384    30.01    9032.02
> >
> > However, many performance improvements were left out of this series for
> > simplicity, so difference if performance should shrink in the future.
> >
> > Comments are welcome.
>
>
> The series looks good overall, few comments in the individual patch.
>
> I think if there's no objection, we can try to make it 7.0. (soft-freeze
> is 2022-03-08)
>
> Thanks
>

Sending v3 with all the comments we agree, so we can comment on the
new code too. We can ignore it if sending it with the pending comments
is wrong.

Thanks!

>
> >
> > TODO in future series:
> > * Event, indirect, packed, and others features of virtio.
> > * To support different set of features between the device<->SVQ and the
> >    SVQ<->guest communication.
> > * Support of device host notifier memory regions.
> > * To sepparate buffers forwarding in its own AIO context, so we can
> >    throw more threads to that task and we don't need to stop the main
> >    event loop.
> > * Support multiqueue virtio-net vdpa.
> > * Proper documentation.
> >
> > Changes from v1:
> > * Feature set at device->SVQ is now the same as SVQ->guest.
> > * Size of SVQ is not max available device size anymore, but guest's
> >    negotiated.
> > * Add VHOST_FILE_UNBIND kick and call fd treatment.
> > * Make SVQ a public struct
> > * Come back to previous approach to iova-tree
> > * Some assertions are now fail paths. Some errors are now log_guest.
> > * Only mask _F_LOG feature at vdpa_set_features svq enable path.
> > * Refactor some errors and messages. Add missing error unwindings.
> > * Add memory barrier at _F_NO_NOTIFY set.
> > * Stop checking for features flags out of transport range.
> > v1 link:
> > https://lore.kernel.org/virtualization/7d86c715-6d71-8a27-91f5-8d47b71e3201@redhat.com/
> >
> > Changes from v4 RFC:
> > * Support of allocating / freeing iova ranges in IOVA tree. Extending
> >    already present iova-tree for that.
> > * Proper validation of guest features. Now SVQ can negotiate a
> >    different set of features with the device when enabled.
> > * Support of host notifiers memory regions
> > * Handling of SVQ full queue in case guest's descriptors span to
> >    different memory regions (qemu's VA chunks).
> > * Flush pending used buffers at end of SVQ operation.
> > * QMP command now looks by NetClientState name. Other devices will need
> >    to implement it's way to enable vdpa.
> > * Rename QMP command to set, so it looks more like a way of working
> > * Better use of qemu error system
> > * Make a few assertions proper error-handling paths.
> > * Add more documentation
> > * Less coupling of virtio / vhost, that could cause friction on changes
> > * Addressed many other small comments and small fixes.
> >
> > Changes from v3 RFC:
> >    * Move everything to vhost-vdpa backend. A big change, this allowed
> >      some cleanup but more code has been added in other places.
> >    * More use of glib utilities, especially to manage memory.
> > v3 link:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
> >
> > Changes from v2 RFC:
> >    * Adding vhost-vdpa devices support
> >    * Fixed some memory leaks pointed by different comments
> > v2 link:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
> >
> > Changes from v1 RFC:
> >    * Use QMP instead of migration to start SVQ mode.
> >    * Only accepting IOMMU devices, closer behavior with target devices
> >      (vDPA)
> >    * Fix invalid masking/unmasking of vhost call fd.
> >    * Use of proper methods for synchronization.
> >    * No need to modify VirtIO device code, all of the changes are
> >      contained in vhost code.
> >    * Delete superfluous code.
> >    * An intermediate RFC was sent with only the notifications forwarding
> >      changes. It can be seen in
> >      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> > v1 link:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
> >
> > Eugenio Pérez (20):
> >        virtio: Add VIRTIO_F_QUEUE_STATE
> >        virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
> >        virtio: Add virtio_queue_is_host_notifier_enabled
> >        vhost: Make vhost_virtqueue_{start,stop} public
> >        vhost: Add x-vhost-enable-shadow-vq qmp
> >        vhost: Add VhostShadowVirtqueue
> >        vdpa: Register vdpa devices in a list
> >        vhost: Route guest->host notification through shadow virtqueue
> >        Add vhost_svq_get_svq_call_notifier
> >        Add vhost_svq_set_guest_call_notifier
> >        vdpa: Save call_fd in vhost-vdpa
> >        vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
> >        vhost: Route host->guest notification through shadow virtqueue
> >        virtio: Add vhost_shadow_vq_get_vring_addr
> >        vdpa: Save host and guest features
> >        vhost: Add vhost_svq_valid_device_features to shadow vq
> >        vhost: Shadow virtqueue buffers forwarding
> >        vhost: Add VhostIOVATree
> >        vhost: Use a tree to store memory mappings
> >        vdpa: Add custom IOTLB translations to SVQ
> >
> > Eugenio Pérez (14):
> >    vhost: Add VhostShadowVirtqueue
> >    vhost: Add Shadow VirtQueue kick forwarding capabilities
> >    vhost: Add Shadow VirtQueue call forwarding capabilities
> >    vhost: Add vhost_svq_valid_features to shadow vq
> >    virtio: Add vhost_shadow_vq_get_vring_addr
> >    vdpa: adapt vhost_ops callbacks to svq
> >    vhost: Shadow virtqueue buffers forwarding
> >    util: Add iova_tree_alloc
> >    vhost: Add VhostIOVATree
> >    vdpa: Add custom IOTLB translations to SVQ
> >    vdpa: Adapt vhost_vdpa_get_vring_base to SVQ
> >    vdpa: Never set log_base addr if SVQ is enabled
> >    vdpa: Expose VHOST_F_LOG_ALL on SVQ
> >    vdpa: Add x-svq to NetdevVhostVDPAOptions
> >
> >   qapi/net.json                      |   5 +-
> >   hw/virtio/vhost-iova-tree.h        |  27 ++
> >   hw/virtio/vhost-shadow-virtqueue.h |  90 ++++
> >   include/hw/virtio/vhost-vdpa.h     |   8 +
> >   include/qemu/iova-tree.h           |  18 +
> >   hw/virtio/vhost-iova-tree.c        | 155 +++++++
> >   hw/virtio/vhost-shadow-virtqueue.c | 632 +++++++++++++++++++++++++++++
> >   hw/virtio/vhost-vdpa.c             | 551 ++++++++++++++++++++++++-
> >   net/vhost-vdpa.c                   |  48 ++-
> >   util/iova-tree.c                   | 133 ++++++
> >   hw/virtio/meson.build              |   2 +-
> >   11 files changed, 1644 insertions(+), 25 deletions(-)
> >   create mode 100644 hw/virtio/vhost-iova-tree.h
> >   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
> >   create mode 100644 hw/virtio/vhost-iova-tree.c
> >   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-03-01 18:49     ` Eugenio Perez Martin
@ 2022-03-03  7:12         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:12 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/2 上午2:49, Eugenio Perez Martin 写道:
> On Mon, Feb 28, 2022 at 3:57 AM Jason Wang<jasowang@redhat.com>  wrote:
>> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
>>> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
>>> will just forward the guest's kicks to the device.
>>>
>>> Host memory notifiers regions are left out for simplicity, and they will
>>> not be addressed in this series.
>>>
>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.h |  14 +++
>>>    include/hw/virtio/vhost-vdpa.h     |   4 +
>>>    hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
>>>    hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
>>>    4 files changed, 213 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index f1519e3c7b..1cbc87d5d8 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
>>>        EventNotifier hdev_kick;
>>>        /* Shadow call notifier, sent to vhost */
>>>        EventNotifier hdev_call;
>>> +
>>> +    /*
>>> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
>>> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
>>> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
>>> +     * retrieve VhostShadowVirtqueue.
>>> +     *
>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>> +     */
>>> +    EventNotifier svq_kick;
>>>    } VhostShadowVirtqueue;
>>>
>>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>>> +
>>> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>> +
>>>    VhostShadowVirtqueue *vhost_svq_new(void);
>>>
>>>    void vhost_svq_free(gpointer vq);
>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>> index 3ce79a646d..009a9f3b6b 100644
>>> --- a/include/hw/virtio/vhost-vdpa.h
>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>> @@ -12,6 +12,8 @@
>>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
>>>    #define HW_VIRTIO_VHOST_VDPA_H
>>>
>>> +#include <gmodule.h>
>>> +
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
>>>        bool iotlb_batch_begin_sent;
>>>        MemoryListener listener;
>>>        struct vhost_vdpa_iova_range iova_range;
>>> +    bool shadow_vqs_enabled;
>>> +    GPtrArray *shadow_vqs;
>>>        struct vhost_dev *dev;
>>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>>    } VhostVDPA;
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 019cf1950f..a5d0659f86 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -11,6 +11,56 @@
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>
>>>    #include "qemu/error-report.h"
>>> +#include "qemu/main-loop.h"
>>> +#include "linux-headers/linux/vhost.h"
>>> +
>>> +/** Forward guest notifications */
>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>> +{
>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> +                                             svq_kick);
>>> +    event_notifier_test_and_clear(n);
>>> +    event_notifier_set(&svq->hdev_kick);
>>> +}
>>> +
>>> +/**
>>> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
>>> + *
>>> + * @svq          The svq
>>> + * @svq_kick_fd  The svq kick fd
>>> + *
>>> + * Note that the SVQ will never close the old file descriptor.
>>> + */
>>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>>> +{
>>> +    EventNotifier *svq_kick = &svq->svq_kick;
>>> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
>> I wonder if this is robust. E.g is there any chance that may end up with
>> both poll_stop and poll_start are false?
>>
> I cannot make that happen in qemu, but the function supports that case
> well: It will do nothing. It's more or less the same code as used in
> the vhost kernel, and is the expected behaviour if you send two
> VHOST_FILE_UNBIND one right after another to me.


I would think it's just stop twice.


>
>> If not, can we simple detect poll_stop as below and treat !poll_start
>> and poll_stop?
>>
> I'm not sure what does it add. Is there an unexpected consequence with
> the current do-nothing behavior I've missed?


I'm not sure, but it feels odd if poll_start is not the reverse value of 
poll_stop.

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
@ 2022-03-03  7:12         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:12 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/2 上午2:49, Eugenio Perez Martin 写道:
> On Mon, Feb 28, 2022 at 3:57 AM Jason Wang<jasowang@redhat.com>  wrote:
>> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
>>> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
>>> will just forward the guest's kicks to the device.
>>>
>>> Host memory notifiers regions are left out for simplicity, and they will
>>> not be addressed in this series.
>>>
>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.h |  14 +++
>>>    include/hw/virtio/vhost-vdpa.h     |   4 +
>>>    hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
>>>    hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
>>>    4 files changed, 213 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index f1519e3c7b..1cbc87d5d8 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
>>>        EventNotifier hdev_kick;
>>>        /* Shadow call notifier, sent to vhost */
>>>        EventNotifier hdev_call;
>>> +
>>> +    /*
>>> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
>>> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
>>> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
>>> +     * retrieve VhostShadowVirtqueue.
>>> +     *
>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>> +     */
>>> +    EventNotifier svq_kick;
>>>    } VhostShadowVirtqueue;
>>>
>>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>>> +
>>> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>> +
>>>    VhostShadowVirtqueue *vhost_svq_new(void);
>>>
>>>    void vhost_svq_free(gpointer vq);
>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>> index 3ce79a646d..009a9f3b6b 100644
>>> --- a/include/hw/virtio/vhost-vdpa.h
>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>> @@ -12,6 +12,8 @@
>>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
>>>    #define HW_VIRTIO_VHOST_VDPA_H
>>>
>>> +#include <gmodule.h>
>>> +
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
>>>        bool iotlb_batch_begin_sent;
>>>        MemoryListener listener;
>>>        struct vhost_vdpa_iova_range iova_range;
>>> +    bool shadow_vqs_enabled;
>>> +    GPtrArray *shadow_vqs;
>>>        struct vhost_dev *dev;
>>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>>    } VhostVDPA;
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 019cf1950f..a5d0659f86 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -11,6 +11,56 @@
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>
>>>    #include "qemu/error-report.h"
>>> +#include "qemu/main-loop.h"
>>> +#include "linux-headers/linux/vhost.h"
>>> +
>>> +/** Forward guest notifications */
>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>> +{
>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> +                                             svq_kick);
>>> +    event_notifier_test_and_clear(n);
>>> +    event_notifier_set(&svq->hdev_kick);
>>> +}
>>> +
>>> +/**
>>> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
>>> + *
>>> + * @svq          The svq
>>> + * @svq_kick_fd  The svq kick fd
>>> + *
>>> + * Note that the SVQ will never close the old file descriptor.
>>> + */
>>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>>> +{
>>> +    EventNotifier *svq_kick = &svq->svq_kick;
>>> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
>> I wonder if this is robust. E.g is there any chance that may end up with
>> both poll_stop and poll_start are false?
>>
> I cannot make that happen in qemu, but the function supports that case
> well: It will do nothing. It's more or less the same code as used in
> the vhost kernel, and is the expected behaviour if you send two
> VHOST_FILE_UNBIND one right after another to me.


I would think it's just stop twice.


>
>> If not, can we simple detect poll_stop as below and treat !poll_start
>> and poll_stop?
>>
> I'm not sure what does it add. Is there an unexpected consequence with
> the current do-nothing behavior I've missed?


I'm not sure, but it feels odd if poll_start is not the reverse value of 
poll_stop.

Thanks




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/14] util: Add iova_tree_alloc
  2022-03-01 10:06     ` Eugenio Perez Martin
@ 2022-03-03  7:16         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:16 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/1 下午6:06, Eugenio Perez Martin 写道:
>>> +
>>> +    /*
>>> +     * Find a valid hole for the mapping
>>> +     *
>>> +     * Assuming low iova_begin, so no need to do a binary search to
>>> +     * locate the first node.
>>> +     *
>>> +     * TODO: Replace all this with g_tree_node_first/next/last when available
>>> +     * (from glib since 2.68). To do it with g_tree_foreach complicates the
>>> +     * code a lot.
>>> +     *
>> One more question
>>
>> The current code looks work but still a little bit complicated to be
>> reviewed. Looking at the missing helpers above, if the add and remove
>> are seldom. I wonder if we can simply do
>>
>> g_tree_foreach() during each add/del to build a sorted list then we can
>> emulate g_tree_node_first/next/last easily?
>>
> This sounds a lot like the method in v1 [1]:).


Oh, right. I missed that and it takes time to recover the memory.


>
> But it didn't use the O(N) foreach, since we can locate the new node's
> previous element looking for the upper bound of iova-1, maintaining
> the insertion's complexity O(log(N)). The function g_tree_upper_bound
> is added in Glib version 2.68, so the proposed version will be deleted
> sooner or later.
>
> Also the deletion keeps being O(log(N)) since deleting a node in QLIST is O(1).


Yes, so I think we can leave the log as is and do optimization on top.

Thanks


>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/14] util: Add iova_tree_alloc
@ 2022-03-03  7:16         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:16 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/1 下午6:06, Eugenio Perez Martin 写道:
>>> +
>>> +    /*
>>> +     * Find a valid hole for the mapping
>>> +     *
>>> +     * Assuming low iova_begin, so no need to do a binary search to
>>> +     * locate the first node.
>>> +     *
>>> +     * TODO: Replace all this with g_tree_node_first/next/last when available
>>> +     * (from glib since 2.68). To do it with g_tree_foreach complicates the
>>> +     * code a lot.
>>> +     *
>> One more question
>>
>> The current code looks work but still a little bit complicated to be
>> reviewed. Looking at the missing helpers above, if the add and remove
>> are seldom. I wonder if we can simply do
>>
>> g_tree_foreach() during each add/del to build a sorted list then we can
>> emulate g_tree_node_first/next/last easily?
>>
> This sounds a lot like the method in v1 [1]:).


Oh, right. I missed that and it takes time to recover the memory.


>
> But it didn't use the O(N) foreach, since we can locate the new node's
> previous element looking for the upper bound of iova-1, maintaining
> the insertion's complexity O(log(N)). The function g_tree_upper_bound
> is added in Glib version 2.68, so the proposed version will be deleted
> sooner or later.
>
> Also the deletion keeps being O(log(N)) since deleting a node in QLIST is O(1).


Yes, so I think we can leave the log as is and do optimization on top.

Thanks


>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-03-01  8:50     ` Eugenio Perez Martin
@ 2022-03-03  7:33         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
> On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
>>> Use translations added in VhostIOVATree in SVQ.
>>>
>>> Only introduce usage here, not allocation and deallocation. As with
>>> previous patches, we use the dead code paths of shadow_vqs_enabled to
>>> avoid commiting too many changes at once. These are impossible to take
>>> at the moment.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>>>    include/hw/virtio/vhost-vdpa.h     |   3 +
>>>    hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>>>    hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>>>    4 files changed, 187 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index 04c67685fd..b2f722d101 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -13,6 +13,7 @@
>>>    #include "qemu/event_notifier.h"
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>
>>>    /* Shadow virtqueue to relay notifications */
>>>    typedef struct VhostShadowVirtqueue {
>>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>>>        /* Virtio device */
>>>        VirtIODevice *vdev;
>>>
>>> +    /* IOVA mapping */
>>> +    VhostIOVATree *iova_tree;
>>> +
>>>        /* Map for use the guest's descriptors */
>>>        VirtQueueElement **ring_id_maps;
>>>
>>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>>>                         VirtQueue *vq);
>>>    void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>>
>>> -VhostShadowVirtqueue *vhost_svq_new(void);
>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>>>
>>>    void vhost_svq_free(gpointer vq);
>>>    G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>> index 009a9f3b6b..ee8e939ad0 100644
>>> --- a/include/hw/virtio/vhost-vdpa.h
>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>> @@ -14,6 +14,7 @@
>>>
>>>    #include <gmodule.h>
>>>
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>>>        MemoryListener listener;
>>>        struct vhost_vdpa_iova_range iova_range;
>>>        bool shadow_vqs_enabled;
>>> +    /* IOVA mapping used by the Shadow Virtqueue */
>>> +    VhostIOVATree *iova_tree;
>>>        GPtrArray *shadow_vqs;
>>>        struct vhost_dev *dev;
>>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index a38d313755..7e073773d1 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -11,6 +11,7 @@
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>
>>>    #include "qemu/error-report.h"
>>> +#include "qemu/log.h"
>>>    #include "qemu/main-loop.h"
>>>    #include "qemu/log.h"
>>>    #include "linux-headers/linux/vhost.h"
>>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>>>        }
>>>    }
>>>
>>> +/**
>>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
>>> + *
>>> + * @svq    Shadow VirtQueue
>>> + * @vaddr  Translated IOVA addresses
>>> + * @iovec  Source qemu's VA addresses
>>> + * @num    Length of iovec and minimum length of vaddr
>>> + */
>>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
>>> +                                     void **addrs, const struct iovec *iovec,
>>> +                                     size_t num)
>>> +{
>>> +    if (num == 0) {
>>> +        return true;
>>> +    }
>>> +
>>> +    for (size_t i = 0; i < num; ++i) {
>>> +        DMAMap needle = {
>>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
>>> +            .size = iovec[i].iov_len,
>>> +        };
>>> +        size_t off;
>>> +
>>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
>>> +        /*
>>> +         * Map cannot be NULL since iova map contains all guest space and
>>> +         * qemu already has a physical address mapped
>>> +         */
>>> +        if (unlikely(!map)) {
>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
>>> +                          needle.translated_addr);
>>> +            return false;
>>> +        }
>>> +
>>> +        off = needle.translated_addr - map->translated_addr;
>>> +        addrs[i] = (void *)(map->iova + off);
>>> +
>>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
>>> +                                          iovec[i].iov_len),
>>> +                               map->translated_addr + map->size))) {
>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>> +                          "Guest buffer expands over iova range");
>>> +            return false;
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>>    static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> +                                    void * const *vaddr_sg,
>>
>> Nit: it looks to me we are not passing vaddr but iova here, so it might
>> be better to use "sg"?
>>
> Sure, this is a leftover of pre-IOVA translations. I see better to
> write as you say.
>
>>>                                        const struct iovec *iovec,
>>>                                        size_t num, bool more_descs, bool write)
>>>    {
>>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>            } else {
>>>                descs[i].flags = flags;
>>>            }
>>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>>>            descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>>
>>>            last = i;
>>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>    {
>>>        unsigned avail_idx;
>>>        vring_avail_t *avail = svq->vring.avail;
>>> +    bool ok;
>>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>>>
>>>        *head = svq->free_head;
>>>
>>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>            return false;
>>>        }
>>>
>>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
>>> +    if (unlikely(!ok)) {
>>> +        return false;
>>> +    }
>>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>>>                                elem->in_num > 0, false);
>>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> +
>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
>>> +    if (unlikely(!ok)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>>>
>>>        /*
>>>         * Put the entry in the available array (but don't update avail->idx until
>>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>>>     * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>>>     * shadow methods and file descriptors.
>>>     *
>>> + * @iova_tree Tree to perform descriptors translations
>>> + *
>>>     * Returns the new virtqueue or NULL.
>>>     *
>>>     * In case of error, reason is reported through error_report.
>>>     */
>>> -VhostShadowVirtqueue *vhost_svq_new(void)
>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>>>    {
>>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>>        int r;
>>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>>>
>>>        event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>>>        event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
>>> +    svq->iova_tree = iova_tree;
>>>        return g_steal_pointer(&svq);
>>>
>>>    err_init_hdev_call:
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 435b9c2e9e..56f9f125cd 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>                                             vaddr, section->readonly);
>>>
>>>        llsize = int128_sub(llend, int128_make64(iova));
>>> +    if (v->shadow_vqs_enabled) {
>>> +        DMAMap mem_region = {
>>> +            .translated_addr = (hwaddr)vaddr,
>>> +            .size = int128_get64(llsize) - 1,
>>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>> +        };
>>> +
>>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
>>> +        if (unlikely(r != IOVA_OK)) {
>>> +            error_report("Can't allocate a mapping (%d)", r);
>>> +            goto fail;
>>> +        }
>>> +
>>> +        iova = mem_region.iova;
>>> +    }
>>>
>>>        vhost_vdpa_iotlb_batch_begin_once(v);
>>>        ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
>>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>
>>>        llsize = int128_sub(llend, int128_make64(iova));
>>>
>>> +    if (v->shadow_vqs_enabled) {
>>> +        const DMAMap *result;
>>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>> +            section->offset_within_region +
>>> +            (iova - section->offset_within_address_space);
>>> +        DMAMap mem_region = {
>>> +            .translated_addr = (hwaddr)vaddr,
>>> +            .size = int128_get64(llsize) - 1,
>>> +        };
>>> +
>>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
>>> +        iova = result->iova;
>>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
>>> +    }
>>>        vhost_vdpa_iotlb_batch_begin_once(v);
>>>        ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>>>        if (ret) {
>>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>>>
>>>        shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>>>        for (unsigned n = 0; n < hdev->nvqs; ++n) {
>>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
>>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>>>
>>>            if (unlikely(!svq)) {
>>>                error_setg(errp, "Cannot create svq %u", n);
>>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>>>    /**
>>>     * Unmap a SVQ area in the device
>>>     */
>>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
>>> -                                      hwaddr size)
>>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
>>> +                                      const DMAMap *needle)
>>>    {
>>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
>>> +    hwaddr size;
>>>        int r;
>>>
>>> -    size = ROUND_UP(size, qemu_real_host_page_size);
>>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
>>> +    if (unlikely(!result)) {
>>> +        error_report("Unable to find SVQ address to unmap");
>>> +        return false;
>>> +    }
>>> +
>>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
>>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>>>        return r == 0;
>>>    }
>>>
>>>    static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>                                           const VhostShadowVirtqueue *svq)
>>>    {
>>> +    DMAMap needle;
>>>        struct vhost_vdpa *v = dev->opaque;
>>>        struct vhost_vring_addr svq_addr;
>>> -    size_t device_size = vhost_svq_device_area_size(svq);
>>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>>>        bool ok;
>>>
>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
>>>
>>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
>>> +    needle = (DMAMap) {
>>> +        .translated_addr = svq_addr.desc_user_addr,
>>> +    };
>>
>> Let's simply initialize the member to zero during start of this function
>> then we can use needle->transalted_addr = XXX here.
>>
> Sure
>
>>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>>>        if (unlikely(!ok)) {
>>>            return false;
>>>        }
>>>
>>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
>>> +    needle = (DMAMap) {
>>> +        .translated_addr = svq_addr.used_user_addr,
>>> +    };
>>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
>>> +}
>>> +
>>> +/**
>>> + * Map the SVQ area in the device
>>> + *
>>> + * @v          Vhost-vdpa device
>>> + * @needle     The area to search iova
>>> + * @errorp     Error pointer
>>> + */
>>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>> +                                    Error **errp)
>>> +{
>>> +    int r;
>>> +
>>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
>>> +    if (unlikely(r != IOVA_OK)) {
>>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
>>> +        return false;
>>> +    }
>>> +
>>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
>>> +                           (void *)needle->translated_addr,
>>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
>>
>> Let's simply use needle->perm == IOMMU_RO here?
>>
> The motivation to use this way is to be more resilient to the future.
> For example, if a new flag is added.
>
> But I'm totally ok with comparing with IOMMU_RO, I see that scenario
> unlikely at the moment.
>
>>> +    if (unlikely(r != 0)) {
>>> +        error_setg_errno(errp, -r, "Cannot map region to device");
>>> +        vhost_iova_tree_remove(v->iova_tree, needle);
>>> +    }
>>> +
>>> +    return r == 0;
>>>    }
>>>
>>>    /**
>>> - * Map shadow virtqueue rings in device
>>> + * Map the shadow virtqueue rings in the device
>>>     *
>>>     * @dev   The vhost device
>>>     * @svq   The shadow virtqueue
>>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>                                         struct vhost_vring_addr *addr,
>>>                                         Error **errp)
>>>    {
>>> +    DMAMap device_region, driver_region;
>>> +    struct vhost_vring_addr svq_addr;
>>>        struct vhost_vdpa *v = dev->opaque;
>>>        size_t device_size = vhost_svq_device_area_size(svq);
>>>        size_t driver_size = vhost_svq_driver_area_size(svq);
>>> -    int r;
>>> +    size_t avail_offset;
>>> +    bool ok;
>>>
>>>        ERRP_GUARD();
>>> -    vhost_svq_get_vring_addr(svq, addr);
>>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>>>
>>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
>>> -                           (void *)addr->desc_user_addr, true);
>>> -    if (unlikely(r != 0)) {
>>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
>>> +    driver_region = (DMAMap) {
>>> +        .translated_addr = svq_addr.desc_user_addr,
>>> +        .size = driver_size - 1,
>>
>> Any reason for the "-1" here? I see several places do things like that,
>> it's probably hint of wrong API somehwere.
>>
> The "problem" is the api mismatch between _end and _last, to include
> the last member in the size or not.
>
> IOVA tree needs to use _end so we can allocate the last page in case
> of available range ending in (uint64_t)-1 [1]. But If we change
> vhost_svq_{device,driver}_area_size to make it inclusive,


These functions looks sane since it doesn't return a range. It's up to 
the caller to decide how to use the size.


>   we need to
> use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
> Probably in more places too


I'm not sure I get here. Maybe you can show which code may suffers if we 
don't decrease it by one here.

But current code may endup to passing qemu_real_host_page_size - 1 to 
vhost-VDPA which seems wrong?

E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it 
was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().

Thanks


>
> QEMU's emulated Intel iommu code solves it using the address mask as
> the size, something that does not fit 100% with vhost devices since
> they can allocate an arbitrary address of arbitrary size when using
> vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
> sure we expose unaligned and whole pages with vrings, but I feel it
> would only be to move the problem somewhere else.
>
> Thanks!
>
> [1] There are alternatives: to use Int128_t, etc. But I think it's
> better not to change that in this patch series.
>
>> Thanks
>>
>>
>>> +        .perm = IOMMU_RO,
>>> +    };
>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>> +    if (unlikely(!ok)) {
>>> +        error_prepend(errp, "Cannot create vq driver region: ");
>>>            return false;
>>>        }
>>> +    addr->desc_user_addr = driver_region.iova;
>>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
>>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>>>
>>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
>>> -                           (void *)addr->used_user_addr, false);
>>> -    if (unlikely(r != 0)) {
>>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
>>> +    device_region = (DMAMap) {
>>> +        .translated_addr = svq_addr.used_user_addr,
>>> +        .size = device_size - 1,
>>> +        .perm = IOMMU_RW,
>>> +    };
>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>> +    if (unlikely(!ok)) {
>>> +        error_prepend(errp, "Cannot create vq device region: ");
>>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>>>        }
>>> +    addr->used_user_addr = device_region.iova;
>>>
>>> -    return r == 0;
>>> +    return ok;
>>>    }
>>>
>>>    static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
@ 2022-03-03  7:33         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
> On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
>>> Use translations added in VhostIOVATree in SVQ.
>>>
>>> Only introduce usage here, not allocation and deallocation. As with
>>> previous patches, we use the dead code paths of shadow_vqs_enabled to
>>> avoid commiting too many changes at once. These are impossible to take
>>> at the moment.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>>>    include/hw/virtio/vhost-vdpa.h     |   3 +
>>>    hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>>>    hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>>>    4 files changed, 187 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index 04c67685fd..b2f722d101 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -13,6 +13,7 @@
>>>    #include "qemu/event_notifier.h"
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>
>>>    /* Shadow virtqueue to relay notifications */
>>>    typedef struct VhostShadowVirtqueue {
>>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>>>        /* Virtio device */
>>>        VirtIODevice *vdev;
>>>
>>> +    /* IOVA mapping */
>>> +    VhostIOVATree *iova_tree;
>>> +
>>>        /* Map for use the guest's descriptors */
>>>        VirtQueueElement **ring_id_maps;
>>>
>>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>>>                         VirtQueue *vq);
>>>    void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>>
>>> -VhostShadowVirtqueue *vhost_svq_new(void);
>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>>>
>>>    void vhost_svq_free(gpointer vq);
>>>    G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>> index 009a9f3b6b..ee8e939ad0 100644
>>> --- a/include/hw/virtio/vhost-vdpa.h
>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>> @@ -14,6 +14,7 @@
>>>
>>>    #include <gmodule.h>
>>>
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>    #include "hw/virtio/virtio.h"
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>>>        MemoryListener listener;
>>>        struct vhost_vdpa_iova_range iova_range;
>>>        bool shadow_vqs_enabled;
>>> +    /* IOVA mapping used by the Shadow Virtqueue */
>>> +    VhostIOVATree *iova_tree;
>>>        GPtrArray *shadow_vqs;
>>>        struct vhost_dev *dev;
>>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index a38d313755..7e073773d1 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -11,6 +11,7 @@
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>
>>>    #include "qemu/error-report.h"
>>> +#include "qemu/log.h"
>>>    #include "qemu/main-loop.h"
>>>    #include "qemu/log.h"
>>>    #include "linux-headers/linux/vhost.h"
>>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>>>        }
>>>    }
>>>
>>> +/**
>>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
>>> + *
>>> + * @svq    Shadow VirtQueue
>>> + * @vaddr  Translated IOVA addresses
>>> + * @iovec  Source qemu's VA addresses
>>> + * @num    Length of iovec and minimum length of vaddr
>>> + */
>>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
>>> +                                     void **addrs, const struct iovec *iovec,
>>> +                                     size_t num)
>>> +{
>>> +    if (num == 0) {
>>> +        return true;
>>> +    }
>>> +
>>> +    for (size_t i = 0; i < num; ++i) {
>>> +        DMAMap needle = {
>>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
>>> +            .size = iovec[i].iov_len,
>>> +        };
>>> +        size_t off;
>>> +
>>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
>>> +        /*
>>> +         * Map cannot be NULL since iova map contains all guest space and
>>> +         * qemu already has a physical address mapped
>>> +         */
>>> +        if (unlikely(!map)) {
>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
>>> +                          needle.translated_addr);
>>> +            return false;
>>> +        }
>>> +
>>> +        off = needle.translated_addr - map->translated_addr;
>>> +        addrs[i] = (void *)(map->iova + off);
>>> +
>>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
>>> +                                          iovec[i].iov_len),
>>> +                               map->translated_addr + map->size))) {
>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>> +                          "Guest buffer expands over iova range");
>>> +            return false;
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>>    static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> +                                    void * const *vaddr_sg,
>>
>> Nit: it looks to me we are not passing vaddr but iova here, so it might
>> be better to use "sg"?
>>
> Sure, this is a leftover of pre-IOVA translations. I see better to
> write as you say.
>
>>>                                        const struct iovec *iovec,
>>>                                        size_t num, bool more_descs, bool write)
>>>    {
>>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>            } else {
>>>                descs[i].flags = flags;
>>>            }
>>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>>>            descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>>
>>>            last = i;
>>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>    {
>>>        unsigned avail_idx;
>>>        vring_avail_t *avail = svq->vring.avail;
>>> +    bool ok;
>>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>>>
>>>        *head = svq->free_head;
>>>
>>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>            return false;
>>>        }
>>>
>>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
>>> +    if (unlikely(!ok)) {
>>> +        return false;
>>> +    }
>>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>>>                                elem->in_num > 0, false);
>>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> +
>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
>>> +    if (unlikely(!ok)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>>>
>>>        /*
>>>         * Put the entry in the available array (but don't update avail->idx until
>>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>>>     * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>>>     * shadow methods and file descriptors.
>>>     *
>>> + * @iova_tree Tree to perform descriptors translations
>>> + *
>>>     * Returns the new virtqueue or NULL.
>>>     *
>>>     * In case of error, reason is reported through error_report.
>>>     */
>>> -VhostShadowVirtqueue *vhost_svq_new(void)
>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>>>    {
>>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>>        int r;
>>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>>>
>>>        event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>>>        event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
>>> +    svq->iova_tree = iova_tree;
>>>        return g_steal_pointer(&svq);
>>>
>>>    err_init_hdev_call:
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 435b9c2e9e..56f9f125cd 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>                                             vaddr, section->readonly);
>>>
>>>        llsize = int128_sub(llend, int128_make64(iova));
>>> +    if (v->shadow_vqs_enabled) {
>>> +        DMAMap mem_region = {
>>> +            .translated_addr = (hwaddr)vaddr,
>>> +            .size = int128_get64(llsize) - 1,
>>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>> +        };
>>> +
>>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
>>> +        if (unlikely(r != IOVA_OK)) {
>>> +            error_report("Can't allocate a mapping (%d)", r);
>>> +            goto fail;
>>> +        }
>>> +
>>> +        iova = mem_region.iova;
>>> +    }
>>>
>>>        vhost_vdpa_iotlb_batch_begin_once(v);
>>>        ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
>>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>
>>>        llsize = int128_sub(llend, int128_make64(iova));
>>>
>>> +    if (v->shadow_vqs_enabled) {
>>> +        const DMAMap *result;
>>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>> +            section->offset_within_region +
>>> +            (iova - section->offset_within_address_space);
>>> +        DMAMap mem_region = {
>>> +            .translated_addr = (hwaddr)vaddr,
>>> +            .size = int128_get64(llsize) - 1,
>>> +        };
>>> +
>>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
>>> +        iova = result->iova;
>>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
>>> +    }
>>>        vhost_vdpa_iotlb_batch_begin_once(v);
>>>        ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>>>        if (ret) {
>>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>>>
>>>        shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>>>        for (unsigned n = 0; n < hdev->nvqs; ++n) {
>>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
>>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>>>
>>>            if (unlikely(!svq)) {
>>>                error_setg(errp, "Cannot create svq %u", n);
>>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>>>    /**
>>>     * Unmap a SVQ area in the device
>>>     */
>>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
>>> -                                      hwaddr size)
>>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
>>> +                                      const DMAMap *needle)
>>>    {
>>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
>>> +    hwaddr size;
>>>        int r;
>>>
>>> -    size = ROUND_UP(size, qemu_real_host_page_size);
>>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
>>> +    if (unlikely(!result)) {
>>> +        error_report("Unable to find SVQ address to unmap");
>>> +        return false;
>>> +    }
>>> +
>>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
>>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>>>        return r == 0;
>>>    }
>>>
>>>    static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>                                           const VhostShadowVirtqueue *svq)
>>>    {
>>> +    DMAMap needle;
>>>        struct vhost_vdpa *v = dev->opaque;
>>>        struct vhost_vring_addr svq_addr;
>>> -    size_t device_size = vhost_svq_device_area_size(svq);
>>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>>>        bool ok;
>>>
>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
>>>
>>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
>>> +    needle = (DMAMap) {
>>> +        .translated_addr = svq_addr.desc_user_addr,
>>> +    };
>>
>> Let's simply initialize the member to zero during start of this function
>> then we can use needle->transalted_addr = XXX here.
>>
> Sure
>
>>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>>>        if (unlikely(!ok)) {
>>>            return false;
>>>        }
>>>
>>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
>>> +    needle = (DMAMap) {
>>> +        .translated_addr = svq_addr.used_user_addr,
>>> +    };
>>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
>>> +}
>>> +
>>> +/**
>>> + * Map the SVQ area in the device
>>> + *
>>> + * @v          Vhost-vdpa device
>>> + * @needle     The area to search iova
>>> + * @errorp     Error pointer
>>> + */
>>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>> +                                    Error **errp)
>>> +{
>>> +    int r;
>>> +
>>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
>>> +    if (unlikely(r != IOVA_OK)) {
>>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
>>> +        return false;
>>> +    }
>>> +
>>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
>>> +                           (void *)needle->translated_addr,
>>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
>>
>> Let's simply use needle->perm == IOMMU_RO here?
>>
> The motivation to use this way is to be more resilient to the future.
> For example, if a new flag is added.
>
> But I'm totally ok with comparing with IOMMU_RO, I see that scenario
> unlikely at the moment.
>
>>> +    if (unlikely(r != 0)) {
>>> +        error_setg_errno(errp, -r, "Cannot map region to device");
>>> +        vhost_iova_tree_remove(v->iova_tree, needle);
>>> +    }
>>> +
>>> +    return r == 0;
>>>    }
>>>
>>>    /**
>>> - * Map shadow virtqueue rings in device
>>> + * Map the shadow virtqueue rings in the device
>>>     *
>>>     * @dev   The vhost device
>>>     * @svq   The shadow virtqueue
>>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>                                         struct vhost_vring_addr *addr,
>>>                                         Error **errp)
>>>    {
>>> +    DMAMap device_region, driver_region;
>>> +    struct vhost_vring_addr svq_addr;
>>>        struct vhost_vdpa *v = dev->opaque;
>>>        size_t device_size = vhost_svq_device_area_size(svq);
>>>        size_t driver_size = vhost_svq_driver_area_size(svq);
>>> -    int r;
>>> +    size_t avail_offset;
>>> +    bool ok;
>>>
>>>        ERRP_GUARD();
>>> -    vhost_svq_get_vring_addr(svq, addr);
>>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>>>
>>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
>>> -                           (void *)addr->desc_user_addr, true);
>>> -    if (unlikely(r != 0)) {
>>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
>>> +    driver_region = (DMAMap) {
>>> +        .translated_addr = svq_addr.desc_user_addr,
>>> +        .size = driver_size - 1,
>>
>> Any reason for the "-1" here? I see several places do things like that,
>> it's probably hint of wrong API somehwere.
>>
> The "problem" is the api mismatch between _end and _last, to include
> the last member in the size or not.
>
> IOVA tree needs to use _end so we can allocate the last page in case
> of available range ending in (uint64_t)-1 [1]. But If we change
> vhost_svq_{device,driver}_area_size to make it inclusive,


These functions looks sane since it doesn't return a range. It's up to 
the caller to decide how to use the size.


>   we need to
> use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
> Probably in more places too


I'm not sure I get here. Maybe you can show which code may suffers if we 
don't decrease it by one here.

But current code may endup to passing qemu_real_host_page_size - 1 to 
vhost-VDPA which seems wrong?

E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it 
was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().

Thanks


>
> QEMU's emulated Intel iommu code solves it using the address mask as
> the size, something that does not fit 100% with vhost devices since
> they can allocate an arbitrary address of arbitrary size when using
> vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
> sure we expose unaligned and whole pages with vrings, but I feel it
> would only be to move the problem somewhere else.
>
> Thanks!
>
> [1] There are alternatives: to use Int128_t, etc. But I think it's
> better not to change that in this patch series.
>
>> Thanks
>>
>>
>>> +        .perm = IOMMU_RO,
>>> +    };
>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>> +    if (unlikely(!ok)) {
>>> +        error_prepend(errp, "Cannot create vq driver region: ");
>>>            return false;
>>>        }
>>> +    addr->desc_user_addr = driver_region.iova;
>>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
>>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>>>
>>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
>>> -                           (void *)addr->used_user_addr, false);
>>> -    if (unlikely(r != 0)) {
>>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
>>> +    device_region = (DMAMap) {
>>> +        .translated_addr = svq_addr.used_user_addr,
>>> +        .size = device_size - 1,
>>> +        .perm = IOMMU_RW,
>>> +    };
>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>> +    if (unlikely(!ok)) {
>>> +        error_prepend(errp, "Cannot create vq device region: ");
>>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>>>        }
>>> +    addr->used_user_addr = device_region.iova;
>>>
>>> -    return r == 0;
>>> +    return ok;
>>>    }
>>>
>>>    static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
  2022-03-02 18:23     ` Eugenio Perez Martin
@ 2022-03-03  7:35         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:35 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/3 上午2:23, Eugenio Perez Martin 写道:
>>> +
>>> +static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>> +                                VirtQueueElement *elem,
>>> +                                unsigned *head)
>>> +{
>>> +    unsigned avail_idx;
>>> +    vring_avail_t *avail = svq->vring.avail;
>>> +
>>> +    *head = svq->free_head;
>>> +
>>> +    /* We need some descriptors here */
>>> +    if (unlikely(!elem->out_num && !elem->in_num)) {
>>> +        qemu_log_mask(LOG_GUEST_ERROR,
>>> +            "Guest provided element with no descriptors");
>>> +        return false;
>>> +    }
>>> +
>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +                            elem->in_num > 0, false);
>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>> I wonder instead of passing in/out separately and using the hint like
>> more_descs, is it better to simply pass the elem to
>> vhost_vrign_write_descs() then we know which one is the last that
>> doesn't depend on more_descs.
>>
> I'm not sure I follow this.
>
> The purpose of vhost_vring_write_descs is to abstract the writing of a
> batch of descriptors, its chaining, etc. It accepts the write
> parameter just for the write flag. If we make elem as a parameter, we
> would need to duplicate that for loop for read and for write
> descriptors, isn't it?
>
> To duplicate the for loop is the way it is done in the kernel, but I
> actually think the kernel could benefit from abstracting both in the
> same function too. Please let me know if you think otherwise or I've
> missed your point.


Ok, so it's just a suggestion and we can do optimization on top for sure.

Thanks


>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding
@ 2022-03-03  7:35         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-03  7:35 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/3 上午2:23, Eugenio Perez Martin 写道:
>>> +
>>> +static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>> +                                VirtQueueElement *elem,
>>> +                                unsigned *head)
>>> +{
>>> +    unsigned avail_idx;
>>> +    vring_avail_t *avail = svq->vring.avail;
>>> +
>>> +    *head = svq->free_head;
>>> +
>>> +    /* We need some descriptors here */
>>> +    if (unlikely(!elem->out_num && !elem->in_num)) {
>>> +        qemu_log_mask(LOG_GUEST_ERROR,
>>> +            "Guest provided element with no descriptors");
>>> +        return false;
>>> +    }
>>> +
>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +                            elem->in_num > 0, false);
>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>> I wonder instead of passing in/out separately and using the hint like
>> more_descs, is it better to simply pass the elem to
>> vhost_vrign_write_descs() then we know which one is the last that
>> doesn't depend on more_descs.
>>
> I'm not sure I follow this.
>
> The purpose of vhost_vring_write_descs is to abstract the writing of a
> batch of descriptors, its chaining, etc. It accepts the write
> parameter just for the write flag. If we make elem as a parameter, we
> would need to duplicate that for loop for read and for write
> descriptors, isn't it?
>
> To duplicate the for loop is the way it is done in the kernel, but I
> actually think the kernel could benefit from abstracting both in the
> same function too. Please let me know if you think otherwise or I've
> missed your point.


Ok, so it's just a suggestion and we can do optimization on top for sure.

Thanks


>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-03-03  7:12         ` Jason Wang
  (?)
@ 2022-03-03  9:24         ` Eugenio Perez Martin
  2022-03-04  1:39             ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-03  9:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Thu, Mar 3, 2022 at 8:12 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/3/2 上午2:49, Eugenio Perez Martin 写道:
> > On Mon, Feb 28, 2022 at 3:57 AM Jason Wang<jasowang@redhat.com>  wrote:
> >> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> >>> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> >>> will just forward the guest's kicks to the device.
> >>>
> >>> Host memory notifiers regions are left out for simplicity, and they will
> >>> not be addressed in this series.
> >>>
> >>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.h |  14 +++
> >>>    include/hw/virtio/vhost-vdpa.h     |   4 +
> >>>    hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
> >>>    hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
> >>>    4 files changed, 213 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>> index f1519e3c7b..1cbc87d5d8 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
> >>>        EventNotifier hdev_kick;
> >>>        /* Shadow call notifier, sent to vhost */
> >>>        EventNotifier hdev_call;
> >>> +
> >>> +    /*
> >>> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> >>> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> >>> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> >>> +     * retrieve VhostShadowVirtqueue.
> >>> +     *
> >>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >>> +     */
> >>> +    EventNotifier svq_kick;
> >>>    } VhostShadowVirtqueue;
> >>>
> >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> >>> +
> >>> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >>> +
> >>>    VhostShadowVirtqueue *vhost_svq_new(void);
> >>>
> >>>    void vhost_svq_free(gpointer vq);
> >>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> >>> index 3ce79a646d..009a9f3b6b 100644
> >>> --- a/include/hw/virtio/vhost-vdpa.h
> >>> +++ b/include/hw/virtio/vhost-vdpa.h
> >>> @@ -12,6 +12,8 @@
> >>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
> >>>    #define HW_VIRTIO_VHOST_VDPA_H
> >>>
> >>> +#include <gmodule.h>
> >>> +
> >>>    #include "hw/virtio/virtio.h"
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
> >>>        bool iotlb_batch_begin_sent;
> >>>        MemoryListener listener;
> >>>        struct vhost_vdpa_iova_range iova_range;
> >>> +    bool shadow_vqs_enabled;
> >>> +    GPtrArray *shadow_vqs;
> >>>        struct vhost_dev *dev;
> >>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >>>    } VhostVDPA;
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 019cf1950f..a5d0659f86 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -11,6 +11,56 @@
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>
> >>>    #include "qemu/error-report.h"
> >>> +#include "qemu/main-loop.h"
> >>> +#include "linux-headers/linux/vhost.h"
> >>> +
> >>> +/** Forward guest notifications */
> >>> +static void vhost_handle_guest_kick(EventNotifier *n)
> >>> +{
> >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> +                                             svq_kick);
> >>> +    event_notifier_test_and_clear(n);
> >>> +    event_notifier_set(&svq->hdev_kick);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> >>> + *
> >>> + * @svq          The svq
> >>> + * @svq_kick_fd  The svq kick fd
> >>> + *
> >>> + * Note that the SVQ will never close the old file descriptor.
> >>> + */
> >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> >>> +{
> >>> +    EventNotifier *svq_kick = &svq->svq_kick;
> >>> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
> >> I wonder if this is robust. E.g is there any chance that may end up with
> >> both poll_stop and poll_start are false?
> >>
> > I cannot make that happen in qemu, but the function supports that case
> > well: It will do nothing. It's more or less the same code as used in
> > the vhost kernel, and is the expected behaviour if you send two
> > VHOST_FILE_UNBIND one right after another to me.
>
>
> I would think it's just stop twice.
>
>
> >
> >> If not, can we simple detect poll_stop as below and treat !poll_start
> >> and poll_stop?
> >>
> > I'm not sure what does it add. Is there an unexpected consequence with
> > the current do-nothing behavior I've missed?
>
>
> I'm not sure, but it feels odd if poll_start is not the reverse value of
> poll_stop.
>

If we want to not to restrict the inputs, we need to handle for situations:

a) old_fd = -1, new_fd = -1,

This is the situation you described, and it's basically a no-op.
poll_stop == poll_start == false.

If we make poll_stop = true and poll_stop = false, we call
event_notifier_set_handler(-1, ...). Hopefully it will return just an
error.

If we make poll_stop = false and poll_stop = true, we are calling
event_notifier_set(-1) and event_notifier_set_handler(-1,
poll_callback). Same situation, hopefully an error, but unexpected.

b) old_fd = -1, new_fd = >-1,

We need to start polling the new_fd. No need for stop polling the
old_fd, since we are not polling it actually.

c) old_fd = >-1, new_fd = >-1,

We need to stop polling the old_fd and start polling the new one.

If we make poll_stop = true and poll_stop = false, we don't register a
new polling function for the new kick_fd so we will miss guest's
kicks.

If we make poll_stop = false and poll_stop = true, we keep polling the
old file descriptor too, so whatever it gets assigned to could call
vhost_handle_guest_kick if it does not override poll callback.

We *could* detect if old_fd == new_fd so we skip all the work, but I
think it is not worth it to complicate the code, since we're only
being called with the kick_fd at dev start.

d) c) old_fd = >-1, new_fd = -1,

We need to stop polling, or we could get invalid kicks callbacks if it
gets writed after this. No need to poll anything beyond this.

> Thanks
>
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-03-03  7:33         ` Jason Wang
  (?)
@ 2022-03-03 11:35         ` Eugenio Perez Martin
  2022-03-07  4:24             ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-03 11:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Thu, Mar 3, 2022 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
> > On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> >>> Use translations added in VhostIOVATree in SVQ.
> >>>
> >>> Only introduce usage here, not allocation and deallocation. As with
> >>> previous patches, we use the dead code paths of shadow_vqs_enabled to
> >>> avoid commiting too many changes at once. These are impossible to take
> >>> at the moment.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.h |   6 +-
> >>>    include/hw/virtio/vhost-vdpa.h     |   3 +
> >>>    hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
> >>>    hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
> >>>    4 files changed, 187 insertions(+), 26 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>> index 04c67685fd..b2f722d101 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>> @@ -13,6 +13,7 @@
> >>>    #include "qemu/event_notifier.h"
> >>>    #include "hw/virtio/virtio.h"
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>> +#include "hw/virtio/vhost-iova-tree.h"
> >>>
> >>>    /* Shadow virtqueue to relay notifications */
> >>>    typedef struct VhostShadowVirtqueue {
> >>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
> >>>        /* Virtio device */
> >>>        VirtIODevice *vdev;
> >>>
> >>> +    /* IOVA mapping */
> >>> +    VhostIOVATree *iova_tree;
> >>> +
> >>>        /* Map for use the guest's descriptors */
> >>>        VirtQueueElement **ring_id_maps;
> >>>
> >>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> >>>                         VirtQueue *vq);
> >>>    void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >>>
> >>> -VhostShadowVirtqueue *vhost_svq_new(void);
> >>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
> >>>
> >>>    void vhost_svq_free(gpointer vq);
> >>>    G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
> >>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> >>> index 009a9f3b6b..ee8e939ad0 100644
> >>> --- a/include/hw/virtio/vhost-vdpa.h
> >>> +++ b/include/hw/virtio/vhost-vdpa.h
> >>> @@ -14,6 +14,7 @@
> >>>
> >>>    #include <gmodule.h>
> >>>
> >>> +#include "hw/virtio/vhost-iova-tree.h"
> >>>    #include "hw/virtio/virtio.h"
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
> >>>        MemoryListener listener;
> >>>        struct vhost_vdpa_iova_range iova_range;
> >>>        bool shadow_vqs_enabled;
> >>> +    /* IOVA mapping used by the Shadow Virtqueue */
> >>> +    VhostIOVATree *iova_tree;
> >>>        GPtrArray *shadow_vqs;
> >>>        struct vhost_dev *dev;
> >>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index a38d313755..7e073773d1 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -11,6 +11,7 @@
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>
> >>>    #include "qemu/error-report.h"
> >>> +#include "qemu/log.h"
> >>>    #include "qemu/main-loop.h"
> >>>    #include "qemu/log.h"
> >>>    #include "linux-headers/linux/vhost.h"
> >>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> >>>        }
> >>>    }
> >>>
> >>> +/**
> >>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
> >>> + *
> >>> + * @svq    Shadow VirtQueue
> >>> + * @vaddr  Translated IOVA addresses
> >>> + * @iovec  Source qemu's VA addresses
> >>> + * @num    Length of iovec and minimum length of vaddr
> >>> + */
> >>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> >>> +                                     void **addrs, const struct iovec *iovec,
> >>> +                                     size_t num)
> >>> +{
> >>> +    if (num == 0) {
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    for (size_t i = 0; i < num; ++i) {
> >>> +        DMAMap needle = {
> >>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
> >>> +            .size = iovec[i].iov_len,
> >>> +        };
> >>> +        size_t off;
> >>> +
> >>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
> >>> +        /*
> >>> +         * Map cannot be NULL since iova map contains all guest space and
> >>> +         * qemu already has a physical address mapped
> >>> +         */
> >>> +        if (unlikely(!map)) {
> >>> +            qemu_log_mask(LOG_GUEST_ERROR,
> >>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
> >>> +                          needle.translated_addr);
> >>> +            return false;
> >>> +        }
> >>> +
> >>> +        off = needle.translated_addr - map->translated_addr;
> >>> +        addrs[i] = (void *)(map->iova + off);
> >>> +
> >>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
> >>> +                                          iovec[i].iov_len),
> >>> +                               map->translated_addr + map->size))) {
> >>> +            qemu_log_mask(LOG_GUEST_ERROR,
> >>> +                          "Guest buffer expands over iova range");
> >>> +            return false;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return true;
> >>> +}
> >>> +
> >>>    static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>> +                                    void * const *vaddr_sg,
> >>
> >> Nit: it looks to me we are not passing vaddr but iova here, so it might
> >> be better to use "sg"?
> >>
> > Sure, this is a leftover of pre-IOVA translations. I see better to
> > write as you say.
> >
> >>>                                        const struct iovec *iovec,
> >>>                                        size_t num, bool more_descs, bool write)
> >>>    {
> >>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>>            } else {
> >>>                descs[i].flags = flags;
> >>>            }
> >>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> >>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
> >>>            descs[i].len = cpu_to_le32(iovec[n].iov_len);
> >>>
> >>>            last = i;
> >>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>>    {
> >>>        unsigned avail_idx;
> >>>        vring_avail_t *avail = svq->vring.avail;
> >>> +    bool ok;
> >>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
> >>>
> >>>        *head = svq->free_head;
> >>>
> >>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>>            return false;
> >>>        }
> >>>
> >>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
> >>> +    if (unlikely(!ok)) {
> >>> +        return false;
> >>> +    }
> >>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
> >>>                                elem->in_num > 0, false);
> >>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>> +
> >>> +
> >>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
> >>> +    if (unlikely(!ok)) {
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
> >>>
> >>>        /*
> >>>         * Put the entry in the available array (but don't update avail->idx until
> >>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
> >>>     * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> >>>     * shadow methods and file descriptors.
> >>>     *
> >>> + * @iova_tree Tree to perform descriptors translations
> >>> + *
> >>>     * Returns the new virtqueue or NULL.
> >>>     *
> >>>     * In case of error, reason is reported through error_report.
> >>>     */
> >>> -VhostShadowVirtqueue *vhost_svq_new(void)
> >>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
> >>>    {
> >>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >>>        int r;
> >>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
> >>>
> >>>        event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
> >>>        event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
> >>> +    svq->iova_tree = iova_tree;
> >>>        return g_steal_pointer(&svq);
> >>>
> >>>    err_init_hdev_call:
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index 435b9c2e9e..56f9f125cd 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>                                             vaddr, section->readonly);
> >>>
> >>>        llsize = int128_sub(llend, int128_make64(iova));
> >>> +    if (v->shadow_vqs_enabled) {
> >>> +        DMAMap mem_region = {
> >>> +            .translated_addr = (hwaddr)vaddr,
> >>> +            .size = int128_get64(llsize) - 1,
> >>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>> +        };
> >>> +
> >>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> >>> +        if (unlikely(r != IOVA_OK)) {
> >>> +            error_report("Can't allocate a mapping (%d)", r);
> >>> +            goto fail;
> >>> +        }
> >>> +
> >>> +        iova = mem_region.iova;
> >>> +    }
> >>>
> >>>        vhost_vdpa_iotlb_batch_begin_once(v);
> >>>        ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> >>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >>>
> >>>        llsize = int128_sub(llend, int128_make64(iova));
> >>>
> >>> +    if (v->shadow_vqs_enabled) {
> >>> +        const DMAMap *result;
> >>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> >>> +            section->offset_within_region +
> >>> +            (iova - section->offset_within_address_space);
> >>> +        DMAMap mem_region = {
> >>> +            .translated_addr = (hwaddr)vaddr,
> >>> +            .size = int128_get64(llsize) - 1,
> >>> +        };
> >>> +
> >>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> >>> +        iova = result->iova;
> >>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
> >>> +    }
> >>>        vhost_vdpa_iotlb_batch_begin_once(v);
> >>>        ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
> >>>        if (ret) {
> >>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >>>
> >>>        shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> >>>        for (unsigned n = 0; n < hdev->nvqs; ++n) {
> >>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> >>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
> >>>
> >>>            if (unlikely(!svq)) {
> >>>                error_setg(errp, "Cannot create svq %u", n);
> >>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> >>>    /**
> >>>     * Unmap a SVQ area in the device
> >>>     */
> >>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> >>> -                                      hwaddr size)
> >>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
> >>> +                                      const DMAMap *needle)
> >>>    {
> >>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
> >>> +    hwaddr size;
> >>>        int r;
> >>>
> >>> -    size = ROUND_UP(size, qemu_real_host_page_size);
> >>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
> >>> +    if (unlikely(!result)) {
> >>> +        error_report("Unable to find SVQ address to unmap");
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
> >>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
> >>>        return r == 0;
> >>>    }
> >>>
> >>>    static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>>                                           const VhostShadowVirtqueue *svq)
> >>>    {
> >>> +    DMAMap needle;
> >>>        struct vhost_vdpa *v = dev->opaque;
> >>>        struct vhost_vring_addr svq_addr;
> >>> -    size_t device_size = vhost_svq_device_area_size(svq);
> >>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
> >>>        bool ok;
> >>>
> >>>        vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>
> >>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> >>> +    needle = (DMAMap) {
> >>> +        .translated_addr = svq_addr.desc_user_addr,
> >>> +    };
> >>
> >> Let's simply initialize the member to zero during start of this function
> >> then we can use needle->transalted_addr = XXX here.
> >>
> > Sure
> >
> >>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
> >>>        if (unlikely(!ok)) {
> >>>            return false;
> >>>        }
> >>>
> >>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> >>> +    needle = (DMAMap) {
> >>> +        .translated_addr = svq_addr.used_user_addr,
> >>> +    };
> >>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Map the SVQ area in the device
> >>> + *
> >>> + * @v          Vhost-vdpa device
> >>> + * @needle     The area to search iova
> >>> + * @errorp     Error pointer
> >>> + */
> >>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >>> +                                    Error **errp)
> >>> +{
> >>> +    int r;
> >>> +
> >>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> >>> +    if (unlikely(r != IOVA_OK)) {
> >>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> >>> +                           (void *)needle->translated_addr,
> >>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
> >>
> >> Let's simply use needle->perm == IOMMU_RO here?
> >>
> > The motivation to use this way is to be more resilient to the future.
> > For example, if a new flag is added.
> >
> > But I'm totally ok with comparing with IOMMU_RO, I see that scenario
> > unlikely at the moment.
> >
> >>> +    if (unlikely(r != 0)) {
> >>> +        error_setg_errno(errp, -r, "Cannot map region to device");
> >>> +        vhost_iova_tree_remove(v->iova_tree, needle);
> >>> +    }
> >>> +
> >>> +    return r == 0;
> >>>    }
> >>>
> >>>    /**
> >>> - * Map shadow virtqueue rings in device
> >>> + * Map the shadow virtqueue rings in the device
> >>>     *
> >>>     * @dev   The vhost device
> >>>     * @svq   The shadow virtqueue
> >>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>                                         struct vhost_vring_addr *addr,
> >>>                                         Error **errp)
> >>>    {
> >>> +    DMAMap device_region, driver_region;
> >>> +    struct vhost_vring_addr svq_addr;
> >>>        struct vhost_vdpa *v = dev->opaque;
> >>>        size_t device_size = vhost_svq_device_area_size(svq);
> >>>        size_t driver_size = vhost_svq_driver_area_size(svq);
> >>> -    int r;
> >>> +    size_t avail_offset;
> >>> +    bool ok;
> >>>
> >>>        ERRP_GUARD();
> >>> -    vhost_svq_get_vring_addr(svq, addr);
> >>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>
> >>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> >>> -                           (void *)addr->desc_user_addr, true);
> >>> -    if (unlikely(r != 0)) {
> >>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> >>> +    driver_region = (DMAMap) {
> >>> +        .translated_addr = svq_addr.desc_user_addr,
> >>> +        .size = driver_size - 1,
> >>
> >> Any reason for the "-1" here? I see several places do things like that,
> >> it's probably hint of wrong API somehwere.
> >>
> > The "problem" is the api mismatch between _end and _last, to include
> > the last member in the size or not.
> >
> > IOVA tree needs to use _end so we can allocate the last page in case
> > of available range ending in (uint64_t)-1 [1]. But If we change
> > vhost_svq_{device,driver}_area_size to make it inclusive,
>
>
> These functions looks sane since it doesn't return a range. It's up to
> the caller to decide how to use the size.
>

Ok I think I didn't get your comment the first time, so there is a bug
here. But I'm not sure if we are on the same page regarding the iova
tree.

Regarding the alignment, it's up to the caller how to use the size.
But if you introduce a mapping of (iova_1, translated_addr_1, size_1),
the iova address iova_1+size_1 belongs to that mapping. If you want to
introduce a new mapping (iova_2 = iova_1 + size_1, translated_addr_2,
size_2) it will be rejected, since it overlaps with the first one.
That part is not up to the caller.

At this moment, vhost_svq_driver_area_size and
vhost_svq_device_area_size returns in the same terms as sizeof(x). In
other words, size is not inclusive. As memset() or VHOST_IOTLB_UPDATE
expects, for example. We could move the -1 inside of these functions,
and then we need to adapt qemu_memalign calls on vhost_svq_start or
vhost_vdpa dma_map/unmap.

>
> >   we need to
> > use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
> > Probably in more places too
>
>
> I'm not sure I get here. Maybe you can show which code may suffers if we
> don't decrease it by one here.
>

Less than I expected I have to say:

diff --git a/hw/virtio/vhost-shadow-virtqueue.c
b/hw/virtio/vhost-shadow-virtqueue.c
index 497237dcbb..b42ba5a3c0 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
VhostShadowVirtqueue *svq)
 {
     size_t used_size = offsetof(vring_used_t, ring) +
                                     sizeof(vring_used_elem_t) * svq->vring.num;
-    return ROUND_UP(used_size, qemu_real_host_page_size);
+    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
 }

 /**
@@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
VirtIODevice *vdev,
     svq->vq = vq;

     svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
-    driver_size = vhost_svq_driver_area_size(svq);
-    device_size = vhost_svq_device_area_size(svq);
+    driver_size = vhost_svq_driver_area_size(svq) + 1;
+    device_size = vhost_svq_device_area_size(svq) + 1;
     svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
     desc_size = sizeof(vring_desc_t) * svq->vring.num;
     svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 5eefc5911a..2bf648de4a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,

     driver_region = (DMAMap) {
         .translated_addr = svq_addr.desc_user_addr,
-        .size = driver_size - 1,
+        .size = driver_size,
         .perm = IOMMU_RO,
     };
     ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
@@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,

     device_region = (DMAMap) {
         .translated_addr = svq_addr.used_user_addr,
-        .size = device_size - 1,
+        .size = device_size,
         .perm = IOMMU_RW,
     };
     ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
diff --git a/hw/virtio/vhost-shadow-virtqueue.c
b/hw/virtio/vhost-shadow-virtqueue.c
index 497237dcbb..b42ba5a3c0 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
VhostShadowVirtqueue *svq)
 {
     size_t used_size = offsetof(vring_used_t, ring) +
                                     sizeof(vring_used_elem_t) * svq->vring.num;
-    return ROUND_UP(used_size, qemu_real_host_page_size);
+    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
 }

 /**
@@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
VirtIODevice *vdev,
     svq->vq = vq;

     svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
-    driver_size = vhost_svq_driver_area_size(svq);
-    device_size = vhost_svq_device_area_size(svq);
+    driver_size = vhost_svq_driver_area_size(svq) + 1;
+    device_size = vhost_svq_device_area_size(svq) + 1;
     svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
     desc_size = sizeof(vring_desc_t) * svq->vring.num;
     svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 5eefc5911a..2bf648de4a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -918,7 +918,7 @@ static bool vhost_vdpa_svq_map_ring(struct
vhost_vdpa *v, DMAMap *needle,
         return false;
     }

-    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
+    r = vhost_vdpa_dma_map(v, needle->iova, needle->size + 1,
                            (void *)needle->translated_addr,
                            needle->perm == IOMMU_RO);
     if (unlikely(r != 0)) {
@@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,

     driver_region = (DMAMap) {
         .translated_addr = svq_addr.desc_user_addr,
-        .size = driver_size - 1,
+        .size = driver_size,
         .perm = IOMMU_RO,
     };
     ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
@@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,

     device_region = (DMAMap) {
         .translated_addr = svq_addr.used_user_addr,
-        .size = device_size - 1,
+        .size = device_size,
         .perm = IOMMU_RW,
     };
     ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
---

> But current code may endup to passing qemu_real_host_page_size - 1 to
> vhost-VDPA which seems wrong?
>
> E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it
> was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().
>

That part needs fixing, but the right solution is not to skip the -1
but increment to pass to the vhost_vdpa_dma_map. Doing otherwise would
bring problems with how iova-tree works. It will be included in the
next series.

Thanks!

> Thanks
>
>
> >
> > QEMU's emulated Intel iommu code solves it using the address mask as
> > the size, something that does not fit 100% with vhost devices since
> > they can allocate an arbitrary address of arbitrary size when using
> > vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
> > sure we expose unaligned and whole pages with vrings, but I feel it
> > would only be to move the problem somewhere else.
> >
> > Thanks!
> >
> > [1] There are alternatives: to use Int128_t, etc. But I think it's
> > better not to change that in this patch series.
> >
> >> Thanks
> >>
> >>
> >>> +        .perm = IOMMU_RO,
> >>> +    };
> >>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>> +    if (unlikely(!ok)) {
> >>> +        error_prepend(errp, "Cannot create vq driver region: ");
> >>>            return false;
> >>>        }
> >>> +    addr->desc_user_addr = driver_region.iova;
> >>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> >>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>
> >>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> >>> -                           (void *)addr->used_user_addr, false);
> >>> -    if (unlikely(r != 0)) {
> >>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> >>> +    device_region = (DMAMap) {
> >>> +        .translated_addr = svq_addr.used_user_addr,
> >>> +        .size = device_size - 1,
> >>> +        .perm = IOMMU_RW,
> >>> +    };
> >>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>> +    if (unlikely(!ok)) {
> >>> +        error_prepend(errp, "Cannot create vq device region: ");
> >>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
> >>>        }
> >>> +    addr->used_user_addr = device_region.iova;
> >>>
> >>> -    return r == 0;
> >>> +    return ok;
> >>>    }
> >>>
> >>>    static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-02-28  7:06     ` Jason Wang
  (?)
@ 2022-03-03 16:32     ` Eugenio Perez Martin
  2022-03-04  2:04         ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-03 16:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > This tree is able to look for a translated address from an IOVA address.
> >
> > At first glance it is similar to util/iova-tree. However, SVQ working on
> > devices with limited IOVA space need more capabilities, like allocating
> > IOVA chunks or performing reverse translations (qemu addresses to iova).
> >
> > The allocation capability, as "assign a free IOVA address to this chunk
> > of memory in qemu's address space" allows shadow virtqueue to create a
> > new address space that is not restricted by guest's addressable one, so
> > we can allocate shadow vqs vrings outside of it.
> >
> > It duplicates the tree so it can search efficiently in both directions,
> > and it will signal overlap if iova or the translated address is present
> > in any tree.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> >   hw/virtio/meson.build       |   2 +-
> >   3 files changed, 183 insertions(+), 1 deletion(-)
> >   create mode 100644 hw/virtio/vhost-iova-tree.h
> >   create mode 100644 hw/virtio/vhost-iova-tree.c
> >
> > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > new file mode 100644
> > index 0000000000..6a4f24e0f9
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.h
> > @@ -0,0 +1,27 @@
> > +/*
> > + * vhost software live migration iova tree
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > +
> > +#include "qemu/iova-tree.h"
> > +#include "exec/memory.h"
> > +
> > +typedef struct VhostIOVATree VhostIOVATree;
> > +
> > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > +
> > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > +                                        const DMAMap *map);
> > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > +
> > +#endif
> > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > new file mode 100644
> > index 0000000000..03496ac075
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.c
> > @@ -0,0 +1,155 @@
> > +/*
> > + * vhost software live migration iova tree
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/iova-tree.h"
> > +#include "vhost-iova-tree.h"
> > +
> > +#define iova_min_addr qemu_real_host_page_size
> > +
> > +/**
> > + * VhostIOVATree, able to:
> > + * - Translate iova address
> > + * - Reverse translate iova address (from translated to iova)
> > + * - Allocate IOVA regions for translated range (linear operation)
> > + */
> > +struct VhostIOVATree {
> > +    /* First addressable iova address in the device */
> > +    uint64_t iova_first;
> > +
> > +    /* Last addressable iova address in the device */
> > +    uint64_t iova_last;
> > +
> > +    /* IOVA address to qemu memory maps. */
> > +    IOVATree *iova_taddr_map;
> > +
> > +    /* QEMU virtual memory address to iova maps */
> > +    GTree *taddr_iova_map;
> > +};
> > +
> > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > +                                      gpointer data)
> > +{
> > +    const DMAMap *m1 = a, *m2 = b;
> > +
> > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > +        return 1;
> > +    }
> > +
> > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > +        return -1;
> > +    }
> > +
> > +    /* Overlapped */
> > +    return 0;
> > +}
> > +
> > +/**
> > + * Create a new IOVA tree
> > + *
> > + * Returns the new IOVA tree
> > + */
> > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > +{
> > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > +
> > +    /* Some devices do not like 0 addresses */
> > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > +    tree->iova_last = iova_last;
> > +
> > +    tree->iova_taddr_map = iova_tree_new();
> > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > +                                           NULL, g_free);
> > +    return tree;
> > +}
> > +
> > +/**
> > + * Delete an iova tree
> > + */
> > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > +{
> > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > +    g_tree_unref(iova_tree->taddr_iova_map);
> > +    g_free(iova_tree);
> > +}
> > +
> > +/**
> > + * Find the IOVA address stored from a memory address
> > + *
> > + * @tree     The iova tree
> > + * @map      The map with the memory address
> > + *
> > + * Return the stored mapping, or NULL if not found.
> > + */
> > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > +                                        const DMAMap *map)
> > +{
> > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > +}
> > +
> > +/**
> > + * Allocate a new mapping
> > + *
> > + * @tree  The iova tree
> > + * @map   The iova map
> > + *
> > + * Returns:
> > + * - IOVA_OK if the map fits in the container
> > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > + *
> > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > + */
> > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > +{
> > +    /* Some vhost devices do not like addr 0. Skip first page */
> > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > +    DMAMap *new;
> > +    int r;
> > +
> > +    if (map->translated_addr + map->size < map->translated_addr ||
> > +        map->perm == IOMMU_NONE) {
> > +        return IOVA_ERR_INVALID;
> > +    }
> > +
> > +    /* Check for collisions in translated addresses */
> > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > +        return IOVA_ERR_OVERLAP;
> > +    }
> > +
> > +    /* Allocate a node in IOVA address */
> > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > +                            tree->iova_last);
> > +    if (r != IOVA_OK) {
> > +        return r;
> > +    }
> > +
> > +    /* Allocate node in qemu -> iova translations */
> > +    new = g_malloc(sizeof(*new));
> > +    memcpy(new, map, sizeof(*new));
> > +    g_tree_insert(tree->taddr_iova_map, new, new);
>
>
> Can the caller map two IOVA ranges to the same e.g GPA range?
>

It shouldn't matter, because we are totally ignoring GPA here. HVA
could be more problematic.

We call it from two places: The shadow vring addresses and through the
memory listener. The SVQ vring addresses should already be on a
separated translated address from each one and guest's HVA because of
malloc semantics.

Regarding the listener, it should already report flattened memory with
no overlapping between the HVA chunks.
vhost_vdpa_listener_skipped_section should skip all problematic
sections if I'm not wrong.

But I may have missed some scenarios: vdpa devices only care about
IOVA -> HVA translation, so two IOVA could translate to the same HVA
in theory and we would not notice until we try with SVQ. To develop an
algorithm to handle this seems complicated at this moment: Should we
keep the bigger one? The last mapped? What happens if the listener
unmaps one of them, we suddenly must start translating from the not
unmapping? Seems that some kind of stacking would be needed.

Thanks!

> Thanks
>
>
> > +    return IOVA_OK;
> > +}
> > +
> > +/**
> > + * Remove existing mappings from iova tree
> > + *
> > + * @param  iova_tree  The vhost iova tree
> > + * @param  map        The map to remove
> > + */
> > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > +{
> > +    const DMAMap *overlap;
> > +
> > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > +    }
> > +}
> > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > index 2dc87613bc..6047670804 100644
> > --- a/hw/virtio/meson.build
> > +++ b/hw/virtio/meson.build
> > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> >
> >   virtio_ss = ss.source_set()
> >   virtio_ss.add(files('virtio.c'))
> > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
  2022-03-03  9:24         ` Eugenio Perez Martin
@ 2022-03-04  1:39             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-04  1:39 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Thu, Mar 3, 2022 at 5:25 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, Mar 3, 2022 at 8:12 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2022/3/2 上午2:49, Eugenio Perez Martin 写道:
> > > On Mon, Feb 28, 2022 at 3:57 AM Jason Wang<jasowang@redhat.com>  wrote:
> > >> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> > >>> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> > >>> will just forward the guest's kicks to the device.
> > >>>
> > >>> Host memory notifiers regions are left out for simplicity, and they will
> > >>> not be addressed in this series.
> > >>>
> > >>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > >>> ---
> > >>>    hw/virtio/vhost-shadow-virtqueue.h |  14 +++
> > >>>    include/hw/virtio/vhost-vdpa.h     |   4 +
> > >>>    hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
> > >>>    hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
> > >>>    4 files changed, 213 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > >>> index f1519e3c7b..1cbc87d5d8 100644
> > >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > >>> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
> > >>>        EventNotifier hdev_kick;
> > >>>        /* Shadow call notifier, sent to vhost */
> > >>>        EventNotifier hdev_call;
> > >>> +
> > >>> +    /*
> > >>> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> > >>> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> > >>> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> > >>> +     * retrieve VhostShadowVirtqueue.
> > >>> +     *
> > >>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > >>> +     */
> > >>> +    EventNotifier svq_kick;
> > >>>    } VhostShadowVirtqueue;
> > >>>
> > >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> > >>> +
> > >>> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> > >>> +
> > >>>    VhostShadowVirtqueue *vhost_svq_new(void);
> > >>>
> > >>>    void vhost_svq_free(gpointer vq);
> > >>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > >>> index 3ce79a646d..009a9f3b6b 100644
> > >>> --- a/include/hw/virtio/vhost-vdpa.h
> > >>> +++ b/include/hw/virtio/vhost-vdpa.h
> > >>> @@ -12,6 +12,8 @@
> > >>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
> > >>>    #define HW_VIRTIO_VHOST_VDPA_H
> > >>>
> > >>> +#include <gmodule.h>
> > >>> +
> > >>>    #include "hw/virtio/virtio.h"
> > >>>    #include "standard-headers/linux/vhost_types.h"
> > >>>
> > >>> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
> > >>>        bool iotlb_batch_begin_sent;
> > >>>        MemoryListener listener;
> > >>>        struct vhost_vdpa_iova_range iova_range;
> > >>> +    bool shadow_vqs_enabled;
> > >>> +    GPtrArray *shadow_vqs;
> > >>>        struct vhost_dev *dev;
> > >>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > >>>    } VhostVDPA;
> > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> index 019cf1950f..a5d0659f86 100644
> > >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> @@ -11,6 +11,56 @@
> > >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> > >>>
> > >>>    #include "qemu/error-report.h"
> > >>> +#include "qemu/main-loop.h"
> > >>> +#include "linux-headers/linux/vhost.h"
> > >>> +
> > >>> +/** Forward guest notifications */
> > >>> +static void vhost_handle_guest_kick(EventNotifier *n)
> > >>> +{
> > >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > >>> +                                             svq_kick);
> > >>> +    event_notifier_test_and_clear(n);
> > >>> +    event_notifier_set(&svq->hdev_kick);
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> > >>> + *
> > >>> + * @svq          The svq
> > >>> + * @svq_kick_fd  The svq kick fd
> > >>> + *
> > >>> + * Note that the SVQ will never close the old file descriptor.
> > >>> + */
> > >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> > >>> +{
> > >>> +    EventNotifier *svq_kick = &svq->svq_kick;
> > >>> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
> > >> I wonder if this is robust. E.g is there any chance that may end up with
> > >> both poll_stop and poll_start are false?
> > >>
> > > I cannot make that happen in qemu, but the function supports that case
> > > well: It will do nothing. It's more or less the same code as used in
> > > the vhost kernel, and is the expected behaviour if you send two
> > > VHOST_FILE_UNBIND one right after another to me.
> >
> >
> > I would think it's just stop twice.
> >
> >
> > >
> > >> If not, can we simple detect poll_stop as below and treat !poll_start
> > >> and poll_stop?
> > >>
> > > I'm not sure what does it add. Is there an unexpected consequence with
> > > the current do-nothing behavior I've missed?
> >
> >
> > I'm not sure, but it feels odd if poll_start is not the reverse value of
> > poll_stop.
> >
>
> If we want to not to restrict the inputs, we need to handle for situations:
>
> a) old_fd = -1, new_fd = -1,
>
> This is the situation you described, and it's basically a no-op.
> poll_stop == poll_start == false.
>
> If we make poll_stop = true and poll_stop = false, we call
> event_notifier_set_handler(-1, ...). Hopefully it will return just an
> error.
>
> If we make poll_stop = false and poll_stop = true, we are calling
> event_notifier_set(-1) and event_notifier_set_handler(-1,
> poll_callback). Same situation, hopefully an error, but unexpected.
>
> b) old_fd = -1, new_fd = >-1,
>
> We need to start polling the new_fd. No need for stop polling the
> old_fd, since we are not polling it actually.
>
> c) old_fd = >-1, new_fd = >-1,
>
> We need to stop polling the old_fd and start polling the new one.
>
> If we make poll_stop = true and poll_stop = false, we don't register a
> new polling function for the new kick_fd so we will miss guest's
> kicks.
>
> If we make poll_stop = false and poll_stop = true, we keep polling the
> old file descriptor too, so whatever it gets assigned to could call
> vhost_handle_guest_kick if it does not override poll callback.
>
> We *could* detect if old_fd == new_fd so we skip all the work, but I
> think it is not worth it to complicate the code, since we're only
> being called with the kick_fd at dev start.
>
> d) c) old_fd = >-1, new_fd = -1,
>
> We need to stop polling, or we could get invalid kicks callbacks if it
> gets writed after this. No need to poll anything beyond this.

I see, thanks for the clarification.

>
> > Thanks
> >
> >
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities
@ 2022-03-04  1:39             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-04  1:39 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Thu, Mar 3, 2022 at 5:25 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, Mar 3, 2022 at 8:12 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2022/3/2 上午2:49, Eugenio Perez Martin 写道:
> > > On Mon, Feb 28, 2022 at 3:57 AM Jason Wang<jasowang@redhat.com>  wrote:
> > >> 在 2022/2/27 下午9:40, Eugenio Pérez 写道:
> > >>> At this mode no buffer forwarding will be performed in SVQ mode: Qemu
> > >>> will just forward the guest's kicks to the device.
> > >>>
> > >>> Host memory notifiers regions are left out for simplicity, and they will
> > >>> not be addressed in this series.
> > >>>
> > >>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > >>> ---
> > >>>    hw/virtio/vhost-shadow-virtqueue.h |  14 +++
> > >>>    include/hw/virtio/vhost-vdpa.h     |   4 +
> > >>>    hw/virtio/vhost-shadow-virtqueue.c |  52 +++++++++++
> > >>>    hw/virtio/vhost-vdpa.c             | 145 ++++++++++++++++++++++++++++-
> > >>>    4 files changed, 213 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > >>> index f1519e3c7b..1cbc87d5d8 100644
> > >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > >>> @@ -18,8 +18,22 @@ typedef struct VhostShadowVirtqueue {
> > >>>        EventNotifier hdev_kick;
> > >>>        /* Shadow call notifier, sent to vhost */
> > >>>        EventNotifier hdev_call;
> > >>> +
> > >>> +    /*
> > >>> +     * Borrowed virtqueue's guest to host notifier. To borrow it in this event
> > >>> +     * notifier allows to recover the VhostShadowVirtqueue from the event loop
> > >>> +     * easily. If we use the VirtQueue's one, we don't have an easy way to
> > >>> +     * retrieve VhostShadowVirtqueue.
> > >>> +     *
> > >>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > >>> +     */
> > >>> +    EventNotifier svq_kick;
> > >>>    } VhostShadowVirtqueue;
> > >>>
> > >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
> > >>> +
> > >>> +void vhost_svq_stop(VhostShadowVirtqueue *svq);
> > >>> +
> > >>>    VhostShadowVirtqueue *vhost_svq_new(void);
> > >>>
> > >>>    void vhost_svq_free(gpointer vq);
> > >>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > >>> index 3ce79a646d..009a9f3b6b 100644
> > >>> --- a/include/hw/virtio/vhost-vdpa.h
> > >>> +++ b/include/hw/virtio/vhost-vdpa.h
> > >>> @@ -12,6 +12,8 @@
> > >>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
> > >>>    #define HW_VIRTIO_VHOST_VDPA_H
> > >>>
> > >>> +#include <gmodule.h>
> > >>> +
> > >>>    #include "hw/virtio/virtio.h"
> > >>>    #include "standard-headers/linux/vhost_types.h"
> > >>>
> > >>> @@ -27,6 +29,8 @@ typedef struct vhost_vdpa {
> > >>>        bool iotlb_batch_begin_sent;
> > >>>        MemoryListener listener;
> > >>>        struct vhost_vdpa_iova_range iova_range;
> > >>> +    bool shadow_vqs_enabled;
> > >>> +    GPtrArray *shadow_vqs;
> > >>>        struct vhost_dev *dev;
> > >>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > >>>    } VhostVDPA;
> > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> index 019cf1950f..a5d0659f86 100644
> > >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> @@ -11,6 +11,56 @@
> > >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> > >>>
> > >>>    #include "qemu/error-report.h"
> > >>> +#include "qemu/main-loop.h"
> > >>> +#include "linux-headers/linux/vhost.h"
> > >>> +
> > >>> +/** Forward guest notifications */
> > >>> +static void vhost_handle_guest_kick(EventNotifier *n)
> > >>> +{
> > >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > >>> +                                             svq_kick);
> > >>> +    event_notifier_test_and_clear(n);
> > >>> +    event_notifier_set(&svq->hdev_kick);
> > >>> +}
> > >>> +
> > >>> +/**
> > >>> + * Set a new file descriptor for the guest to kick the SVQ and notify for avail
> > >>> + *
> > >>> + * @svq          The svq
> > >>> + * @svq_kick_fd  The svq kick fd
> > >>> + *
> > >>> + * Note that the SVQ will never close the old file descriptor.
> > >>> + */
> > >>> +void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
> > >>> +{
> > >>> +    EventNotifier *svq_kick = &svq->svq_kick;
> > >>> +    bool poll_stop = VHOST_FILE_UNBIND != event_notifier_get_fd(svq_kick);
> > >> I wonder if this is robust. E.g is there any chance that may end up with
> > >> both poll_stop and poll_start are false?
> > >>
> > > I cannot make that happen in qemu, but the function supports that case
> > > well: It will do nothing. It's more or less the same code as used in
> > > the vhost kernel, and is the expected behaviour if you send two
> > > VHOST_FILE_UNBIND one right after another to me.
> >
> >
> > I would think it's just stop twice.
> >
> >
> > >
> > >> If not, can we simple detect poll_stop as below and treat !poll_start
> > >> and poll_stop?
> > >>
> > > I'm not sure what does it add. Is there an unexpected consequence with
> > > the current do-nothing behavior I've missed?
> >
> >
> > I'm not sure, but it feels odd if poll_start is not the reverse value of
> > poll_stop.
> >
>
> If we want to not to restrict the inputs, we need to handle for situations:
>
> a) old_fd = -1, new_fd = -1,
>
> This is the situation you described, and it's basically a no-op.
> poll_stop == poll_start == false.
>
> If we make poll_stop = true and poll_stop = false, we call
> event_notifier_set_handler(-1, ...). Hopefully it will return just an
> error.
>
> If we make poll_stop = false and poll_stop = true, we are calling
> event_notifier_set(-1) and event_notifier_set_handler(-1,
> poll_callback). Same situation, hopefully an error, but unexpected.
>
> b) old_fd = -1, new_fd = >-1,
>
> We need to start polling the new_fd. No need for stop polling the
> old_fd, since we are not polling it actually.
>
> c) old_fd = >-1, new_fd = >-1,
>
> We need to stop polling the old_fd and start polling the new one.
>
> If we make poll_stop = true and poll_stop = false, we don't register a
> new polling function for the new kick_fd so we will miss guest's
> kicks.
>
> If we make poll_stop = false and poll_stop = true, we keep polling the
> old file descriptor too, so whatever it gets assigned to could call
> vhost_handle_guest_kick if it does not override poll callback.
>
> We *could* detect if old_fd == new_fd so we skip all the work, but I
> think it is not worth it to complicate the code, since we're only
> being called with the kick_fd at dev start.
>
> d) c) old_fd = >-1, new_fd = -1,
>
> We need to stop polling, or we could get invalid kicks callbacks if it
> gets writed after this. No need to poll anything beyond this.

I see, thanks for the clarification.

>
> > Thanks
> >
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-03-03 16:32     ` Eugenio Perez Martin
@ 2022-03-04  2:04         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-04  2:04 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > This tree is able to look for a translated address from an IOVA address.
> > >
> > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > devices with limited IOVA space need more capabilities, like allocating
> > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > >
> > > The allocation capability, as "assign a free IOVA address to this chunk
> > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > new address space that is not restricted by guest's addressable one, so
> > > we can allocate shadow vqs vrings outside of it.
> > >
> > > It duplicates the tree so it can search efficiently in both directions,
> > > and it will signal overlap if iova or the translated address is present
> > > in any tree.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > >   hw/virtio/meson.build       |   2 +-
> > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > >
> > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > new file mode 100644
> > > index 0000000000..6a4f24e0f9
> > > --- /dev/null
> > > +++ b/hw/virtio/vhost-iova-tree.h
> > > @@ -0,0 +1,27 @@
> > > +/*
> > > + * vhost software live migration iova tree
> > > + *
> > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + */
> > > +
> > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > +
> > > +#include "qemu/iova-tree.h"
> > > +#include "exec/memory.h"
> > > +
> > > +typedef struct VhostIOVATree VhostIOVATree;
> > > +
> > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > +
> > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > +                                        const DMAMap *map);
> > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > +
> > > +#endif
> > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > new file mode 100644
> > > index 0000000000..03496ac075
> > > --- /dev/null
> > > +++ b/hw/virtio/vhost-iova-tree.c
> > > @@ -0,0 +1,155 @@
> > > +/*
> > > + * vhost software live migration iova tree
> > > + *
> > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/iova-tree.h"
> > > +#include "vhost-iova-tree.h"
> > > +
> > > +#define iova_min_addr qemu_real_host_page_size
> > > +
> > > +/**
> > > + * VhostIOVATree, able to:
> > > + * - Translate iova address
> > > + * - Reverse translate iova address (from translated to iova)
> > > + * - Allocate IOVA regions for translated range (linear operation)
> > > + */
> > > +struct VhostIOVATree {
> > > +    /* First addressable iova address in the device */
> > > +    uint64_t iova_first;
> > > +
> > > +    /* Last addressable iova address in the device */
> > > +    uint64_t iova_last;
> > > +
> > > +    /* IOVA address to qemu memory maps. */
> > > +    IOVATree *iova_taddr_map;
> > > +
> > > +    /* QEMU virtual memory address to iova maps */
> > > +    GTree *taddr_iova_map;
> > > +};
> > > +
> > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > +                                      gpointer data)
> > > +{
> > > +    const DMAMap *m1 = a, *m2 = b;
> > > +
> > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > +        return 1;
> > > +    }
> > > +
> > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    /* Overlapped */
> > > +    return 0;
> > > +}
> > > +
> > > +/**
> > > + * Create a new IOVA tree
> > > + *
> > > + * Returns the new IOVA tree
> > > + */
> > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > +{
> > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > +
> > > +    /* Some devices do not like 0 addresses */
> > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > +    tree->iova_last = iova_last;
> > > +
> > > +    tree->iova_taddr_map = iova_tree_new();
> > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > +                                           NULL, g_free);
> > > +    return tree;
> > > +}
> > > +
> > > +/**
> > > + * Delete an iova tree
> > > + */
> > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > +{
> > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > +    g_free(iova_tree);
> > > +}
> > > +
> > > +/**
> > > + * Find the IOVA address stored from a memory address
> > > + *
> > > + * @tree     The iova tree
> > > + * @map      The map with the memory address
> > > + *
> > > + * Return the stored mapping, or NULL if not found.
> > > + */
> > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > +                                        const DMAMap *map)
> > > +{
> > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > +}
> > > +
> > > +/**
> > > + * Allocate a new mapping
> > > + *
> > > + * @tree  The iova tree
> > > + * @map   The iova map
> > > + *
> > > + * Returns:
> > > + * - IOVA_OK if the map fits in the container
> > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > + *
> > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > + */
> > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > +{
> > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > +    DMAMap *new;
> > > +    int r;
> > > +
> > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > +        map->perm == IOMMU_NONE) {
> > > +        return IOVA_ERR_INVALID;
> > > +    }
> > > +
> > > +    /* Check for collisions in translated addresses */
> > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > +        return IOVA_ERR_OVERLAP;
> > > +    }
> > > +
> > > +    /* Allocate a node in IOVA address */
> > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > +                            tree->iova_last);
> > > +    if (r != IOVA_OK) {
> > > +        return r;
> > > +    }
> > > +
> > > +    /* Allocate node in qemu -> iova translations */
> > > +    new = g_malloc(sizeof(*new));
> > > +    memcpy(new, map, sizeof(*new));
> > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> >
> >
> > Can the caller map two IOVA ranges to the same e.g GPA range?
> >
>
> It shouldn't matter, because we are totally ignoring GPA here. HVA
> could be more problematic.
>
> We call it from two places: The shadow vring addresses and through the
> memory listener. The SVQ vring addresses should already be on a
> separated translated address from each one and guest's HVA because of
> malloc semantics.

Right, so SVQ addresses should be fine, the problem is the guest mappings.

>
> Regarding the listener, it should already report flattened memory with
> no overlapping between the HVA chunks.
> vhost_vdpa_listener_skipped_section should skip all problematic
> sections if I'm not wrong.
>
> But I may have missed some scenarios: vdpa devices only care about
> IOVA -> HVA translation, so two IOVA could translate to the same HVA
> in theory and we would not notice until we try with SVQ. To develop an
> algorithm to handle this seems complicated at this moment: Should we
> keep the bigger one? The last mapped? What happens if the listener
> unmaps one of them, we suddenly must start translating from the not
> unmapping? Seems that some kind of stacking would be needed.
>
> Thanks!

It looks to me that we should always try to allocate new iova each
time, even if the HVA is the same. This means we need to remove the
reverse mapping tree.

Currently we had:

    /* Check for collisions in translated addresses */
    if (vhost_iova_tree_find_iova(tree, map)) {
        return IOVA_ERR_OVERLAP;
    }

We probably need to remove that. And during the translation we need to
iterate the whole iova tree to get the reverse mapping instead by
returning the largest possible mapping there.

But this may degrade the performance, but consider the memslots should
not be much at most of the time, it should be fine.

Thanks


>
> > Thanks
> >
> >
> > > +    return IOVA_OK;
> > > +}
> > > +
> > > +/**
> > > + * Remove existing mappings from iova tree
> > > + *
> > > + * @param  iova_tree  The vhost iova tree
> > > + * @param  map        The map to remove
> > > + */
> > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > +{
> > > +    const DMAMap *overlap;
> > > +
> > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > +    }
> > > +}
> > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > index 2dc87613bc..6047670804 100644
> > > --- a/hw/virtio/meson.build
> > > +++ b/hw/virtio/meson.build
> > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > >
> > >   virtio_ss = ss.source_set()
> > >   virtio_ss.add(files('virtio.c'))
> > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> >
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
@ 2022-03-04  2:04         ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-04  2:04 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > This tree is able to look for a translated address from an IOVA address.
> > >
> > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > devices with limited IOVA space need more capabilities, like allocating
> > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > >
> > > The allocation capability, as "assign a free IOVA address to this chunk
> > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > new address space that is not restricted by guest's addressable one, so
> > > we can allocate shadow vqs vrings outside of it.
> > >
> > > It duplicates the tree so it can search efficiently in both directions,
> > > and it will signal overlap if iova or the translated address is present
> > > in any tree.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > >   hw/virtio/meson.build       |   2 +-
> > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > >
> > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > new file mode 100644
> > > index 0000000000..6a4f24e0f9
> > > --- /dev/null
> > > +++ b/hw/virtio/vhost-iova-tree.h
> > > @@ -0,0 +1,27 @@
> > > +/*
> > > + * vhost software live migration iova tree
> > > + *
> > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + */
> > > +
> > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > +
> > > +#include "qemu/iova-tree.h"
> > > +#include "exec/memory.h"
> > > +
> > > +typedef struct VhostIOVATree VhostIOVATree;
> > > +
> > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > +
> > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > +                                        const DMAMap *map);
> > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > +
> > > +#endif
> > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > new file mode 100644
> > > index 0000000000..03496ac075
> > > --- /dev/null
> > > +++ b/hw/virtio/vhost-iova-tree.c
> > > @@ -0,0 +1,155 @@
> > > +/*
> > > + * vhost software live migration iova tree
> > > + *
> > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/iova-tree.h"
> > > +#include "vhost-iova-tree.h"
> > > +
> > > +#define iova_min_addr qemu_real_host_page_size
> > > +
> > > +/**
> > > + * VhostIOVATree, able to:
> > > + * - Translate iova address
> > > + * - Reverse translate iova address (from translated to iova)
> > > + * - Allocate IOVA regions for translated range (linear operation)
> > > + */
> > > +struct VhostIOVATree {
> > > +    /* First addressable iova address in the device */
> > > +    uint64_t iova_first;
> > > +
> > > +    /* Last addressable iova address in the device */
> > > +    uint64_t iova_last;
> > > +
> > > +    /* IOVA address to qemu memory maps. */
> > > +    IOVATree *iova_taddr_map;
> > > +
> > > +    /* QEMU virtual memory address to iova maps */
> > > +    GTree *taddr_iova_map;
> > > +};
> > > +
> > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > +                                      gpointer data)
> > > +{
> > > +    const DMAMap *m1 = a, *m2 = b;
> > > +
> > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > +        return 1;
> > > +    }
> > > +
> > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    /* Overlapped */
> > > +    return 0;
> > > +}
> > > +
> > > +/**
> > > + * Create a new IOVA tree
> > > + *
> > > + * Returns the new IOVA tree
> > > + */
> > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > +{
> > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > +
> > > +    /* Some devices do not like 0 addresses */
> > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > +    tree->iova_last = iova_last;
> > > +
> > > +    tree->iova_taddr_map = iova_tree_new();
> > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > +                                           NULL, g_free);
> > > +    return tree;
> > > +}
> > > +
> > > +/**
> > > + * Delete an iova tree
> > > + */
> > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > +{
> > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > +    g_free(iova_tree);
> > > +}
> > > +
> > > +/**
> > > + * Find the IOVA address stored from a memory address
> > > + *
> > > + * @tree     The iova tree
> > > + * @map      The map with the memory address
> > > + *
> > > + * Return the stored mapping, or NULL if not found.
> > > + */
> > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > +                                        const DMAMap *map)
> > > +{
> > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > +}
> > > +
> > > +/**
> > > + * Allocate a new mapping
> > > + *
> > > + * @tree  The iova tree
> > > + * @map   The iova map
> > > + *
> > > + * Returns:
> > > + * - IOVA_OK if the map fits in the container
> > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > + *
> > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > + */
> > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > +{
> > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > +    DMAMap *new;
> > > +    int r;
> > > +
> > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > +        map->perm == IOMMU_NONE) {
> > > +        return IOVA_ERR_INVALID;
> > > +    }
> > > +
> > > +    /* Check for collisions in translated addresses */
> > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > +        return IOVA_ERR_OVERLAP;
> > > +    }
> > > +
> > > +    /* Allocate a node in IOVA address */
> > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > +                            tree->iova_last);
> > > +    if (r != IOVA_OK) {
> > > +        return r;
> > > +    }
> > > +
> > > +    /* Allocate node in qemu -> iova translations */
> > > +    new = g_malloc(sizeof(*new));
> > > +    memcpy(new, map, sizeof(*new));
> > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> >
> >
> > Can the caller map two IOVA ranges to the same e.g GPA range?
> >
>
> It shouldn't matter, because we are totally ignoring GPA here. HVA
> could be more problematic.
>
> We call it from two places: The shadow vring addresses and through the
> memory listener. The SVQ vring addresses should already be on a
> separated translated address from each one and guest's HVA because of
> malloc semantics.

Right, so SVQ addresses should be fine, the problem is the guest mappings.

>
> Regarding the listener, it should already report flattened memory with
> no overlapping between the HVA chunks.
> vhost_vdpa_listener_skipped_section should skip all problematic
> sections if I'm not wrong.
>
> But I may have missed some scenarios: vdpa devices only care about
> IOVA -> HVA translation, so two IOVA could translate to the same HVA
> in theory and we would not notice until we try with SVQ. To develop an
> algorithm to handle this seems complicated at this moment: Should we
> keep the bigger one? The last mapped? What happens if the listener
> unmaps one of them, we suddenly must start translating from the not
> unmapping? Seems that some kind of stacking would be needed.
>
> Thanks!

It looks to me that we should always try to allocate new iova each
time, even if the HVA is the same. This means we need to remove the
reverse mapping tree.

Currently we had:

    /* Check for collisions in translated addresses */
    if (vhost_iova_tree_find_iova(tree, map)) {
        return IOVA_ERR_OVERLAP;
    }

We probably need to remove that. And during the translation we need to
iterate the whole iova tree to get the reverse mapping instead by
returning the largest possible mapping there.

But this may degrade the performance, but consider the memslots should
not be much at most of the time, it should be fine.

Thanks


>
> > Thanks
> >
> >
> > > +    return IOVA_OK;
> > > +}
> > > +
> > > +/**
> > > + * Remove existing mappings from iova tree
> > > + *
> > > + * @param  iova_tree  The vhost iova tree
> > > + * @param  map        The map to remove
> > > + */
> > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > +{
> > > +    const DMAMap *overlap;
> > > +
> > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > +    }
> > > +}
> > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > index 2dc87613bc..6047670804 100644
> > > --- a/hw/virtio/meson.build
> > > +++ b/hw/virtio/meson.build
> > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > >
> > >   virtio_ss = ss.source_set()
> > >   virtio_ss.add(files('virtio.c'))
> > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-03-04  2:04         ` Jason Wang
  (?)
@ 2022-03-04  8:01         ` Eugenio Perez Martin
  2022-03-07  3:41             ` Jason Wang
  -1 siblings, 1 reply; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-04  8:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Fri, Mar 4, 2022 at 3:04 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > > This tree is able to look for a translated address from an IOVA address.
> > > >
> > > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > > devices with limited IOVA space need more capabilities, like allocating
> > > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > > >
> > > > The allocation capability, as "assign a free IOVA address to this chunk
> > > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > > new address space that is not restricted by guest's addressable one, so
> > > > we can allocate shadow vqs vrings outside of it.
> > > >
> > > > It duplicates the tree so it can search efficiently in both directions,
> > > > and it will signal overlap if iova or the translated address is present
> > > > in any tree.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > > >   hw/virtio/meson.build       |   2 +-
> > > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > > >
> > > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > > new file mode 100644
> > > > index 0000000000..6a4f24e0f9
> > > > --- /dev/null
> > > > +++ b/hw/virtio/vhost-iova-tree.h
> > > > @@ -0,0 +1,27 @@
> > > > +/*
> > > > + * vhost software live migration iova tree
> > > > + *
> > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > + *
> > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > + */
> > > > +
> > > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > +
> > > > +#include "qemu/iova-tree.h"
> > > > +#include "exec/memory.h"
> > > > +
> > > > +typedef struct VhostIOVATree VhostIOVATree;
> > > > +
> > > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > > +
> > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > > +                                        const DMAMap *map);
> > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > > +
> > > > +#endif
> > > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > > new file mode 100644
> > > > index 0000000000..03496ac075
> > > > --- /dev/null
> > > > +++ b/hw/virtio/vhost-iova-tree.c
> > > > @@ -0,0 +1,155 @@
> > > > +/*
> > > > + * vhost software live migration iova tree
> > > > + *
> > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > + *
> > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > + */
> > > > +
> > > > +#include "qemu/osdep.h"
> > > > +#include "qemu/iova-tree.h"
> > > > +#include "vhost-iova-tree.h"
> > > > +
> > > > +#define iova_min_addr qemu_real_host_page_size
> > > > +
> > > > +/**
> > > > + * VhostIOVATree, able to:
> > > > + * - Translate iova address
> > > > + * - Reverse translate iova address (from translated to iova)
> > > > + * - Allocate IOVA regions for translated range (linear operation)
> > > > + */
> > > > +struct VhostIOVATree {
> > > > +    /* First addressable iova address in the device */
> > > > +    uint64_t iova_first;
> > > > +
> > > > +    /* Last addressable iova address in the device */
> > > > +    uint64_t iova_last;
> > > > +
> > > > +    /* IOVA address to qemu memory maps. */
> > > > +    IOVATree *iova_taddr_map;
> > > > +
> > > > +    /* QEMU virtual memory address to iova maps */
> > > > +    GTree *taddr_iova_map;
> > > > +};
> > > > +
> > > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > > +                                      gpointer data)
> > > > +{
> > > > +    const DMAMap *m1 = a, *m2 = b;
> > > > +
> > > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > > +        return 1;
> > > > +    }
> > > > +
> > > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    /* Overlapped */
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Create a new IOVA tree
> > > > + *
> > > > + * Returns the new IOVA tree
> > > > + */
> > > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > > +{
> > > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > > +
> > > > +    /* Some devices do not like 0 addresses */
> > > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > > +    tree->iova_last = iova_last;
> > > > +
> > > > +    tree->iova_taddr_map = iova_tree_new();
> > > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > > +                                           NULL, g_free);
> > > > +    return tree;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Delete an iova tree
> > > > + */
> > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > > +{
> > > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > > +    g_free(iova_tree);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Find the IOVA address stored from a memory address
> > > > + *
> > > > + * @tree     The iova tree
> > > > + * @map      The map with the memory address
> > > > + *
> > > > + * Return the stored mapping, or NULL if not found.
> > > > + */
> > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > > +                                        const DMAMap *map)
> > > > +{
> > > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Allocate a new mapping
> > > > + *
> > > > + * @tree  The iova tree
> > > > + * @map   The iova map
> > > > + *
> > > > + * Returns:
> > > > + * - IOVA_OK if the map fits in the container
> > > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > > + *
> > > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > > + */
> > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > > +{
> > > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > > +    DMAMap *new;
> > > > +    int r;
> > > > +
> > > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > > +        map->perm == IOMMU_NONE) {
> > > > +        return IOVA_ERR_INVALID;
> > > > +    }
> > > > +
> > > > +    /* Check for collisions in translated addresses */
> > > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > > +        return IOVA_ERR_OVERLAP;
> > > > +    }
> > > > +
> > > > +    /* Allocate a node in IOVA address */
> > > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > > +                            tree->iova_last);
> > > > +    if (r != IOVA_OK) {
> > > > +        return r;
> > > > +    }
> > > > +
> > > > +    /* Allocate node in qemu -> iova translations */
> > > > +    new = g_malloc(sizeof(*new));
> > > > +    memcpy(new, map, sizeof(*new));
> > > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> > >
> > >
> > > Can the caller map two IOVA ranges to the same e.g GPA range?
> > >
> >
> > It shouldn't matter, because we are totally ignoring GPA here. HVA
> > could be more problematic.
> >
> > We call it from two places: The shadow vring addresses and through the
> > memory listener. The SVQ vring addresses should already be on a
> > separated translated address from each one and guest's HVA because of
> > malloc semantics.
>
> Right, so SVQ addresses should be fine, the problem is the guest mappings.
>
> >
> > Regarding the listener, it should already report flattened memory with
> > no overlapping between the HVA chunks.
> > vhost_vdpa_listener_skipped_section should skip all problematic
> > sections if I'm not wrong.
> >
> > But I may have missed some scenarios: vdpa devices only care about
> > IOVA -> HVA translation, so two IOVA could translate to the same HVA
> > in theory and we would not notice until we try with SVQ. To develop an
> > algorithm to handle this seems complicated at this moment: Should we
> > keep the bigger one? The last mapped? What happens if the listener
> > unmaps one of them, we suddenly must start translating from the not
> > unmapping? Seems that some kind of stacking would be needed.
> >
> > Thanks!
>
> It looks to me that we should always try to allocate new iova each
> time, even if the HVA is the same. This means we need to remove the
> reverse mapping tree.
>
> Currently we had:
>
>     /* Check for collisions in translated addresses */
>     if (vhost_iova_tree_find_iova(tree, map)) {
>         return IOVA_ERR_OVERLAP;
>     }
>
> We probably need to remove that. And during the translation we need to
> iterate the whole iova tree to get the reverse mapping instead by
> returning the largest possible mapping there.
>

I'm not sure if that is possible. g_tree_insert() calls the comparison
methods so it knows where to place the new element, so it's expected
to do something if the node already exists. Looking at the sources it
actually silently destroys the new node. If we call g_tree_replace, we
achieve the opposite and destroy the old node. But the tree is
expected to have non-overlapping keys.

Apart from that, we're not using this struct as a tree anymore so it's
better to use directly a list in that case.

But even with the list there are still questions on how to handle
overlappings. How to handle this deletion:

* Allocate translated_addr 0, size 0x1000.
* Allocate translated_addr 0, size 0x2000.
* Delete translated_addr 0, size 0x1000.

Should it delete only the first node? Both of them?

iova-tree has similar questions too with iova. Inserting (iova=0,
size=0x1000) and deleting (.iova=0, size=0x800) will delete all the
whole node, so we cannot search the translation of (.iova=0x900)
anymore. Is this expected?

> But this may degrade the performance, but consider the memslots should
> not be much at most of the time, it should be fine.
>
> Thanks
>
>
> >
> > > Thanks
> > >
> > >
> > > > +    return IOVA_OK;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Remove existing mappings from iova tree
> > > > + *
> > > > + * @param  iova_tree  The vhost iova tree
> > > > + * @param  map        The map to remove
> > > > + */
> > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > > +{
> > > > +    const DMAMap *overlap;
> > > > +
> > > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > > +    }
> > > > +}
> > > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > > index 2dc87613bc..6047670804 100644
> > > > --- a/hw/virtio/meson.build
> > > > +++ b/hw/virtio/meson.build
> > > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > > >
> > > >   virtio_ss = ss.source_set()
> > > >   virtio_ss.add(files('virtio.c'))
> > > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> > >
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-03-04  8:01         ` Eugenio Perez Martin
@ 2022-03-07  3:41             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-07  3:41 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Fri, Mar 4, 2022 at 4:02 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Fri, Mar 4, 2022 at 3:04 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > > > This tree is able to look for a translated address from an IOVA address.
> > > > >
> > > > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > > > >
> > > > > The allocation capability, as "assign a free IOVA address to this chunk
> > > > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > > > new address space that is not restricted by guest's addressable one, so
> > > > > we can allocate shadow vqs vrings outside of it.
> > > > >
> > > > > It duplicates the tree so it can search efficiently in both directions,
> > > > > and it will signal overlap if iova or the translated address is present
> > > > > in any tree.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > > > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > > > >   hw/virtio/meson.build       |   2 +-
> > > > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > > > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > > > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > > > >
> > > > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > > > new file mode 100644
> > > > > index 0000000000..6a4f24e0f9
> > > > > --- /dev/null
> > > > > +++ b/hw/virtio/vhost-iova-tree.h
> > > > > @@ -0,0 +1,27 @@
> > > > > +/*
> > > > > + * vhost software live migration iova tree
> > > > > + *
> > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > + *
> > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > + */
> > > > > +
> > > > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > +
> > > > > +#include "qemu/iova-tree.h"
> > > > > +#include "exec/memory.h"
> > > > > +
> > > > > +typedef struct VhostIOVATree VhostIOVATree;
> > > > > +
> > > > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > > > +
> > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > > > +                                        const DMAMap *map);
> > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > > > +
> > > > > +#endif
> > > > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > > > new file mode 100644
> > > > > index 0000000000..03496ac075
> > > > > --- /dev/null
> > > > > +++ b/hw/virtio/vhost-iova-tree.c
> > > > > @@ -0,0 +1,155 @@
> > > > > +/*
> > > > > + * vhost software live migration iova tree
> > > > > + *
> > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > + *
> > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > + */
> > > > > +
> > > > > +#include "qemu/osdep.h"
> > > > > +#include "qemu/iova-tree.h"
> > > > > +#include "vhost-iova-tree.h"
> > > > > +
> > > > > +#define iova_min_addr qemu_real_host_page_size
> > > > > +
> > > > > +/**
> > > > > + * VhostIOVATree, able to:
> > > > > + * - Translate iova address
> > > > > + * - Reverse translate iova address (from translated to iova)
> > > > > + * - Allocate IOVA regions for translated range (linear operation)
> > > > > + */
> > > > > +struct VhostIOVATree {
> > > > > +    /* First addressable iova address in the device */
> > > > > +    uint64_t iova_first;
> > > > > +
> > > > > +    /* Last addressable iova address in the device */
> > > > > +    uint64_t iova_last;
> > > > > +
> > > > > +    /* IOVA address to qemu memory maps. */
> > > > > +    IOVATree *iova_taddr_map;
> > > > > +
> > > > > +    /* QEMU virtual memory address to iova maps */
> > > > > +    GTree *taddr_iova_map;
> > > > > +};
> > > > > +
> > > > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > > > +                                      gpointer data)
> > > > > +{
> > > > > +    const DMAMap *m1 = a, *m2 = b;
> > > > > +
> > > > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > > > +        return 1;
> > > > > +    }
> > > > > +
> > > > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > > +    /* Overlapped */
> > > > > +    return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Create a new IOVA tree
> > > > > + *
> > > > > + * Returns the new IOVA tree
> > > > > + */
> > > > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > > > +{
> > > > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > > > +
> > > > > +    /* Some devices do not like 0 addresses */
> > > > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > > > +    tree->iova_last = iova_last;
> > > > > +
> > > > > +    tree->iova_taddr_map = iova_tree_new();
> > > > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > > > +                                           NULL, g_free);
> > > > > +    return tree;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Delete an iova tree
> > > > > + */
> > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > > > +{
> > > > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > > > +    g_free(iova_tree);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Find the IOVA address stored from a memory address
> > > > > + *
> > > > > + * @tree     The iova tree
> > > > > + * @map      The map with the memory address
> > > > > + *
> > > > > + * Return the stored mapping, or NULL if not found.
> > > > > + */
> > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > > > +                                        const DMAMap *map)
> > > > > +{
> > > > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Allocate a new mapping
> > > > > + *
> > > > > + * @tree  The iova tree
> > > > > + * @map   The iova map
> > > > > + *
> > > > > + * Returns:
> > > > > + * - IOVA_OK if the map fits in the container
> > > > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > > > + *
> > > > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > > > + */
> > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > > > +{
> > > > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > > > +    DMAMap *new;
> > > > > +    int r;
> > > > > +
> > > > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > > > +        map->perm == IOMMU_NONE) {
> > > > > +        return IOVA_ERR_INVALID;
> > > > > +    }
> > > > > +
> > > > > +    /* Check for collisions in translated addresses */
> > > > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > > > +        return IOVA_ERR_OVERLAP;
> > > > > +    }
> > > > > +
> > > > > +    /* Allocate a node in IOVA address */
> > > > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > > > +                            tree->iova_last);
> > > > > +    if (r != IOVA_OK) {
> > > > > +        return r;
> > > > > +    }
> > > > > +
> > > > > +    /* Allocate node in qemu -> iova translations */
> > > > > +    new = g_malloc(sizeof(*new));
> > > > > +    memcpy(new, map, sizeof(*new));
> > > > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> > > >
> > > >
> > > > Can the caller map two IOVA ranges to the same e.g GPA range?
> > > >
> > >
> > > It shouldn't matter, because we are totally ignoring GPA here. HVA
> > > could be more problematic.
> > >
> > > We call it from two places: The shadow vring addresses and through the
> > > memory listener. The SVQ vring addresses should already be on a
> > > separated translated address from each one and guest's HVA because of
> > > malloc semantics.
> >
> > Right, so SVQ addresses should be fine, the problem is the guest mappings.
> >
> > >
> > > Regarding the listener, it should already report flattened memory with
> > > no overlapping between the HVA chunks.
> > > vhost_vdpa_listener_skipped_section should skip all problematic
> > > sections if I'm not wrong.
> > >
> > > But I may have missed some scenarios: vdpa devices only care about
> > > IOVA -> HVA translation, so two IOVA could translate to the same HVA
> > > in theory and we would not notice until we try with SVQ. To develop an
> > > algorithm to handle this seems complicated at this moment: Should we
> > > keep the bigger one? The last mapped? What happens if the listener
> > > unmaps one of them, we suddenly must start translating from the not
> > > unmapping? Seems that some kind of stacking would be needed.
> > >
> > > Thanks!
> >
> > It looks to me that we should always try to allocate new iova each
> > time, even if the HVA is the same. This means we need to remove the
> > reverse mapping tree.
> >
> > Currently we had:
> >
> >     /* Check for collisions in translated addresses */
> >     if (vhost_iova_tree_find_iova(tree, map)) {
> >         return IOVA_ERR_OVERLAP;
> >     }
> >
> > We probably need to remove that. And during the translation we need to
> > iterate the whole iova tree to get the reverse mapping instead by
> > returning the largest possible mapping there.
> >
>
> I'm not sure if that is possible. g_tree_insert() calls the comparison
> methods so it knows where to place the new element, so it's expected
> to do something if the node already exists. Looking at the sources it
> actually silently destroys the new node. If we call g_tree_replace, we
> achieve the opposite and destroy the old node. But the tree is
> expected to have non-overlapping keys.

So the problem is that the current IOVA tree design is not fit for our
requirement:

static inline void iova_tree_insert_internal(GTree *gtree, DMAMap *range)
{
    /* Key and value are sharing the same range data */
    g_tree_insert(gtree, range, range);
}

It looks to me we need to  extend the current IOVA tree, split IOVA
range as key, this allows us to do an IOVA allocator on top. If we use
IOVA as the key, we can do

IOVA1->HVA
IOVA2->HVA

And then we can remove the current taddr_iova_map which assumes an 1:1
mapping here. When doing HVA to IOVA translation, we need to iterate
the tree and return the first match and continue the search until we
meet the size.

>
> Apart from that, we're not using this struct as a tree anymore so it's
> better to use directly a list in that case.
>
> But even with the list there are still questions on how to handle
> overlappings. How to handle this deletion:
>
> * Allocate translated_addr 0, size 0x1000.
> * Allocate translated_addr 0, size 0x2000.
> * Delete translated_addr 0, size 0x1000.
>
> Should it delete only the first node? Both of them?

I'd suggest removing the taddr_iova_map.

>
> iova-tree has similar questions too with iova. Inserting (iova=0,
> size=0x1000) and deleting (.iova=0, size=0x800) will delete all the
> whole node, so we cannot search the translation of (.iova=0x900)
> anymore. Is this expected?

Not sure. When vIOMMU is enabled, the guest risks itself to do this.
When vIOMMU is not enabled, it should be a bug of qemu to add and
remove GPA ranges with different size.

Thanks

>
> > But this may degrade the performance, but consider the memslots should
> > not be much at most of the time, it should be fine.
> >
> > Thanks
> >
> >
> > >
> > > > Thanks
> > > >
> > > >
> > > > > +    return IOVA_OK;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Remove existing mappings from iova tree
> > > > > + *
> > > > > + * @param  iova_tree  The vhost iova tree
> > > > > + * @param  map        The map to remove
> > > > > + */
> > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > > > +{
> > > > > +    const DMAMap *overlap;
> > > > > +
> > > > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > > > +    }
> > > > > +}
> > > > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > > > index 2dc87613bc..6047670804 100644
> > > > > --- a/hw/virtio/meson.build
> > > > > +++ b/hw/virtio/meson.build
> > > > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > > > >
> > > > >   virtio_ss = ss.source_set()
> > > > >   virtio_ss.add(files('virtio.c'))
> > > > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> > > >
> > >
> >
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
@ 2022-03-07  3:41             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-07  3:41 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Fri, Mar 4, 2022 at 4:02 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Fri, Mar 4, 2022 at 3:04 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > > > This tree is able to look for a translated address from an IOVA address.
> > > > >
> > > > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > > > >
> > > > > The allocation capability, as "assign a free IOVA address to this chunk
> > > > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > > > new address space that is not restricted by guest's addressable one, so
> > > > > we can allocate shadow vqs vrings outside of it.
> > > > >
> > > > > It duplicates the tree so it can search efficiently in both directions,
> > > > > and it will signal overlap if iova or the translated address is present
> > > > > in any tree.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > > > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > > > >   hw/virtio/meson.build       |   2 +-
> > > > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > > > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > > > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > > > >
> > > > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > > > new file mode 100644
> > > > > index 0000000000..6a4f24e0f9
> > > > > --- /dev/null
> > > > > +++ b/hw/virtio/vhost-iova-tree.h
> > > > > @@ -0,0 +1,27 @@
> > > > > +/*
> > > > > + * vhost software live migration iova tree
> > > > > + *
> > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > + *
> > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > + */
> > > > > +
> > > > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > +
> > > > > +#include "qemu/iova-tree.h"
> > > > > +#include "exec/memory.h"
> > > > > +
> > > > > +typedef struct VhostIOVATree VhostIOVATree;
> > > > > +
> > > > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > > > +
> > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > > > +                                        const DMAMap *map);
> > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > > > +
> > > > > +#endif
> > > > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > > > new file mode 100644
> > > > > index 0000000000..03496ac075
> > > > > --- /dev/null
> > > > > +++ b/hw/virtio/vhost-iova-tree.c
> > > > > @@ -0,0 +1,155 @@
> > > > > +/*
> > > > > + * vhost software live migration iova tree
> > > > > + *
> > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > + *
> > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > + */
> > > > > +
> > > > > +#include "qemu/osdep.h"
> > > > > +#include "qemu/iova-tree.h"
> > > > > +#include "vhost-iova-tree.h"
> > > > > +
> > > > > +#define iova_min_addr qemu_real_host_page_size
> > > > > +
> > > > > +/**
> > > > > + * VhostIOVATree, able to:
> > > > > + * - Translate iova address
> > > > > + * - Reverse translate iova address (from translated to iova)
> > > > > + * - Allocate IOVA regions for translated range (linear operation)
> > > > > + */
> > > > > +struct VhostIOVATree {
> > > > > +    /* First addressable iova address in the device */
> > > > > +    uint64_t iova_first;
> > > > > +
> > > > > +    /* Last addressable iova address in the device */
> > > > > +    uint64_t iova_last;
> > > > > +
> > > > > +    /* IOVA address to qemu memory maps. */
> > > > > +    IOVATree *iova_taddr_map;
> > > > > +
> > > > > +    /* QEMU virtual memory address to iova maps */
> > > > > +    GTree *taddr_iova_map;
> > > > > +};
> > > > > +
> > > > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > > > +                                      gpointer data)
> > > > > +{
> > > > > +    const DMAMap *m1 = a, *m2 = b;
> > > > > +
> > > > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > > > +        return 1;
> > > > > +    }
> > > > > +
> > > > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > > +    /* Overlapped */
> > > > > +    return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Create a new IOVA tree
> > > > > + *
> > > > > + * Returns the new IOVA tree
> > > > > + */
> > > > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > > > +{
> > > > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > > > +
> > > > > +    /* Some devices do not like 0 addresses */
> > > > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > > > +    tree->iova_last = iova_last;
> > > > > +
> > > > > +    tree->iova_taddr_map = iova_tree_new();
> > > > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > > > +                                           NULL, g_free);
> > > > > +    return tree;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Delete an iova tree
> > > > > + */
> > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > > > +{
> > > > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > > > +    g_free(iova_tree);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Find the IOVA address stored from a memory address
> > > > > + *
> > > > > + * @tree     The iova tree
> > > > > + * @map      The map with the memory address
> > > > > + *
> > > > > + * Return the stored mapping, or NULL if not found.
> > > > > + */
> > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > > > +                                        const DMAMap *map)
> > > > > +{
> > > > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Allocate a new mapping
> > > > > + *
> > > > > + * @tree  The iova tree
> > > > > + * @map   The iova map
> > > > > + *
> > > > > + * Returns:
> > > > > + * - IOVA_OK if the map fits in the container
> > > > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > > > + *
> > > > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > > > + */
> > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > > > +{
> > > > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > > > +    DMAMap *new;
> > > > > +    int r;
> > > > > +
> > > > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > > > +        map->perm == IOMMU_NONE) {
> > > > > +        return IOVA_ERR_INVALID;
> > > > > +    }
> > > > > +
> > > > > +    /* Check for collisions in translated addresses */
> > > > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > > > +        return IOVA_ERR_OVERLAP;
> > > > > +    }
> > > > > +
> > > > > +    /* Allocate a node in IOVA address */
> > > > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > > > +                            tree->iova_last);
> > > > > +    if (r != IOVA_OK) {
> > > > > +        return r;
> > > > > +    }
> > > > > +
> > > > > +    /* Allocate node in qemu -> iova translations */
> > > > > +    new = g_malloc(sizeof(*new));
> > > > > +    memcpy(new, map, sizeof(*new));
> > > > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> > > >
> > > >
> > > > Can the caller map two IOVA ranges to the same e.g GPA range?
> > > >
> > >
> > > It shouldn't matter, because we are totally ignoring GPA here. HVA
> > > could be more problematic.
> > >
> > > We call it from two places: The shadow vring addresses and through the
> > > memory listener. The SVQ vring addresses should already be on a
> > > separated translated address from each one and guest's HVA because of
> > > malloc semantics.
> >
> > Right, so SVQ addresses should be fine, the problem is the guest mappings.
> >
> > >
> > > Regarding the listener, it should already report flattened memory with
> > > no overlapping between the HVA chunks.
> > > vhost_vdpa_listener_skipped_section should skip all problematic
> > > sections if I'm not wrong.
> > >
> > > But I may have missed some scenarios: vdpa devices only care about
> > > IOVA -> HVA translation, so two IOVA could translate to the same HVA
> > > in theory and we would not notice until we try with SVQ. To develop an
> > > algorithm to handle this seems complicated at this moment: Should we
> > > keep the bigger one? The last mapped? What happens if the listener
> > > unmaps one of them, we suddenly must start translating from the not
> > > unmapping? Seems that some kind of stacking would be needed.
> > >
> > > Thanks!
> >
> > It looks to me that we should always try to allocate new iova each
> > time, even if the HVA is the same. This means we need to remove the
> > reverse mapping tree.
> >
> > Currently we had:
> >
> >     /* Check for collisions in translated addresses */
> >     if (vhost_iova_tree_find_iova(tree, map)) {
> >         return IOVA_ERR_OVERLAP;
> >     }
> >
> > We probably need to remove that. And during the translation we need to
> > iterate the whole iova tree to get the reverse mapping instead by
> > returning the largest possible mapping there.
> >
>
> I'm not sure if that is possible. g_tree_insert() calls the comparison
> methods so it knows where to place the new element, so it's expected
> to do something if the node already exists. Looking at the sources it
> actually silently destroys the new node. If we call g_tree_replace, we
> achieve the opposite and destroy the old node. But the tree is
> expected to have non-overlapping keys.

So the problem is that the current IOVA tree design is not fit for our
requirement:

static inline void iova_tree_insert_internal(GTree *gtree, DMAMap *range)
{
    /* Key and value are sharing the same range data */
    g_tree_insert(gtree, range, range);
}

It looks to me we need to  extend the current IOVA tree, split IOVA
range as key, this allows us to do an IOVA allocator on top. If we use
IOVA as the key, we can do

IOVA1->HVA
IOVA2->HVA

And then we can remove the current taddr_iova_map which assumes an 1:1
mapping here. When doing HVA to IOVA translation, we need to iterate
the tree and return the first match and continue the search until we
meet the size.

>
> Apart from that, we're not using this struct as a tree anymore so it's
> better to use directly a list in that case.
>
> But even with the list there are still questions on how to handle
> overlappings. How to handle this deletion:
>
> * Allocate translated_addr 0, size 0x1000.
> * Allocate translated_addr 0, size 0x2000.
> * Delete translated_addr 0, size 0x1000.
>
> Should it delete only the first node? Both of them?

I'd suggest removing the taddr_iova_map.

>
> iova-tree has similar questions too with iova. Inserting (iova=0,
> size=0x1000) and deleting (.iova=0, size=0x800) will delete all the
> whole node, so we cannot search the translation of (.iova=0x900)
> anymore. Is this expected?

Not sure. When vIOMMU is enabled, the guest risks itself to do this.
When vIOMMU is not enabled, it should be a bug of qemu to add and
remove GPA ranges with different size.

Thanks

>
> > But this may degrade the performance, but consider the memslots should
> > not be much at most of the time, it should be fine.
> >
> > Thanks
> >
> >
> > >
> > > > Thanks
> > > >
> > > >
> > > > > +    return IOVA_OK;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Remove existing mappings from iova tree
> > > > > + *
> > > > > + * @param  iova_tree  The vhost iova tree
> > > > > + * @param  map        The map to remove
> > > > > + */
> > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > > > +{
> > > > > +    const DMAMap *overlap;
> > > > > +
> > > > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > > > +    }
> > > > > +}
> > > > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > > > index 2dc87613bc..6047670804 100644
> > > > > --- a/hw/virtio/meson.build
> > > > > +++ b/hw/virtio/meson.build
> > > > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > > > >
> > > > >   virtio_ss = ss.source_set()
> > > > >   virtio_ss.add(files('virtio.c'))
> > > > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > > > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-03-03 11:35         ` Eugenio Perez Martin
@ 2022-03-07  4:24             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-07  4:24 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, virtualization, Eli Cohen,
	Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Laurent Vivier,
	Eduardo Habkost, Richard Henderson, Gautam Dawar, Xiao W Wang,
	Stefan Hajnoczi, Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/3 下午7:35, Eugenio Perez Martin 写道:
> On Thu, Mar 3, 2022 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
>>> On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
>>>>> Use translations added in VhostIOVATree in SVQ.
>>>>>
>>>>> Only introduce usage here, not allocation and deallocation. As with
>>>>> previous patches, we use the dead code paths of shadow_vqs_enabled to
>>>>> avoid commiting too many changes at once. These are impossible to take
>>>>> at the moment.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>>>>>     include/hw/virtio/vhost-vdpa.h     |   3 +
>>>>>     hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>>>>>     hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>>>>>     4 files changed, 187 insertions(+), 26 deletions(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> index 04c67685fd..b2f722d101 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> @@ -13,6 +13,7 @@
>>>>>     #include "qemu/event_notifier.h"
>>>>>     #include "hw/virtio/virtio.h"
>>>>>     #include "standard-headers/linux/vhost_types.h"
>>>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>>>
>>>>>     /* Shadow virtqueue to relay notifications */
>>>>>     typedef struct VhostShadowVirtqueue {
>>>>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>>>>>         /* Virtio device */
>>>>>         VirtIODevice *vdev;
>>>>>
>>>>> +    /* IOVA mapping */
>>>>> +    VhostIOVATree *iova_tree;
>>>>> +
>>>>>         /* Map for use the guest's descriptors */
>>>>>         VirtQueueElement **ring_id_maps;
>>>>>
>>>>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>>>>>                          VirtQueue *vq);
>>>>>     void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>>>>
>>>>> -VhostShadowVirtqueue *vhost_svq_new(void);
>>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>>>>>
>>>>>     void vhost_svq_free(gpointer vq);
>>>>>     G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
>>>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>>>> index 009a9f3b6b..ee8e939ad0 100644
>>>>> --- a/include/hw/virtio/vhost-vdpa.h
>>>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>>>> @@ -14,6 +14,7 @@
>>>>>
>>>>>     #include <gmodule.h>
>>>>>
>>>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>>>     #include "hw/virtio/virtio.h"
>>>>>     #include "standard-headers/linux/vhost_types.h"
>>>>>
>>>>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>>>>>         MemoryListener listener;
>>>>>         struct vhost_vdpa_iova_range iova_range;
>>>>>         bool shadow_vqs_enabled;
>>>>> +    /* IOVA mapping used by the Shadow Virtqueue */
>>>>> +    VhostIOVATree *iova_tree;
>>>>>         GPtrArray *shadow_vqs;
>>>>>         struct vhost_dev *dev;
>>>>>         VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> index a38d313755..7e073773d1 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> @@ -11,6 +11,7 @@
>>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>
>>>>>     #include "qemu/error-report.h"
>>>>> +#include "qemu/log.h"
>>>>>     #include "qemu/main-loop.h"
>>>>>     #include "qemu/log.h"
>>>>>     #include "linux-headers/linux/vhost.h"
>>>>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>>>>>         }
>>>>>     }
>>>>>
>>>>> +/**
>>>>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
>>>>> + *
>>>>> + * @svq    Shadow VirtQueue
>>>>> + * @vaddr  Translated IOVA addresses
>>>>> + * @iovec  Source qemu's VA addresses
>>>>> + * @num    Length of iovec and minimum length of vaddr
>>>>> + */
>>>>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
>>>>> +                                     void **addrs, const struct iovec *iovec,
>>>>> +                                     size_t num)
>>>>> +{
>>>>> +    if (num == 0) {
>>>>> +        return true;
>>>>> +    }
>>>>> +
>>>>> +    for (size_t i = 0; i < num; ++i) {
>>>>> +        DMAMap needle = {
>>>>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
>>>>> +            .size = iovec[i].iov_len,
>>>>> +        };
>>>>> +        size_t off;
>>>>> +
>>>>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
>>>>> +        /*
>>>>> +         * Map cannot be NULL since iova map contains all guest space and
>>>>> +         * qemu already has a physical address mapped
>>>>> +         */
>>>>> +        if (unlikely(!map)) {
>>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
>>>>> +                          needle.translated_addr);
>>>>> +            return false;
>>>>> +        }
>>>>> +
>>>>> +        off = needle.translated_addr - map->translated_addr;
>>>>> +        addrs[i] = (void *)(map->iova + off);
>>>>> +
>>>>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
>>>>> +                                          iovec[i].iov_len),
>>>>> +                               map->translated_addr + map->size))) {
>>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>>> +                          "Guest buffer expands over iova range");
>>>>> +            return false;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>>     static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>> +                                    void * const *vaddr_sg,
>>>> Nit: it looks to me we are not passing vaddr but iova here, so it might
>>>> be better to use "sg"?
>>>>
>>> Sure, this is a leftover of pre-IOVA translations. I see better to
>>> write as you say.
>>>
>>>>>                                         const struct iovec *iovec,
>>>>>                                         size_t num, bool more_descs, bool write)
>>>>>     {
>>>>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>>             } else {
>>>>>                 descs[i].flags = flags;
>>>>>             }
>>>>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>>>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>>>>>             descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>>>>
>>>>>             last = i;
>>>>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>>>     {
>>>>>         unsigned avail_idx;
>>>>>         vring_avail_t *avail = svq->vring.avail;
>>>>> +    bool ok;
>>>>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>>>>>
>>>>>         *head = svq->free_head;
>>>>>
>>>>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>>>>>                                 elem->in_num > 0, false);
>>>>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>>>> +
>>>>> +
>>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>>>>>
>>>>>         /*
>>>>>          * Put the entry in the available array (but don't update avail->idx until
>>>>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>>>>>      * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>>>>>      * shadow methods and file descriptors.
>>>>>      *
>>>>> + * @iova_tree Tree to perform descriptors translations
>>>>> + *
>>>>>      * Returns the new virtqueue or NULL.
>>>>>      *
>>>>>      * In case of error, reason is reported through error_report.
>>>>>      */
>>>>> -VhostShadowVirtqueue *vhost_svq_new(void)
>>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>>>>>     {
>>>>>         g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>>>>         int r;
>>>>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>>>>>
>>>>>         event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>>>>>         event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
>>>>> +    svq->iova_tree = iova_tree;
>>>>>         return g_steal_pointer(&svq);
>>>>>
>>>>>     err_init_hdev_call:
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 435b9c2e9e..56f9f125cd 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>                                              vaddr, section->readonly);
>>>>>
>>>>>         llsize = int128_sub(llend, int128_make64(iova));
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        DMAMap mem_region = {
>>>>> +            .translated_addr = (hwaddr)vaddr,
>>>>> +            .size = int128_get64(llsize) - 1,
>>>>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>> +        };
>>>>> +
>>>>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
>>>>> +        if (unlikely(r != IOVA_OK)) {
>>>>> +            error_report("Can't allocate a mapping (%d)", r);
>>>>> +            goto fail;
>>>>> +        }
>>>>> +
>>>>> +        iova = mem_region.iova;
>>>>> +    }
>>>>>
>>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
>>>>>         ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
>>>>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>>>
>>>>>         llsize = int128_sub(llend, int128_make64(iova));
>>>>>
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        const DMAMap *result;
>>>>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>>>> +            section->offset_within_region +
>>>>> +            (iova - section->offset_within_address_space);
>>>>> +        DMAMap mem_region = {
>>>>> +            .translated_addr = (hwaddr)vaddr,
>>>>> +            .size = int128_get64(llsize) - 1,
>>>>> +        };
>>>>> +
>>>>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
>>>>> +        iova = result->iova;
>>>>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
>>>>> +    }
>>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
>>>>>         ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>>>>>         if (ret) {
>>>>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>>>>>
>>>>>         shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>>>>>         for (unsigned n = 0; n < hdev->nvqs; ++n) {
>>>>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
>>>>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>>>>>
>>>>>             if (unlikely(!svq)) {
>>>>>                 error_setg(errp, "Cannot create svq %u", n);
>>>>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>>>>>     /**
>>>>>      * Unmap a SVQ area in the device
>>>>>      */
>>>>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
>>>>> -                                      hwaddr size)
>>>>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
>>>>> +                                      const DMAMap *needle)
>>>>>     {
>>>>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
>>>>> +    hwaddr size;
>>>>>         int r;
>>>>>
>>>>> -    size = ROUND_UP(size, qemu_real_host_page_size);
>>>>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
>>>>> +    if (unlikely(!result)) {
>>>>> +        error_report("Unable to find SVQ address to unmap");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
>>>>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>>>>>         return r == 0;
>>>>>     }
>>>>>
>>>>>     static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>>                                            const VhostShadowVirtqueue *svq)
>>>>>     {
>>>>> +    DMAMap needle;
>>>>>         struct vhost_vdpa *v = dev->opaque;
>>>>>         struct vhost_vring_addr svq_addr;
>>>>> -    size_t device_size = vhost_svq_device_area_size(svq);
>>>>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>>>>>         bool ok;
>>>>>
>>>>>         vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>
>>>>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
>>>>> +    needle = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.desc_user_addr,
>>>>> +    };
>>>> Let's simply initialize the member to zero during start of this function
>>>> then we can use needle->transalted_addr = XXX here.
>>>>
>>> Sure
>>>
>>>>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>>>>>         if (unlikely(!ok)) {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
>>>>> +    needle = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.used_user_addr,
>>>>> +    };
>>>>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * Map the SVQ area in the device
>>>>> + *
>>>>> + * @v          Vhost-vdpa device
>>>>> + * @needle     The area to search iova
>>>>> + * @errorp     Error pointer
>>>>> + */
>>>>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>>> +                                    Error **errp)
>>>>> +{
>>>>> +    int r;
>>>>> +
>>>>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
>>>>> +                           (void *)needle->translated_addr,
>>>>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
>>>> Let's simply use needle->perm == IOMMU_RO here?
>>>>
>>> The motivation to use this way is to be more resilient to the future.
>>> For example, if a new flag is added.
>>>
>>> But I'm totally ok with comparing with IOMMU_RO, I see that scenario
>>> unlikely at the moment.
>>>
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        error_setg_errno(errp, -r, "Cannot map region to device");
>>>>> +        vhost_iova_tree_remove(v->iova_tree, needle);
>>>>> +    }
>>>>> +
>>>>> +    return r == 0;
>>>>>     }
>>>>>
>>>>>     /**
>>>>> - * Map shadow virtqueue rings in device
>>>>> + * Map the shadow virtqueue rings in the device
>>>>>      *
>>>>>      * @dev   The vhost device
>>>>>      * @svq   The shadow virtqueue
>>>>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>                                          struct vhost_vring_addr *addr,
>>>>>                                          Error **errp)
>>>>>     {
>>>>> +    DMAMap device_region, driver_region;
>>>>> +    struct vhost_vring_addr svq_addr;
>>>>>         struct vhost_vdpa *v = dev->opaque;
>>>>>         size_t device_size = vhost_svq_device_area_size(svq);
>>>>>         size_t driver_size = vhost_svq_driver_area_size(svq);
>>>>> -    int r;
>>>>> +    size_t avail_offset;
>>>>> +    bool ok;
>>>>>
>>>>>         ERRP_GUARD();
>>>>> -    vhost_svq_get_vring_addr(svq, addr);
>>>>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>
>>>>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
>>>>> -                           (void *)addr->desc_user_addr, true);
>>>>> -    if (unlikely(r != 0)) {
>>>>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
>>>>> +    driver_region = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.desc_user_addr,
>>>>> +        .size = driver_size - 1,
>>>> Any reason for the "-1" here? I see several places do things like that,
>>>> it's probably hint of wrong API somehwere.
>>>>
>>> The "problem" is the api mismatch between _end and _last, to include
>>> the last member in the size or not.
>>>
>>> IOVA tree needs to use _end so we can allocate the last page in case
>>> of available range ending in (uint64_t)-1 [1]. But If we change
>>> vhost_svq_{device,driver}_area_size to make it inclusive,
>>
>> These functions looks sane since it doesn't return a range. It's up to
>> the caller to decide how to use the size.
>>
> Ok I think I didn't get your comment the first time, so there is a bug
> here. But I'm not sure if we are on the same page regarding the iova
> tree.
>
> Regarding the alignment, it's up to the caller how to use the size.
> But if you introduce a mapping of (iova_1, translated_addr_1, size_1),
> the iova address iova_1+size_1 belongs to that mapping.


This seems contradict to the definition of size_1? E.g if we get a iova 
range start from 0 and it's size is 1, 1 is not included in that mapping.


> If you want to
> introduce a new mapping (iova_2 = iova_1 + size_1, translated_addr_2,
> size_2) it will be rejected, since it overlaps with the first one.
> That part is not up to the caller.
>
> At this moment, vhost_svq_driver_area_size and
> vhost_svq_device_area_size returns in the same terms as sizeof(x). In
> other words, size is not inclusive. As memset() or VHOST_IOTLB_UPDATE
> expects, for example. We could move the -1 inside of these functions,
> and then we need to adapt qemu_memalign calls on vhost_svq_start or
> vhost_vdpa dma_map/unmap.
>
>>>    we need to
>>> use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
>>> Probably in more places too
>>
>> I'm not sure I get here. Maybe you can show which code may suffers if we
>> don't decrease it by one here.
>>
> Less than I expected I have to say:
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 497237dcbb..b42ba5a3c0 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> VhostShadowVirtqueue *svq)
>   {
>       size_t used_size = offsetof(vring_used_t, ring) +
>                                       sizeof(vring_used_elem_t) * svq->vring.num;
> -    return ROUND_UP(used_size, qemu_real_host_page_size);
> +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
>   }
>
>   /**
> @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> VirtIODevice *vdev,
>       svq->vq = vq;
>
>       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> -    driver_size = vhost_svq_driver_area_size(svq);
> -    device_size = vhost_svq_device_area_size(svq);
> +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> +    device_size = vhost_svq_device_area_size(svq) + 1;
>       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
>       desc_size = sizeof(vring_desc_t) * svq->vring.num;
>       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 5eefc5911a..2bf648de4a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       driver_region = (DMAMap) {
>           .translated_addr = svq_addr.desc_user_addr,
> -        .size = driver_size - 1,
> +        .size = driver_size,
>           .perm = IOMMU_RO,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       device_region = (DMAMap) {
>           .translated_addr = svq_addr.used_user_addr,
> -        .size = device_size - 1,
> +        .size = device_size,
>           .perm = IOMMU_RW,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 497237dcbb..b42ba5a3c0 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> VhostShadowVirtqueue *svq)
>   {
>       size_t used_size = offsetof(vring_used_t, ring) +
>                                       sizeof(vring_used_elem_t) * svq->vring.num;
> -    return ROUND_UP(used_size, qemu_real_host_page_size);
> +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
>   }
>
>   /**
> @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> VirtIODevice *vdev,
>       svq->vq = vq;
>
>       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> -    driver_size = vhost_svq_driver_area_size(svq);
> -    device_size = vhost_svq_device_area_size(svq);
> +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> +    device_size = vhost_svq_device_area_size(svq) + 1;
>       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
>       desc_size = sizeof(vring_desc_t) * svq->vring.num;
>       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 5eefc5911a..2bf648de4a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -918,7 +918,7 @@ static bool vhost_vdpa_svq_map_ring(struct
> vhost_vdpa *v, DMAMap *needle,
>           return false;
>       }
>
> -    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size + 1,
>                              (void *)needle->translated_addr,
>                              needle->perm == IOMMU_RO);
>       if (unlikely(r != 0)) {
> @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       driver_region = (DMAMap) {
>           .translated_addr = svq_addr.desc_user_addr,
> -        .size = driver_size - 1,
> +        .size = driver_size,
>           .perm = IOMMU_RO,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       device_region = (DMAMap) {
>           .translated_addr = svq_addr.used_user_addr,
> -        .size = device_size - 1,
> +        .size = device_size,
>           .perm = IOMMU_RW,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> ---


Sorry, I still don't get why -1/+1 is required. Maybe you can show me 
what happens if we don't do these.

Thanks


>
>> But current code may endup to passing qemu_real_host_page_size - 1 to
>> vhost-VDPA which seems wrong?
>>
>> E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it
>> was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().
>>
> That part needs fixing, but the right solution is not to skip the -1
> but increment to pass to the vhost_vdpa_dma_map. Doing otherwise would
> bring problems with how iova-tree works. It will be included in the
> next series.
>
> Thanks!
>
>> Thanks
>>
>>
>>> QEMU's emulated Intel iommu code solves it using the address mask as
>>> the size, something that does not fit 100% with vhost devices since
>>> they can allocate an arbitrary address of arbitrary size when using
>>> vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
>>> sure we expose unaligned and whole pages with vrings, but I feel it
>>> would only be to move the problem somewhere else.
>>>
>>> Thanks!
>>>
>>> [1] There are alternatives: to use Int128_t, etc. But I think it's
>>> better not to change that in this patch series.
>>>
>>>> Thanks
>>>>
>>>>
>>>>> +        .perm = IOMMU_RO,
>>>>> +    };
>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        error_prepend(errp, "Cannot create vq driver region: ");
>>>>>             return false;
>>>>>         }
>>>>> +    addr->desc_user_addr = driver_region.iova;
>>>>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
>>>>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>>
>>>>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
>>>>> -                           (void *)addr->used_user_addr, false);
>>>>> -    if (unlikely(r != 0)) {
>>>>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
>>>>> +    device_region = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.used_user_addr,
>>>>> +        .size = device_size - 1,
>>>>> +        .perm = IOMMU_RW,
>>>>> +    };
>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        error_prepend(errp, "Cannot create vq device region: ");
>>>>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>>>>>         }
>>>>> +    addr->used_user_addr = device_region.iova;
>>>>>
>>>>> -    return r == 0;
>>>>> +    return ok;
>>>>>     }
>>>>>
>>>>>     static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
@ 2022-03-07  4:24             ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2022-03-07  4:24 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan


在 2022/3/3 下午7:35, Eugenio Perez Martin 写道:
> On Thu, Mar 3, 2022 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
>>> On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
>>>>> Use translations added in VhostIOVATree in SVQ.
>>>>>
>>>>> Only introduce usage here, not allocation and deallocation. As with
>>>>> previous patches, we use the dead code paths of shadow_vqs_enabled to
>>>>> avoid commiting too many changes at once. These are impossible to take
>>>>> at the moment.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     hw/virtio/vhost-shadow-virtqueue.h |   6 +-
>>>>>     include/hw/virtio/vhost-vdpa.h     |   3 +
>>>>>     hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
>>>>>     hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
>>>>>     4 files changed, 187 insertions(+), 26 deletions(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> index 04c67685fd..b2f722d101 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>>>> @@ -13,6 +13,7 @@
>>>>>     #include "qemu/event_notifier.h"
>>>>>     #include "hw/virtio/virtio.h"
>>>>>     #include "standard-headers/linux/vhost_types.h"
>>>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>>>
>>>>>     /* Shadow virtqueue to relay notifications */
>>>>>     typedef struct VhostShadowVirtqueue {
>>>>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
>>>>>         /* Virtio device */
>>>>>         VirtIODevice *vdev;
>>>>>
>>>>> +    /* IOVA mapping */
>>>>> +    VhostIOVATree *iova_tree;
>>>>> +
>>>>>         /* Map for use the guest's descriptors */
>>>>>         VirtQueueElement **ring_id_maps;
>>>>>
>>>>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>>>>>                          VirtQueue *vq);
>>>>>     void vhost_svq_stop(VhostShadowVirtqueue *svq);
>>>>>
>>>>> -VhostShadowVirtqueue *vhost_svq_new(void);
>>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
>>>>>
>>>>>     void vhost_svq_free(gpointer vq);
>>>>>     G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
>>>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>>>> index 009a9f3b6b..ee8e939ad0 100644
>>>>> --- a/include/hw/virtio/vhost-vdpa.h
>>>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>>>> @@ -14,6 +14,7 @@
>>>>>
>>>>>     #include <gmodule.h>
>>>>>
>>>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>>>     #include "hw/virtio/virtio.h"
>>>>>     #include "standard-headers/linux/vhost_types.h"
>>>>>
>>>>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
>>>>>         MemoryListener listener;
>>>>>         struct vhost_vdpa_iova_range iova_range;
>>>>>         bool shadow_vqs_enabled;
>>>>> +    /* IOVA mapping used by the Shadow Virtqueue */
>>>>> +    VhostIOVATree *iova_tree;
>>>>>         GPtrArray *shadow_vqs;
>>>>>         struct vhost_dev *dev;
>>>>>         VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> index a38d313755..7e073773d1 100644
>>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>>>> @@ -11,6 +11,7 @@
>>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>>>
>>>>>     #include "qemu/error-report.h"
>>>>> +#include "qemu/log.h"
>>>>>     #include "qemu/main-loop.h"
>>>>>     #include "qemu/log.h"
>>>>>     #include "linux-headers/linux/vhost.h"
>>>>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>>>>>         }
>>>>>     }
>>>>>
>>>>> +/**
>>>>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
>>>>> + *
>>>>> + * @svq    Shadow VirtQueue
>>>>> + * @vaddr  Translated IOVA addresses
>>>>> + * @iovec  Source qemu's VA addresses
>>>>> + * @num    Length of iovec and minimum length of vaddr
>>>>> + */
>>>>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
>>>>> +                                     void **addrs, const struct iovec *iovec,
>>>>> +                                     size_t num)
>>>>> +{
>>>>> +    if (num == 0) {
>>>>> +        return true;
>>>>> +    }
>>>>> +
>>>>> +    for (size_t i = 0; i < num; ++i) {
>>>>> +        DMAMap needle = {
>>>>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
>>>>> +            .size = iovec[i].iov_len,
>>>>> +        };
>>>>> +        size_t off;
>>>>> +
>>>>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
>>>>> +        /*
>>>>> +         * Map cannot be NULL since iova map contains all guest space and
>>>>> +         * qemu already has a physical address mapped
>>>>> +         */
>>>>> +        if (unlikely(!map)) {
>>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
>>>>> +                          needle.translated_addr);
>>>>> +            return false;
>>>>> +        }
>>>>> +
>>>>> +        off = needle.translated_addr - map->translated_addr;
>>>>> +        addrs[i] = (void *)(map->iova + off);
>>>>> +
>>>>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
>>>>> +                                          iovec[i].iov_len),
>>>>> +                               map->translated_addr + map->size))) {
>>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>>> +                          "Guest buffer expands over iova range");
>>>>> +            return false;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>>     static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>> +                                    void * const *vaddr_sg,
>>>> Nit: it looks to me we are not passing vaddr but iova here, so it might
>>>> be better to use "sg"?
>>>>
>>> Sure, this is a leftover of pre-IOVA translations. I see better to
>>> write as you say.
>>>
>>>>>                                         const struct iovec *iovec,
>>>>>                                         size_t num, bool more_descs, bool write)
>>>>>     {
>>>>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>>>>             } else {
>>>>>                 descs[i].flags = flags;
>>>>>             }
>>>>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>>>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
>>>>>             descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>>>>
>>>>>             last = i;
>>>>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>>>     {
>>>>>         unsigned avail_idx;
>>>>>         vring_avail_t *avail = svq->vring.avail;
>>>>> +    bool ok;
>>>>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
>>>>>
>>>>>         *head = svq->free_head;
>>>>>
>>>>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
>>>>>                                 elem->in_num > 0, false);
>>>>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>>>> +
>>>>> +
>>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
>>>>>
>>>>>         /*
>>>>>          * Put the entry in the available array (but don't update avail->idx until
>>>>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>>>>>      * Creates vhost shadow virtqueue, and instructs the vhost device to use the
>>>>>      * shadow methods and file descriptors.
>>>>>      *
>>>>> + * @iova_tree Tree to perform descriptors translations
>>>>> + *
>>>>>      * Returns the new virtqueue or NULL.
>>>>>      *
>>>>>      * In case of error, reason is reported through error_report.
>>>>>      */
>>>>> -VhostShadowVirtqueue *vhost_svq_new(void)
>>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
>>>>>     {
>>>>>         g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>>>>         int r;
>>>>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
>>>>>
>>>>>         event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
>>>>>         event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
>>>>> +    svq->iova_tree = iova_tree;
>>>>>         return g_steal_pointer(&svq);
>>>>>
>>>>>     err_init_hdev_call:
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 435b9c2e9e..56f9f125cd 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>                                              vaddr, section->readonly);
>>>>>
>>>>>         llsize = int128_sub(llend, int128_make64(iova));
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        DMAMap mem_region = {
>>>>> +            .translated_addr = (hwaddr)vaddr,
>>>>> +            .size = int128_get64(llsize) - 1,
>>>>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>> +        };
>>>>> +
>>>>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
>>>>> +        if (unlikely(r != IOVA_OK)) {
>>>>> +            error_report("Can't allocate a mapping (%d)", r);
>>>>> +            goto fail;
>>>>> +        }
>>>>> +
>>>>> +        iova = mem_region.iova;
>>>>> +    }
>>>>>
>>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
>>>>>         ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
>>>>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>>>
>>>>>         llsize = int128_sub(llend, int128_make64(iova));
>>>>>
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        const DMAMap *result;
>>>>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>>>> +            section->offset_within_region +
>>>>> +            (iova - section->offset_within_address_space);
>>>>> +        DMAMap mem_region = {
>>>>> +            .translated_addr = (hwaddr)vaddr,
>>>>> +            .size = int128_get64(llsize) - 1,
>>>>> +        };
>>>>> +
>>>>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
>>>>> +        iova = result->iova;
>>>>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
>>>>> +    }
>>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
>>>>>         ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
>>>>>         if (ret) {
>>>>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
>>>>>
>>>>>         shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
>>>>>         for (unsigned n = 0; n < hdev->nvqs; ++n) {
>>>>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
>>>>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
>>>>>
>>>>>             if (unlikely(!svq)) {
>>>>>                 error_setg(errp, "Cannot create svq %u", n);
>>>>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
>>>>>     /**
>>>>>      * Unmap a SVQ area in the device
>>>>>      */
>>>>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
>>>>> -                                      hwaddr size)
>>>>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
>>>>> +                                      const DMAMap *needle)
>>>>>     {
>>>>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
>>>>> +    hwaddr size;
>>>>>         int r;
>>>>>
>>>>> -    size = ROUND_UP(size, qemu_real_host_page_size);
>>>>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
>>>>> +    if (unlikely(!result)) {
>>>>> +        error_report("Unable to find SVQ address to unmap");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
>>>>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
>>>>>         return r == 0;
>>>>>     }
>>>>>
>>>>>     static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>>                                            const VhostShadowVirtqueue *svq)
>>>>>     {
>>>>> +    DMAMap needle;
>>>>>         struct vhost_vdpa *v = dev->opaque;
>>>>>         struct vhost_vring_addr svq_addr;
>>>>> -    size_t device_size = vhost_svq_device_area_size(svq);
>>>>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
>>>>>         bool ok;
>>>>>
>>>>>         vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>
>>>>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
>>>>> +    needle = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.desc_user_addr,
>>>>> +    };
>>>> Let's simply initialize the member to zero during start of this function
>>>> then we can use needle->transalted_addr = XXX here.
>>>>
>>> Sure
>>>
>>>>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
>>>>>         if (unlikely(!ok)) {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
>>>>> +    needle = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.used_user_addr,
>>>>> +    };
>>>>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * Map the SVQ area in the device
>>>>> + *
>>>>> + * @v          Vhost-vdpa device
>>>>> + * @needle     The area to search iova
>>>>> + * @errorp     Error pointer
>>>>> + */
>>>>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>>> +                                    Error **errp)
>>>>> +{
>>>>> +    int r;
>>>>> +
>>>>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
>>>>> +                           (void *)needle->translated_addr,
>>>>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
>>>> Let's simply use needle->perm == IOMMU_RO here?
>>>>
>>> The motivation to use this way is to be more resilient to the future.
>>> For example, if a new flag is added.
>>>
>>> But I'm totally ok with comparing with IOMMU_RO, I see that scenario
>>> unlikely at the moment.
>>>
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        error_setg_errno(errp, -r, "Cannot map region to device");
>>>>> +        vhost_iova_tree_remove(v->iova_tree, needle);
>>>>> +    }
>>>>> +
>>>>> +    return r == 0;
>>>>>     }
>>>>>
>>>>>     /**
>>>>> - * Map shadow virtqueue rings in device
>>>>> + * Map the shadow virtqueue rings in the device
>>>>>      *
>>>>>      * @dev   The vhost device
>>>>>      * @svq   The shadow virtqueue
>>>>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>                                          struct vhost_vring_addr *addr,
>>>>>                                          Error **errp)
>>>>>     {
>>>>> +    DMAMap device_region, driver_region;
>>>>> +    struct vhost_vring_addr svq_addr;
>>>>>         struct vhost_vdpa *v = dev->opaque;
>>>>>         size_t device_size = vhost_svq_device_area_size(svq);
>>>>>         size_t driver_size = vhost_svq_driver_area_size(svq);
>>>>> -    int r;
>>>>> +    size_t avail_offset;
>>>>> +    bool ok;
>>>>>
>>>>>         ERRP_GUARD();
>>>>> -    vhost_svq_get_vring_addr(svq, addr);
>>>>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>
>>>>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
>>>>> -                           (void *)addr->desc_user_addr, true);
>>>>> -    if (unlikely(r != 0)) {
>>>>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
>>>>> +    driver_region = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.desc_user_addr,
>>>>> +        .size = driver_size - 1,
>>>> Any reason for the "-1" here? I see several places do things like that,
>>>> it's probably hint of wrong API somehwere.
>>>>
>>> The "problem" is the api mismatch between _end and _last, to include
>>> the last member in the size or not.
>>>
>>> IOVA tree needs to use _end so we can allocate the last page in case
>>> of available range ending in (uint64_t)-1 [1]. But If we change
>>> vhost_svq_{device,driver}_area_size to make it inclusive,
>>
>> These functions looks sane since it doesn't return a range. It's up to
>> the caller to decide how to use the size.
>>
> Ok I think I didn't get your comment the first time, so there is a bug
> here. But I'm not sure if we are on the same page regarding the iova
> tree.
>
> Regarding the alignment, it's up to the caller how to use the size.
> But if you introduce a mapping of (iova_1, translated_addr_1, size_1),
> the iova address iova_1+size_1 belongs to that mapping.


This seems contradict to the definition of size_1? E.g if we get a iova 
range start from 0 and it's size is 1, 1 is not included in that mapping.


> If you want to
> introduce a new mapping (iova_2 = iova_1 + size_1, translated_addr_2,
> size_2) it will be rejected, since it overlaps with the first one.
> That part is not up to the caller.
>
> At this moment, vhost_svq_driver_area_size and
> vhost_svq_device_area_size returns in the same terms as sizeof(x). In
> other words, size is not inclusive. As memset() or VHOST_IOTLB_UPDATE
> expects, for example. We could move the -1 inside of these functions,
> and then we need to adapt qemu_memalign calls on vhost_svq_start or
> vhost_vdpa dma_map/unmap.
>
>>>    we need to
>>> use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
>>> Probably in more places too
>>
>> I'm not sure I get here. Maybe you can show which code may suffers if we
>> don't decrease it by one here.
>>
> Less than I expected I have to say:
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 497237dcbb..b42ba5a3c0 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> VhostShadowVirtqueue *svq)
>   {
>       size_t used_size = offsetof(vring_used_t, ring) +
>                                       sizeof(vring_used_elem_t) * svq->vring.num;
> -    return ROUND_UP(used_size, qemu_real_host_page_size);
> +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
>   }
>
>   /**
> @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> VirtIODevice *vdev,
>       svq->vq = vq;
>
>       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> -    driver_size = vhost_svq_driver_area_size(svq);
> -    device_size = vhost_svq_device_area_size(svq);
> +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> +    device_size = vhost_svq_device_area_size(svq) + 1;
>       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
>       desc_size = sizeof(vring_desc_t) * svq->vring.num;
>       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 5eefc5911a..2bf648de4a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       driver_region = (DMAMap) {
>           .translated_addr = svq_addr.desc_user_addr,
> -        .size = driver_size - 1,
> +        .size = driver_size,
>           .perm = IOMMU_RO,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       device_region = (DMAMap) {
>           .translated_addr = svq_addr.used_user_addr,
> -        .size = device_size - 1,
> +        .size = device_size,
>           .perm = IOMMU_RW,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 497237dcbb..b42ba5a3c0 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> VhostShadowVirtqueue *svq)
>   {
>       size_t used_size = offsetof(vring_used_t, ring) +
>                                       sizeof(vring_used_elem_t) * svq->vring.num;
> -    return ROUND_UP(used_size, qemu_real_host_page_size);
> +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
>   }
>
>   /**
> @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> VirtIODevice *vdev,
>       svq->vq = vq;
>
>       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> -    driver_size = vhost_svq_driver_area_size(svq);
> -    device_size = vhost_svq_device_area_size(svq);
> +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> +    device_size = vhost_svq_device_area_size(svq) + 1;
>       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
>       desc_size = sizeof(vring_desc_t) * svq->vring.num;
>       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 5eefc5911a..2bf648de4a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -918,7 +918,7 @@ static bool vhost_vdpa_svq_map_ring(struct
> vhost_vdpa *v, DMAMap *needle,
>           return false;
>       }
>
> -    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size + 1,
>                              (void *)needle->translated_addr,
>                              needle->perm == IOMMU_RO);
>       if (unlikely(r != 0)) {
> @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       driver_region = (DMAMap) {
>           .translated_addr = svq_addr.desc_user_addr,
> -        .size = driver_size - 1,
> +        .size = driver_size,
>           .perm = IOMMU_RO,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>
>       device_region = (DMAMap) {
>           .translated_addr = svq_addr.used_user_addr,
> -        .size = device_size - 1,
> +        .size = device_size,
>           .perm = IOMMU_RW,
>       };
>       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> ---


Sorry, I still don't get why -1/+1 is required. Maybe you can show me 
what happens if we don't do these.

Thanks


>
>> But current code may endup to passing qemu_real_host_page_size - 1 to
>> vhost-VDPA which seems wrong?
>>
>> E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it
>> was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().
>>
> That part needs fixing, but the right solution is not to skip the -1
> but increment to pass to the vhost_vdpa_dma_map. Doing otherwise would
> bring problems with how iova-tree works. It will be included in the
> next series.
>
> Thanks!
>
>> Thanks
>>
>>
>>> QEMU's emulated Intel iommu code solves it using the address mask as
>>> the size, something that does not fit 100% with vhost devices since
>>> they can allocate an arbitrary address of arbitrary size when using
>>> vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
>>> sure we expose unaligned and whole pages with vrings, but I feel it
>>> would only be to move the problem somewhere else.
>>>
>>> Thanks!
>>>
>>> [1] There are alternatives: to use Int128_t, etc. But I think it's
>>> better not to change that in this patch series.
>>>
>>>> Thanks
>>>>
>>>>
>>>>> +        .perm = IOMMU_RO,
>>>>> +    };
>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        error_prepend(errp, "Cannot create vq driver region: ");
>>>>>             return false;
>>>>>         }
>>>>> +    addr->desc_user_addr = driver_region.iova;
>>>>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
>>>>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>>
>>>>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
>>>>> -                           (void *)addr->used_user_addr, false);
>>>>> -    if (unlikely(r != 0)) {
>>>>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
>>>>> +    device_region = (DMAMap) {
>>>>> +        .translated_addr = svq_addr.used_user_addr,
>>>>> +        .size = device_size - 1,
>>>>> +        .perm = IOMMU_RW,
>>>>> +    };
>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>>> +    if (unlikely(!ok)) {
>>>>> +        error_prepend(errp, "Cannot create vq device region: ");
>>>>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
>>>>>         }
>>>>> +    addr->used_user_addr = device_region.iova;
>>>>>
>>>>> -    return r == 0;
>>>>> +    return ok;
>>>>>     }
>>>>>
>>>>>     static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ
  2022-03-07  4:24             ` Jason Wang
  (?)
@ 2022-03-07  7:44             ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-07  7:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Mar 7, 2022 at 5:24 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2022/3/3 下午7:35, Eugenio Perez Martin 写道:
> > On Thu, Mar 3, 2022 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2022/3/1 下午4:50, Eugenio Perez Martin 写道:
> >>> On Mon, Feb 28, 2022 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> >>>>> Use translations added in VhostIOVATree in SVQ.
> >>>>>
> >>>>> Only introduce usage here, not allocation and deallocation. As with
> >>>>> previous patches, we use the dead code paths of shadow_vqs_enabled to
> >>>>> avoid commiting too many changes at once. These are impossible to take
> >>>>> at the moment.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     hw/virtio/vhost-shadow-virtqueue.h |   6 +-
> >>>>>     include/hw/virtio/vhost-vdpa.h     |   3 +
> >>>>>     hw/virtio/vhost-shadow-virtqueue.c |  76 ++++++++++++++++-
> >>>>>     hw/virtio/vhost-vdpa.c             | 128 ++++++++++++++++++++++++-----
> >>>>>     4 files changed, 187 insertions(+), 26 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> index 04c67685fd..b2f722d101 100644
> >>>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>>>> @@ -13,6 +13,7 @@
> >>>>>     #include "qemu/event_notifier.h"
> >>>>>     #include "hw/virtio/virtio.h"
> >>>>>     #include "standard-headers/linux/vhost_types.h"
> >>>>> +#include "hw/virtio/vhost-iova-tree.h"
> >>>>>
> >>>>>     /* Shadow virtqueue to relay notifications */
> >>>>>     typedef struct VhostShadowVirtqueue {
> >>>>> @@ -43,6 +44,9 @@ typedef struct VhostShadowVirtqueue {
> >>>>>         /* Virtio device */
> >>>>>         VirtIODevice *vdev;
> >>>>>
> >>>>> +    /* IOVA mapping */
> >>>>> +    VhostIOVATree *iova_tree;
> >>>>> +
> >>>>>         /* Map for use the guest's descriptors */
> >>>>>         VirtQueueElement **ring_id_maps;
> >>>>>
> >>>>> @@ -78,7 +82,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
> >>>>>                          VirtQueue *vq);
> >>>>>     void vhost_svq_stop(VhostShadowVirtqueue *svq);
> >>>>>
> >>>>> -VhostShadowVirtqueue *vhost_svq_new(void);
> >>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
> >>>>>
> >>>>>     void vhost_svq_free(gpointer vq);
> >>>>>     G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
> >>>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> >>>>> index 009a9f3b6b..ee8e939ad0 100644
> >>>>> --- a/include/hw/virtio/vhost-vdpa.h
> >>>>> +++ b/include/hw/virtio/vhost-vdpa.h
> >>>>> @@ -14,6 +14,7 @@
> >>>>>
> >>>>>     #include <gmodule.h>
> >>>>>
> >>>>> +#include "hw/virtio/vhost-iova-tree.h"
> >>>>>     #include "hw/virtio/virtio.h"
> >>>>>     #include "standard-headers/linux/vhost_types.h"
> >>>>>
> >>>>> @@ -30,6 +31,8 @@ typedef struct vhost_vdpa {
> >>>>>         MemoryListener listener;
> >>>>>         struct vhost_vdpa_iova_range iova_range;
> >>>>>         bool shadow_vqs_enabled;
> >>>>> +    /* IOVA mapping used by the Shadow Virtqueue */
> >>>>> +    VhostIOVATree *iova_tree;
> >>>>>         GPtrArray *shadow_vqs;
> >>>>>         struct vhost_dev *dev;
> >>>>>         VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >>>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> index a38d313755..7e073773d1 100644
> >>>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>>>> @@ -11,6 +11,7 @@
> >>>>>     #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>>>
> >>>>>     #include "qemu/error-report.h"
> >>>>> +#include "qemu/log.h"
> >>>>>     #include "qemu/main-loop.h"
> >>>>>     #include "qemu/log.h"
> >>>>>     #include "linux-headers/linux/vhost.h"
> >>>>> @@ -84,7 +85,58 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> >>>>>         }
> >>>>>     }
> >>>>>
> >>>>> +/**
> >>>>> + * Translate addresses between the qemu's virtual address and the SVQ IOVA
> >>>>> + *
> >>>>> + * @svq    Shadow VirtQueue
> >>>>> + * @vaddr  Translated IOVA addresses
> >>>>> + * @iovec  Source qemu's VA addresses
> >>>>> + * @num    Length of iovec and minimum length of vaddr
> >>>>> + */
> >>>>> +static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> >>>>> +                                     void **addrs, const struct iovec *iovec,
> >>>>> +                                     size_t num)
> >>>>> +{
> >>>>> +    if (num == 0) {
> >>>>> +        return true;
> >>>>> +    }
> >>>>> +
> >>>>> +    for (size_t i = 0; i < num; ++i) {
> >>>>> +        DMAMap needle = {
> >>>>> +            .translated_addr = (hwaddr)iovec[i].iov_base,
> >>>>> +            .size = iovec[i].iov_len,
> >>>>> +        };
> >>>>> +        size_t off;
> >>>>> +
> >>>>> +        const DMAMap *map = vhost_iova_tree_find_iova(svq->iova_tree, &needle);
> >>>>> +        /*
> >>>>> +         * Map cannot be NULL since iova map contains all guest space and
> >>>>> +         * qemu already has a physical address mapped
> >>>>> +         */
> >>>>> +        if (unlikely(!map)) {
> >>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
> >>>>> +                          "Invalid address 0x%"HWADDR_PRIx" given by guest",
> >>>>> +                          needle.translated_addr);
> >>>>> +            return false;
> >>>>> +        }
> >>>>> +
> >>>>> +        off = needle.translated_addr - map->translated_addr;
> >>>>> +        addrs[i] = (void *)(map->iova + off);
> >>>>> +
> >>>>> +        if (unlikely(int128_gt(int128_add(needle.translated_addr,
> >>>>> +                                          iovec[i].iov_len),
> >>>>> +                               map->translated_addr + map->size))) {
> >>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
> >>>>> +                          "Guest buffer expands over iova range");
> >>>>> +            return false;
> >>>>> +        }
> >>>>> +    }
> >>>>> +
> >>>>> +    return true;
> >>>>> +}
> >>>>> +
> >>>>>     static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>>>> +                                    void * const *vaddr_sg,
> >>>> Nit: it looks to me we are not passing vaddr but iova here, so it might
> >>>> be better to use "sg"?
> >>>>
> >>> Sure, this is a leftover of pre-IOVA translations. I see better to
> >>> write as you say.
> >>>
> >>>>>                                         const struct iovec *iovec,
> >>>>>                                         size_t num, bool more_descs, bool write)
> >>>>>     {
> >>>>> @@ -103,7 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>>>>             } else {
> >>>>>                 descs[i].flags = flags;
> >>>>>             }
> >>>>> -        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> >>>>> +        descs[i].addr = cpu_to_le64((hwaddr)vaddr_sg[n]);
> >>>>>             descs[i].len = cpu_to_le32(iovec[n].iov_len);
> >>>>>
> >>>>>             last = i;
> >>>>> @@ -119,6 +171,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>>>>     {
> >>>>>         unsigned avail_idx;
> >>>>>         vring_avail_t *avail = svq->vring.avail;
> >>>>> +    bool ok;
> >>>>> +    g_autofree void **sgs = g_new(void *, MAX(elem->out_num, elem->in_num));
> >>>>>
> >>>>>         *head = svq->free_head;
> >>>>>
> >>>>> @@ -129,9 +183,20 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>>>>             return false;
> >>>>>         }
> >>>>>
> >>>>> -    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
> >>>>> +    if (unlikely(!ok)) {
> >>>>> +        return false;
> >>>>> +    }
> >>>>> +    vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
> >>>>>                                 elem->in_num > 0, false);
> >>>>> -    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>>>> +
> >>>>> +
> >>>>> +    ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
> >>>>> +    if (unlikely(!ok)) {
> >>>>> +        return false;
> >>>>> +    }
> >>>>> +
> >>>>> +    vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
> >>>>>
> >>>>>         /*
> >>>>>          * Put the entry in the available array (but don't update avail->idx until
> >>>>> @@ -514,11 +579,13 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
> >>>>>      * Creates vhost shadow virtqueue, and instructs the vhost device to use the
> >>>>>      * shadow methods and file descriptors.
> >>>>>      *
> >>>>> + * @iova_tree Tree to perform descriptors translations
> >>>>> + *
> >>>>>      * Returns the new virtqueue or NULL.
> >>>>>      *
> >>>>>      * In case of error, reason is reported through error_report.
> >>>>>      */
> >>>>> -VhostShadowVirtqueue *vhost_svq_new(void)
> >>>>> +VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
> >>>>>     {
> >>>>>         g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >>>>>         int r;
> >>>>> @@ -539,6 +606,7 @@ VhostShadowVirtqueue *vhost_svq_new(void)
> >>>>>
> >>>>>         event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
> >>>>>         event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
> >>>>> +    svq->iova_tree = iova_tree;
> >>>>>         return g_steal_pointer(&svq);
> >>>>>
> >>>>>     err_init_hdev_call:
> >>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>>> index 435b9c2e9e..56f9f125cd 100644
> >>>>> --- a/hw/virtio/vhost-vdpa.c
> >>>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>>> @@ -209,6 +209,21 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>>                                              vaddr, section->readonly);
> >>>>>
> >>>>>         llsize = int128_sub(llend, int128_make64(iova));
> >>>>> +    if (v->shadow_vqs_enabled) {
> >>>>> +        DMAMap mem_region = {
> >>>>> +            .translated_addr = (hwaddr)vaddr,
> >>>>> +            .size = int128_get64(llsize) - 1,
> >>>>> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>>>> +        };
> >>>>> +
> >>>>> +        int r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> >>>>> +        if (unlikely(r != IOVA_OK)) {
> >>>>> +            error_report("Can't allocate a mapping (%d)", r);
> >>>>> +            goto fail;
> >>>>> +        }
> >>>>> +
> >>>>> +        iova = mem_region.iova;
> >>>>> +    }
> >>>>>
> >>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
> >>>>>         ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> >>>>> @@ -261,6 +276,20 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >>>>>
> >>>>>         llsize = int128_sub(llend, int128_make64(iova));
> >>>>>
> >>>>> +    if (v->shadow_vqs_enabled) {
> >>>>> +        const DMAMap *result;
> >>>>> +        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> >>>>> +            section->offset_within_region +
> >>>>> +            (iova - section->offset_within_address_space);
> >>>>> +        DMAMap mem_region = {
> >>>>> +            .translated_addr = (hwaddr)vaddr,
> >>>>> +            .size = int128_get64(llsize) - 1,
> >>>>> +        };
> >>>>> +
> >>>>> +        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> >>>>> +        iova = result->iova;
> >>>>> +        vhost_iova_tree_remove(v->iova_tree, &mem_region);
> >>>>> +    }
> >>>>>         vhost_vdpa_iotlb_batch_begin_once(v);
> >>>>>         ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
> >>>>>         if (ret) {
> >>>>> @@ -383,7 +412,7 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
> >>>>>
> >>>>>         shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
> >>>>>         for (unsigned n = 0; n < hdev->nvqs; ++n) {
> >>>>> -        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new();
> >>>>> +        g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
> >>>>>
> >>>>>             if (unlikely(!svq)) {
> >>>>>                 error_setg(errp, "Cannot create svq %u", n);
> >>>>> @@ -834,37 +863,78 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
> >>>>>     /**
> >>>>>      * Unmap a SVQ area in the device
> >>>>>      */
> >>>>> -static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr iova,
> >>>>> -                                      hwaddr size)
> >>>>> +static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
> >>>>> +                                      const DMAMap *needle)
> >>>>>     {
> >>>>> +    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
> >>>>> +    hwaddr size;
> >>>>>         int r;
> >>>>>
> >>>>> -    size = ROUND_UP(size, qemu_real_host_page_size);
> >>>>> -    r = vhost_vdpa_dma_unmap(v, iova, size);
> >>>>> +    if (unlikely(!result)) {
> >>>>> +        error_report("Unable to find SVQ address to unmap");
> >>>>> +        return false;
> >>>>> +    }
> >>>>> +
> >>>>> +    size = ROUND_UP(result->size, qemu_real_host_page_size);
> >>>>> +    r = vhost_vdpa_dma_unmap(v, result->iova, size);
> >>>>>         return r == 0;
> >>>>>     }
> >>>>>
> >>>>>     static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>>>>                                            const VhostShadowVirtqueue *svq)
> >>>>>     {
> >>>>> +    DMAMap needle;
> >>>>>         struct vhost_vdpa *v = dev->opaque;
> >>>>>         struct vhost_vring_addr svq_addr;
> >>>>> -    size_t device_size = vhost_svq_device_area_size(svq);
> >>>>> -    size_t driver_size = vhost_svq_driver_area_size(svq);
> >>>>>         bool ok;
> >>>>>
> >>>>>         vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>>
> >>>>> -    ok = vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr, driver_size);
> >>>>> +    needle = (DMAMap) {
> >>>>> +        .translated_addr = svq_addr.desc_user_addr,
> >>>>> +    };
> >>>> Let's simply initialize the member to zero during start of this function
> >>>> then we can use needle->transalted_addr = XXX here.
> >>>>
> >>> Sure
> >>>
> >>>>> +    ok = vhost_vdpa_svq_unmap_ring(v, &needle);
> >>>>>         if (unlikely(!ok)) {
> >>>>>             return false;
> >>>>>         }
> >>>>>
> >>>>> -    return vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr, device_size);
> >>>>> +    needle = (DMAMap) {
> >>>>> +        .translated_addr = svq_addr.used_user_addr,
> >>>>> +    };
> >>>>> +    return vhost_vdpa_svq_unmap_ring(v, &needle);
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * Map the SVQ area in the device
> >>>>> + *
> >>>>> + * @v          Vhost-vdpa device
> >>>>> + * @needle     The area to search iova
> >>>>> + * @errorp     Error pointer
> >>>>> + */
> >>>>> +static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >>>>> +                                    Error **errp)
> >>>>> +{
> >>>>> +    int r;
> >>>>> +
> >>>>> +    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> >>>>> +    if (unlikely(r != IOVA_OK)) {
> >>>>> +        error_setg(errp, "Cannot allocate iova (%d)", r);
> >>>>> +        return false;
> >>>>> +    }
> >>>>> +
> >>>>> +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> >>>>> +                           (void *)needle->translated_addr,
> >>>>> +                           !(needle->perm & IOMMU_ACCESS_FLAG(0, 1)));
> >>>> Let's simply use needle->perm == IOMMU_RO here?
> >>>>
> >>> The motivation to use this way is to be more resilient to the future.
> >>> For example, if a new flag is added.
> >>>
> >>> But I'm totally ok with comparing with IOMMU_RO, I see that scenario
> >>> unlikely at the moment.
> >>>
> >>>>> +    if (unlikely(r != 0)) {
> >>>>> +        error_setg_errno(errp, -r, "Cannot map region to device");
> >>>>> +        vhost_iova_tree_remove(v->iova_tree, needle);
> >>>>> +    }
> >>>>> +
> >>>>> +    return r == 0;
> >>>>>     }
> >>>>>
> >>>>>     /**
> >>>>> - * Map shadow virtqueue rings in device
> >>>>> + * Map the shadow virtqueue rings in the device
> >>>>>      *
> >>>>>      * @dev   The vhost device
> >>>>>      * @svq   The shadow virtqueue
> >>>>> @@ -876,28 +946,44 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>>                                          struct vhost_vring_addr *addr,
> >>>>>                                          Error **errp)
> >>>>>     {
> >>>>> +    DMAMap device_region, driver_region;
> >>>>> +    struct vhost_vring_addr svq_addr;
> >>>>>         struct vhost_vdpa *v = dev->opaque;
> >>>>>         size_t device_size = vhost_svq_device_area_size(svq);
> >>>>>         size_t driver_size = vhost_svq_driver_area_size(svq);
> >>>>> -    int r;
> >>>>> +    size_t avail_offset;
> >>>>> +    bool ok;
> >>>>>
> >>>>>         ERRP_GUARD();
> >>>>> -    vhost_svq_get_vring_addr(svq, addr);
> >>>>> +    vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>>
> >>>>> -    r = vhost_vdpa_dma_map(v, addr->desc_user_addr, driver_size,
> >>>>> -                           (void *)addr->desc_user_addr, true);
> >>>>> -    if (unlikely(r != 0)) {
> >>>>> -        error_setg_errno(errp, -r, "Cannot create vq driver region: ");
> >>>>> +    driver_region = (DMAMap) {
> >>>>> +        .translated_addr = svq_addr.desc_user_addr,
> >>>>> +        .size = driver_size - 1,
> >>>> Any reason for the "-1" here? I see several places do things like that,
> >>>> it's probably hint of wrong API somehwere.
> >>>>
> >>> The "problem" is the api mismatch between _end and _last, to include
> >>> the last member in the size or not.
> >>>
> >>> IOVA tree needs to use _end so we can allocate the last page in case
> >>> of available range ending in (uint64_t)-1 [1]. But If we change
> >>> vhost_svq_{device,driver}_area_size to make it inclusive,
> >>
> >> These functions looks sane since it doesn't return a range. It's up to
> >> the caller to decide how to use the size.
> >>
> > Ok I think I didn't get your comment the first time, so there is a bug
> > here. But I'm not sure if we are on the same page regarding the iova
> > tree.
> >
> > Regarding the alignment, it's up to the caller how to use the size.
> > But if you introduce a mapping of (iova_1, translated_addr_1, size_1),
> > the iova address iova_1+size_1 belongs to that mapping.
>
>
> This seems contradict to the definition of size_1? E.g if we get a iova
> range start from 0 and it's size is 1, 1 is not included in that mapping.
>

Yes it is included. I think it's better to trace the code here to explain:

The definition of DMAMap have a doc staying that is /* Inclusive */:
typedef struct DMAMap {
    hwaddr iova;
    hwaddr translated_addr;
    hwaddr size;                /* Inclusive */
    IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;

And if we trace the code, assuming that we have an iova tree of only
one element .iova=0, .size=1, and we want to add another mapping of
.iova = 1 and .size = 1:

    int iova_tree_insert(IOVATree *tree, const DMAMap *map)
    {
        DMAMap *new;

        if (map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
            return IOVA_ERR_INVALID;
        }

map->iova + map->size does not overlap, and let's assume permissions
are valid because they are out of scope for this discussion.

        /* We don't allow to insert range that overlaps with existings */
        if (iova_tree_find(tree, map)) {
            return IOVA_ERR_OVERLAP;
        }

This will call iova_tree_compare internally. For the purpose of the
example I'm going to assume that previous mapping of iova == 0 is m2
and the new one is m1:

    static int iova_tree_compare(gconstpointer a, gconstpointer b,
gpointer data)
    {
        const DMAMap *m1 = a, *m2 = b;

        if (m1->iova > m2->iova + m2->size) {
            return 1;
        }

1 > 1 + 0 -> false

        if (m1->iova + m1->size < m2->iova) {
            return -1;
        }

2 < 1 -> false

        /* Overlapped */
        return 0;

There is no other conclusion: The two maps are the same.
    }

And that's in qemu master, it's not because SVQ allocation's change.

I'm starting two think that instead of trusting in naming or comments,
we should start trusting in using different types to tell the
difference between inclusive (_size?, _last) and non-inclusive
(_size?, _end) sizes. There should be no cost in using the type
members anyway, and we can either let the compiler do the conversions
with _Generic or force the right type:

struct InclusiveSize {
  hwaddr size;
}

struct RegularSize {
  hwaddr size;
}

So iova tree functions would use InclusiveSize and vhost_vdpa_dma_map
would use RegularSize. Much like c++ chrono with ::seconds,
::milliseconds,... or duration vs time_point. I would have saved a
while using this.

>
> > If you want to
> > introduce a new mapping (iova_2 = iova_1 + size_1, translated_addr_2,
> > size_2) it will be rejected, since it overlaps with the first one.
> > That part is not up to the caller.
> >
> > At this moment, vhost_svq_driver_area_size and
> > vhost_svq_device_area_size returns in the same terms as sizeof(x). In
> > other words, size is not inclusive. As memset() or VHOST_IOTLB_UPDATE
> > expects, for example. We could move the -1 inside of these functions,
> > and then we need to adapt qemu_memalign calls on vhost_svq_start or
> > vhost_vdpa dma_map/unmap.
> >
> >>>    we need to
> >>> use "+1" in calls like qemu_memalign and memset at vhost_svq_start.
> >>> Probably in more places too
> >>
> >> I'm not sure I get here. Maybe you can show which code may suffers if we
> >> don't decrease it by one here.
> >>
> > Less than I expected I have to say:
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> > b/hw/virtio/vhost-shadow-virtqueue.c
> > index 497237dcbb..b42ba5a3c0 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> > VhostShadowVirtqueue *svq)
> >   {
> >       size_t used_size = offsetof(vring_used_t, ring) +
> >                                       sizeof(vring_used_elem_t) * svq->vring.num;
> > -    return ROUND_UP(used_size, qemu_real_host_page_size);
> > +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
> >   }
> >
> >   /**
> > @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> > VirtIODevice *vdev,
> >       svq->vq = vq;
> >
> >       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> > -    driver_size = vhost_svq_driver_area_size(svq);
> > -    device_size = vhost_svq_device_area_size(svq);
> > +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> > +    device_size = vhost_svq_device_area_size(svq) + 1;
> >       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> >       desc_size = sizeof(vring_desc_t) * svq->vring.num;
> >       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 5eefc5911a..2bf648de4a 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >
> >       driver_region = (DMAMap) {
> >           .translated_addr = svq_addr.desc_user_addr,
> > -        .size = driver_size - 1,
> > +        .size = driver_size,
> >           .perm = IOMMU_RO,
> >       };
> >       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> > @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >
> >       device_region = (DMAMap) {
> >           .translated_addr = svq_addr.used_user_addr,
> > -        .size = device_size - 1,
> > +        .size = device_size,
> >           .perm = IOMMU_RW,
> >       };
> >       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> > b/hw/virtio/vhost-shadow-virtqueue.c
> > index 497237dcbb..b42ba5a3c0 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -479,7 +479,7 @@ size_t vhost_svq_device_area_size(const
> > VhostShadowVirtqueue *svq)
> >   {
> >       size_t used_size = offsetof(vring_used_t, ring) +
> >                                       sizeof(vring_used_elem_t) * svq->vring.num;
> > -    return ROUND_UP(used_size, qemu_real_host_page_size);
> > +    return ROUND_UP(used_size, qemu_real_host_page_size) - 1;
> >   }
> >
> >   /**
> > @@ -532,8 +532,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq,
> > VirtIODevice *vdev,
> >       svq->vq = vq;
> >
> >       svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> > -    driver_size = vhost_svq_driver_area_size(svq);
> > -    device_size = vhost_svq_device_area_size(svq);
> > +    driver_size = vhost_svq_driver_area_size(svq) + 1;
> > +    device_size = vhost_svq_device_area_size(svq) + 1;
> >       svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> >       desc_size = sizeof(vring_desc_t) * svq->vring.num;
> >       svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 5eefc5911a..2bf648de4a 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -918,7 +918,7 @@ static bool vhost_vdpa_svq_map_ring(struct
> > vhost_vdpa *v, DMAMap *needle,
> >           return false;
> >       }
> >
> > -    r = vhost_vdpa_dma_map(v, needle->iova, needle->size,
> > +    r = vhost_vdpa_dma_map(v, needle->iova, needle->size + 1,
> >                              (void *)needle->translated_addr,
> >                              needle->perm == IOMMU_RO);
> >       if (unlikely(r != 0)) {
> > @@ -955,7 +955,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >
> >       driver_region = (DMAMap) {
> >           .translated_addr = svq_addr.desc_user_addr,
> > -        .size = driver_size - 1,
> > +        .size = driver_size,
> >           .perm = IOMMU_RO,
> >       };
> >       ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> > @@ -969,7 +969,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >
> >       device_region = (DMAMap) {
> >           .translated_addr = svq_addr.used_user_addr,
> > -        .size = device_size - 1,
> > +        .size = device_size,
> >           .perm = IOMMU_RW,
> >       };
> >       ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> > ---
>
>
> Sorry, I still don't get why -1/+1 is required. Maybe you can show me
> what happens if we don't do these.
>

I think it's solved with the previous example but let me know if we
should continue here too.

Thanks!

> Thanks
>
>
> >
> >> But current code may endup to passing qemu_real_host_page_size - 1 to
> >> vhost-VDPA which seems wrong?
> >>
> >> E.g vhost_svq_device_area_size() return qemu_real_host_page_size, but it
> >> was decreased by 1 here for size, then we pass size to vhost_vdpa_dma_map().
> >>
> > That part needs fixing, but the right solution is not to skip the -1
> > but increment to pass to the vhost_vdpa_dma_map. Doing otherwise would
> > bring problems with how iova-tree works. It will be included in the
> > next series.
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
> >>> QEMU's emulated Intel iommu code solves it using the address mask as
> >>> the size, something that does not fit 100% with vhost devices since
> >>> they can allocate an arbitrary address of arbitrary size when using
> >>> vIOMMU. It's not a problem for vhost-vdpa at this moment since we make
> >>> sure we expose unaligned and whole pages with vrings, but I feel it
> >>> would only be to move the problem somewhere else.
> >>>
> >>> Thanks!
> >>>
> >>> [1] There are alternatives: to use Int128_t, etc. But I think it's
> >>> better not to change that in this patch series.
> >>>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> +        .perm = IOMMU_RO,
> >>>>> +    };
> >>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>>>> +    if (unlikely(!ok)) {
> >>>>> +        error_prepend(errp, "Cannot create vq driver region: ");
> >>>>>             return false;
> >>>>>         }
> >>>>> +    addr->desc_user_addr = driver_region.iova;
> >>>>> +    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> >>>>> +    addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>>>
> >>>>> -    r = vhost_vdpa_dma_map(v, addr->used_user_addr, device_size,
> >>>>> -                           (void *)addr->used_user_addr, false);
> >>>>> -    if (unlikely(r != 0)) {
> >>>>> -        error_setg_errno(errp, -r, "Cannot create vq device region: ");
> >>>>> +    device_region = (DMAMap) {
> >>>>> +        .translated_addr = svq_addr.used_user_addr,
> >>>>> +        .size = device_size - 1,
> >>>>> +        .perm = IOMMU_RW,
> >>>>> +    };
> >>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>>>> +    if (unlikely(!ok)) {
> >>>>> +        error_prepend(errp, "Cannot create vq device region: ");
> >>>>> +        vhost_vdpa_svq_unmap_ring(v, &driver_region);
> >>>>>         }
> >>>>> +    addr->used_user_addr = device_region.iova;
> >>>>>
> >>>>> -    return r == 0;
> >>>>> +    return ok;
> >>>>>     }
> >>>>>
> >>>>>     static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/14] vhost: Add VhostIOVATree
  2022-03-07  3:41             ` Jason Wang
  (?)
@ 2022-03-07  8:56             ` Eugenio Perez Martin
  -1 siblings, 0 replies; 69+ messages in thread
From: Eugenio Perez Martin @ 2022-03-07  8:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Eli Cohen, Eric Blake, Parav Pandit, Cindy Lu, Fangyi (Eric),
	Markus Armbruster, yebiaoxiang, Liuxiangdong, Stefano Garzarella,
	Laurent Vivier, Eduardo Habkost, Richard Henderson, Gautam Dawar,
	Xiao W Wang, Stefan Hajnoczi, Juan Quintela,
	Harpreet Singh Anand, Paolo Bonzini, Lingshan

On Mon, Mar 7, 2022 at 4:42 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Mar 4, 2022 at 4:02 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Fri, Mar 4, 2022 at 3:04 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Mar 4, 2022 at 12:33 AM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Feb 28, 2022 at 8:06 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > >
> > > > > 在 2022/2/27 下午9:41, Eugenio Pérez 写道:
> > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > >
> > > > > > At first glance it is similar to util/iova-tree. However, SVQ working on
> > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > IOVA chunks or performing reverse translations (qemu addresses to iova).
> > > > > >
> > > > > > The allocation capability, as "assign a free IOVA address to this chunk
> > > > > > of memory in qemu's address space" allows shadow virtqueue to create a
> > > > > > new address space that is not restricted by guest's addressable one, so
> > > > > > we can allocate shadow vqs vrings outside of it.
> > > > > >
> > > > > > It duplicates the tree so it can search efficiently in both directions,
> > > > > > and it will signal overlap if iova or the translated address is present
> > > > > > in any tree.
> > > > > >
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > ---
> > > > > >   hw/virtio/vhost-iova-tree.h |  27 +++++++
> > > > > >   hw/virtio/vhost-iova-tree.c | 155 ++++++++++++++++++++++++++++++++++++
> > > > > >   hw/virtio/meson.build       |   2 +-
> > > > > >   3 files changed, 183 insertions(+), 1 deletion(-)
> > > > > >   create mode 100644 hw/virtio/vhost-iova-tree.h
> > > > > >   create mode 100644 hw/virtio/vhost-iova-tree.c
> > > > > >
> > > > > > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > > > > > new file mode 100644
> > > > > > index 0000000000..6a4f24e0f9
> > > > > > --- /dev/null
> > > > > > +++ b/hw/virtio/vhost-iova-tree.h
> > > > > > @@ -0,0 +1,27 @@
> > > > > > +/*
> > > > > > + * vhost software live migration iova tree
> > > > > > + *
> > > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > > + *
> > > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > > > > > +
> > > > > > +#include "qemu/iova-tree.h"
> > > > > > +#include "exec/memory.h"
> > > > > > +
> > > > > > +typedef struct VhostIOVATree VhostIOVATree;
> > > > > > +
> > > > > > +VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, uint64_t iova_last);
> > > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
> > > > > > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
> > > > > > +
> > > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> > > > > > +                                        const DMAMap *map);
> > > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> > > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > > > > > new file mode 100644
> > > > > > index 0000000000..03496ac075
> > > > > > --- /dev/null
> > > > > > +++ b/hw/virtio/vhost-iova-tree.c
> > > > > > @@ -0,0 +1,155 @@
> > > > > > +/*
> > > > > > + * vhost software live migration iova tree
> > > > > > + *
> > > > > > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > > > > > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > > > > > + *
> > > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > + */
> > > > > > +
> > > > > > +#include "qemu/osdep.h"
> > > > > > +#include "qemu/iova-tree.h"
> > > > > > +#include "vhost-iova-tree.h"
> > > > > > +
> > > > > > +#define iova_min_addr qemu_real_host_page_size
> > > > > > +
> > > > > > +/**
> > > > > > + * VhostIOVATree, able to:
> > > > > > + * - Translate iova address
> > > > > > + * - Reverse translate iova address (from translated to iova)
> > > > > > + * - Allocate IOVA regions for translated range (linear operation)
> > > > > > + */
> > > > > > +struct VhostIOVATree {
> > > > > > +    /* First addressable iova address in the device */
> > > > > > +    uint64_t iova_first;
> > > > > > +
> > > > > > +    /* Last addressable iova address in the device */
> > > > > > +    uint64_t iova_last;
> > > > > > +
> > > > > > +    /* IOVA address to qemu memory maps. */
> > > > > > +    IOVATree *iova_taddr_map;
> > > > > > +
> > > > > > +    /* QEMU virtual memory address to iova maps */
> > > > > > +    GTree *taddr_iova_map;
> > > > > > +};
> > > > > > +
> > > > > > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b,
> > > > > > +                                      gpointer data)
> > > > > > +{
> > > > > > +    const DMAMap *m1 = a, *m2 = b;
> > > > > > +
> > > > > > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > > > > > +        return 1;
> > > > > > +    }
> > > > > > +
> > > > > > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > > > > > +        return -1;
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Overlapped */
> > > > > > +    return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Create a new IOVA tree
> > > > > > + *
> > > > > > + * Returns the new IOVA tree
> > > > > > + */
> > > > > > +VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> > > > > > +{
> > > > > > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > > > > > +
> > > > > > +    /* Some devices do not like 0 addresses */
> > > > > > +    tree->iova_first = MAX(iova_first, iova_min_addr);
> > > > > > +    tree->iova_last = iova_last;
> > > > > > +
> > > > > > +    tree->iova_taddr_map = iova_tree_new();
> > > > > > +    tree->taddr_iova_map = g_tree_new_full(vhost_iova_tree_cmp_taddr, NULL,
> > > > > > +                                           NULL, g_free);
> > > > > > +    return tree;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Delete an iova tree
> > > > > > + */
> > > > > > +void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> > > > > > +{
> > > > > > +    iova_tree_destroy(iova_tree->iova_taddr_map);
> > > > > > +    g_tree_unref(iova_tree->taddr_iova_map);
> > > > > > +    g_free(iova_tree);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Find the IOVA address stored from a memory address
> > > > > > + *
> > > > > > + * @tree     The iova tree
> > > > > > + * @map      The map with the memory address
> > > > > > + *
> > > > > > + * Return the stored mapping, or NULL if not found.
> > > > > > + */
> > > > > > +const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > > > > > +                                        const DMAMap *map)
> > > > > > +{
> > > > > > +    return g_tree_lookup(tree->taddr_iova_map, map);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Allocate a new mapping
> > > > > > + *
> > > > > > + * @tree  The iova tree
> > > > > > + * @map   The iova map
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * - IOVA_OK if the map fits in the container
> > > > > > + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> > > > > > + * - IOVA_ERR_OVERLAP if the tree already contains that map
> > > > > > + * - IOVA_ERR_NOMEM if tree cannot allocate more space.
> > > > > > + *
> > > > > > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > > > > > + */
> > > > > > +int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> > > > > > +{
> > > > > > +    /* Some vhost devices do not like addr 0. Skip first page */
> > > > > > +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size;
> > > > > > +    DMAMap *new;
> > > > > > +    int r;
> > > > > > +
> > > > > > +    if (map->translated_addr + map->size < map->translated_addr ||
> > > > > > +        map->perm == IOMMU_NONE) {
> > > > > > +        return IOVA_ERR_INVALID;
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Check for collisions in translated addresses */
> > > > > > +    if (vhost_iova_tree_find_iova(tree, map)) {
> > > > > > +        return IOVA_ERR_OVERLAP;
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Allocate a node in IOVA address */
> > > > > > +    r = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> > > > > > +                            tree->iova_last);
> > > > > > +    if (r != IOVA_OK) {
> > > > > > +        return r;
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Allocate node in qemu -> iova translations */
> > > > > > +    new = g_malloc(sizeof(*new));
> > > > > > +    memcpy(new, map, sizeof(*new));
> > > > > > +    g_tree_insert(tree->taddr_iova_map, new, new);
> > > > >
> > > > >
> > > > > Can the caller map two IOVA ranges to the same e.g GPA range?
> > > > >
> > > >
> > > > It shouldn't matter, because we are totally ignoring GPA here. HVA
> > > > could be more problematic.
> > > >
> > > > We call it from two places: The shadow vring addresses and through the
> > > > memory listener. The SVQ vring addresses should already be on a
> > > > separated translated address from each one and guest's HVA because of
> > > > malloc semantics.
> > >
> > > Right, so SVQ addresses should be fine, the problem is the guest mappings.
> > >
> > > >
> > > > Regarding the listener, it should already report flattened memory with
> > > > no overlapping between the HVA chunks.
> > > > vhost_vdpa_listener_skipped_section should skip all problematic
> > > > sections if I'm not wrong.
> > > >
> > > > But I may have missed some scenarios: vdpa devices only care about
> > > > IOVA -> HVA translation, so two IOVA could translate to the same HVA
> > > > in theory and we would not notice until we try with SVQ. To develop an
> > > > algorithm to handle this seems complicated at this moment: Should we
> > > > keep the bigger one? The last mapped? What happens if the listener
> > > > unmaps one of them, we suddenly must start translating from the not
> > > > unmapping? Seems that some kind of stacking would be needed.
> > > >
> > > > Thanks!
> > >
> > > It looks to me that we should always try to allocate new iova each
> > > time, even if the HVA is the same. This means we need to remove the
> > > reverse mapping tree.
> > >
> > > Currently we had:
> > >
> > >     /* Check for collisions in translated addresses */
> > >     if (vhost_iova_tree_find_iova(tree, map)) {
> > >         return IOVA_ERR_OVERLAP;
> > >     }
> > >
> > > We probably need to remove that. And during the translation we need to
> > > iterate the whole iova tree to get the reverse mapping instead by
> > > returning the largest possible mapping there.
> > >
> >
> > I'm not sure if that is possible. g_tree_insert() calls the comparison
> > methods so it knows where to place the new element, so it's expected
> > to do something if the node already exists. Looking at the sources it
> > actually silently destroys the new node. If we call g_tree_replace, we
> > achieve the opposite and destroy the old node. But the tree is
> > expected to have non-overlapping keys.
>
> So the problem is that the current IOVA tree design is not fit for our
> requirement:
>
> static inline void iova_tree_insert_internal(GTree *gtree, DMAMap *range)
> {
>     /* Key and value are sharing the same range data */
>     g_tree_insert(gtree, range, range);
> }
>
> It looks to me we need to  extend the current IOVA tree, split IOVA
> range as key, this allows us to do an IOVA allocator on top. If we use
> IOVA as the key, we can do
>
> IOVA1->HVA
> IOVA2->HVA
>

I don't get 100% of this first part, but I think I get the idea in
general terms. What do you mean with "split iova range as key"?

> And then we can remove the current taddr_iova_map which assumes an 1:1
> mapping here. When doing HVA to IOVA translation, we need to iterate
> the tree and return the first match and continue the search until we
> meet the size.
>

Ok this part is doable and actually I've a POC working.

> >
> > Apart from that, we're not using this struct as a tree anymore so it's
> > better to use directly a list in that case.
> >
> > But even with the list there are still questions on how to handle
> > overlappings. How to handle this deletion:
> >
> > * Allocate translated_addr 0, size 0x1000.
> > * Allocate translated_addr 0, size 0x2000.
> > * Delete translated_addr 0, size 0x1000.
> >
> > Should it delete only the first node? Both of them?
>
> I'd suggest removing the taddr_iova_map.
>

We could find it also iterating. I'm not sure if we will ever see it.

> >
> > iova-tree has similar questions too with iova. Inserting (iova=0,
> > size=0x1000) and deleting (.iova=0, size=0x800) will delete all the
> > whole node, so we cannot search the translation of (.iova=0x900)
> > anymore. Is this expected?
>
> Not sure. When vIOMMU is enabled, the guest risks itself to do this.
> When vIOMMU is not enabled, it should be a bug of qemu to add and
> remove GPA ranges with different size.
>

That's my impression too.

So while it's doable to do it iterating and allowing overlapping maps
that way, my impression is that if we see this in the wild it means
something else is failing: Either qemu is not filtering / flattening
the addresses right, or we are not filtering it right at
vhost_vdpa_listener_skipped_section.

And this is a different problem of how we store the mappings somehow:
The problem is not to store them in a list or a tree or a combination,
but to order them so we pick the same one as the dma_memory_map
function. If we agree that "it should be a bug of qemu to add and
remove GPA ranges with different size", we may well add a simple
reference counter.

Note that I'm 100% with iterate over IOVATree, this is just an example
to expose my point.

Thanks!

> Thanks
>
> >
> > > But this may degrade the performance, but consider the memslots should
> > > not be much at most of the time, it should be fine.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > > +    return IOVA_OK;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Remove existing mappings from iova tree
> > > > > > + *
> > > > > > + * @param  iova_tree  The vhost iova tree
> > > > > > + * @param  map        The map to remove
> > > > > > + */
> > > > > > +void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
> > > > > > +{
> > > > > > +    const DMAMap *overlap;
> > > > > > +
> > > > > > +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> > > > > > +    while ((overlap = vhost_iova_tree_find_iova(iova_tree, map))) {
> > > > > > +        g_tree_remove(iova_tree->taddr_iova_map, overlap);
> > > > > > +    }
> > > > > > +}
> > > > > > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > > > > > index 2dc87613bc..6047670804 100644
> > > > > > --- a/hw/virtio/meson.build
> > > > > > +++ b/hw/virtio/meson.build
> > > > > > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> > > > > >
> > > > > >   virtio_ss = ss.source_set()
> > > > > >   virtio_ss.add(files('virtio.c'))
> > > > > > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > > > > > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> > > > > >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> > > > > >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> > > > > >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
> > > > >
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2022-03-07  9:04 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-27 13:40 [PATCH v2 00/14] vDPA shadow virtqueue Eugenio Pérez
2022-02-27 13:40 ` [PATCH v2 01/14] vhost: Add VhostShadowVirtqueue Eugenio Pérez
2022-02-27 13:40 ` [PATCH v2 02/14] vhost: Add Shadow VirtQueue kick forwarding capabilities Eugenio Pérez
2022-02-28  2:57   ` Jason Wang
2022-02-28  2:57     ` Jason Wang
2022-03-01 18:49     ` Eugenio Perez Martin
2022-03-03  7:12       ` Jason Wang
2022-03-03  7:12         ` Jason Wang
2022-03-03  9:24         ` Eugenio Perez Martin
2022-03-04  1:39           ` Jason Wang
2022-03-04  1:39             ` Jason Wang
2022-02-27 13:41 ` [PATCH v2 03/14] vhost: Add Shadow VirtQueue call " Eugenio Pérez
2022-02-28  3:18   ` Jason Wang
2022-02-28  3:18     ` Jason Wang
2022-03-01 11:18     ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 04/14] vhost: Add vhost_svq_valid_features to shadow vq Eugenio Pérez
2022-02-28  3:25   ` Jason Wang
2022-02-28  3:25     ` Jason Wang
2022-03-01 19:18     ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 05/14] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
2022-02-27 13:41 ` [PATCH v2 06/14] vdpa: adapt vhost_ops callbacks to svq Eugenio Pérez
2022-02-28  3:59   ` Jason Wang
2022-02-28  3:59     ` Jason Wang
2022-03-01 19:31     ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 07/14] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
2022-02-28  5:39   ` Jason Wang
2022-02-28  5:39     ` Jason Wang
2022-03-02 18:23     ` Eugenio Perez Martin
2022-03-03  7:35       ` Jason Wang
2022-03-03  7:35         ` Jason Wang
2022-02-27 13:41 ` [PATCH v2 08/14] util: Add iova_tree_alloc Eugenio Pérez
2022-02-28  6:39   ` Jason Wang
2022-02-28  6:39     ` Jason Wang
2022-03-01 10:06     ` Eugenio Perez Martin
2022-03-03  7:16       ` Jason Wang
2022-03-03  7:16         ` Jason Wang
2022-02-27 13:41 ` [PATCH v2 09/14] vhost: Add VhostIOVATree Eugenio Pérez
2022-02-28  7:06   ` Jason Wang
2022-02-28  7:06     ` Jason Wang
2022-03-03 16:32     ` Eugenio Perez Martin
2022-03-04  2:04       ` Jason Wang
2022-03-04  2:04         ` Jason Wang
2022-03-04  8:01         ` Eugenio Perez Martin
2022-03-07  3:41           ` Jason Wang
2022-03-07  3:41             ` Jason Wang
2022-03-07  8:56             ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 10/14] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
2022-02-28  7:36   ` Jason Wang
2022-02-28  7:36     ` Jason Wang
2022-03-01  8:50     ` Eugenio Perez Martin
2022-03-03  7:33       ` Jason Wang
2022-03-03  7:33         ` Jason Wang
2022-03-03 11:35         ` Eugenio Perez Martin
2022-03-07  4:24           ` Jason Wang
2022-03-07  4:24             ` Jason Wang
2022-03-07  7:44             ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 11/14] vdpa: Adapt vhost_vdpa_get_vring_base " Eugenio Pérez
2022-02-28  7:38   ` Jason Wang
2022-02-28  7:38     ` Jason Wang
2022-03-01  7:51     ` Eugenio Perez Martin
2022-02-27 13:41 ` [PATCH v2 12/14] vdpa: Never set log_base addr if SVQ is enabled Eugenio Pérez
2022-02-27 13:41 ` [PATCH v2 13/14] vdpa: Expose VHOST_F_LOG_ALL on SVQ Eugenio Pérez
2022-02-27 13:41 ` [PATCH v2 14/14] vdpa: Add x-svq to NetdevVhostVDPAOptions Eugenio Pérez
2022-02-28  2:32 ` [PATCH v2 00/14] vDPA shadow virtqueue Jason Wang
2022-02-28  2:32   ` Jason Wang
2022-03-01 11:36   ` Eugenio Perez Martin
2022-02-28  7:41 ` Jason Wang
2022-02-28  7:41   ` Jason Wang
2022-03-02 20:30   ` Eugenio Perez Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.