All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration
@ 2023-12-15 17:28 Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment the migration starts.  Moving that operation allows
QEMU to communicate the kernel the maps while the workload is still running in
the source, so Linux can start mapping them.

Also, the destination of the guest memory may finish before the destination
QEMU maps all the memory.  In this case, the rest of the memory will be mapped
at the same time as before applying this series, when the device is starting.
So we're only improving with this series.

If the destination has the switchover_ack capability enabled, the destination
hold the migration until all the memory is mapped.

This needs to be applied on top of [1]. That series performs some code
reorganization that allows to map the guest memory without knowing the queue
layout the guest configure on the device.

This series reduced the downtime in the stop-and-copy phase of the live
migration from 20s~30s to 5s, with a 128G mem guest and two mlx5_vdpa devices,
per [2].

Future directions on top of this series may include:
* Iterative migration of virtio-net devices, as it may reduce downtime per [3].
  vhost-vdpa net can apply the configuration through CVQ in the destination
  while the source is still migrating.
* Move more things ahead of migration time, like DRIVER_OK.
* Check that the devices of the destination are valid, and cancel the migration
  in case it is not.

v1 from RFC v2:
* Hold on migration if memory has not been mapped in full with switchover_ack.
* Revert map if the device is not started.

RFC v2:
* Delegate map to another thread so it does no block QMP.
* Fix not allocating iova_tree if x-svq=on at the destination.
* Rebased on latest master.
* More cleanups of current code, that might be split from this series too.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01986.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
[3] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/

Eugenio Pérez (12):
  vdpa: do not set virtio status bits if unneeded
  vdpa: make batch_begin_once early return
  vdpa: merge _begin_batch into _batch_begin_once
  vdpa: extract out _dma_end_batch from _listener_commit
  vdpa: factor out stop path of vhost_vdpa_dev_start
  vdpa: check for iova tree initialized at net_client_start
  vdpa: set backend capabilities at vhost_vdpa_init
  vdpa: add vhost_vdpa_load_setup
  vdpa: approve switchover after memory map in the migration destination
  vdpa: add vhost_vdpa_net_load_setup NetClient callback
  vdpa: add vhost_vdpa_net_switchover_ack_needed
  virtio_net: register incremental migration handlers

 include/hw/virtio/vhost-vdpa.h |  32 ++++
 include/net/net.h              |   8 +
 hw/net/virtio-net.c            |  48 ++++++
 hw/virtio/vhost-vdpa.c         | 274 +++++++++++++++++++++++++++------
 net/vhost-vdpa.c               |  43 +++++-
 5 files changed, 357 insertions(+), 48 deletions(-)

-- 
2.39.3




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:25   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return Eugenio Pérez
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Next commits will set DRIVER and ACKNOWLEDGE flags repeatedly in the
case of a migration destination.  Let's save ioctls with this.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 7500c2fc82..cc252fc2d8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -510,6 +510,10 @@ static int vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
     if (ret < 0) {
         return ret;
     }
+    if ((s & status) == status) {
+        /* Don't set bits already set */
+        return 0;
+    }
 
     s |= status;
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:27   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once Eugenio Pérez
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Prefer early return so it is easier to merge
vhost_vdpa_listener_begin_batch here and make iotlb baches begin and end
symmetrical.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index cc252fc2d8..bf9771870a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -160,11 +160,12 @@ static void vhost_vdpa_listener_begin_batch(VhostVDPAShared *s)
 
 static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 {
-    if (s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
-        !s->iotlb_batch_begin_sent) {
-        vhost_vdpa_listener_begin_batch(s);
+    if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
+        s->iotlb_batch_begin_sent) {
+        return;
     }
 
+    vhost_vdpa_listener_begin_batch(s);
     s->iotlb_batch_begin_sent = true;
 }
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:30   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit Eugenio Pérez
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

There was only one call.  This way we can make the begin and end of the
batch symmetrical.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bf9771870a..a533fc5bc7 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -143,7 +143,7 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
     return ret;
 }
 
-static void vhost_vdpa_listener_begin_batch(VhostVDPAShared *s)
+static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 {
     int fd = s->device_fd;
     struct vhost_msg_v2 msg = {
@@ -151,21 +151,16 @@ static void vhost_vdpa_listener_begin_batch(VhostVDPAShared *s)
         .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
     };
 
-    trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
-    if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
-        error_report("failed to write, fd=%d, errno=%d (%s)",
-                     fd, errno, strerror(errno));
-    }
-}
-
-static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
-{
     if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
         s->iotlb_batch_begin_sent) {
         return;
     }
 
-    vhost_vdpa_listener_begin_batch(s);
+    trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
+    if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
+        error_report("failed to write, fd=%d, errno=%d (%s)",
+                     fd, errno, strerror(errno));
+    }
     s->iotlb_batch_begin_sent = true;
 }
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:31   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start Eugenio Pérez
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

So we can call out vhost_vdpa_dma_end_batch out of the listener
callbacks.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a533fc5bc7..57a8043cd4 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -164,9 +164,8 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
     s->iotlb_batch_begin_sent = true;
 }
 
-static void vhost_vdpa_listener_commit(MemoryListener *listener)
+static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
 {
-    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
     struct vhost_msg_v2 msg = {};
     int fd = s->device_fd;
 
@@ -190,6 +189,13 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
     s->iotlb_batch_begin_sent = false;
 }
 
+static void vhost_vdpa_listener_commit(MemoryListener *listener)
+{
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
+
+    vhost_vdpa_dma_end_batch(s);
+}
+
 static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     struct vdpa_iommu *iommu = container_of(n, struct vdpa_iommu, n);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:31   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 06/12] vdpa: check for iova tree initialized at net_client_start Eugenio Pérez
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

This makes easier to build an error path in next patches.  No functional
change.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 57a8043cd4..449c3794b2 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1302,7 +1302,7 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev)
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
     struct vhost_vdpa *v = dev->opaque;
-    bool ok;
+    bool ok = true;
     trace_vhost_vdpa_dev_start(dev, started);
 
     if (started) {
@@ -1313,8 +1313,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         }
     } else {
         vhost_vdpa_suspend(dev);
-        vhost_vdpa_svqs_stop(dev);
-        vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
+        goto out_stop;
     }
 
     if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
@@ -1333,6 +1332,11 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
     }
 
     return 0;
+
+out_stop:
+    vhost_vdpa_svqs_stop(dev);
+    vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
+    return ok ? 0 : -1;
 }
 
 static void vhost_vdpa_reset_status(struct vhost_dev *dev)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 06/12] vdpa: check for iova tree initialized at net_client_start
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init Eugenio Pérez
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

To map the guest memory while it is migrating we need to create the
iova_tree, as long as the destination uses x-svq=on. Checking to not
override it.

The function vhost_vdpa_net_client_stop clear it if the device is
stopped. If the guest starts the device again, the iova tree is
recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start
if needed, so old behavior is kept.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 3726ee5d67..e11b390466 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -341,7 +341,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 
     migration_add_notifier(&s->migration_state,
                            vdpa_net_migration_state_notifier);
-    if (v->shadow_vqs_enabled) {
+
+    /* iova_tree may be initialized by vhost_vdpa_net_load_setup */
+    if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
         v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
                                                    v->shared->iova_range.last);
     }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 06/12] vdpa: check for iova tree initialized at net_client_start Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  4:34   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

The backend does not reset them until the vdpa file descriptor is closed
so there is no harm in doing it only once.

This allows the destination of a live migration to premap memory in
batches, using VHOST_BACKEND_F_IOTLB_BATCH.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 50 ++++++++++++++++--------------------------
 1 file changed, 19 insertions(+), 31 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 449c3794b2..43f7c382b1 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -587,11 +587,25 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     struct vhost_vdpa *v = opaque;
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
     trace_vhost_vdpa_init(dev, v->shared, opaque);
+    uint64_t backend_features;
+    uint64_t qemu_backend_features = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
+                                     0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
+                                     0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
+                                     0x1ULL << VHOST_BACKEND_F_SUSPEND;
     int ret;
 
     v->dev = dev;
     dev->opaque =  opaque ;
     v->shared->listener = vhost_vdpa_memory_listener;
+
+    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &backend_features)) {
+        return -EFAULT;
+    }
+
+    backend_features &= qemu_backend_features;
+
+    dev->backend_cap = backend_features;
+    v->shared->backend_cap = backend_features;
     vhost_vdpa_init_svq(dev, v);
 
     error_propagate(&dev->migration_blocker, v->migration_blocker);
@@ -599,6 +613,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
         return 0;
     }
 
+    ret = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &backend_features);
+    if (ret) {
+        return -EFAULT;
+    }
+
     /*
      * If dev->shadow_vqs_enabled at initialization that means the device has
      * been started with x-svq=on, so don't block migration
@@ -829,36 +848,6 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
     return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
 }
 
-static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
-{
-    struct vhost_vdpa *v = dev->opaque;
-
-    uint64_t features;
-    uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
-        0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
-        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
-        0x1ULL << VHOST_BACKEND_F_SUSPEND;
-    int r;
-
-    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
-        return -EFAULT;
-    }
-
-    features &= f;
-
-    if (vhost_vdpa_first_dev(dev)) {
-        r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
-        if (r) {
-            return -EFAULT;
-        }
-    }
-
-    dev->backend_cap = features;
-    v->shared->backend_cap = features;
-
-    return 0;
-}
-
 static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
                                     uint32_t *device_id)
 {
@@ -1512,7 +1501,6 @@ const VhostOps vdpa_ops = {
         .vhost_set_vring_kick = vhost_vdpa_set_vring_kick,
         .vhost_set_vring_call = vhost_vdpa_set_vring_call,
         .vhost_get_features = vhost_vdpa_get_features,
-        .vhost_set_backend_cap = vhost_vdpa_set_backend_cap,
         .vhost_set_owner = vhost_vdpa_set_owner,
         .vhost_set_vring_endian = NULL,
         .vhost_backend_memslots_limit = vhost_vdpa_memslots_limit,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-20  5:21   ` Jason Wang
  2023-12-15 17:28 ` [PATCH for 9.0 09/12] vdpa: approve switchover after memory map in the migration destination Eugenio Pérez
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Callers can use this function to setup the incoming migration thread.

This thread is able to map the guest memory while the migration is
ongoing, without blocking QMP or other important tasks. While this
allows the destination QEMU not to block, it expands the mapping time
during migration instead of making it pre-migration.

This thread joins at vdpa backend device start, so it could happen that
the guest memory is so large that we still have guest memory to map
before this time.  This can be improved in later iterations, when the
destination device can inform QEMU that it is not ready to complete the
migration.

If the device is not started, the clean of the mapped memory is done at
.load_cleanup.  This is far from ideal, as the destination machine has
mapped all the guest ram for nothing, and now it needs to unmap it.
However, we don't have information about the state of the device so its
the best we can do.  Once iterative migration is supported, this will be
improved as we know the virtio state of the device.

If the VM migrates before finishing all the maps, the source will stop
but the destination is still not ready to continue, and it will wait
until all guest RAM is mapped.  It is still an improvement over doing
all the map when the migration finish, but next patches use the
switchover_ack method to prevent source to stop until all the memory is
mapped at the destination.

The memory unmapping if the device is not started is weird
too, as ideally nothing would be mapped.  This can be fixed when we
migrate the device state iteratively, and we know for sure if the device
is started or not.  At this moment we don't have such information so
there is no better alternative.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

---
v1:
* Moved all the map thread members to another struct than
  VhostVDPAShared, so its memory and usage can be free from the moment
  the thread joins.

RFC v2:
* Use a dedicated thread for map instead of doing all in .load_setup,
  blocking QMP etc.

Other options considered:
* ThreadPool: It is hard to keep the order of the requests in this form.
  To send the worker thread to threadpool is also nice as the number of
  qemu threads can be controlled that way, but then vhost_vdpa_dev_map
  and similar cannot know if they are in the worker thread by checking
  with pthread_self. We can split the dma_map and similar in two
  functions, one checking if we can offload to thread map and other
  without it.
  Synchronization with qemu_thread_join is also lost. A semaphore,
  QemuEvent, or plain Mutex + Cond would be needed to synchronize list
  access, complicating the solution. Maps are heavy enough.
  Using coroutines have the same result.
* QIOTask: This is very nice as it does even have the task cleaner to
  run in the main thread, and I have a model to follow in
  dns-resolver.c. However, I still need to check if I'm in the main
  thread or not.

To be able to get rid of qemu_thread_is_self, we could create another
set of .dma_map_sync, .dma_unmap_sync, etc, and call the
(potentially) async versions from the listener or the 100% sure sync
version from the map thread. This makes the code more complex in my
opinion.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  28 ++++++
 hw/virtio/vhost-vdpa.c         | 166 ++++++++++++++++++++++++++++++++-
 2 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8f54e5edd4..b49286b327 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -17,6 +17,7 @@
 #include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/virtio.h"
+#include "qemu/thread.h"
 #include "standard-headers/linux/vhost_types.h"
 
 /*
@@ -30,6 +31,12 @@ typedef struct VhostVDPAHostNotifier {
     void *addr;
 } VhostVDPAHostNotifier;
 
+typedef struct VhostVDPAMapThread {
+    QemuThread thread;
+    GAsyncQueue *queue;
+    bool map_thread_enabled;
+} VhostVDPAMapThread;
+
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
     int device_fd;
@@ -43,8 +50,27 @@ typedef struct vhost_vdpa_shared {
     /* Copy of backend features */
     uint64_t backend_cap;
 
+    /*
+     * Thread to map memory in QEMU incoming migration.
+     *
+     * Incoming migration calls devices ->load_setup in the main thread, but
+     * map operations can take a long time with BLQ acquired.  This forbids
+     * QEMU to serve other requests like QMP.
+     *
+     * To solve it, offload the first listener operations until the first
+     * listener commit from the main thread.  Once these are served, join the
+     * map thread.
+     */
+    VhostVDPAMapThread *map_thread;
+
     bool iotlb_batch_begin_sent;
 
+    /*
+     * The memory listener has been registered, so DMA maps have been sent to
+     * the device.
+     */
+    bool listener_registered;
+
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
 } VhostVDPAShared;
@@ -73,6 +99,8 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly);
 int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size);
+int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
+int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start);
 
 typedef struct vdpa_iommu {
     VhostVDPAShared *dev_shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 43f7c382b1..339e11c58a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -101,6 +101,16 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
     msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
     msg.iotlb.type = VHOST_IOTLB_UPDATE;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return 0;
+    }
+
     trace_vhost_vdpa_dma_map(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                              msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
                              msg.iotlb.type);
@@ -131,6 +141,16 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
     msg.iotlb.size = size;
     msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return 0;
+    }
+
     trace_vhost_vdpa_dma_unmap(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                                msg.iotlb.size, msg.iotlb.type);
 
@@ -156,6 +176,16 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
         return;
     }
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return;
+    }
+
     trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -180,6 +210,17 @@ static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
     msg.type = VHOST_IOTLB_MSG_V2;
     msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+        s->map_thread->map_thread_enabled = false;
+
+        return;
+    }
+
     trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -1288,6 +1329,88 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev)
     vhost_vdpa_reset_device(dev);
 }
 
+static void *vhost_vdpa_load_map_worker(void *opaque)
+{
+    VhostVDPAShared *shared = opaque;
+    GPtrArray *ret = NULL;
+
+    while (true) {
+        g_autofree struct vhost_msg_v2 *msg = NULL;
+        int r = 0;
+
+        msg = g_async_queue_pop(shared->map_thread->queue);
+        switch (msg->iotlb.type) {
+        case VHOST_IOTLB_UPDATE:
+            r = vhost_vdpa_dma_map(shared, msg->asid, msg->iotlb.iova,
+                                   msg->iotlb.size,
+                                   (void *)(uintptr_t)msg->iotlb.uaddr,
+                                   msg->iotlb.perm == VHOST_ACCESS_RO);
+            break;
+        case VHOST_IOTLB_INVALIDATE:
+            r = vhost_vdpa_dma_unmap(shared, msg->asid, msg->iotlb.iova,
+                                     msg->iotlb.size);
+            break;
+        case VHOST_IOTLB_BATCH_BEGIN:
+            vhost_vdpa_iotlb_batch_begin_once(shared);
+            break;
+        case VHOST_IOTLB_BATCH_END:
+            vhost_vdpa_dma_end_batch(shared);
+            goto end;
+        default:
+            error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
+            break;
+        };
+
+        if (unlikely(r != 0)) {
+            /* Add to failed iotlbs so we can remove it from iova_tree */
+            if (ret == NULL) {
+                ret = g_ptr_array_new_full(1, g_free);
+            }
+
+            g_ptr_array_add(ret, g_steal_pointer(&msg));
+        }
+    }
+
+end:
+    return ret;
+}
+
+static void vhost_vdpa_spawn_maps_thread(VhostVDPAShared *shared)
+{
+    shared->map_thread = g_new0(VhostVDPAMapThread, 1);
+    shared->map_thread->queue = g_async_queue_new();
+    qemu_thread_create(&shared->map_thread->thread, "vdpa map thread",
+                       vhost_vdpa_load_map_worker, shared,
+                       QEMU_THREAD_JOINABLE);
+    shared->map_thread->map_thread_enabled = true;
+}
+
+static bool vhost_vdpa_join_maps_thread(VhostVDPAShared *shared)
+{
+    g_autoptr(GPtrArray) failed_iova = NULL;
+
+    failed_iova = qemu_thread_join(&shared->map_thread->thread);
+    g_async_queue_unref(shared->map_thread->queue);
+    g_clear_pointer(&shared->map_thread, g_free);
+
+    if (likely(!failed_iova)) {
+        return true;
+    }
+
+    /* If it is a failed IOVA, abort starting */
+    for (size_t i = 0; failed_iova->len; ++i) {
+        struct vhost_msg_v2 *msg = g_ptr_array_index(failed_iova, i);
+        DMAMap mem_region = {
+            .iova = msg->iotlb.iova,
+            .size = msg->iotlb.size - 1, /* Inclusive */
+        };
+
+        vhost_iova_tree_remove(shared->iova_tree, mem_region);
+    }
+
+    return false;
+}
+
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
     struct vhost_vdpa *v = dev->opaque;
@@ -1315,7 +1438,15 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
-        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+        if (!v->shared->listener_registered) {
+            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+            v->shared->listener_registered = true;
+        } else {
+            ok = vhost_vdpa_join_maps_thread(v->shared);
+            if (unlikely(!ok)) {
+                goto out_stop;
+            }
+        }
 
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     }
@@ -1340,6 +1471,8 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
     memory_listener_unregister(&v->shared->listener);
+    v->shared->listener_registered = false;
+
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
@@ -1522,3 +1655,34 @@ const VhostOps vdpa_ops = {
         .vhost_set_config_call = vhost_vdpa_set_config_call,
         .vhost_reset_status = vhost_vdpa_reset_status,
 };
+
+int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
+{
+    uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+    int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
+    if (unlikely(r < 0)) {
+        return r;
+    }
+
+    vhost_vdpa_spawn_maps_thread(shared);
+    memory_listener_register(&shared->listener, dma_as);
+    shared->listener_registered = true;
+    return 0;
+}
+
+int vhost_vdpa_load_cleanup(VhostVDPAShared *shared, bool vhost_will_start)
+{
+    if (shared->map_thread && !shared->map_thread->map_thread_enabled) {
+        return 0;
+    }
+
+    if (vhost_will_start) {
+        /*
+         * Delegate the join of map thread to vhost_vdpa_dev_start, as it runs
+         * out of main qemu lock.
+         */
+        return 0;
+    }
+
+    return vhost_vdpa_join_maps_thread(shared) ? 0 : -1;
+}
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 09/12] vdpa: approve switchover after memory map in the migration destination
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 10/12] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

If the VM migrates before finishing all the maps, the source will stop
but the destination is still not ready to continue, and it will wait
until all guest RAM is mapped.  The destination can use switchover_ack
to prevent source to stop until all the memory is mapped at the
destination.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  4 ++++
 hw/virtio/vhost-vdpa.c         | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index b49286b327..1c7e3fbd24 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -34,6 +34,7 @@ typedef struct VhostVDPAHostNotifier {
 typedef struct VhostVDPAMapThread {
     QemuThread thread;
     GAsyncQueue *queue;
+    QEMUBH *bh;
     bool map_thread_enabled;
 } VhostVDPAMapThread;
 
@@ -60,6 +61,9 @@ typedef struct vhost_vdpa_shared {
      * To solve it, offload the first listener operations until the first
      * listener commit from the main thread.  Once these are served, join the
      * map thread.
+     *
+     * This map thread is joined by join_map_thread BH if
+     * migrate_switchover_ack is supported, or by vhost_vdpa_dev_start if not.
      */
     VhostVDPAMapThread *map_thread;
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 339e11c58a..7d31f4a30e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -22,6 +22,8 @@
 #include "hw/virtio/vhost-vdpa.h"
 #include "exec/address-spaces.h"
 #include "migration/blocker.h"
+#include "migration/options.h"
+#include "migration/savevm.h"
 #include "qemu/cutils.h"
 #include "qemu/main-loop.h"
 #include "trace.h"
@@ -1372,13 +1374,26 @@ static void *vhost_vdpa_load_map_worker(void *opaque)
     }
 
 end:
+    if (shared->map_thread->bh) {
+        qemu_bh_schedule(shared->map_thread->bh);
+    }
+
     return ret;
 }
 
+static void vhost_vdpa_load_map_switchover_ack(void *opaque)
+{
+    qemu_loadvm_approve_switchover();
+}
+
 static void vhost_vdpa_spawn_maps_thread(VhostVDPAShared *shared)
 {
     shared->map_thread = g_new0(VhostVDPAMapThread, 1);
     shared->map_thread->queue = g_async_queue_new();
+    if (migrate_switchover_ack()) {
+        shared->map_thread->bh = qemu_bh_new(vhost_vdpa_load_map_switchover_ack,
+                                             NULL);
+    }
     qemu_thread_create(&shared->map_thread->thread, "vdpa map thread",
                        vhost_vdpa_load_map_worker, shared,
                        QEMU_THREAD_JOINABLE);
@@ -1390,6 +1405,9 @@ static bool vhost_vdpa_join_maps_thread(VhostVDPAShared *shared)
     g_autoptr(GPtrArray) failed_iova = NULL;
 
     failed_iova = qemu_thread_join(&shared->map_thread->thread);
+    if (shared->map_thread->bh) {
+        qemu_bh_delete(shared->map_thread->bh);
+    }
     g_async_queue_unref(shared->map_thread->queue);
     g_clear_pointer(&shared->map_thread, g_free);
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 10/12] vdpa: add vhost_vdpa_net_load_setup NetClient callback
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 09/12] vdpa: approve switchover after memory map in the migration destination Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 11/12] vdpa: add vhost_vdpa_net_switchover_ack_needed Eugenio Pérez
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

So the vDPA backend knows when a migration incoming starts.  NicState
argument is needed so we can get the dma address space.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2:
* Solve git conflict with .set_steering_ebpf
* Fix x-svq=on use case which did not allocated iova_tree.
---
 include/net/net.h |  6 ++++++
 net/vhost-vdpa.c  | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index ffbd2c8d56..68282dde31 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -42,6 +42,7 @@ typedef struct NICConf {
 
 /* Net clients */
 
+struct NICState;
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
 typedef int (NetStart)(NetClientState *);
@@ -69,6 +70,9 @@ typedef void (SocketReadStateFinalize)(SocketReadState *rs);
 typedef void (NetAnnounce)(NetClientState *);
 typedef bool (SetSteeringEBPF)(NetClientState *, int);
 typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **);
+/* This can be called before start & pair, so get also the peer */
+typedef int (NetMigrationLoadSetup)(NetClientState *, struct NICState *);
+typedef int (NetMigrationLoadCleanup)(NetClientState *, struct NICState *);
 
 typedef struct NetClientInfo {
     NetClientDriver type;
@@ -98,6 +102,8 @@ typedef struct NetClientInfo {
     NetAnnounce *announce;
     SetSteeringEBPF *set_steering_ebpf;
     NetCheckPeerType *check_peer_type;
+    NetMigrationLoadSetup *load_setup;
+    NetMigrationLoadCleanup *load_cleanup;
 } NetClientInfo;
 
 struct NetClientState {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index e11b390466..7d4a99878e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -406,6 +406,37 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
     }
 }
 
+static int vhost_vdpa_net_load_setup(NetClientState *nc, NICState *nic)
+{
+    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    VirtIONet *n = qemu_get_nic_opaque(&nic->ncs[0]);
+    VhostVDPAShared *shared = s->vhost_vdpa.shared;
+    int r;
+
+    if (s->always_svq) {
+        /* iova tree is needed because of SVQ */
+        shared->iova_tree = vhost_iova_tree_new(shared->iova_range.first,
+                                                shared->iova_range.last);
+    }
+
+    r = vhost_vdpa_load_setup(shared, n->parent_obj.dma_as);
+    if (unlikely(r < 0)) {
+        g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
+                        vhost_iova_tree_delete);
+    }
+
+    return r;
+}
+
+static int vhost_vdpa_net_load_cleanup(NetClientState *nc, NICState *nic)
+{
+    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    VirtIONet *n = qemu_get_nic_opaque(&nic->ncs[0]);
+
+    return vhost_vdpa_load_cleanup(s->vhost_vdpa.shared,
+                             n->parent_obj.status & VIRTIO_CONFIG_S_DRIVER_OK);
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
         .type = NET_CLIENT_DRIVER_VHOST_VDPA,
         .size = sizeof(VhostVDPAState),
@@ -418,6 +449,8 @@ static NetClientInfo net_vhost_vdpa_info = {
         .has_ufo = vhost_vdpa_has_ufo,
         .check_peer_type = vhost_vdpa_check_peer_type,
         .set_steering_ebpf = vhost_vdpa_set_steering_ebpf,
+        .load_setup = vhost_vdpa_net_load_setup,
+        .load_cleanup = vhost_vdpa_net_load_cleanup,
 };
 
 static int64_t vhost_vdpa_get_vring_group(int device_fd, unsigned vq_index,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 11/12] vdpa: add vhost_vdpa_net_switchover_ack_needed
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 10/12] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-15 17:28 ` [PATCH for 9.0 12/12] virtio_net: register incremental migration handlers Eugenio Pérez
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Use migration switchover ack capability to make sure QEMU has mapped all
the guest memory to the device before the source stops the VM and
attempts to complete the migration.

All net vdpa devices support this early map of guest memory, so return
always true.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/net/net.h | 2 ++
 net/vhost-vdpa.c  | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 68282dde31..f8c65dcb86 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -73,6 +73,7 @@ typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **);
 /* This can be called before start & pair, so get also the peer */
 typedef int (NetMigrationLoadSetup)(NetClientState *, struct NICState *);
 typedef int (NetMigrationLoadCleanup)(NetClientState *, struct NICState *);
+typedef bool (NetMigrationSwichoverAckNeeded)(const NetClientState *);
 
 typedef struct NetClientInfo {
     NetClientDriver type;
@@ -104,6 +105,7 @@ typedef struct NetClientInfo {
     NetCheckPeerType *check_peer_type;
     NetMigrationLoadSetup *load_setup;
     NetMigrationLoadCleanup *load_cleanup;
+    NetMigrationSwichoverAckNeeded *load_switchover_ack_needed;
 } NetClientInfo;
 
 struct NetClientState {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 7d4a99878e..9aa958c4f1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -437,12 +437,18 @@ static int vhost_vdpa_net_load_cleanup(NetClientState *nc, NICState *nic)
                              n->parent_obj.status & VIRTIO_CONFIG_S_DRIVER_OK);
 }
 
+static bool vhost_vdpa_net_switchover_ack_needed(const NetClientState *nc)
+{
+    return true;
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
         .type = NET_CLIENT_DRIVER_VHOST_VDPA,
         .size = sizeof(VhostVDPAState),
         .receive = vhost_vdpa_receive,
         .start = vhost_vdpa_net_data_start,
         .load = vhost_vdpa_net_data_load,
+        .load_switchover_ack_needed = vhost_vdpa_net_switchover_ack_needed,
         .stop = vhost_vdpa_net_client_stop,
         .cleanup = vhost_vdpa_cleanup,
         .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for 9.0 12/12] virtio_net: register incremental migration handlers
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 11/12] vdpa: add vhost_vdpa_net_switchover_ack_needed Eugenio Pérez
@ 2023-12-15 17:28 ` Eugenio Pérez
  2023-12-25  1:41 ` [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Lei Yang
  2023-12-25 16:30 ` Michael S. Tsirkin
  13 siblings, 0 replies; 33+ messages in thread
From: Eugenio Pérez @ 2023-12-15 17:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, si-wei.liu, Lei Yang, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

This way VirtIONet can detect when the incoming migration starts.

While registering in the backend (nc->peer) seems more logical, we need
nic dma address space, and we cannot get it from the backend.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
This could be done in vhost_vdpa or VirtIODevice struct, but future
series will add state restore through CVQ so it's easier to start in
VirtIONet directly.  If we need to make this more generic, we can move
to VirtIODevice and expose callbacks from VirtIONet class.

Also, the pointer may not be the best id, but there are not a lot of
things initialized in n.
---
 hw/net/virtio-net.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 80c56f0cfc..7dd08c98b6 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -38,6 +38,7 @@
 #include "qapi/qapi-events-migration.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
+#include "migration/register.h"
 #include "standard-headers/linux/ethtool.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
@@ -3810,9 +3811,52 @@ static void virtio_net_device_unrealize(DeviceState *dev)
     virtio_cleanup(vdev);
 }
 
+static int virtio_net_load_setup(QEMUFile *f, void *opaque)
+{
+    VirtIONet *n = opaque;
+    NetClientState *nc = qemu_get_queue(n->nic);
+
+    if (nc->peer && nc->peer->info->load_setup) {
+        return nc->peer->info->load_setup(nc->peer, n->nic);
+    }
+
+    return 0;
+}
+
+static int virtio_net_load_cleanup(void *opaque)
+{
+    VirtIONet *n = opaque;
+    NetClientState *nc = qemu_get_queue(n->nic);
+
+    if (nc->peer && nc->peer->info->load_cleanup) {
+        return nc->peer->info->load_cleanup(nc->peer, n->nic);
+    }
+
+    return 0;
+}
+
+static bool virtio_net_switchover_ack_needed(void *opaque)
+{
+    VirtIONet *n = opaque;
+    NetClientState *nc = qemu_get_queue(n->nic);
+
+    if (nc->peer && nc->peer->info->load_switchover_ack_needed) {
+        return nc->peer->info->load_switchover_ack_needed(nc->peer);
+    }
+
+    return false;
+}
+
+static const SaveVMHandlers savevm_virtio_net_handlers = {
+    .load_setup = virtio_net_load_setup,
+    .load_cleanup = virtio_net_load_cleanup,
+    .switchover_ack_needed = virtio_net_switchover_ack_needed,
+};
+
 static void virtio_net_instance_init(Object *obj)
 {
     VirtIONet *n = VIRTIO_NET(obj);
+    g_autoptr(GString) id = g_string_new(NULL);
 
     /*
      * The default config_size is sizeof(struct virtio_net_config).
@@ -3824,6 +3868,10 @@ static void virtio_net_instance_init(Object *obj)
                                   DEVICE(n));
 
     ebpf_rss_init(&n->ebpf_rss);
+
+    g_string_printf(id, "%p", n);
+    register_savevm_live(id->str, VMSTATE_INSTANCE_ID_ANY, 1,
+                         &savevm_virtio_net_handlers, n);
 }
 
 static int virtio_net_pre_save(void *opaque)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded
  2023-12-15 17:28 ` [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
@ 2023-12-20  4:25   ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:25 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Next commits will set DRIVER and ACKNOWLEDGE flags repeatedly in the
> case of a migration destination.  Let's save ioctls with this.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  hw/virtio/vhost-vdpa.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 7500c2fc82..cc252fc2d8 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -510,6 +510,10 @@ static int vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
>      if (ret < 0) {
>          return ret;
>      }
> +    if ((s & status) == status) {
> +        /* Don't set bits already set */
> +        return 0;
> +    }
>
>      s |= status;
>
> --
> 2.39.3
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return
  2023-12-15 17:28 ` [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return Eugenio Pérez
@ 2023-12-20  4:27   ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:27 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Prefer early return so it is easier to merge
> vhost_vdpa_listener_begin_batch here and make iotlb baches begin and end
> symmetrical.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  hw/virtio/vhost-vdpa.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index cc252fc2d8..bf9771870a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -160,11 +160,12 @@ static void vhost_vdpa_listener_begin_batch(VhostVDPAShared *s)
>
>  static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
>  {
> -    if (s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
> -        !s->iotlb_batch_begin_sent) {
> -        vhost_vdpa_listener_begin_batch(s);
> +    if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
> +        s->iotlb_batch_begin_sent) {
> +        return;
>      }
>
> +    vhost_vdpa_listener_begin_batch(s);
>      s->iotlb_batch_begin_sent = true;
>  }
>
> --
> 2.39.3
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once
  2023-12-15 17:28 ` [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once Eugenio Pérez
@ 2023-12-20  4:30   ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:30 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> There was only one call.  This way we can make the begin and end of the
> batch symmetrical.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit
  2023-12-15 17:28 ` [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit Eugenio Pérez
@ 2023-12-20  4:31   ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:31 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> So we can call out vhost_vdpa_dma_end_batch out of the listener
> callbacks.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  hw/virtio/vhost-vdpa.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index a533fc5bc7..57a8043cd4 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -164,9 +164,8 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
>      s->iotlb_batch_begin_sent = true;
>  }
>
> -static void vhost_vdpa_listener_commit(MemoryListener *listener)
> +static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
>  {
> -    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
>      struct vhost_msg_v2 msg = {};
>      int fd = s->device_fd;
>
> @@ -190,6 +189,13 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
>      s->iotlb_batch_begin_sent = false;
>  }
>
> +static void vhost_vdpa_listener_commit(MemoryListener *listener)
> +{
> +    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
> +
> +    vhost_vdpa_dma_end_batch(s);
> +}
> +
>  static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>  {
>      struct vdpa_iommu *iommu = container_of(n, struct vdpa_iommu, n);
> --
> 2.39.3
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start
  2023-12-15 17:28 ` [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start Eugenio Pérez
@ 2023-12-20  4:31   ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:31 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This makes easier to build an error path in next patches.  No functional
> change.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  hw/virtio/vhost-vdpa.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 57a8043cd4..449c3794b2 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1302,7 +1302,7 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev)
>  static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>  {
>      struct vhost_vdpa *v = dev->opaque;
> -    bool ok;
> +    bool ok = true;
>      trace_vhost_vdpa_dev_start(dev, started);
>
>      if (started) {
> @@ -1313,8 +1313,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>          }
>      } else {
>          vhost_vdpa_suspend(dev);
> -        vhost_vdpa_svqs_stop(dev);
> -        vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> +        goto out_stop;
>      }
>
>      if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
> @@ -1333,6 +1332,11 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>      }
>
>      return 0;
> +
> +out_stop:
> +    vhost_vdpa_svqs_stop(dev);
> +    vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> +    return ok ? 0 : -1;
>  }
>
>  static void vhost_vdpa_reset_status(struct vhost_dev *dev)
> --
> 2.39.3
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init
  2023-12-15 17:28 ` [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init Eugenio Pérez
@ 2023-12-20  4:34   ` Jason Wang
  2023-12-20  7:07     ` Eugenio Perez Martin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2023-12-20  4:34 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The backend does not reset them until the vdpa file descriptor is closed
> so there is no harm in doing it only once.
>
> This allows the destination of a live migration to premap memory in
> batches, using VHOST_BACKEND_F_IOTLB_BATCH.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-vdpa.c | 50 ++++++++++++++++--------------------------
>  1 file changed, 19 insertions(+), 31 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 449c3794b2..43f7c382b1 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -587,11 +587,25 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>      struct vhost_vdpa *v = opaque;
>      assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
>      trace_vhost_vdpa_init(dev, v->shared, opaque);
> +    uint64_t backend_features;
> +    uint64_t qemu_backend_features = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
> +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> +                                     0x1ULL << VHOST_BACKEND_F_SUSPEND;
>      int ret;
>
>      v->dev = dev;
>      dev->opaque =  opaque ;
>      v->shared->listener = vhost_vdpa_memory_listener;
> +
> +    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &backend_features)) {
> +        return -EFAULT;
> +    }
> +
> +    backend_features &= qemu_backend_features;
> +
> +    dev->backend_cap = backend_features;
> +    v->shared->backend_cap = backend_features;
>      vhost_vdpa_init_svq(dev, v);
>
>      error_propagate(&dev->migration_blocker, v->migration_blocker);
> @@ -599,6 +613,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>          return 0;
>      }
>
> +    ret = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &backend_features);
> +    if (ret) {
> +        return -EFAULT;
> +    }
> +
>      /*
>       * If dev->shadow_vqs_enabled at initialization that means the device has
>       * been started with x-svq=on, so don't block migration
> @@ -829,36 +848,6 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
>      return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
>  }
>
> -static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)

How about keeping this function but just calling it in vhost_vdpa_init()?

Thanks



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-15 17:28 ` [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
@ 2023-12-20  5:21   ` Jason Wang
  2023-12-20  7:06     ` Eugenio Perez Martin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2023-12-20  5:21 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Callers can use this function to setup the incoming migration thread.
>
> This thread is able to map the guest memory while the migration is
> ongoing, without blocking QMP or other important tasks. While this
> allows the destination QEMU not to block, it expands the mapping time
> during migration instead of making it pre-migration.

If it's just QMP, can we simply use bh with a quota here?

Btw, have you measured the hotspot that causes such slowness? Is it
pinning or vendor specific mapping that slows down the progress? Or if
VFIO has a similar issue?

>
> This thread joins at vdpa backend device start, so it could happen that
> the guest memory is so large that we still have guest memory to map
> before this time.

So we would still hit the QMP stall in this case?

> This can be improved in later iterations, when the
> destination device can inform QEMU that it is not ready to complete the
> migration.
>
> If the device is not started, the clean of the mapped memory is done at
> .load_cleanup.  This is far from ideal, as the destination machine has
> mapped all the guest ram for nothing, and now it needs to unmap it.
> However, we don't have information about the state of the device so its
> the best we can do.  Once iterative migration is supported, this will be
> improved as we know the virtio state of the device.
>
> If the VM migrates before finishing all the maps, the source will stop
> but the destination is still not ready to continue, and it will wait
> until all guest RAM is mapped.  It is still an improvement over doing
> all the map when the migration finish, but next patches use the
> switchover_ack method to prevent source to stop until all the memory is
> mapped at the destination.
>
> The memory unmapping if the device is not started is weird
> too, as ideally nothing would be mapped.  This can be fixed when we
> migrate the device state iteratively, and we know for sure if the device
> is started or not.  At this moment we don't have such information so
> there is no better alternative.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> ---

Thanks



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-20  5:21   ` Jason Wang
@ 2023-12-20  7:06     ` Eugenio Perez Martin
  2023-12-21  2:17       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2023-12-20  7:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Callers can use this function to setup the incoming migration thread.
> >
> > This thread is able to map the guest memory while the migration is
> > ongoing, without blocking QMP or other important tasks. While this
> > allows the destination QEMU not to block, it expands the mapping time
> > during migration instead of making it pre-migration.
>
> If it's just QMP, can we simply use bh with a quota here?
>

Because QEMU cannot guarantee the quota at write(fd,
VHOST_IOTLB_UPDATE, ...). Also, synchronization with
vhost_vdpa_dev_start would complicate as it would need to be
re-scheduled too.

As a half-baked idea, we can split the mapping chunks in manageable
sizes, but I don't like that idea a lot.

> Btw, have you measured the hotspot that causes such slowness? Is it
> pinning or vendor specific mapping that slows down the progress? Or if
> VFIO has a similar issue?
>

Si-Wei did the actual profiling as he is the one with the 128G guests,
but most of the time was spent in the memory pinning. Si-Wei, please
correct me if I'm wrong.

I didn't check VFIO, but I think it just maps at realize phase with
vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
previous testings, this delayed the VM initialization by a lot, as
we're moving that 20s of blocking to every VM start.

Investigating a way to do it only in the case of being the destination
of a live migration, I think the right place is .load_setup migration
handler. But I'm ok to move it for sure.

> >
> > This thread joins at vdpa backend device start, so it could happen that
> > the guest memory is so large that we still have guest memory to map
> > before this time.
>
> So we would still hit the QMP stall in this case?
>

This paragraph is kind of outdated, sorry. I can only cause this if I
don't enable switchover_ack migration capability and if I artificially
make memory pinning in the kernel artificially slow. But I didn't
check QMP to be honest, so I can try to test it, yes.

If QMP is not responsive, that means QMP is not responsive in QEMU
master in that period actually. So we're only improving anyway.

Thanks!

> > This can be improved in later iterations, when the
> > destination device can inform QEMU that it is not ready to complete the
> > migration.
> >
> > If the device is not started, the clean of the mapped memory is done at
> > .load_cleanup.  This is far from ideal, as the destination machine has
> > mapped all the guest ram for nothing, and now it needs to unmap it.
> > However, we don't have information about the state of the device so its
> > the best we can do.  Once iterative migration is supported, this will be
> > improved as we know the virtio state of the device.
> >
> > If the VM migrates before finishing all the maps, the source will stop
> > but the destination is still not ready to continue, and it will wait
> > until all guest RAM is mapped.  It is still an improvement over doing
> > all the map when the migration finish, but next patches use the
> > switchover_ack method to prevent source to stop until all the memory is
> > mapped at the destination.
> >
> > The memory unmapping if the device is not started is weird
> > too, as ideally nothing would be mapped.  This can be fixed when we
> > migrate the device state iteratively, and we know for sure if the device
> > is started or not.  At this moment we don't have such information so
> > there is no better alternative.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > ---
>
> Thanks
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init
  2023-12-20  4:34   ` Jason Wang
@ 2023-12-20  7:07     ` Eugenio Perez Martin
  2023-12-21  3:39       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2023-12-20  7:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Wed, Dec 20, 2023 at 5:34 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The backend does not reset them until the vdpa file descriptor is closed
> > so there is no harm in doing it only once.
> >
> > This allows the destination of a live migration to premap memory in
> > batches, using VHOST_BACKEND_F_IOTLB_BATCH.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/virtio/vhost-vdpa.c | 50 ++++++++++++++++--------------------------
> >  1 file changed, 19 insertions(+), 31 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 449c3794b2..43f7c382b1 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -587,11 +587,25 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >      struct vhost_vdpa *v = opaque;
> >      assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
> >      trace_vhost_vdpa_init(dev, v->shared, opaque);
> > +    uint64_t backend_features;
> > +    uint64_t qemu_backend_features = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
> > +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> > +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> > +                                     0x1ULL << VHOST_BACKEND_F_SUSPEND;
> >      int ret;
> >
> >      v->dev = dev;
> >      dev->opaque =  opaque ;
> >      v->shared->listener = vhost_vdpa_memory_listener;
> > +
> > +    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &backend_features)) {
> > +        return -EFAULT;
> > +    }
> > +
> > +    backend_features &= qemu_backend_features;
> > +
> > +    dev->backend_cap = backend_features;
> > +    v->shared->backend_cap = backend_features;
> >      vhost_vdpa_init_svq(dev, v);
> >
> >      error_propagate(&dev->migration_blocker, v->migration_blocker);
> > @@ -599,6 +613,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >          return 0;
> >      }
> >
> > +    ret = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &backend_features);
> > +    if (ret) {
> > +        return -EFAULT;
> > +    }
> > +
> >      /*
> >       * If dev->shadow_vqs_enabled at initialization that means the device has
> >       * been started with x-svq=on, so don't block migration
> > @@ -829,36 +848,6 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
> >      return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
> >  }
> >
> > -static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>
> How about keeping this function but just calling it in vhost_vdpa_init()?
>

Sure, that is possible. I need to remove the VhostOps
vhost_set_backend_cap = vhost_vdpa_set_backend_cap, anyway, is that ok
for you?

Thanks!



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-20  7:06     ` Eugenio Perez Martin
@ 2023-12-21  2:17       ` Jason Wang
  2023-12-21  8:20         ` Eugenio Perez Martin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2023-12-21  2:17 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier, Peter Xu

On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Callers can use this function to setup the incoming migration thread.
> > >
> > > This thread is able to map the guest memory while the migration is
> > > ongoing, without blocking QMP or other important tasks. While this
> > > allows the destination QEMU not to block, it expands the mapping time
> > > during migration instead of making it pre-migration.
> >
> > If it's just QMP, can we simply use bh with a quota here?
> >
>
> Because QEMU cannot guarantee the quota at write(fd,
> VHOST_IOTLB_UPDATE, ...).

So you mean the delay may be caused by a single syscall?

> Also, synchronization with
> vhost_vdpa_dev_start would complicate as it would need to be
> re-scheduled too.

Just a flush of the bh, or not?

But another question. How to synchronize with the memory API in this
case. Currently the updating (without vIOMMU) is done under the
listener callback.

Usually after the commit, Qemu may think the memory topology has been
updated. If it is done asynchronously, would we have any problem?

>
> As a half-baked idea, we can split the mapping chunks in manageable
> sizes, but I don't like that idea a lot.
>
> > Btw, have you measured the hotspot that causes such slowness? Is it
> > pinning or vendor specific mapping that slows down the progress? Or if
> > VFIO has a similar issue?
> >
>
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.
>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

Adding Peter for more ideas.

>
> > >
> > > This thread joins at vdpa backend device start, so it could happen that
> > > the guest memory is so large that we still have guest memory to map
> > > before this time.
> >
> > So we would still hit the QMP stall in this case?
> >
>
> This paragraph is kind of outdated, sorry. I can only cause this if I
> don't enable switchover_ack migration capability and if I artificially
> make memory pinning in the kernel artificially slow. But I didn't
> check QMP to be honest, so I can try to test it, yes.
>
> If QMP is not responsive, that means QMP is not responsive in QEMU
> master in that period actually. So we're only improving anyway.
>
> Thanks!
>

Thanks



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init
  2023-12-20  7:07     ` Eugenio Perez Martin
@ 2023-12-21  3:39       ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2023-12-21  3:39 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Wed, Dec 20, 2023 at 3:08 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Dec 20, 2023 at 5:34 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > The backend does not reset them until the vdpa file descriptor is closed
> > > so there is no harm in doing it only once.
> > >
> > > This allows the destination of a live migration to premap memory in
> > > batches, using VHOST_BACKEND_F_IOTLB_BATCH.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  hw/virtio/vhost-vdpa.c | 50 ++++++++++++++++--------------------------
> > >  1 file changed, 19 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index 449c3794b2..43f7c382b1 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -587,11 +587,25 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> > >      struct vhost_vdpa *v = opaque;
> > >      assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
> > >      trace_vhost_vdpa_init(dev, v->shared, opaque);
> > > +    uint64_t backend_features;
> > > +    uint64_t qemu_backend_features = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
> > > +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> > > +                                     0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> > > +                                     0x1ULL << VHOST_BACKEND_F_SUSPEND;
> > >      int ret;
> > >
> > >      v->dev = dev;
> > >      dev->opaque =  opaque ;
> > >      v->shared->listener = vhost_vdpa_memory_listener;
> > > +
> > > +    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &backend_features)) {
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    backend_features &= qemu_backend_features;
> > > +
> > > +    dev->backend_cap = backend_features;
> > > +    v->shared->backend_cap = backend_features;
> > >      vhost_vdpa_init_svq(dev, v);
> > >
> > >      error_propagate(&dev->migration_blocker, v->migration_blocker);
> > > @@ -599,6 +613,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> > >          return 0;
> > >      }
> > >
> > > +    ret = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &backend_features);
> > > +    if (ret) {
> > > +        return -EFAULT;
> > > +    }
> > > +
> > >      /*
> > >       * If dev->shadow_vqs_enabled at initialization that means the device has
> > >       * been started with x-svq=on, so don't block migration
> > > @@ -829,36 +848,6 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
> > >      return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
> > >  }
> > >
> > > -static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >
> > How about keeping this function but just calling it in vhost_vdpa_init()?
> >
>
> Sure, that is possible. I need to remove the VhostOps
> vhost_set_backend_cap = vhost_vdpa_set_backend_cap, anyway, is that ok
> for you?

Fine with me.

Thanks

>
> Thanks!
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-21  2:17       ` Jason Wang
@ 2023-12-21  8:20         ` Eugenio Perez Martin
  2024-01-02  5:33           ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2023-12-21  8:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier, Peter Xu

On Thu, Dec 21, 2023 at 3:17 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > Callers can use this function to setup the incoming migration thread.
> > > >
> > > > This thread is able to map the guest memory while the migration is
> > > > ongoing, without blocking QMP or other important tasks. While this
> > > > allows the destination QEMU not to block, it expands the mapping time
> > > > during migration instead of making it pre-migration.
> > >
> > > If it's just QMP, can we simply use bh with a quota here?
> > >
> >
> > Because QEMU cannot guarantee the quota at write(fd,
> > VHOST_IOTLB_UPDATE, ...).
>
> So you mean the delay may be caused by a single syscall?
>

Mostly yes, the iotlb write() that maps of all the guest memory.

> > Also, synchronization with
> > vhost_vdpa_dev_start would complicate as it would need to be
> > re-scheduled too.
>
> Just a flush of the bh, or not?
>

Let me put it differently: to map the guest memory, vhost_vdpa_dma_map
is called because the guest starts the device by a PCI write to the
device status:
#0  vhost_vdpa_dma_map (s=0x5555570e0e60, asid=0, iova=0, size=786432,
vaddr=0x7fff40000000, readonly=false)
    at ../hw/virtio/vhost-vdpa.c:93
#1  0x0000555555979451 in vhost_vdpa_listener_region_add
(listener=0x5555570e0e68, section=0x7fffee5bc0d0) at
../hw/virtio/vhost-vdpa.c:415
#2  0x0000555555b3c543 in listener_add_address_space
(listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>)
    at ../system/memory.c:3011
#3  0x0000555555b3c996 in memory_listener_register
(listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>)
    at ../system/memory.c:3081
#4  0x000055555597be03 in vhost_vdpa_dev_start (dev=0x5555570e1310,
started=true) at ../hw/virtio/vhost-vdpa.c:1460
#5  0x00005555559734c2 in vhost_dev_start (hdev=0x5555570e1310,
vdev=0x5555584b2c80, vrings=false) at ../hw/virtio/vhost.c:2058
#6  0x0000555555854ec8 in vhost_net_start_one (net=0x5555570e1310,
dev=0x5555584b2c80) at ../hw/net/vhost_net.c:274
#7  0x00005555558554ca in vhost_net_start (dev=0x5555584b2c80,
ncs=0x5555584c8278, data_queue_pairs=1, cvq=1) at
../hw/net/vhost_net.c:415
#8  0x0000555555ace7a5 in virtio_net_vhost_status (n=0x5555584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:310
#9  0x0000555555acea50 in virtio_net_set_status (vdev=0x5555584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:391
#10 0x0000555555b06fee in virtio_set_status (vdev=0x5555584b2c80,
val=15 '\017') at ../hw/virtio/virtio.c:2048
#11 0x000055555595d667 in virtio_pci_common_write
(opaque=0x5555584aa8b0, addr=20, val=15, size=1) at
../hw/virtio/virtio-pci.c:1580
#12 0x0000555555b351c1 in memory_region_write_accessor
(mr=0x5555584ab3f0, addr=20, value=0x7fffee5bc4c8, size=1, shift=0,
mask=255,
    attrs=...) at ../system/memory.c:497
#13 0x0000555555b354c5 in access_with_adjusted_size (addr=20,
value=0x7fffee5bc4c8, size=1, access_size_min=1, access_size_max=4,
    access_fn=0x555555b350cf <memory_region_write_accessor>,
mr=0x5555584ab3f0, attrs=...) at ../system/memory.c:573
#14 0x0000555555b3856f in memory_region_dispatch_write
(mr=0x5555584ab3f0, addr=20, data=15, op=MO_8, attrs=...) at
../system/memory.c:1521
#15 0x0000555555b45885 in flatview_write_continue (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., ptr=0x7ffff7ff0028, len=1, addr1=20,
    l=1, mr=0x5555584ab3f0) at ../system/physmem.c:2714
#16 0x0000555555b459e8 in flatview_write (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., buf=0x7ffff7ff0028, len=1)
    at ../system/physmem.c:2756
#17 0x0000555555b45d9a in address_space_write (as=0x555556db72e0
<address_space_memory>, addr=4227858452, attrs=...,
buf=0x7ffff7ff0028,
    len=1) at ../system/physmem.c:2863
#18 0x0000555555b45e07 in address_space_rw (as=0x555556db72e0
<address_space_memory>, addr=4227858452, attrs=...,
buf=0x7ffff7ff0028,
    len=1, is_write=true) at ../system/physmem.c:2873
#19 0x0000555555b5eb30 in kvm_cpu_exec (cpu=0x5555571258f0) at
../accel/kvm/kvm-all.c:2915
#20 0x0000555555b61798 in kvm_vcpu_thread_fn (arg=0x5555571258f0) at
../accel/kvm/kvm-accel-ops.c:51
#21 0x0000555555d384b7 in qemu_thread_start (args=0x55555712c390) at
../util/qemu-thread-posix.c:541
#22 0x00007ffff580814a in start_thread () from /lib64/libpthread.so.0
#23 0x00007ffff54fcf23 in clone () from /lib64/libc.so.6

Can we reschedule that map to a bh without returning the control to the vCPU?

> But another question. How to synchronize with the memory API in this
> case. Currently the updating (without vIOMMU) is done under the
> listener callback.
>
> Usually after the commit, Qemu may think the memory topology has been
> updated. If it is done asynchronously, would we have any problem?
>

The function vhost_vdpa_process_iotlb_msg in the kernel has its own
lock. So two QEMU threads can map memory independently and they get
serialized.

For the write() caller, it is like the call takes more time, but there
are no deadlocks or similar.

> >
> > As a half-baked idea, we can split the mapping chunks in manageable
> > sizes, but I don't like that idea a lot.
> >
> > > Btw, have you measured the hotspot that causes such slowness? Is it
> > > pinning or vendor specific mapping that slows down the progress? Or if
> > > VFIO has a similar issue?
> > >
> >
> > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > but most of the time was spent in the memory pinning. Si-Wei, please
> > correct me if I'm wrong.
> >
> > I didn't check VFIO, but I think it just maps at realize phase with
> > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > previous testings, this delayed the VM initialization by a lot, as
> > we're moving that 20s of blocking to every VM start.
> >
> > Investigating a way to do it only in the case of being the destination
> > of a live migration, I think the right place is .load_setup migration
> > handler. But I'm ok to move it for sure.
>
> Adding Peter for more ideas.
>

Appreciated :).

Thanks!

> >
> > > >
> > > > This thread joins at vdpa backend device start, so it could happen that
> > > > the guest memory is so large that we still have guest memory to map
> > > > before this time.
> > >
> > > So we would still hit the QMP stall in this case?
> > >
> >
> > This paragraph is kind of outdated, sorry. I can only cause this if I
> > don't enable switchover_ack migration capability and if I artificially
> > make memory pinning in the kernel artificially slow. But I didn't
> > check QMP to be honest, so I can try to test it, yes.
> >
> > If QMP is not responsive, that means QMP is not responsive in QEMU
> > master in that period actually. So we're only improving anyway.
> >
> > Thanks!
> >
>
> Thanks
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2023-12-15 17:28 ` [PATCH for 9.0 12/12] virtio_net: register incremental migration handlers Eugenio Pérez
@ 2023-12-25  1:41 ` Lei Yang
  2023-12-25 16:30 ` Michael S. Tsirkin
  13 siblings, 0 replies; 33+ messages in thread
From: Lei Yang @ 2023-12-25  1:41 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Michael S. Tsirkin, si-wei.liu, Jason Wang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

QE tested this series with regression tests, there are no new regression issues.

Tested-by: Lei Yang <leiyang@redhat.com>



On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment the migration starts.  Moving that operation allows
> QEMU to communicate the kernel the maps while the workload is still running in
> the source, so Linux can start mapping them.
>
> Also, the destination of the guest memory may finish before the destination
> QEMU maps all the memory.  In this case, the rest of the memory will be mapped
> at the same time as before applying this series, when the device is starting.
> So we're only improving with this series.
>
> If the destination has the switchover_ack capability enabled, the destination
> hold the migration until all the memory is mapped.
>
> This needs to be applied on top of [1]. That series performs some code
> reorganization that allows to map the guest memory without knowing the queue
> layout the guest configure on the device.
>
> This series reduced the downtime in the stop-and-copy phase of the live
> migration from 20s~30s to 5s, with a 128G mem guest and two mlx5_vdpa devices,
> per [2].
>
> Future directions on top of this series may include:
> * Iterative migration of virtio-net devices, as it may reduce downtime per [3].
>   vhost-vdpa net can apply the configuration through CVQ in the destination
>   while the source is still migrating.
> * Move more things ahead of migration time, like DRIVER_OK.
> * Check that the devices of the destination are valid, and cancel the migration
>   in case it is not.
>
> v1 from RFC v2:
> * Hold on migration if memory has not been mapped in full with switchover_ack.
> * Revert map if the device is not started.
>
> RFC v2:
> * Delegate map to another thread so it does no block QMP.
> * Fix not allocating iova_tree if x-svq=on at the destination.
> * Rebased on latest master.
> * More cleanups of current code, that might be split from this series too.
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01986.html
> [2] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
> [3] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
> Eugenio Pérez (12):
>   vdpa: do not set virtio status bits if unneeded
>   vdpa: make batch_begin_once early return
>   vdpa: merge _begin_batch into _batch_begin_once
>   vdpa: extract out _dma_end_batch from _listener_commit
>   vdpa: factor out stop path of vhost_vdpa_dev_start
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add vhost_vdpa_load_setup
>   vdpa: approve switchover after memory map in the migration destination
>   vdpa: add vhost_vdpa_net_load_setup NetClient callback
>   vdpa: add vhost_vdpa_net_switchover_ack_needed
>   virtio_net: register incremental migration handlers
>
>  include/hw/virtio/vhost-vdpa.h |  32 ++++
>  include/net/net.h              |   8 +
>  hw/net/virtio-net.c            |  48 ++++++
>  hw/virtio/vhost-vdpa.c         | 274 +++++++++++++++++++++++++++------
>  net/vhost-vdpa.c               |  43 +++++-
>  5 files changed, 357 insertions(+), 48 deletions(-)
>
> --
> 2.39.3
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration
  2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2023-12-25  1:41 ` [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Lei Yang
@ 2023-12-25 16:30 ` Michael S. Tsirkin
  2024-01-02 14:38   ` Eugenio Perez Martin
  13 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2023-12-25 16:30 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Lei Yang, Jason Wang, Dragos Tatulea,
	Zhu Lingshan, Parav Pandit, Stefano Garzarella, Laurent Vivier

On Fri, Dec 15, 2023 at 06:28:18PM +0100, Eugenio Pérez wrote:
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
> 
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment the migration starts.  Moving that operation allows
> QEMU to communicate the kernel the maps while the workload is still running in
> the source, so Linux can start mapping them.
> 
> Also, the destination of the guest memory may finish before the destination
> QEMU maps all the memory.  In this case, the rest of the memory will be mapped
> at the same time as before applying this series, when the device is starting.
> So we're only improving with this series.
> 
> If the destination has the switchover_ack capability enabled, the destination
> hold the migration until all the memory is mapped.
> 
> This needs to be applied on top of [1]. That series performs some code
> reorganization that allows to map the guest memory without knowing the queue
> layout the guest configure on the device.
> 
> This series reduced the downtime in the stop-and-copy phase of the live
> migration from 20s~30s to 5s, with a 128G mem guest and two mlx5_vdpa devices,
> per [2].

I think this is reasonable and could be applied - batching is good.
Could you rebase on master and repost please?

> Future directions on top of this series may include:
> * Iterative migration of virtio-net devices, as it may reduce downtime per [3].
>   vhost-vdpa net can apply the configuration through CVQ in the destination
>   while the source is still migrating.
> * Move more things ahead of migration time, like DRIVER_OK.
> * Check that the devices of the destination are valid, and cancel the migration
>   in case it is not.
> 
> v1 from RFC v2:
> * Hold on migration if memory has not been mapped in full with switchover_ack.
> * Revert map if the device is not started.
> 
> RFC v2:
> * Delegate map to another thread so it does no block QMP.
> * Fix not allocating iova_tree if x-svq=on at the destination.
> * Rebased on latest master.
> * More cleanups of current code, that might be split from this series too.
> 
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01986.html
> [2] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
> [3] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> 
> Eugenio Pérez (12):
>   vdpa: do not set virtio status bits if unneeded
>   vdpa: make batch_begin_once early return
>   vdpa: merge _begin_batch into _batch_begin_once
>   vdpa: extract out _dma_end_batch from _listener_commit
>   vdpa: factor out stop path of vhost_vdpa_dev_start
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add vhost_vdpa_load_setup
>   vdpa: approve switchover after memory map in the migration destination
>   vdpa: add vhost_vdpa_net_load_setup NetClient callback
>   vdpa: add vhost_vdpa_net_switchover_ack_needed
>   virtio_net: register incremental migration handlers
> 
>  include/hw/virtio/vhost-vdpa.h |  32 ++++
>  include/net/net.h              |   8 +
>  hw/net/virtio-net.c            |  48 ++++++
>  hw/virtio/vhost-vdpa.c         | 274 +++++++++++++++++++++++++++------
>  net/vhost-vdpa.c               |  43 +++++-
>  5 files changed, 357 insertions(+), 48 deletions(-)
> 
> -- 
> 2.39.3
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2023-12-21  8:20         ` Eugenio Perez Martin
@ 2024-01-02  5:33           ` Peter Xu
  2024-01-02 11:28             ` Eugenio Perez Martin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2024-01-02  5:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

Jason, Eugenio,

Apologies for a late reply; just back from the long holiday.

On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.

IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
take a lot of time if it's similar to what VFIO does.

>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

If it's destined to map the 128G, it does sound sensible to me to do it
when VM starts, rather than anytime afterwards.

Could anyone help to explain what's the problem if vDPA maps 128G at VM
init just like what VFIO does?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2024-01-02  5:33           ` Peter Xu
@ 2024-01-02 11:28             ` Eugenio Perez Martin
  2024-01-03  6:16               ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-02 11:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier

On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
>
> Jason, Eugenio,
>
> Apologies for a late reply; just back from the long holiday.
>
> On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > but most of the time was spent in the memory pinning. Si-Wei, please
> > correct me if I'm wrong.
>
> IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> take a lot of time if it's similar to what VFIO does.
>
> >
> > I didn't check VFIO, but I think it just maps at realize phase with
> > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > previous testings, this delayed the VM initialization by a lot, as
> > we're moving that 20s of blocking to every VM start.
> >
> > Investigating a way to do it only in the case of being the destination
> > of a live migration, I think the right place is .load_setup migration
> > handler. But I'm ok to move it for sure.
>
> If it's destined to map the 128G, it does sound sensible to me to do it
> when VM starts, rather than anytime afterwards.
>

Just for completion, it is not 100% sure the driver will start the
device. But it is likely for sure.

> Could anyone help to explain what's the problem if vDPA maps 128G at VM
> init just like what VFIO does?
>

The main problem was the delay of VM start. In the master branch, the
pinning is done when the driver starts the device. While it takes the
BQL, the rest of the vCPUs can move work forward while the host is
pinning. So the impact of it is not so evident.

To move it to initialization time made it very noticeable. To make
things worse, QEMU did not respond to QMP commands and similar. That's
why it was done only if the VM was the destination of a LM.

However, we've added the memory map thread in this version, so this
might not be a problem anymore. We could move the spawn of the thread
to initialization time.

But how to undo this pinning in the case the guest does not start the
device? In this series, this is done at the destination with
vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
long as QEMU has the vDPA device?

Thanks!



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration
  2023-12-25 16:30 ` Michael S. Tsirkin
@ 2024-01-02 14:38   ` Eugenio Perez Martin
  0 siblings, 0 replies; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-02 14:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, si-wei.liu, Lei Yang, Jason Wang, Dragos Tatulea,
	Zhu Lingshan, Parav Pandit, Stefano Garzarella, Laurent Vivier

On Mon, Dec 25, 2023 at 5:31 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Dec 15, 2023 at 06:28:18PM +0100, Eugenio Pérez wrote:
> > Current memory operations like pinning may take a lot of time at the
> > destination.  Currently they are done after the source of the migration is
> > stopped, and before the workload is resumed at the destination.  This is a
> > period where neigher traffic can flow, nor the VM workload can continue
> > (downtime).
> >
> > We can do better as we know the memory layout of the guest RAM at the
> > destination from the moment the migration starts.  Moving that operation allows
> > QEMU to communicate the kernel the maps while the workload is still running in
> > the source, so Linux can start mapping them.
> >
> > Also, the destination of the guest memory may finish before the destination
> > QEMU maps all the memory.  In this case, the rest of the memory will be mapped
> > at the same time as before applying this series, when the device is starting.
> > So we're only improving with this series.
> >
> > If the destination has the switchover_ack capability enabled, the destination
> > hold the migration until all the memory is mapped.
> >
> > This needs to be applied on top of [1]. That series performs some code
> > reorganization that allows to map the guest memory without knowing the queue
> > layout the guest configure on the device.
> >
> > This series reduced the downtime in the stop-and-copy phase of the live
> > migration from 20s~30s to 5s, with a 128G mem guest and two mlx5_vdpa devices,
> > per [2].
>
> I think this is reasonable and could be applied - batching is good.
> Could you rebase on master and repost please?
>

New comments appeared in the meantime [1], but I'll rebase with the
needed changes after they converge.

Thanks!

[1] https://patchwork.kernel.org/comment/25653487/


> > Future directions on top of this series may include:
> > * Iterative migration of virtio-net devices, as it may reduce downtime per [3].
> >   vhost-vdpa net can apply the configuration through CVQ in the destination
> >   while the source is still migrating.
> > * Move more things ahead of migration time, like DRIVER_OK.
> > * Check that the devices of the destination are valid, and cancel the migration
> >   in case it is not.
> >
> > v1 from RFC v2:
> > * Hold on migration if memory has not been mapped in full with switchover_ack.
> > * Revert map if the device is not started.
> >
> > RFC v2:
> > * Delegate map to another thread so it does no block QMP.
> > * Fix not allocating iova_tree if x-svq=on at the destination.
> > * Rebased on latest master.
> > * More cleanups of current code, that might be split from this series too.
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01986.html
> > [2] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
> > [3] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >
> > Eugenio Pérez (12):
> >   vdpa: do not set virtio status bits if unneeded
> >   vdpa: make batch_begin_once early return
> >   vdpa: merge _begin_batch into _batch_begin_once
> >   vdpa: extract out _dma_end_batch from _listener_commit
> >   vdpa: factor out stop path of vhost_vdpa_dev_start
> >   vdpa: check for iova tree initialized at net_client_start
> >   vdpa: set backend capabilities at vhost_vdpa_init
> >   vdpa: add vhost_vdpa_load_setup
> >   vdpa: approve switchover after memory map in the migration destination
> >   vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >   vdpa: add vhost_vdpa_net_switchover_ack_needed
> >   virtio_net: register incremental migration handlers
> >
> >  include/hw/virtio/vhost-vdpa.h |  32 ++++
> >  include/net/net.h              |   8 +
> >  hw/net/virtio-net.c            |  48 ++++++
> >  hw/virtio/vhost-vdpa.c         | 274 +++++++++++++++++++++++++++------
> >  net/vhost-vdpa.c               |  43 +++++-
> >  5 files changed, 357 insertions(+), 48 deletions(-)
> >
> > --
> > 2.39.3
> >
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2024-01-02 11:28             ` Eugenio Perez Martin
@ 2024-01-03  6:16               ` Peter Xu
  2024-01-03 11:11                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2024-01-03  6:16 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier, Alex Williamson

On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Jason, Eugenio,
> >
> > Apologies for a late reply; just back from the long holiday.
> >
> > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > correct me if I'm wrong.
> >
> > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > take a lot of time if it's similar to what VFIO does.
> >
> > >
> > > I didn't check VFIO, but I think it just maps at realize phase with
> > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > previous testings, this delayed the VM initialization by a lot, as
> > > we're moving that 20s of blocking to every VM start.
> > >
> > > Investigating a way to do it only in the case of being the destination
> > > of a live migration, I think the right place is .load_setup migration
> > > handler. But I'm ok to move it for sure.
> >
> > If it's destined to map the 128G, it does sound sensible to me to do it
> > when VM starts, rather than anytime afterwards.
> >
> 
> Just for completion, it is not 100% sure the driver will start the
> device. But it is likely for sure.

My understanding is that vDPA is still a quite special device, assuming
only targeting advanced users, and should not appear in a default config
for anyone.  It means the user should hopefully remove the device if the
guest is not using it, instead of worrying on a slow boot.

> 
> > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > init just like what VFIO does?
> >
> 
> The main problem was the delay of VM start. In the master branch, the
> pinning is done when the driver starts the device. While it takes the
> BQL, the rest of the vCPUs can move work forward while the host is
> pinning. So the impact of it is not so evident.
> 
> To move it to initialization time made it very noticeable. To make
> things worse, QEMU did not respond to QMP commands and similar. That's
> why it was done only if the VM was the destination of a LM.

Is that a major issue for us?  IIUC then VFIO shares the same condition.
If it's a real problem, do we want to have a solution that works for both
(or, is it possible)?

> 
> However, we've added the memory map thread in this version, so this
> might not be a problem anymore. We could move the spawn of the thread
> to initialization time.
> 
> But how to undo this pinning in the case the guest does not start the
> device? In this series, this is done at the destination with
> vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> long as QEMU has the vDPA device?

I think even if vDPA decides to use a thread, we should keep the same
behavior before/after the migration.  Having assymetric behavior over DMA
from the assigned HWs might have unpredictable implications.

What I worry is we may over-optimize / over-engineer the case where the
user will specify the vDPA device but not use it, as I mentioned above.

For the long term, maybe there's chance to optimize DMA pinning for both
vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
that issue only exists for large VMs, while they should normally be good
candidates for huge pages already.  Then, it means maybe one folio/page can
cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
also provides possibility of IOMMU large page mappings.  I didn't check at
which stage we are for VFIO on this, Alex may know better. I'm copying Alex
anyway since the problem seems to be a common one already, so maybe he has
some thoughts.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2024-01-03  6:16               ` Peter Xu
@ 2024-01-03 11:11                 ` Eugenio Perez Martin
  2024-01-04  6:46                   ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Eugenio Perez Martin @ 2024-01-03 11:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier, Alex Williamson

On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Jason, Eugenio,
> > >
> > > Apologies for a late reply; just back from the long holiday.
> > >
> > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > correct me if I'm wrong.
> > >
> > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > take a lot of time if it's similar to what VFIO does.
> > >
> > > >
> > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > previous testings, this delayed the VM initialization by a lot, as
> > > > we're moving that 20s of blocking to every VM start.
> > > >
> > > > Investigating a way to do it only in the case of being the destination
> > > > of a live migration, I think the right place is .load_setup migration
> > > > handler. But I'm ok to move it for sure.
> > >
> > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > when VM starts, rather than anytime afterwards.
> > >
> >
> > Just for completion, it is not 100% sure the driver will start the
> > device. But it is likely for sure.
>
> My understanding is that vDPA is still a quite special device, assuming
> only targeting advanced users, and should not appear in a default config
> for anyone.  It means the user should hopefully remove the device if the
> guest is not using it, instead of worrying on a slow boot.
>
> >
> > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > init just like what VFIO does?
> > >
> >
> > The main problem was the delay of VM start. In the master branch, the
> > pinning is done when the driver starts the device. While it takes the
> > BQL, the rest of the vCPUs can move work forward while the host is
> > pinning. So the impact of it is not so evident.
> >
> > To move it to initialization time made it very noticeable. To make
> > things worse, QEMU did not respond to QMP commands and similar. That's
> > why it was done only if the VM was the destination of a LM.
>
> Is that a major issue for us?

To me it is a regression but I'm ok with it for sure.

>  IIUC then VFIO shares the same condition.
> If it's a real problem, do we want to have a solution that works for both
> (or, is it possible)?
>

I would not consider a regression for VFIO since I think it has
behaved that way from the beginning. But yes, I'm all in to find a
common solution.

> >
> > However, we've added the memory map thread in this version, so this
> > might not be a problem anymore. We could move the spawn of the thread
> > to initialization time.
> >
> > But how to undo this pinning in the case the guest does not start the
> > device? In this series, this is done at the destination with
> > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > long as QEMU has the vDPA device?
>
> I think even if vDPA decides to use a thread, we should keep the same
> behavior before/after the migration.  Having assymetric behavior over DMA
> from the assigned HWs might have unpredictable implications.
>
> What I worry is we may over-optimize / over-engineer the case where the
> user will specify the vDPA device but not use it, as I mentioned above.
>

I agree with all of the above. If it is ok to keep memory mapped while
the guest has not started I think we can move the spawn of the thread,
or even just the map write itself, to the vdpa init.

> For the long term, maybe there's chance to optimize DMA pinning for both
> vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> that issue only exists for large VMs, while they should normally be good
> candidates for huge pages already.  Then, it means maybe one folio/page can
> cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> also provides possibility of IOMMU large page mappings.  I didn't check at
> which stage we are for VFIO on this, Alex may know better.

Sounds interesting, and I think it should be implemented. Thanks for
the pointer!

> I'm copying Alex
> anyway since the problem seems to be a common one already, so maybe he has
> some thoughts.
>

Appreciated :).

Thanks!



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup
  2024-01-03 11:11                 ` Eugenio Perez Martin
@ 2024-01-04  6:46                   ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2024-01-04  6:46 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, Michael S. Tsirkin, si-wei.liu, Lei Yang,
	Dragos Tatulea, Zhu Lingshan, Parav Pandit, Stefano Garzarella,
	Laurent Vivier, Alex Williamson

On Wed, Jan 03, 2024 at 12:11:19PM +0100, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Jason, Eugenio,
> > > >
> > > > Apologies for a late reply; just back from the long holiday.
> > > >
> > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > > correct me if I'm wrong.
> > > >
> > > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > > take a lot of time if it's similar to what VFIO does.
> > > >
> > > > >
> > > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > > previous testings, this delayed the VM initialization by a lot, as
> > > > > we're moving that 20s of blocking to every VM start.
> > > > >
> > > > > Investigating a way to do it only in the case of being the destination
> > > > > of a live migration, I think the right place is .load_setup migration
> > > > > handler. But I'm ok to move it for sure.
> > > >
> > > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > > when VM starts, rather than anytime afterwards.
> > > >
> > >
> > > Just for completion, it is not 100% sure the driver will start the
> > > device. But it is likely for sure.
> >
> > My understanding is that vDPA is still a quite special device, assuming
> > only targeting advanced users, and should not appear in a default config
> > for anyone.  It means the user should hopefully remove the device if the
> > guest is not using it, instead of worrying on a slow boot.
> >
> > >
> > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > > init just like what VFIO does?
> > > >
> > >
> > > The main problem was the delay of VM start. In the master branch, the
> > > pinning is done when the driver starts the device. While it takes the
> > > BQL, the rest of the vCPUs can move work forward while the host is
> > > pinning. So the impact of it is not so evident.
> > >
> > > To move it to initialization time made it very noticeable. To make
> > > things worse, QEMU did not respond to QMP commands and similar. That's
> > > why it was done only if the VM was the destination of a LM.
> >
> > Is that a major issue for us?
> 
> To me it is a regression but I'm ok with it for sure.
> 
> >  IIUC then VFIO shares the same condition.
> > If it's a real problem, do we want to have a solution that works for both
> > (or, is it possible)?
> >
> 
> I would not consider a regression for VFIO since I think it has
> behaved that way from the beginning. But yes, I'm all in to find a
> common solution.
> 
> > >
> > > However, we've added the memory map thread in this version, so this
> > > might not be a problem anymore. We could move the spawn of the thread
> > > to initialization time.
> > >
> > > But how to undo this pinning in the case the guest does not start the
> > > device? In this series, this is done at the destination with
> > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > > long as QEMU has the vDPA device?
> >
> > I think even if vDPA decides to use a thread, we should keep the same
> > behavior before/after the migration.  Having assymetric behavior over DMA
> > from the assigned HWs might have unpredictable implications.
> >
> > What I worry is we may over-optimize / over-engineer the case where the
> > user will specify the vDPA device but not use it, as I mentioned above.
> >
> 
> I agree with all of the above. If it is ok to keep memory mapped while
> the guest has not started I think we can move the spawn of the thread,
> or even just the map write itself, to the vdpa init.
> 
> > For the long term, maybe there's chance to optimize DMA pinning for both
> > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> > that issue only exists for large VMs, while they should normally be good
> > candidates for huge pages already.  Then, it means maybe one folio/page can
> > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> > also provides possibility of IOMMU large page mappings.  I didn't check at
> > which stage we are for VFIO on this, Alex may know better.
> 
> Sounds interesting, and I think it should be implemented. Thanks for
> the pointer!

I didn't have an exact pointer previously, but to provide a pointer, I
think it can be something like this:

  physr discussion - Jason Gunthorpe
  https://www.youtube.com/watch?v=QftOTtks-pI&list=PLbzoR-pLrL6rlmdpJ3-oMgU_zxc1wAhjS&index=36

Since I have zero knowledge on vDPA side, I can only provide the exmaple
from VFIO and even if so that may not be fully accurate.  Basically afaiu
currently vfio is already smart enough to recognize continuous ranges (via
vfio_pin_pages_remote()):

		/* Pin a contiguous chunk of memory */
		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
					      size >> PAGE_SHIFT, &pfn, limit,
					      &batch);

But the pin can still be slow, due to e.g. we need a page* for each
PAGE_SIZE range.

I think something like physr can improve it.  That should correspond to the
1st slide of the video of the "performance" entry on pin_user_pages().

Thanks,

> 
> > I'm copying Alex
> > anyway since the problem seems to be a common one already, so maybe he has
> > some thoughts.
> >
> 
> Appreciated :).
> 
> Thanks!
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-01-04  6:47 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-15 17:28 [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
2023-12-15 17:28 ` [PATCH for 9.0 01/12] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
2023-12-20  4:25   ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 02/12] vdpa: make batch_begin_once early return Eugenio Pérez
2023-12-20  4:27   ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 03/12] vdpa: merge _begin_batch into _batch_begin_once Eugenio Pérez
2023-12-20  4:30   ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 04/12] vdpa: extract out _dma_end_batch from _listener_commit Eugenio Pérez
2023-12-20  4:31   ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 05/12] vdpa: factor out stop path of vhost_vdpa_dev_start Eugenio Pérez
2023-12-20  4:31   ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 06/12] vdpa: check for iova tree initialized at net_client_start Eugenio Pérez
2023-12-15 17:28 ` [PATCH for 9.0 07/12] vdpa: set backend capabilities at vhost_vdpa_init Eugenio Pérez
2023-12-20  4:34   ` Jason Wang
2023-12-20  7:07     ` Eugenio Perez Martin
2023-12-21  3:39       ` Jason Wang
2023-12-15 17:28 ` [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
2023-12-20  5:21   ` Jason Wang
2023-12-20  7:06     ` Eugenio Perez Martin
2023-12-21  2:17       ` Jason Wang
2023-12-21  8:20         ` Eugenio Perez Martin
2024-01-02  5:33           ` Peter Xu
2024-01-02 11:28             ` Eugenio Perez Martin
2024-01-03  6:16               ` Peter Xu
2024-01-03 11:11                 ` Eugenio Perez Martin
2024-01-04  6:46                   ` Peter Xu
2023-12-15 17:28 ` [PATCH for 9.0 09/12] vdpa: approve switchover after memory map in the migration destination Eugenio Pérez
2023-12-15 17:28 ` [PATCH for 9.0 10/12] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
2023-12-15 17:28 ` [PATCH for 9.0 11/12] vdpa: add vhost_vdpa_net_switchover_ack_needed Eugenio Pérez
2023-12-15 17:28 ` [PATCH for 9.0 12/12] virtio_net: register incremental migration handlers Eugenio Pérez
2023-12-25  1:41 ` [PATCH for 9.0 00/12] Map memory at destination .load_setup in vDPA-net migration Lei Yang
2023-12-25 16:30 ` Michael S. Tsirkin
2024-01-02 14:38   ` Eugenio Perez Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.