[RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
@ 2023-01-12 17:24 Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
                   ` (13 more replies)
  0 siblings, 14 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

It's possible to migrate vdpa net devices if they are shadowed from the
start.  But to always shadow the dataplane is effectively break its host
passthrough, so its not convenient in vDPA scenarios.

This series enables dynamically switching to shadow mode only at
migration time.  This allow full data virtqueues passthrough all the
time qemu is not migrating.

Successfully tested with vdpa_sim_net (but it needs some patches, I
will send them soon) and qemu emulated device with vp_vdpa with
some restrictions:
* No CVQ.
* VIRTIO_RING_F_STATE patches.
* Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like
  DPDK.

Comments are welcome, especially in the patcheswith RFC in the message.

v2:
- Use a migration listener instead of a memory listener to know when
  the migration starts.
- Add stuff not picked with ASID patches, like enable rings after
  driver_ok
- Add rewinding on the migration src, not in dst
- v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html

Eugenio Pérez (13):
  vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
  vdpa net: move iova tree creation from init to start
  vdpa: copy cvq shadow_data from data vqs, not from x-svq
  vdpa: rewind at get_base, not set_base
  vdpa net: add migration blocker if cannot migrate cvq
  vhost: delay set_vring_ready after DRIVER_OK
  vdpa: delay set_vring_ready after DRIVER_OK
  vdpa: Negotiate _F_SUSPEND feature
  vdpa: add feature_log parameter to vhost_vdpa
  vdpa net: allow VHOST_F_LOG_ALL
  vdpa: add vdpa net migration state notifier
  vdpa: preemptive kick at enable
  vdpa: Conditionally expose _F_LOG in vhost_net devices

 include/hw/virtio/vhost-backend.h |   4 +
 include/hw/virtio/vhost-vdpa.h    |   1 +
 hw/net/vhost_net.c                |  25 ++-
 hw/virtio/vhost-vdpa.c            |  64 +++++---
 hw/virtio/vhost.c                 |   3 +
 net/vhost-vdpa.c                  | 247 +++++++++++++++++++++++++-----
 6 files changed, 278 insertions(+), 66 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 76+ messages in thread

* [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  3:12   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 02/13] vdpa net: move iova tree creation from init to start Eugenio Pérez
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

VHOST_BACKEND_F_IOTLB_ASID is the feature bit, not the bitmask. Since
the device under test also provided VHOST_BACKEND_F_IOTLB_MSG_V2 and
VHOST_BACKEND_F_IOTLB_BATCH, this went unnoticed.

Fixes: c1a1008685 ("vdpa: always start CVQ in SVQ mode if possible")
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1a13a34d35..de5ed8ff22 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -384,7 +384,7 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
             g_strerror(errno), errno);
         return -1;
     }
-    if (!(backend_features & VHOST_BACKEND_F_IOTLB_ASID) ||
+    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
         !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
         return 0;
     }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  3:53   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 03/13] vdpa: copy cvq shadow_data from data vqs, not from x-svq Eugenio Pérez
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

Only create iova_tree if and when it is needed.

The cleanup keeps being responsability of last VQ but this change allows
to merge both cleanup functions.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 71 insertions(+), 30 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index de5ed8ff22..75cca497c8 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -178,13 +178,9 @@ err_init:
 static void vhost_vdpa_cleanup(NetClientState *nc)
 {
     VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-    struct vhost_dev *dev = &s->vhost_net->dev;
 
     qemu_vfree(s->cvq_cmd_out_buffer);
     qemu_vfree(s->status);
-    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-    }
     if (s->vhost_net) {
         vhost_net_cleanup(s->vhost_net);
         g_free(s->vhost_net);
@@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
     return size;
 }
 
+/** From any vdpa net client, get the netclient of first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+    NICState *nic = qemu_get_nic(s->nc.peer);
+    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+    return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
+static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
+{
+    struct vhost_vdpa *v = &s->vhost_vdpa;
+
+    if (v->shadow_vqs_enabled) {
+        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+                                           v->iova_range.last);
+    }
+}
+
+static int vhost_vdpa_net_data_start(NetClientState *nc)
+{
+    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    struct vhost_vdpa *v = &s->vhost_vdpa;
+
+    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+    if (v->index == 0) {
+        vhost_vdpa_net_data_start_first(s);
+        return 0;
+    }
+
+    if (v->shadow_vqs_enabled) {
+        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+        v->iova_tree = s0->vhost_vdpa.iova_tree;
+    }
+
+    return 0;
+}
+
+static void vhost_vdpa_net_client_stop(NetClientState *nc)
+{
+    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    struct vhost_dev *dev;
+
+    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+    dev = s->vhost_vdpa.dev;
+    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
+        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+    }
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
         .type = NET_CLIENT_DRIVER_VHOST_VDPA,
         .size = sizeof(VhostVDPAState),
         .receive = vhost_vdpa_receive,
+        .start = vhost_vdpa_net_data_start,
+        .stop = vhost_vdpa_net_client_stop,
         .cleanup = vhost_vdpa_cleanup,
         .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
         .has_ufo = vhost_vdpa_has_ufo,
@@ -351,7 +401,7 @@ dma_map_err:
 
 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-    VhostVDPAState *s;
+    VhostVDPAState *s, *s0;
     struct vhost_vdpa *v;
     uint64_t backend_features;
     int64_t cvq_group;
@@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
         return r;
     }
 
-    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
-                                       v->iova_range.last);
     v->shadow_vqs_enabled = true;
     s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
 
@@ -425,6 +473,15 @@ out:
         return 0;
     }
 
+    s0 = vhost_vdpa_net_first_nc_vdpa(s);
+    if (s0->vhost_vdpa.iova_tree) {
+        /* SVQ is already configured for all virtqueues */
+        v->iova_tree = s0->vhost_vdpa.iova_tree;
+    } else {
+        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+                                           v->iova_range.last);
+    }
+
     r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
                                vhost_vdpa_net_cvq_cmd_page_len(), false);
     if (unlikely(r < 0)) {
@@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
     if (s->vhost_vdpa.shadow_vqs_enabled) {
         vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
         vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
-        if (!s->always_svq) {
-            /*
-             * If only the CVQ is shadowed we can delete this safely.
-             * If all the VQs are shadows this will be needed by the time the
-             * device is started again to register SVQ vrings and similar.
-             */
-            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-        }
     }
+
+    vhost_vdpa_net_client_stop(nc);
 }
 
 static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
@@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
                                        int nvqs,
                                        bool is_datapath,
                                        bool svq,
-                                       struct vhost_vdpa_iova_range iova_range,
-                                       VhostIOVATree *iova_tree)
+                                       struct vhost_vdpa_iova_range iova_range)
 {
     NetClientState *nc = NULL;
     VhostVDPAState *s;
@@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     s->vhost_vdpa.shadow_vqs_enabled = svq;
     s->vhost_vdpa.iova_range = iova_range;
     s->vhost_vdpa.shadow_data = svq;
-    s->vhost_vdpa.iova_tree = iova_tree;
     if (!is_datapath) {
         s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
                                             vhost_vdpa_net_cvq_cmd_page_len());
@@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     uint64_t features;
     int vdpa_device_fd;
     g_autofree NetClientState **ncs = NULL;
-    g_autoptr(VhostIOVATree) iova_tree = NULL;
     struct vhost_vdpa_iova_range iova_range;
     NetClientState *nc;
     int queue_pairs, r, i = 0, has_cvq = 0;
@@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
         goto err;
     }
 
-    if (opts->x_svq) {
-        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
-            goto err_svq;
-        }
-
-        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
+    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
+        goto err;
     }
 
     ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
@@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     for (i = 0; i < queue_pairs; i++) {
         ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                      vdpa_device_fd, i, 2, true, opts->x_svq,
-                                     iova_range, iova_tree);
+                                     iova_range);
         if (!ncs[i])
             goto err;
     }
@@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     if (has_cvq) {
         nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                  vdpa_device_fd, i, 1, false,
-                                 opts->x_svq, iova_range, iova_tree);
+                                 opts->x_svq, iova_range);
         if (!nc)
             goto err;
     }
 
-    /* iova_tree ownership belongs to last NetClientState */
-    g_steal_pointer(&iova_tree);
     return 0;
 
 err:
@@ -849,7 +891,6 @@ err:
         }
     }
 
-err_svq:
     qemu_close(vdpa_device_fd);
 
     return -1;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 03/13] vdpa: copy cvq shadow_data from data vqs, not from x-svq
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 02/13] vdpa net: move iova tree creation from init to start Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 04/13] vdpa: rewind at get_base, not set_base Eugenio Pérez
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

This allows to passthrough or shadow the data depending on migration
state in next patches.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 75cca497c8..631424d9c4 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -412,7 +412,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
     s = DO_UPCAST(VhostVDPAState, nc, nc);
     v = &s->vhost_vdpa;
 
-    v->shadow_data = s->always_svq;
     v->shadow_vqs_enabled = s->always_svq;
     s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
 
@@ -482,6 +481,12 @@ out:
                                            v->iova_range.last);
     }
 
+    /*
+     * Memory listener is registered against CVQ vhost device, but different
+     * ASID may enable individually SVQ. Let's copy data vqs value here.
+     */
+    v->shadow_data = s0->vhost_vdpa.shadow_data;
+
     r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
                                vhost_vdpa_net_cvq_cmd_page_len(), false);
     if (unlikely(r < 0)) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 03/13] vdpa: copy cvq shadow_data from data vqs, not from x-svq Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:09   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq Eugenio Pérez
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

At this moment it is only possible to migrate to a vdpa device running
with x-svq=on. As a protective measure, the rewind of the inflight
descriptors was done at the destination. That way if the source sent a
virtqueue with inuse descriptors they are always discarded.

Since this series allows to migrate also to passthrough devices with no
SVQ, the right thing to do is to rewind at the source so base of vrings
are correct.

Support for inflight descriptors may be added in the future.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  4 +++
 hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
 hw/virtio/vhost.c                 |  3 ++
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index c5ab49051e..ec3fbae58d 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
 
 typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
                                        int fd);
+
+typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -177,6 +180,7 @@ typedef struct VhostOps {
     vhost_get_device_id_op vhost_get_device_id;
     vhost_force_iommu_op vhost_force_iommu;
     vhost_set_config_call_op vhost_set_config_call;
+    vhost_reset_status_op vhost_reset_status;
 } VhostOps;
 
 int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 542e003101..28a52ddc78 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
     if (started) {
         memory_listener_register(&v->listener, &address_space_memory);
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
-    } else {
-        vhost_vdpa_reset_device(dev);
-        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
-                                   VIRTIO_CONFIG_S_DRIVER);
-        memory_listener_unregister(&v->listener);
+    }
 
-        return 0;
+    return 0;
+}
+
+static void vhost_vdpa_reset_status(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+
+    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+        return;
     }
+
+    vhost_vdpa_reset_device(dev);
+    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
+                                VIRTIO_CONFIG_S_DRIVER);
+    memory_listener_unregister(&v->listener);
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
@@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
                                        struct vhost_vring_state *ring)
 {
     struct vhost_vdpa *v = dev->opaque;
-    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
 
-    /*
-     * vhost-vdpa devices does not support in-flight requests. Set all of them
-     * as available.
-     *
-     * TODO: This is ok for networking, but other kinds of devices might
-     * have problems with these retransmissions.
-     */
-    while (virtqueue_rewind(vq, 1)) {
-        continue;
-    }
     if (v->shadow_vqs_enabled) {
         /*
          * Device vring base was set at device start. SVQ base is handled by
@@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
     int ret;
 
     if (v->shadow_vqs_enabled) {
+        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
+
+        /*
+         * vhost-vdpa devices does not support in-flight requests. Set all of
+         * them as available.
+         *
+         * TODO: This is ok for networking, but other kinds of devices might
+         * have problems with these retransmissions.
+         */
+        while (virtqueue_rewind(vq, 1)) {
+            continue;
+        }
+
         ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
         return 0;
     }
@@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
         .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
         .vhost_force_iommu = vhost_vdpa_force_iommu,
         .vhost_set_config_call = vhost_vdpa_set_config_call,
+        .vhost_reset_status = vhost_vdpa_reset_status,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index eb8c4c378c..a266396576 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
                              hdev->vqs + i,
                              hdev->vq_index + i);
     }
+    if (hdev->vhost_ops->vhost_reset_status) {
+        hdev->vhost_ops->vhost_reset_status(hdev);
+    }
 
     if (vhost_dev_has_iommu(hdev)) {
         if (hdev->vhost_ops->vhost_set_iotlb_callback) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 04/13] vdpa: rewind at get_base, not set_base Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:24   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK Eugenio Pérez
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

A vdpa net device must initialize with SVQ in order to be migratable,
and initialization code verifies conditions.  If the device is not
initialized with the x-svq parameter, it will not expose _F_LOG so vhost
sybsystem will block VM migration from its initialization.

Next patches change this. Net data VQs will be shadowed only at
migration time and vdpa net devices need to expose _F_LOG as long as it
can go to SVQ.

Since we don't know that at initialization time but at start, add an
independent blocker at CVQ.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 631424d9c4..2ca93e850a 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,12 +26,14 @@
 #include <err.h>
 #include "standard-headers/linux/virtio_net.h"
 #include "monitor/monitor.h"
+#include "migration/blocker.h"
 #include "hw/virtio/vhost.h"
 
 /* Todo:need to add the multiqueue support here */
 typedef struct VhostVDPAState {
     NetClientState nc;
     struct vhost_vdpa vhost_vdpa;
+    Error *migration_blocker;
     VHostNetState *vhost_net;
 
     /* Control commands shadow buffers */
@@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
             g_strerror(errno), errno);
         return -1;
     }
-    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
-        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
-        return 0;
+    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
+        error_setg(&s->migration_blocker,
+                   "vdpa device %s does not support ASID",
+                   nc->name);
+        goto out;
+    }
+    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
+                                           &s->migration_blocker)) {
+        goto out;
     }
 
     /*
@@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
         }
 
         if (group == cvq_group) {
-            return 0;
+            error_setg(&s->migration_blocker,
+                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
+                "%"PRId64, nc->name, i, group, cvq_group);
+            goto out;
         }
     }
 
@@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
     s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
 
 out:
-    if (!s->vhost_vdpa.shadow_vqs_enabled) {
-        return 0;
+    if (s->migration_blocker) {
+        Error *errp = NULL;
+        r = migrate_add_blocker(s->migration_blocker, &errp);
+        if (unlikely(r != 0)) {
+            g_clear_pointer(&s->migration_blocker, error_free);
+            error_report_err(errp);
+        }
+
+        return r;
     }
 
     s0 = vhost_vdpa_net_first_nc_vdpa(s);
@@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
         vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
     }
 
+    if (s->migration_blocker) {
+        migrate_del_blocker(s->migration_blocker);
+        g_clear_pointer(&s->migration_blocker, error_free);
+    }
+
     vhost_vdpa_net_client_stop(nc);
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:36   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 07/13] vdpa: " Eugenio Pérez
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

To restore the device at the destination of a live migration we send the
commands through control virtqueue. For a device to read CVQ it must
have received the DRIVER_OK status bit.

However this opens a window where the device could start receiving
packets in rx queue 0 before it receives the RSS configuration. To avoid
that, we will not send vring_enable until all configuration is used by
the device.

As a first step, run vhost_set_vring_ready for all vhost_net backend after
all of them are started (with DRIVER_OK). This code should not affect
vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/net/vhost_net.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index c4eecc6f36..3900599465 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
         } else {
             peer = qemu_get_peer(ncs, n->max_queue_pairs);
         }
+        r = vhost_net_start_one(get_vhost_net(peer), dev);
+        if (r < 0) {
+            goto err_start;
+        }
+    }
+
+    for (int j = 0; j < nvhosts; j++) {
+        if (j < data_queue_pairs) {
+            peer = qemu_get_peer(ncs, j);
+        } else {
+            peer = qemu_get_peer(ncs, n->max_queue_pairs);
+        }
 
         if (peer->vring_enable) {
             /* restore vring enable state */
@@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
                 goto err_start;
             }
         }
-
-        r = vhost_net_start_one(get_vhost_net(peer), dev);
-        if (r < 0) {
-            goto err_start;
-        }
     }
 
     return 0;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 07/13] vdpa: delay set_vring_ready after DRIVER_OK
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature Eugenio Pérez
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

To restore the device at the destination of a live migration we send the
commands through control virtqueue. For a device to read CVQ it must
have received the DRIVER_OK status bit.

However this opens a window where the device could start receiving
packets in rx queue 0 before it receives the RSS configuration. To avoid
that, we will not send vring_enable until all configuration is used by
the device.

Delegating the sending of VHOST_VDPA_SET_VRING_ENABLE
to the vhost_set_vring_ready VhostOp.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/net/vhost_net.c     | 8 ++++++--
 hw/virtio/vhost-vdpa.c | 8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 3900599465..87938b4449 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -406,15 +406,19 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
     }
 
     for (int j = 0; j < nvhosts; j++) {
+        int enable;
+
         if (j < data_queue_pairs) {
             peer = qemu_get_peer(ncs, j);
         } else {
             peer = qemu_get_peer(ncs, n->max_queue_pairs);
         }
 
-        if (peer->vring_enable) {
+        enable = net->nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA ||
+                 peer->vring_enable;
+        if (enable) {
             /* restore vring enable state */
-            r = vhost_set_vring_enable(peer, peer->vring_enable);
+            r = vhost_set_vring_enable(peer, enable);
 
             if (r < 0) {
                 goto err_start;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 28a52ddc78..4296427a69 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -722,9 +722,13 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
     return idx;
 }
 
-static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
+static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
 {
     int i;
+
+    if (unlikely(!ready)) {
+        return -ENOTSUP;
+    }
     trace_vhost_vdpa_set_vring_ready(dev);
     for (i = 0; i < dev->nvqs; ++i) {
         struct vhost_vring_state state = {
@@ -1119,7 +1123,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         if (unlikely(!ok)) {
             return -1;
         }
-        vhost_vdpa_set_vring_ready(dev);
     } else {
         vhost_vdpa_svqs_stop(dev);
         vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
@@ -1324,6 +1327,7 @@ const VhostOps vdpa_ops = {
         .vhost_set_features = vhost_vdpa_set_features,
         .vhost_reset_device = vhost_vdpa_reset_device,
         .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+        .vhost_set_vring_enable = vhost_vdpa_set_vring_ready,
         .vhost_get_config  = vhost_vdpa_get_config,
         .vhost_set_config = vhost_vdpa_set_config,
         .vhost_requires_shm_log = NULL,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 07/13] vdpa: " Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:39   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 09/13] vdpa: add feature_log parameter to vhost_vdpa Eugenio Pérez
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

This is needed for qemu to know it can suspend the device to retrieve
its status and enable SVQ with it, so all the process is transparent to
the guest.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 4296427a69..a61a6b2a74 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -659,7 +659,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
     uint64_t features;
     uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
         0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
-        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
+        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
+        0x1ULL << VHOST_BACKEND_F_SUSPEND;
     int r;
 
     if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 09/13] vdpa: add feature_log parameter to vhost_vdpa
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-12 17:24 ` [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL Eugenio Pérez
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

This way the device's vhost_vdpa can make the choice about exposing or
not the _F_LOG feature.

At the moment it is always false.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 1 +
 hw/virtio/vhost-vdpa.c         | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 7997f09a8d..7bcbcdb1dd 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -39,6 +39,7 @@ typedef struct vhost_vdpa {
     MemoryListener listener;
     struct vhost_vdpa_iova_range iova_range;
     uint64_t acked_features;
+    bool feature_log;
     bool shadow_vqs_enabled;
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a61a6b2a74..40b7e8706a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1273,7 +1273,7 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
     struct vhost_vdpa *v = dev->opaque;
     int ret = vhost_vdpa_get_dev_features(dev, features);
 
-    if (ret == 0 && v->shadow_vqs_enabled) {
+    if (ret == 0 && (v->shadow_vqs_enabled || v->feature_log)) {
         /* Add SVQ logging capabilities */
         *features |= BIT_ULL(VHOST_F_LOG_ALL);
     }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 09/13] vdpa: add feature_log parameter to vhost_vdpa Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:42   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

Since some actions move to the start function instead of init, the
device features may not be the parent vdpa device's, but the one
returned by vhost backend.  If transition to SVQ is supported, the vhost
backend will return _F_LOG_ALL to signal the device is migratable.

Add VHOST_F_LOG_ALL.  HW dirty page tracking can be added on top of this
change if the device supports it in the future.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 2ca93e850a..5d7ad6e4d7 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -100,6 +100,8 @@ static const uint64_t vdpa_svq_device_features =
     BIT_ULL(VIRTIO_NET_F_MQ) |
     BIT_ULL(VIRTIO_F_ANY_LAYOUT) |
     BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |
+    /* VHOST_F_LOG_ALL is exposed by SVQ */
+    BIT_ULL(VHOST_F_LOG_ALL) |
     BIT_ULL(VIRTIO_NET_F_RSC_EXT) |
     BIT_ULL(VIRTIO_NET_F_STANDBY);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  4:54   ` Jason Wang
  2023-02-02  1:52   ` Si-Wei Liu
  2023-01-12 17:24 ` [RFC v2 12/13] vdpa: preemptive kick at enable Eugenio Pérez
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
  vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 5d7ad6e4d7..f38532b1df 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,6 +26,8 @@
 #include <err.h>
 #include "standard-headers/linux/virtio_net.h"
 #include "monitor/monitor.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
 #include "migration/blocker.h"
 #include "hw/virtio/vhost.h"
 
@@ -33,6 +35,7 @@
 typedef struct VhostVDPAState {
     NetClientState nc;
     struct vhost_vdpa vhost_vdpa;
+    Notifier migration_state;
     Error *migration_blocker;
     VHostNetState *vhost_net;
 
@@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
     return DO_UPCAST(VhostVDPAState, nc, nc0);
 }
 
+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
+{
+    struct vhost_vdpa *v = &s->vhost_vdpa;
+    VirtIONet *n;
+    VirtIODevice *vdev;
+    int data_queue_pairs, cvq, r;
+    NetClientState *peer;
+
+    /* We are only called on the first data vqs and only if x-svq is not set */
+    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+        return;
+    }
+
+    vdev = v->dev->vdev;
+    n = VIRTIO_NET(vdev);
+    if (!n->vhost_started) {
+        return;
+    }
+
+    if (enable) {
+        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+    }
+    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+                                  n->max_ncs - n->max_queue_pairs : 0;
+    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+    peer = s->nc.peer;
+    for (int i = 0; i < data_queue_pairs + cvq; i++) {
+        VhostVDPAState *vdpa_state;
+        NetClientState *nc;
+
+        if (i < data_queue_pairs) {
+            nc = qemu_get_peer(peer, i);
+        } else {
+            nc = qemu_get_peer(peer, n->max_queue_pairs);
+        }
+
+        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+        vdpa_state->vhost_vdpa.shadow_data = enable;
+
+        if (i < data_queue_pairs) {
+            /* Do not override CVQ shadow_vqs_enabled */
+            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+        }
+    }
+
+    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
+    if (unlikely(r < 0)) {
+        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
+    }
+}
+
+static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *migration = data;
+    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
+                                     migration_state);
+
+    switch (migration->state) {
+    case MIGRATION_STATUS_SETUP:
+        vhost_vdpa_net_log_global_enable(s, true);
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        vhost_vdpa_net_log_global_enable(s, false);
+        return;
+    };
+}
+
 static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 {
     struct vhost_vdpa *v = &s->vhost_vdpa;
 
+    if (v->feature_log) {
+        add_migration_state_change_notifier(&s->migration_state);
+    }
+
     if (v->shadow_vqs_enabled) {
         v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
                                            v->iova_range.last);
@@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
 
     assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 
+    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
+        remove_migration_state_change_notifier(&s->migration_state);
+    }
+
     dev = s->vhost_vdpa.dev;
     if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
         g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
@@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     s->vhost_vdpa.device_fd = vdpa_device_fd;
     s->vhost_vdpa.index = queue_pair_index;
     s->always_svq = svq;
+    s->migration_state.notify = vdpa_net_migration_state_notifier;
     s->vhost_vdpa.shadow_vqs_enabled = svq;
     s->vhost_vdpa.iova_range = iova_range;
     s->vhost_vdpa.shadow_data = svq;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-01-13  2:31   ` Jason Wang
  2023-01-12 17:24 ` [RFC v2 13/13] vdpa: Conditionally expose _F_LOG in vhost_net devices Eugenio Pérez
  2023-02-02  1:00 ` [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Si-Wei Liu
  13 siblings, 1 reply; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

Spuriously kick the destination device's queue so it knows in case there
are new descriptors.

RFC: This is somehow a gray area. The guest may have placed descriptors
in a virtqueue but not kicked it, so it might be surprised if the device
starts processing it.

However, that information is not in the migration stream and it should
be an edge case anyhow, being resilient to parallel notifications from
the guest.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 40b7e8706a..dff94355dd 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
     }
     trace_vhost_vdpa_set_vring_ready(dev);
     for (i = 0; i < dev->nvqs; ++i) {
+        VirtQueue *vq;
         struct vhost_vring_state state = {
             .index = dev->vq_index + i,
             .num = 1,
         };
         vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
+
+        /* Preemptive kick */
+        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
+        event_notifier_set(virtio_queue_get_host_notifier(vq));
     }
     return 0;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC v2 13/13] vdpa: Conditionally expose _F_LOG in vhost_net devices
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 12/13] vdpa: preemptive kick at enable Eugenio Pérez
@ 2023-01-12 17:24 ` Eugenio Pérez
  2023-02-02  1:00 ` [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Si-Wei Liu
  13 siblings, 0 replies; 76+ messages in thread
From: Eugenio Pérez @ 2023-01-12 17:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

Vhost-vdpa networking devices need to meet a few conditions to be
migratable. If SVQ is not enabled from the beginning, to suspend the
device to retrieve the vq state is the first requirement.

Expose _F_LOG only in that case.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 net/vhost-vdpa.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index f38532b1df..9cf931010b 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -831,6 +831,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
                                        int nvqs,
                                        bool is_datapath,
                                        bool svq,
+                                       bool feature_log,
                                        struct vhost_vdpa_iova_range iova_range)
 {
     NetClientState *nc = NULL;
@@ -854,6 +855,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     s->vhost_vdpa.shadow_vqs_enabled = svq;
     s->vhost_vdpa.iova_range = iova_range;
     s->vhost_vdpa.shadow_data = svq;
+    s->vhost_vdpa.feature_log = feature_log;
     if (!is_datapath) {
         s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
                                             vhost_vdpa_net_cvq_cmd_page_len());
@@ -920,12 +922,13 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
                         NetClientState *peer, Error **errp)
 {
     const NetdevVhostVDPAOptions *opts;
-    uint64_t features;
+    uint64_t features, backend_features;
     int vdpa_device_fd;
     g_autofree NetClientState **ncs = NULL;
     struct vhost_vdpa_iova_range iova_range;
     NetClientState *nc;
     int queue_pairs, r, i = 0, has_cvq = 0;
+    bool feature_log;
 
     assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
     opts = &netdev->u.vhost_vdpa;
@@ -955,6 +958,12 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
         }
     }
 
+    r = ioctl(vdpa_device_fd, VHOST_GET_BACKEND_FEATURES, &backend_features);
+    if (unlikely(r < 0)) {
+        error_setg_errno(errp, errno, "Cannot get vdpa backend_features");
+        goto err;
+    }
+
     r = vhost_vdpa_get_features(vdpa_device_fd, &features, errp);
     if (unlikely(r < 0)) {
         goto err;
@@ -980,10 +989,17 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
 
     ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
 
+    /*
+     * Offer VHOST_F_LOG_ALL as long as the device met basic requisites, and
+     * let more complicated ones to vhost_vdpa_net_{cvq,data}_start.
+     */
+    feature_log = opts->x_svq ||
+                  ((backend_features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
+                   vhost_vdpa_net_valid_svq_features(features, NULL));
     for (i = 0; i < queue_pairs; i++) {
         ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                      vdpa_device_fd, i, 2, true, opts->x_svq,
-                                     iova_range);
+                                     feature_log, iova_range);
         if (!ncs[i])
             goto err;
     }
@@ -991,7 +1007,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     if (has_cvq) {
         nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                  vdpa_device_fd, i, 1, false,
-                                 opts->x_svq, iova_range);
+                                 opts->x_svq, feature_log, iova_range);
         if (!nc)
             goto err;
     }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-12 17:24 ` [RFC v2 12/13] vdpa: preemptive kick at enable Eugenio Pérez
@ 2023-01-13  2:31   ` Jason Wang
  2023-01-13  3:25     ` Zhu, Lingshan
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  2:31 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Spuriously kick the destination device's queue so it knows in case there
> are new descriptors.
>
> RFC: This is somehow a gray area. The guest may have placed descriptors
> in a virtqueue but not kicked it, so it might be surprised if the device
> starts processing it.

So I think this is kind of the work of the vDPA parent. For the parent
that needs this trick, we should do it in the parent driver.

Thanks

>
> However, that information is not in the migration stream and it should
> be an edge case anyhow, being resilient to parallel notifications from
> the guest.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/virtio/vhost-vdpa.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 40b7e8706a..dff94355dd 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
>      }
>      trace_vhost_vdpa_set_vring_ready(dev);
>      for (i = 0; i < dev->nvqs; ++i) {
> +        VirtQueue *vq;
>          struct vhost_vring_state state = {
>              .index = dev->vq_index + i,
>              .num = 1,
>          };
>          vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> +
> +        /* Preemptive kick */
> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
>      }
>      return 0;
>  }
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
  2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
@ 2023-01-13  3:12   ` Jason Wang
  2023-01-13  6:42     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  3:12 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> VHOST_BACKEND_F_IOTLB_ASID is the feature bit, not the bitmask. Since
> the device under test also provided VHOST_BACKEND_F_IOTLB_MSG_V2 and
> VHOST_BACKEND_F_IOTLB_BATCH, this went unnoticed.
>
> Fixes: c1a1008685 ("vdpa: always start CVQ in SVQ mode if possible")
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Do we need this for -stable?

Thanks

> ---
>  net/vhost-vdpa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 1a13a34d35..de5ed8ff22 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -384,7 +384,7 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>              g_strerror(errno), errno);
>          return -1;
>      }
> -    if (!(backend_features & VHOST_BACKEND_F_IOTLB_ASID) ||
> +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
>          !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
>          return 0;
>      }
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-13  2:31   ` Jason Wang
@ 2023-01-13  3:25     ` Zhu, Lingshan
  2023-01-13  3:39       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Zhu, Lingshan @ 2023-01-13  3:25 UTC (permalink / raw)
  To: Jason Wang, Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit



On 1/13/2023 10:31 AM, Jason Wang wrote:
> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>> Spuriously kick the destination device's queue so it knows in case there
>> are new descriptors.
>>
>> RFC: This is somehow a gray area. The guest may have placed descriptors
>> in a virtqueue but not kicked it, so it might be surprised if the device
>> starts processing it.
> So I think this is kind of the work of the vDPA parent. For the parent
> that needs this trick, we should do it in the parent driver.
Agree, it looks easier implementing this in parent driver,
I can implement it in ifcvf set_vq_ready right now

Thanks
Zhu Lingshan
>
> Thanks
>
>> However, that information is not in the migration stream and it should
>> be an edge case anyhow, being resilient to parallel notifications from
>> the guest.
>>
>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> ---
>>   hw/virtio/vhost-vdpa.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 40b7e8706a..dff94355dd 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
>>       }
>>       trace_vhost_vdpa_set_vring_ready(dev);
>>       for (i = 0; i < dev->nvqs; ++i) {
>> +        VirtQueue *vq;
>>           struct vhost_vring_state state = {
>>               .index = dev->vq_index + i,
>>               .num = 1,
>>           };
>>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>> +
>> +        /* Preemptive kick */
>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
>>       }
>>       return 0;
>>   }
>> --
>> 2.31.1
>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-13  3:25     ` Zhu, Lingshan
@ 2023-01-13  3:39       ` Jason Wang
  2023-01-13  9:06         ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  3:39 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Eugenio Pérez, qemu-devel, si-wei.liu, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>
>
>
> On 1/13/2023 10:31 AM, Jason Wang wrote:
> > On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >> Spuriously kick the destination device's queue so it knows in case there
> >> are new descriptors.
> >>
> >> RFC: This is somehow a gray area. The guest may have placed descriptors
> >> in a virtqueue but not kicked it, so it might be surprised if the device
> >> starts processing it.
> > So I think this is kind of the work of the vDPA parent. For the parent
> > that needs this trick, we should do it in the parent driver.
> Agree, it looks easier implementing this in parent driver,
> I can implement it in ifcvf set_vq_ready right now

Great, but please check whether or not it is really needed.

Some device implementation could check the available descriptions
after DRIVER_OK without waiting for a kick.

Thanks

>
> Thanks
> Zhu Lingshan
> >
> > Thanks
> >
> >> However, that information is not in the migration stream and it should
> >> be an edge case anyhow, being resilient to parallel notifications from
> >> the guest.
> >>
> >> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> ---
> >>   hw/virtio/vhost-vdpa.c | 5 +++++
> >>   1 file changed, 5 insertions(+)
> >>
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index 40b7e8706a..dff94355dd 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
> >>       }
> >>       trace_vhost_vdpa_set_vring_ready(dev);
> >>       for (i = 0; i < dev->nvqs; ++i) {
> >> +        VirtQueue *vq;
> >>           struct vhost_vring_state state = {
> >>               .index = dev->vq_index + i,
> >>               .num = 1,
> >>           };
> >>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> >> +
> >> +        /* Preemptive kick */
> >> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> >> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
> >>       }
> >>       return 0;
> >>   }
> >> --
> >> 2.31.1
> >>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-12 17:24 ` [RFC v2 02/13] vdpa net: move iova tree creation from init to start Eugenio Pérez
@ 2023-01-13  3:53   ` Jason Wang
  2023-01-13  7:28     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  3:53 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Only create iova_tree if and when it is needed.
>
> The cleanup keeps being responsability of last VQ but this change allows
> to merge both cleanup functions.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 71 insertions(+), 30 deletions(-)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index de5ed8ff22..75cca497c8 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -178,13 +178,9 @@ err_init:
>  static void vhost_vdpa_cleanup(NetClientState *nc)
>  {
>      VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> -    struct vhost_dev *dev = &s->vhost_net->dev;
>
>      qemu_vfree(s->cvq_cmd_out_buffer);
>      qemu_vfree(s->status);
> -    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> -    }
>      if (s->vhost_net) {
>          vhost_net_cleanup(s->vhost_net);
>          g_free(s->vhost_net);
> @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
>      return size;
>  }
>
> +/** From any vdpa net client, get the netclient of first queue pair */
> +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> +{
> +    NICState *nic = qemu_get_nic(s->nc.peer);
> +    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
> +
> +    return DO_UPCAST(VhostVDPAState, nc, nc0);
> +}
> +
> +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> +{
> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> +
> +    if (v->shadow_vqs_enabled) {
> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> +                                           v->iova_range.last);
> +    }
> +}
> +
> +static int vhost_vdpa_net_data_start(NetClientState *nc)
> +{
> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> +
> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> +
> +    if (v->index == 0) {
> +        vhost_vdpa_net_data_start_first(s);
> +        return 0;
> +    }
> +
> +    if (v->shadow_vqs_enabled) {
> +        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> +    }

It looks to me the logic here is somehow the same as
vhost_vdpa_net_cvq_start(), can we unify the them?

> +
> +    return 0;
> +}
> +
> +static void vhost_vdpa_net_client_stop(NetClientState *nc)
> +{
> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> +    struct vhost_dev *dev;
> +
> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> +
> +    dev = s->vhost_vdpa.dev;
> +    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> +        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> +    }
> +}
> +
>  static NetClientInfo net_vhost_vdpa_info = {
>          .type = NET_CLIENT_DRIVER_VHOST_VDPA,
>          .size = sizeof(VhostVDPAState),
>          .receive = vhost_vdpa_receive,
> +        .start = vhost_vdpa_net_data_start,
> +        .stop = vhost_vdpa_net_client_stop,
>          .cleanup = vhost_vdpa_cleanup,
>          .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
>          .has_ufo = vhost_vdpa_has_ufo,
> @@ -351,7 +401,7 @@ dma_map_err:
>
>  static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>  {
> -    VhostVDPAState *s;
> +    VhostVDPAState *s, *s0;
>      struct vhost_vdpa *v;
>      uint64_t backend_features;
>      int64_t cvq_group;
> @@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>          return r;
>      }
>
> -    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> -                                       v->iova_range.last);
>      v->shadow_vqs_enabled = true;
>      s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>
> @@ -425,6 +473,15 @@ out:
>          return 0;
>      }
>
> +    s0 = vhost_vdpa_net_first_nc_vdpa(s);
> +    if (s0->vhost_vdpa.iova_tree) {
> +        /* SVQ is already configured for all virtqueues */
> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> +    } else {
> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> +                                           v->iova_range.last);
> +    }
> +
>      r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
>                                 vhost_vdpa_net_cvq_cmd_page_len(), false);
>      if (unlikely(r < 0)) {
> @@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>      if (s->vhost_vdpa.shadow_vqs_enabled) {
>          vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
>          vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> -        if (!s->always_svq) {
> -            /*
> -             * If only the CVQ is shadowed we can delete this safely.
> -             * If all the VQs are shadows this will be needed by the time the
> -             * device is started again to register SVQ vrings and similar.
> -             */
> -            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> -        }
>      }
> +
> +    vhost_vdpa_net_client_stop(nc);
>  }
>
>  static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
> @@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>                                         int nvqs,
>                                         bool is_datapath,
>                                         bool svq,
> -                                       struct vhost_vdpa_iova_range iova_range,
> -                                       VhostIOVATree *iova_tree)
> +                                       struct vhost_vdpa_iova_range iova_range)
>  {
>      NetClientState *nc = NULL;
>      VhostVDPAState *s;
> @@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>      s->vhost_vdpa.shadow_vqs_enabled = svq;
>      s->vhost_vdpa.iova_range = iova_range;
>      s->vhost_vdpa.shadow_data = svq;
> -    s->vhost_vdpa.iova_tree = iova_tree;
>      if (!is_datapath) {
>          s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
>                                              vhost_vdpa_net_cvq_cmd_page_len());
> @@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>      uint64_t features;
>      int vdpa_device_fd;
>      g_autofree NetClientState **ncs = NULL;
> -    g_autoptr(VhostIOVATree) iova_tree = NULL;
>      struct vhost_vdpa_iova_range iova_range;
>      NetClientState *nc;
>      int queue_pairs, r, i = 0, has_cvq = 0;
> @@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>          goto err;
>      }
>
> -    if (opts->x_svq) {
> -        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
> -            goto err_svq;
> -        }
> -
> -        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
> +    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
> +        goto err;
>      }
>
>      ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
> @@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>      for (i = 0; i < queue_pairs; i++) {
>          ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>                                       vdpa_device_fd, i, 2, true, opts->x_svq,
> -                                     iova_range, iova_tree);
> +                                     iova_range);
>          if (!ncs[i])
>              goto err;
>      }
> @@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>      if (has_cvq) {
>          nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>                                   vdpa_device_fd, i, 1, false,
> -                                 opts->x_svq, iova_range, iova_tree);
> +                                 opts->x_svq, iova_range);
>          if (!nc)
>              goto err;
>      }
>
> -    /* iova_tree ownership belongs to last NetClientState */
> -    g_steal_pointer(&iova_tree);
>      return 0;
>
>  err:
> @@ -849,7 +891,6 @@ err:
>          }
>      }
>
> -err_svq:
>      qemu_close(vdpa_device_fd);
>
>      return -1;
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-12 17:24 ` [RFC v2 04/13] vdpa: rewind at get_base, not set_base Eugenio Pérez
@ 2023-01-13  4:09   ` Jason Wang
  2023-01-13  7:40     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:09 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> At this moment it is only possible to migrate to a vdpa device running
> with x-svq=on. As a protective measure, the rewind of the inflight
> descriptors was done at the destination. That way if the source sent a
> virtqueue with inuse descriptors they are always discarded.
>
> Since this series allows to migrate also to passthrough devices with no
> SVQ, the right thing to do is to rewind at the source so base of vrings
> are correct.
>
> Support for inflight descriptors may be added in the future.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  4 +++
>  hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
>  hw/virtio/vhost.c                 |  3 ++
>  3 files changed, 36 insertions(+), 17 deletions(-)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index c5ab49051e..ec3fbae58d 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>
>  typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>                                         int fd);
> +
> +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -177,6 +180,7 @@ typedef struct VhostOps {
>      vhost_get_device_id_op vhost_get_device_id;
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
> +    vhost_reset_status_op vhost_reset_status;
>  } VhostOps;
>
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 542e003101..28a52ddc78 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>      if (started) {
>          memory_listener_register(&v->listener, &address_space_memory);
>          return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> -    } else {
> -        vhost_vdpa_reset_device(dev);
> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> -                                   VIRTIO_CONFIG_S_DRIVER);
> -        memory_listener_unregister(&v->listener);
> +    }
>
> -        return 0;
> +    return 0;
> +}
> +
> +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +
> +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
> +        return;
>      }
> +
> +    vhost_vdpa_reset_device(dev);
> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> +                                VIRTIO_CONFIG_S_DRIVER);
> +    memory_listener_unregister(&v->listener);
>  }
>
>  static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>                                         struct vhost_vring_state *ring)
>  {
>      struct vhost_vdpa *v = dev->opaque;
> -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>
> -    /*
> -     * vhost-vdpa devices does not support in-flight requests. Set all of them
> -     * as available.
> -     *
> -     * TODO: This is ok for networking, but other kinds of devices might
> -     * have problems with these retransmissions.
> -     */
> -    while (virtqueue_rewind(vq, 1)) {
> -        continue;
> -    }
>      if (v->shadow_vqs_enabled) {
>          /*
>           * Device vring base was set at device start. SVQ base is handled by
> @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>      int ret;
>
>      if (v->shadow_vqs_enabled) {
> +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> +
> +        /*
> +         * vhost-vdpa devices does not support in-flight requests. Set all of
> +         * them as available.
> +         *
> +         * TODO: This is ok for networking, but other kinds of devices might
> +         * have problems with these retransmissions.
> +         */
> +        while (virtqueue_rewind(vq, 1)) {
> +            continue;
> +        }
> +
>          ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
>          return 0;
>      }
> @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
>          .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>          .vhost_force_iommu = vhost_vdpa_force_iommu,
>          .vhost_set_config_call = vhost_vdpa_set_config_call,
> +        .vhost_reset_status = vhost_vdpa_reset_status,

Can we simply use the NetClient stop method here?

Thanks

>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index eb8c4c378c..a266396576 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>                               hdev->vqs + i,
>                               hdev->vq_index + i);
>      }
> +    if (hdev->vhost_ops->vhost_reset_status) {
> +        hdev->vhost_ops->vhost_reset_status(hdev);
> +    }
>
>      if (vhost_dev_has_iommu(hdev)) {
>          if (hdev->vhost_ops->vhost_set_iotlb_callback) {
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-12 17:24 ` [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq Eugenio Pérez
@ 2023-01-13  4:24   ` Jason Wang
  2023-01-13  7:46     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:24 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: si-wei.liu, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 01:24, Eugenio Pérez 写道:
> A vdpa net device must initialize with SVQ in order to be migratable,
> and initialization code verifies conditions.  If the device is not
> initialized with the x-svq parameter, it will not expose _F_LOG so vhost
> sybsystem will block VM migration from its initialization.
>
> Next patches change this. Net data VQs will be shadowed only at
> migration time and vdpa net devices need to expose _F_LOG as long as it
> can go to SVQ.
>
> Since we don't know that at initialization time but at start, add an
> independent blocker at CVQ.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
>   1 file changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 631424d9c4..2ca93e850a 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -26,12 +26,14 @@
>   #include <err.h>
>   #include "standard-headers/linux/virtio_net.h"
>   #include "monitor/monitor.h"
> +#include "migration/blocker.h"
>   #include "hw/virtio/vhost.h"
>   
>   /* Todo:need to add the multiqueue support here */
>   typedef struct VhostVDPAState {
>       NetClientState nc;
>       struct vhost_vdpa vhost_vdpa;
> +    Error *migration_blocker;


Any reason we can't use the mivration_blocker in vhost_dev structure?

I believe we don't need to wait until start to know we can't migrate.

Thanks


>       VHostNetState *vhost_net;
>   
>       /* Control commands shadow buffers */
> @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>               g_strerror(errno), errno);
>           return -1;
>       }
> -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
> -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
> -        return 0;
> +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
> +        error_setg(&s->migration_blocker,
> +                   "vdpa device %s does not support ASID",
> +                   nc->name);
> +        goto out;
> +    }
> +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
> +                                           &s->migration_blocker)) {
> +        goto out;
>       }
>   
>       /*
> @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>           }
>   
>           if (group == cvq_group) {
> -            return 0;
> +            error_setg(&s->migration_blocker,
> +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
> +                "%"PRId64, nc->name, i, group, cvq_group);
> +            goto out;
>           }
>       }
>   
> @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>       s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>   
>   out:
> -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
> -        return 0;
> +    if (s->migration_blocker) {
> +        Error *errp = NULL;
> +        r = migrate_add_blocker(s->migration_blocker, &errp);
> +        if (unlikely(r != 0)) {
> +            g_clear_pointer(&s->migration_blocker, error_free);
> +            error_report_err(errp);
> +        }
> +
> +        return r;
>       }
>   
>       s0 = vhost_vdpa_net_first_nc_vdpa(s);
> @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
>       }
>   
> +    if (s->migration_blocker) {
> +        migrate_del_blocker(s->migration_blocker);
> +        g_clear_pointer(&s->migration_blocker, error_free);
> +    }
> +
>       vhost_vdpa_net_client_stop(nc);
>   }
>   



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-12 17:24 ` [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK Eugenio Pérez
@ 2023-01-13  4:36   ` Jason Wang
  2023-01-13  8:19     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:36 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> To restore the device at the destination of a live migration we send the
> commands through control virtqueue. For a device to read CVQ it must
> have received the DRIVER_OK status bit.

This probably requires the support from the parent driver and requires
some changes or fixes in the parent driver.

Some drivers did:

parent_set_status():
if (DRIVER_OK)
    if (queue_enable)
        write queue_enable to the device

Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.

>
> However this opens a window where the device could start receiving
> packets in rx queue 0 before it receives the RSS configuration. To avoid
> that, we will not send vring_enable until all configuration is used by
> the device.
>
> As a first step, run vhost_set_vring_ready for all vhost_net backend after
> all of them are started (with DRIVER_OK). This code should not affect
> vdpa.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  hw/net/vhost_net.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index c4eecc6f36..3900599465 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>          } else {
>              peer = qemu_get_peer(ncs, n->max_queue_pairs);
>          }
> +        r = vhost_net_start_one(get_vhost_net(peer), dev);
> +        if (r < 0) {
> +            goto err_start;
> +        }
> +    }
> +
> +    for (int j = 0; j < nvhosts; j++) {
> +        if (j < data_queue_pairs) {
> +            peer = qemu_get_peer(ncs, j);
> +        } else {
> +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
> +        }

I fail to understand why we need to change the vhost_net layer? This
is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
vhost_vdpa_dev_start()?

Thanks

>
>          if (peer->vring_enable) {
>              /* restore vring enable state */
> @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>                  goto err_start;
>              }
>          }
> -
> -        r = vhost_net_start_one(get_vhost_net(peer), dev);
> -        if (r < 0) {
> -            goto err_start;
> -        }
>      }
>
>      return 0;
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature
  2023-01-12 17:24 ` [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature Eugenio Pérez
@ 2023-01-13  4:39   ` Jason Wang
  2023-01-13  8:45     ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:39 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This is needed for qemu to know it can suspend the device to retrieve
> its status and enable SVQ with it, so all the process is transparent to
> the guest.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

We probably need to add the resume in the future to have a quick
recovery from migration failures.

Thanks

> ---
>  hw/virtio/vhost-vdpa.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 4296427a69..a61a6b2a74 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -659,7 +659,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>      uint64_t features;
>      uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
>          0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> -        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
> +        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> +        0x1ULL << VHOST_BACKEND_F_SUSPEND;
>      int r;
>
>      if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL
  2023-01-12 17:24 ` [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL Eugenio Pérez
@ 2023-01-13  4:42   ` Jason Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:42 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Since some actions move to the start function instead of init, the
> device features may not be the parent vdpa device's, but the one
> returned by vhost backend.  If transition to SVQ is supported, the vhost
> backend will return _F_LOG_ALL to signal the device is migratable.
>
> Add VHOST_F_LOG_ALL.  HW dirty page tracking can be added on top of this
> change if the device supports it in the future.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  net/vhost-vdpa.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 2ca93e850a..5d7ad6e4d7 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -100,6 +100,8 @@ static const uint64_t vdpa_svq_device_features =
>      BIT_ULL(VIRTIO_NET_F_MQ) |
>      BIT_ULL(VIRTIO_F_ANY_LAYOUT) |
>      BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |
> +    /* VHOST_F_LOG_ALL is exposed by SVQ */
> +    BIT_ULL(VHOST_F_LOG_ALL) |
>      BIT_ULL(VIRTIO_NET_F_RSC_EXT) |
>      BIT_ULL(VIRTIO_NET_F_STANDBY);
>
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
@ 2023-01-13  4:54   ` Jason Wang
  2023-01-13  9:00     ` Eugenio Perez Martin
  2023-02-02  1:52   ` Si-Wei Liu
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-13  4:54 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This allows net to restart the device backend to configure SVQ on it.
>
> Ideally, these changes should not be net specific. However, the vdpa net
> backend is the one with enough knowledge to configure everything because
> of some reasons:
> * Queues might need to be shadowed or not depending on its kind (control
>   vs data).
> * Queues need to share the same map translations (iova tree).
>
> Because of that it is cleaner to restart the whole net backend and
> configure again as expected, similar to how vhost-kernel moves between
> userspace and passthrough.
>
> If more kinds of devices need dynamic switching to SVQ we can create a
> callback struct like VhostOps and move most of the code there.
> VhostOps cannot be reused since all vdpa backend share them, and to
> personalize just for networking would be too heavy.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 5d7ad6e4d7..f38532b1df 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -26,6 +26,8 @@
>  #include <err.h>
>  #include "standard-headers/linux/virtio_net.h"
>  #include "monitor/monitor.h"
> +#include "migration/migration.h"
> +#include "migration/misc.h"
>  #include "migration/blocker.h"
>  #include "hw/virtio/vhost.h"
>
> @@ -33,6 +35,7 @@
>  typedef struct VhostVDPAState {
>      NetClientState nc;
>      struct vhost_vdpa vhost_vdpa;
> +    Notifier migration_state;
>      Error *migration_blocker;
>      VHostNetState *vhost_net;
>
> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>      return DO_UPCAST(VhostVDPAState, nc, nc0);
>  }
>
> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> +{
> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> +    VirtIONet *n;
> +    VirtIODevice *vdev;
> +    int data_queue_pairs, cvq, r;
> +    NetClientState *peer;
> +
> +    /* We are only called on the first data vqs and only if x-svq is not set */
> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> +        return;
> +    }
> +
> +    vdev = v->dev->vdev;
> +    n = VIRTIO_NET(vdev);
> +    if (!n->vhost_started) {
> +        return;
> +    }
> +
> +    if (enable) {
> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);

Do we need to check if the device is started or not here?

> +    }

I'm not sure I understand the reason for vhost_net_stop() after a
VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.

> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> +                                  n->max_ncs - n->max_queue_pairs : 0;
> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> +
> +    peer = s->nc.peer;
> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> +        VhostVDPAState *vdpa_state;
> +        NetClientState *nc;
> +
> +        if (i < data_queue_pairs) {
> +            nc = qemu_get_peer(peer, i);
> +        } else {
> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> +        }
> +
> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> +        vdpa_state->vhost_vdpa.shadow_data = enable;
> +
> +        if (i < data_queue_pairs) {
> +            /* Do not override CVQ shadow_vqs_enabled */
> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> +        }
> +    }
> +
> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> +    if (unlikely(r < 0)) {
> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> +    }
> +}
> +
> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *migration = data;
> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> +                                     migration_state);
> +
> +    switch (migration->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        vhost_vdpa_net_log_global_enable(s, true);
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        vhost_vdpa_net_log_global_enable(s, false);

Do we need to recover here?

Thanks

> +        return;
> +    };
> +}
> +
>  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>  {
>      struct vhost_vdpa *v = &s->vhost_vdpa;
>
> +    if (v->feature_log) {
> +        add_migration_state_change_notifier(&s->migration_state);
> +    }
> +
>      if (v->shadow_vqs_enabled) {
>          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>                                             v->iova_range.last);
> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>
>      assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>
> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> +        remove_migration_state_change_notifier(&s->migration_state);
> +    }
> +
>      dev = s->vhost_vdpa.dev;
>      if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>          g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>      s->vhost_vdpa.device_fd = vdpa_device_fd;
>      s->vhost_vdpa.index = queue_pair_index;
>      s->always_svq = svq;
> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>      s->vhost_vdpa.shadow_vqs_enabled = svq;
>      s->vhost_vdpa.iova_range = iova_range;
>      s->vhost_vdpa.shadow_data = svq;
> --
> 2.31.1
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
  2023-01-13  3:12   ` Jason Wang
@ 2023-01-13  6:42     ` Eugenio Perez Martin
  2023-01-16  3:01       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  6:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 4:12 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > VHOST_BACKEND_F_IOTLB_ASID is the feature bit, not the bitmask. Since
> > the device under test also provided VHOST_BACKEND_F_IOTLB_MSG_V2 and
> > VHOST_BACKEND_F_IOTLB_BATCH, this went unnoticed.
> >
> > Fixes: c1a1008685 ("vdpa: always start CVQ in SVQ mode if possible")
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> Do we need this for -stable?
>

The commit c1a1008685 was introduced in this development window so
there is no stable version of qemu with that patch. But I'm ok to CC
stable just in case for sure.

Thanks!

> Thanks
>
> > ---
> >  net/vhost-vdpa.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 1a13a34d35..de5ed8ff22 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -384,7 +384,7 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >              g_strerror(errno), errno);
> >          return -1;
> >      }
> > -    if (!(backend_features & VHOST_BACKEND_F_IOTLB_ASID) ||
> > +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
> >          !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
> >          return 0;
> >      }
> > --
> > 2.31.1
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-13  3:53   ` Jason Wang
@ 2023-01-13  7:28     ` Eugenio Perez Martin
  2023-01-16  3:05       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  7:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 4:53 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Only create iova_tree if and when it is needed.
> >
> > The cleanup keeps being responsability of last VQ but this change allows
> > to merge both cleanup functions.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 71 insertions(+), 30 deletions(-)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index de5ed8ff22..75cca497c8 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -178,13 +178,9 @@ err_init:
> >  static void vhost_vdpa_cleanup(NetClientState *nc)
> >  {
> >      VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> > -    struct vhost_dev *dev = &s->vhost_net->dev;
> >
> >      qemu_vfree(s->cvq_cmd_out_buffer);
> >      qemu_vfree(s->status);
> > -    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > -    }
> >      if (s->vhost_net) {
> >          vhost_net_cleanup(s->vhost_net);
> >          g_free(s->vhost_net);
> > @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
> >      return size;
> >  }
> >
> > +/** From any vdpa net client, get the netclient of first queue pair */
> > +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> > +{
> > +    NICState *nic = qemu_get_nic(s->nc.peer);
> > +    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
> > +
> > +    return DO_UPCAST(VhostVDPAState, nc, nc0);
> > +}
> > +
> > +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> > +{
> > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > +
> > +    if (v->shadow_vqs_enabled) {
> > +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > +                                           v->iova_range.last);
> > +    }
> > +}
> > +
> > +static int vhost_vdpa_net_data_start(NetClientState *nc)
> > +{
> > +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > +
> > +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> > +
> > +    if (v->index == 0) {
> > +        vhost_vdpa_net_data_start_first(s);
> > +        return 0;
> > +    }
> > +
> > +    if (v->shadow_vqs_enabled) {
> > +        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> > +    }
>
> It looks to me the logic here is somehow the same as
> vhost_vdpa_net_cvq_start(), can we unify the them?
>

It depends on what you mean by unify :). But we can explore it for sure.

We can call vhost_vdpa_net_data_start, but the steps to do if
s0->vhost_vdpa.iova_tree == NULL are different. Data queues must do
nothing, but CVQ must create a new iova tree.

So one possibility is to convert this part of vhost_vdpa_net_cvq_start:
    s0 = vhost_vdpa_net_first_nc_vdpa(s);
    if (s0->vhost_vdpa.iova_tree) {
        /* SVQ is already configured for all virtqueues */
        v->iova_tree = s0->vhost_vdpa.iova_tree;
    } else {
        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
                                           v->iova_range.last);
    }

into:
    vhost_vdpa_net_data_start(nc);
    if (!v->iova_tree) {
        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
                                           v->iova_range.last);
    }

I'm ok with the change but it's less clear in my opinion: it's not
obvious that net_data_start is in charge of setting v->iova_tree to
me.

Another possibility is to abstract something like
first_nc_iova_tree(), but we need to check more fields of s0 later
(shadow_data) so I'm not sure about the benefit.

Is that what you have in mind?

Thanks!

> > +
> > +    return 0;
> > +}
> > +
> > +static void vhost_vdpa_net_client_stop(NetClientState *nc)
> > +{
> > +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> > +    struct vhost_dev *dev;
> > +
> > +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> > +
> > +    dev = s->vhost_vdpa.dev;
> > +    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > +        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > +    }
> > +}
> > +
> >  static NetClientInfo net_vhost_vdpa_info = {
> >          .type = NET_CLIENT_DRIVER_VHOST_VDPA,
> >          .size = sizeof(VhostVDPAState),
> >          .receive = vhost_vdpa_receive,
> > +        .start = vhost_vdpa_net_data_start,
> > +        .stop = vhost_vdpa_net_client_stop,
> >          .cleanup = vhost_vdpa_cleanup,
> >          .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
> >          .has_ufo = vhost_vdpa_has_ufo,
> > @@ -351,7 +401,7 @@ dma_map_err:
> >
> >  static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >  {
> > -    VhostVDPAState *s;
> > +    VhostVDPAState *s, *s0;
> >      struct vhost_vdpa *v;
> >      uint64_t backend_features;
> >      int64_t cvq_group;
> > @@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >          return r;
> >      }
> >
> > -    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > -                                       v->iova_range.last);
> >      v->shadow_vqs_enabled = true;
> >      s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
> >
> > @@ -425,6 +473,15 @@ out:
> >          return 0;
> >      }
> >
> > +    s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > +    if (s0->vhost_vdpa.iova_tree) {
> > +        /* SVQ is already configured for all virtqueues */
> > +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> > +    } else {
> > +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > +                                           v->iova_range.last);
> > +    }
> > +
> >      r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
> >                                 vhost_vdpa_net_cvq_cmd_page_len(), false);
> >      if (unlikely(r < 0)) {
> > @@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
> >      if (s->vhost_vdpa.shadow_vqs_enabled) {
> >          vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
> >          vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> > -        if (!s->always_svq) {
> > -            /*
> > -             * If only the CVQ is shadowed we can delete this safely.
> > -             * If all the VQs are shadows this will be needed by the time the
> > -             * device is started again to register SVQ vrings and similar.
> > -             */
> > -            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > -        }
> >      }
> > +
> > +    vhost_vdpa_net_client_stop(nc);
> >  }
> >
> >  static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
> > @@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >                                         int nvqs,
> >                                         bool is_datapath,
> >                                         bool svq,
> > -                                       struct vhost_vdpa_iova_range iova_range,
> > -                                       VhostIOVATree *iova_tree)
> > +                                       struct vhost_vdpa_iova_range iova_range)
> >  {
> >      NetClientState *nc = NULL;
> >      VhostVDPAState *s;
> > @@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >      s->vhost_vdpa.shadow_vqs_enabled = svq;
> >      s->vhost_vdpa.iova_range = iova_range;
> >      s->vhost_vdpa.shadow_data = svq;
> > -    s->vhost_vdpa.iova_tree = iova_tree;
> >      if (!is_datapath) {
> >          s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
> >                                              vhost_vdpa_net_cvq_cmd_page_len());
> > @@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >      uint64_t features;
> >      int vdpa_device_fd;
> >      g_autofree NetClientState **ncs = NULL;
> > -    g_autoptr(VhostIOVATree) iova_tree = NULL;
> >      struct vhost_vdpa_iova_range iova_range;
> >      NetClientState *nc;
> >      int queue_pairs, r, i = 0, has_cvq = 0;
> > @@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >          goto err;
> >      }
> >
> > -    if (opts->x_svq) {
> > -        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
> > -            goto err_svq;
> > -        }
> > -
> > -        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
> > +    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
> > +        goto err;
> >      }
> >
> >      ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
> > @@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >      for (i = 0; i < queue_pairs; i++) {
> >          ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
> >                                       vdpa_device_fd, i, 2, true, opts->x_svq,
> > -                                     iova_range, iova_tree);
> > +                                     iova_range);
> >          if (!ncs[i])
> >              goto err;
> >      }
> > @@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >      if (has_cvq) {
> >          nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
> >                                   vdpa_device_fd, i, 1, false,
> > -                                 opts->x_svq, iova_range, iova_tree);
> > +                                 opts->x_svq, iova_range);
> >          if (!nc)
> >              goto err;
> >      }
> >
> > -    /* iova_tree ownership belongs to last NetClientState */
> > -    g_steal_pointer(&iova_tree);
> >      return 0;
> >
> >  err:
> > @@ -849,7 +891,6 @@ err:
> >          }
> >      }
> >
> > -err_svq:
> >      qemu_close(vdpa_device_fd);
> >
> >      return -1;
> > --
> > 2.31.1
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-13  4:09   ` Jason Wang
@ 2023-01-13  7:40     ` Eugenio Perez Martin
  2023-01-16  3:32       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  7:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > At this moment it is only possible to migrate to a vdpa device running
> > with x-svq=on. As a protective measure, the rewind of the inflight
> > descriptors was done at the destination. That way if the source sent a
> > virtqueue with inuse descriptors they are always discarded.
> >
> > Since this series allows to migrate also to passthrough devices with no
> > SVQ, the right thing to do is to rewind at the source so base of vrings
> > are correct.
> >
> > Support for inflight descriptors may be added in the future.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  include/hw/virtio/vhost-backend.h |  4 +++
> >  hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
> >  hw/virtio/vhost.c                 |  3 ++
> >  3 files changed, 36 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > index c5ab49051e..ec3fbae58d 100644
> > --- a/include/hw/virtio/vhost-backend.h
> > +++ b/include/hw/virtio/vhost-backend.h
> > @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >
> >  typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
> >                                         int fd);
> > +
> > +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> > +
> >  typedef struct VhostOps {
> >      VhostBackendType backend_type;
> >      vhost_backend_init vhost_backend_init;
> > @@ -177,6 +180,7 @@ typedef struct VhostOps {
> >      vhost_get_device_id_op vhost_get_device_id;
> >      vhost_force_iommu_op vhost_force_iommu;
> >      vhost_set_config_call_op vhost_set_config_call;
> > +    vhost_reset_status_op vhost_reset_status;
> >  } VhostOps;
> >
> >  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 542e003101..28a52ddc78 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >      if (started) {
> >          memory_listener_register(&v->listener, &address_space_memory);
> >          return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> > -    } else {
> > -        vhost_vdpa_reset_device(dev);
> > -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> > -                                   VIRTIO_CONFIG_S_DRIVER);
> > -        memory_listener_unregister(&v->listener);
> > +    }
> >
> > -        return 0;
> > +    return 0;
> > +}
> > +
> > +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +
> > +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
> > +        return;
> >      }
> > +
> > +    vhost_vdpa_reset_device(dev);
> > +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> > +                                VIRTIO_CONFIG_S_DRIVER);
> > +    memory_listener_unregister(&v->listener);
> >  }
> >
> >  static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> > @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
> >                                         struct vhost_vring_state *ring)
> >  {
> >      struct vhost_vdpa *v = dev->opaque;
> > -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> >
> > -    /*
> > -     * vhost-vdpa devices does not support in-flight requests. Set all of them
> > -     * as available.
> > -     *
> > -     * TODO: This is ok for networking, but other kinds of devices might
> > -     * have problems with these retransmissions.
> > -     */
> > -    while (virtqueue_rewind(vq, 1)) {
> > -        continue;
> > -    }
> >      if (v->shadow_vqs_enabled) {
> >          /*
> >           * Device vring base was set at device start. SVQ base is handled by
> > @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> >      int ret;
> >
> >      if (v->shadow_vqs_enabled) {
> > +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> > +
> > +        /*
> > +         * vhost-vdpa devices does not support in-flight requests. Set all of
> > +         * them as available.
> > +         *
> > +         * TODO: This is ok for networking, but other kinds of devices might
> > +         * have problems with these retransmissions.
> > +         */
> > +        while (virtqueue_rewind(vq, 1)) {
> > +            continue;
> > +        }
> > +
> >          ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
> >          return 0;
> >      }
> > @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
> >          .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> >          .vhost_force_iommu = vhost_vdpa_force_iommu,
> >          .vhost_set_config_call = vhost_vdpa_set_config_call,
> > +        .vhost_reset_status = vhost_vdpa_reset_status,
>
> Can we simply use the NetClient stop method here?
>

Ouch, I squashed two patches by mistake here.

All the vhost_reset_status part should be independent of this patch,
and I was especially interested in its feedback. It had this message:

    vdpa: move vhost reset after get vring base

    The function vhost.c:vhost_dev_stop calls vhost operation
    vhost_dev_start(false). In the case of vdpa it totally reset and wipes
    the device, making the fetching of the vring base (virtqueue state) totally
    useless.

    The kernel backend does not use vhost_dev_start vhost op callback, but
    vhost-user do. A patch to make vhost_user_dev_start more similar to vdpa
    is desirable, but it can be added on top.

I can resend the series splitting it again but conversation may
scatter between versions. Would you prefer me to send a new version?

Regarding the use of NetClient, it feels weird to call net specific
functions in VhostOps, doesn't it? At the moment vhost ops is
specialized in vhost-kernel, vhost-user and vhost-vdpa. If we want to
make it specific to the kind of device, that makes vhost-vdpa-net too.

Thanks!


> Thanks
>
> >  };
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index eb8c4c378c..a266396576 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >                               hdev->vqs + i,
> >                               hdev->vq_index + i);
> >      }
> > +    if (hdev->vhost_ops->vhost_reset_status) {
> > +        hdev->vhost_ops->vhost_reset_status(hdev);
> > +    }
> >
> >      if (vhost_dev_has_iommu(hdev)) {
> >          if (hdev->vhost_ops->vhost_set_iotlb_callback) {
> > --
> > 2.31.1
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-13  4:24   ` Jason Wang
@ 2023-01-13  7:46     ` Eugenio Perez Martin
  2023-01-16  3:34       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  7:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 5:25 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 01:24, Eugenio Pérez 写道:
> > A vdpa net device must initialize with SVQ in order to be migratable,
> > and initialization code verifies conditions.  If the device is not
> > initialized with the x-svq parameter, it will not expose _F_LOG so vhost
> > sybsystem will block VM migration from its initialization.
> >
> > Next patches change this. Net data VQs will be shadowed only at
> > migration time and vdpa net devices need to expose _F_LOG as long as it
> > can go to SVQ.
> >
> > Since we don't know that at initialization time but at start, add an
> > independent blocker at CVQ.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
> >   1 file changed, 29 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 631424d9c4..2ca93e850a 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -26,12 +26,14 @@
> >   #include <err.h>
> >   #include "standard-headers/linux/virtio_net.h"
> >   #include "monitor/monitor.h"
> > +#include "migration/blocker.h"
> >   #include "hw/virtio/vhost.h"
> >
> >   /* Todo:need to add the multiqueue support here */
> >   typedef struct VhostVDPAState {
> >       NetClientState nc;
> >       struct vhost_vdpa vhost_vdpa;
> > +    Error *migration_blocker;
>
>
> Any reason we can't use the mivration_blocker in vhost_dev structure?
>
> I believe we don't need to wait until start to know we can't migrate.
>

Device migratability also depends on features that the guest acks.

For example, if the device does not support ASID it can be migrated as
long as _F_CVQ is not acked.

Thanks!

>
> Thanks
>
>
> >       VHostNetState *vhost_net;
> >
> >       /* Control commands shadow buffers */
> > @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >               g_strerror(errno), errno);
> >           return -1;
> >       }
> > -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
> > -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
> > -        return 0;
> > +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
> > +        error_setg(&s->migration_blocker,
> > +                   "vdpa device %s does not support ASID",
> > +                   nc->name);
> > +        goto out;
> > +    }
> > +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
> > +                                           &s->migration_blocker)) {
> > +        goto out;
> >       }
> >
> >       /*
> > @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >           }
> >
> >           if (group == cvq_group) {
> > -            return 0;
> > +            error_setg(&s->migration_blocker,
> > +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
> > +                "%"PRId64, nc->name, i, group, cvq_group);
> > +            goto out;
> >           }
> >       }
> >
> > @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >       s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
> >
> >   out:
> > -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
> > -        return 0;
> > +    if (s->migration_blocker) {
> > +        Error *errp = NULL;
> > +        r = migrate_add_blocker(s->migration_blocker, &errp);
> > +        if (unlikely(r != 0)) {
> > +            g_clear_pointer(&s->migration_blocker, error_free);
> > +            error_report_err(errp);
> > +        }
> > +
> > +        return r;
> >       }
> >
> >       s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
> >           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> >       }
> >
> > +    if (s->migration_blocker) {
> > +        migrate_del_blocker(s->migration_blocker);
> > +        g_clear_pointer(&s->migration_blocker, error_free);
> > +    }
> > +
> >       vhost_vdpa_net_client_stop(nc);
> >   }
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13  4:36   ` Jason Wang
@ 2023-01-13  8:19     ` Eugenio Perez Martin
  2023-01-13  9:51       ` Stefano Garzarella
  2023-01-16  6:36       ` Jason Wang
  0 siblings, 2 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > To restore the device at the destination of a live migration we send the
> > commands through control virtqueue. For a device to read CVQ it must
> > have received the DRIVER_OK status bit.
>
> This probably requires the support from the parent driver and requires
> some changes or fixes in the parent driver.
>
> Some drivers did:
>
> parent_set_status():
> if (DRIVER_OK)
>     if (queue_enable)
>         write queue_enable to the device
>
> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>

I don't get your point here. No device should start reading CVQ (or
any other VQ) without having received DRIVER_OK.

Some parent drivers do not support sending the queue enable command
after DRIVER_OK, usually because they clean part of the state like the
set by set_vring_base. Even vdpa_net_sim needs fixes here.

But my understanding is that it should be supported so I consider it a
bug. Especially after queue_reset patches. Is that what you mean?

> >
> > However this opens a window where the device could start receiving
> > packets in rx queue 0 before it receives the RSS configuration. To avoid
> > that, we will not send vring_enable until all configuration is used by
> > the device.
> >
> > As a first step, run vhost_set_vring_ready for all vhost_net backend after
> > all of them are started (with DRIVER_OK). This code should not affect
> > vdpa.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  hw/net/vhost_net.c | 17 ++++++++++++-----
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > index c4eecc6f36..3900599465 100644
> > --- a/hw/net/vhost_net.c
> > +++ b/hw/net/vhost_net.c
> > @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >          } else {
> >              peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >          }
> > +        r = vhost_net_start_one(get_vhost_net(peer), dev);
> > +        if (r < 0) {
> > +            goto err_start;
> > +        }
> > +    }
> > +
> > +    for (int j = 0; j < nvhosts; j++) {
> > +        if (j < data_queue_pairs) {
> > +            peer = qemu_get_peer(ncs, j);
> > +        } else {
> > +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
> > +        }
>
> I fail to understand why we need to change the vhost_net layer? This
> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
> vhost_vdpa_dev_start()?
>

The vhost-net layer explicitly calls vhost_set_vring_enable before
vhost_dev_start, and this is exactly the behavior we want to avoid.
Even if we make changes to vhost_dev, this change is still needed.

And we want to explicitly enable CVQ first, which "only" vhost_net
knows which is. To perform that in vhost_vdpa_dev_start would require
quirks, involving one or more of:
* Ignore vq enable calls if the device is not the CVQ one. How to
signal what is the CVQ? Can we trust it will be the last one for all
kind of devices?
* Enable queues that do not belong to the last vhost_dev from the enable call.
* Enable the rest of the queues from the last enable in reverse order.
* Intercalate the "net load" callback between enabling the last
vhost_vdpa device and enabling the rest of devices.
* Add an "enable priority" order?

Thanks!

> Thanks
>
> >
> >          if (peer->vring_enable) {
> >              /* restore vring enable state */
> > @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >                  goto err_start;
> >              }
> >          }
> > -
> > -        r = vhost_net_start_one(get_vhost_net(peer), dev);
> > -        if (r < 0) {
> > -            goto err_start;
> > -        }
> >      }
> >
> >      return 0;
> > --
> > 2.31.1
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature
  2023-01-13  4:39   ` Jason Wang
@ 2023-01-13  8:45     ` Eugenio Perez Martin
  2023-01-16  6:48       ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  8:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 5:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This is needed for qemu to know it can suspend the device to retrieve
> > its status and enable SVQ with it, so all the process is transparent to
> > the guest.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> We probably need to add the resume in the future to have a quick
> recovery from migration failures.
>

The capability of a resume can be useful here but only in a small
window. During the most time of the migration SVQ is enabled, so in
the event of a migration failure we may need to reset the whole device
to enable passthrough again.

But maybe is it worth giving a quick review and adding some TODOs
where it can be useful in this series?

Thanks!

> Thanks
>
> > ---
> >  hw/virtio/vhost-vdpa.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 4296427a69..a61a6b2a74 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -659,7 +659,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >      uint64_t features;
> >      uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
> >          0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> > -        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
> > +        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> > +        0x1ULL << VHOST_BACKEND_F_SUSPEND;
> >      int r;
> >
> >      if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
> > --
> > 2.31.1
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-13  4:54   ` Jason Wang
@ 2023-01-13  9:00     ` Eugenio Perez Martin
  2023-01-16  6:51       ` Jason Wang
  2023-01-17  9:58       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  9:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	David Gilbert, Maxime Coquelin

On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > This allows net to restart the device backend to configure SVQ on it.
> >
> > Ideally, these changes should not be net specific. However, the vdpa net
> > backend is the one with enough knowledge to configure everything because
> > of some reasons:
> > * Queues might need to be shadowed or not depending on its kind (control
> >   vs data).
> > * Queues need to share the same map translations (iova tree).
> >
> > Because of that it is cleaner to restart the whole net backend and
> > configure again as expected, similar to how vhost-kernel moves between
> > userspace and passthrough.
> >
> > If more kinds of devices need dynamic switching to SVQ we can create a
> > callback struct like VhostOps and move most of the code there.
> > VhostOps cannot be reused since all vdpa backend share them, and to
> > personalize just for networking would be too heavy.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 84 insertions(+)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 5d7ad6e4d7..f38532b1df 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -26,6 +26,8 @@
> >  #include <err.h>
> >  #include "standard-headers/linux/virtio_net.h"
> >  #include "monitor/monitor.h"
> > +#include "migration/migration.h"
> > +#include "migration/misc.h"
> >  #include "migration/blocker.h"
> >  #include "hw/virtio/vhost.h"
> >
> > @@ -33,6 +35,7 @@
> >  typedef struct VhostVDPAState {
> >      NetClientState nc;
> >      struct vhost_vdpa vhost_vdpa;
> > +    Notifier migration_state;
> >      Error *migration_blocker;
> >      VHostNetState *vhost_net;
> >
> > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >      return DO_UPCAST(VhostVDPAState, nc, nc0);
> >  }
> >
> > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> > +{
> > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > +    VirtIONet *n;
> > +    VirtIODevice *vdev;
> > +    int data_queue_pairs, cvq, r;
> > +    NetClientState *peer;
> > +
> > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > +        return;
> > +    }
> > +
> > +    vdev = v->dev->vdev;
> > +    n = VIRTIO_NET(vdev);
> > +    if (!n->vhost_started) {
> > +        return;
> > +    }
> > +
> > +    if (enable) {
> > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
>
> Do we need to check if the device is started or not here?
>

v->vhost_started is checked right above, right?

> > +    }
>
> I'm not sure I understand the reason for vhost_net_stop() after a
> VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
>

I think this is really worth exploring, and it would have been clearer
if I didn't squash the vhost_reset_status commit by mistake :).

Looking at qemu master vhost.c:vhost_dev_stop:
    if (hdev->vhost_ops->vhost_dev_start) {
        hdev->vhost_ops->vhost_dev_start(hdev, false);
    }
    if (vrings) {
        vhost_dev_set_vring_enable(hdev, false);
    }
    for (i = 0; i < hdev->nvqs; ++i) {
        vhost_virtqueue_stop(hdev,
                             vdev,
                             hdev->vqs + i,
                             hdev->vq_index + i);
    }

Both vhost-used and vhost-vdpa set_status(0) at
->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
I think it is too late for vdpa devices to change it. I guess
vhost-user devices do not lose the state there, but I did not test.

I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
similar to vhost_user_dev_start. We can make
vhost_vdpa_dev_start(false) to suspend the device instead. But then we
need to reset it after getting the indexes. That's why I added
vhost_vdpa_reset_status, but I admit it is neither the cleanest
approach nor the best name to it.

Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.

> > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > +
> > +    peer = s->nc.peer;
> > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > +        VhostVDPAState *vdpa_state;
> > +        NetClientState *nc;
> > +
> > +        if (i < data_queue_pairs) {
> > +            nc = qemu_get_peer(peer, i);
> > +        } else {
> > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > +        }
> > +
> > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > +
> > +        if (i < data_queue_pairs) {
> > +            /* Do not override CVQ shadow_vqs_enabled */
> > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > +        }
> > +    }
> > +
> > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > +    if (unlikely(r < 0)) {
> > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > +    }
> > +}
> > +
> > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> > +{
> > +    MigrationState *migration = data;
> > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > +                                     migration_state);
> > +
> > +    switch (migration->state) {
> > +    case MIGRATION_STATUS_SETUP:
> > +        vhost_vdpa_net_log_global_enable(s, true);
> > +        return;
> > +
> > +    case MIGRATION_STATUS_CANCELLING:
> > +    case MIGRATION_STATUS_CANCELLED:
> > +    case MIGRATION_STATUS_FAILED:
> > +        vhost_vdpa_net_log_global_enable(s, false);
>
> Do we need to recover here?
>

I may be missing something, but the device is fully reset and restored
in these cases.

CCing Juan and D. Gilbert, a review would be appreciated to check if
this covers all the cases.

Thanks!


> Thanks
>
> > +        return;
> > +    };
> > +}
> > +
> >  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >  {
> >      struct vhost_vdpa *v = &s->vhost_vdpa;
> >
> > +    if (v->feature_log) {
> > +        add_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >      if (v->shadow_vqs_enabled) {
> >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >                                             v->iova_range.last);
> > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >
> >      assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >
> > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > +        remove_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >      dev = s->vhost_vdpa.dev;
> >      if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >          g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >      s->vhost_vdpa.device_fd = vdpa_device_fd;
> >      s->vhost_vdpa.index = queue_pair_index;
> >      s->always_svq = svq;
> > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >      s->vhost_vdpa.shadow_vqs_enabled = svq;
> >      s->vhost_vdpa.iova_range = iova_range;
> >      s->vhost_vdpa.shadow_data = svq;
> > --
> > 2.31.1
> >
>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-13  3:39       ` Jason Wang
@ 2023-01-13  9:06         ` Eugenio Perez Martin
  2023-01-16  7:02           ` Jason Wang
  2023-02-02  0:56           ` Si-Wei Liu
  0 siblings, 2 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13  9:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, qemu-devel, si-wei.liu, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> >
> >
> >
> > On 1/13/2023 10:31 AM, Jason Wang wrote:
> > > On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >> Spuriously kick the destination device's queue so it knows in case there
> > >> are new descriptors.
> > >>
> > >> RFC: This is somehow a gray area. The guest may have placed descriptors
> > >> in a virtqueue but not kicked it, so it might be surprised if the device
> > >> starts processing it.
> > > So I think this is kind of the work of the vDPA parent. For the parent
> > > that needs this trick, we should do it in the parent driver.
> > Agree, it looks easier implementing this in parent driver,
> > I can implement it in ifcvf set_vq_ready right now
>
> Great, but please check whether or not it is really needed.
>
> Some device implementation could check the available descriptions
> after DRIVER_OK without waiting for a kick.
>

So IIUC we can entirely drop this from the series (and I hope we can).
But then, what with the devices that does *not* check for them?

If we drop it it seems to me we must mandate devices to check for
descriptors at queue_enable. The queue could stall if not, isn't it?

Thanks!

> Thanks
>
> >
> > Thanks
> > Zhu Lingshan
> > >
> > > Thanks
> > >
> > >> However, that information is not in the migration stream and it should
> > >> be an edge case anyhow, being resilient to parallel notifications from
> > >> the guest.
> > >>
> > >> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >> ---
> > >>   hw/virtio/vhost-vdpa.c | 5 +++++
> > >>   1 file changed, 5 insertions(+)
> > >>
> > >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > >> index 40b7e8706a..dff94355dd 100644
> > >> --- a/hw/virtio/vhost-vdpa.c
> > >> +++ b/hw/virtio/vhost-vdpa.c
> > >> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
> > >>       }
> > >>       trace_vhost_vdpa_set_vring_ready(dev);
> > >>       for (i = 0; i < dev->nvqs; ++i) {
> > >> +        VirtQueue *vq;
> > >>           struct vhost_vring_state state = {
> > >>               .index = dev->vq_index + i,
> > >>               .num = 1,
> > >>           };
> > >>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> > >> +
> > >> +        /* Preemptive kick */
> > >> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> > >> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
> > >>       }
> > >>       return 0;
> > >>   }
> > >> --
> > >> 2.31.1
> > >>
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13  8:19     ` Eugenio Perez Martin
@ 2023-01-13  9:51       ` Stefano Garzarella
  2023-01-13 10:03         ` Eugenio Perez Martin
  2023-01-16  6:36       ` Jason Wang
  1 sibling, 1 reply; 76+ messages in thread
From: Stefano Garzarella @ 2023-01-13  9:51 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Cornelia Huck, Cindy Lu,
	Eli Cohen, Paolo Bonzini, Michael S. Tsirkin, Stefan Hajnoczi,
	Parav Pandit

On Fri, Jan 13, 2023 at 09:19:00AM +0100, Eugenio Perez Martin wrote:
>On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>> >
>> > To restore the device at the destination of a live migration we send the
>> > commands through control virtqueue. For a device to read CVQ it must
>> > have received the DRIVER_OK status bit.
>>
>> This probably requires the support from the parent driver and requires
>> some changes or fixes in the parent driver.
>>
>> Some drivers did:
>>
>> parent_set_status():
>> if (DRIVER_OK)
>>     if (queue_enable)
>>         write queue_enable to the device
>>
>> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>>
>
>I don't get your point here. No device should start reading CVQ (or
>any other VQ) without having received DRIVER_OK.
>
>Some parent drivers do not support sending the queue enable command
>after DRIVER_OK, usually because they clean part of the state like the
>set by set_vring_base. Even vdpa_net_sim needs fixes here.
>
>But my understanding is that it should be supported so I consider it a
>bug. Especially after queue_reset patches. Is that what you mean?
>
>> >
>> > However this opens a window where the device could start receiving
>> > packets in rx queue 0 before it receives the RSS configuration. To avoid
>> > that, we will not send vring_enable until all configuration is used by
>> > the device.
>> >
>> > As a first step, run vhost_set_vring_ready for all vhost_net backend after
>> > all of them are started (with DRIVER_OK). This code should not affect
>> > vdpa.
>> >
>> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> > ---
>> >  hw/net/vhost_net.c | 17 ++++++++++++-----
>> >  1 file changed, 12 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>> > index c4eecc6f36..3900599465 100644
>> > --- a/hw/net/vhost_net.c
>> > +++ b/hw/net/vhost_net.c
>> > @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>> >          } else {
>> >              peer = qemu_get_peer(ncs, n->max_queue_pairs);
>> >          }
>> > +        r = vhost_net_start_one(get_vhost_net(peer), dev);
>> > +        if (r < 0) {
>> > +            goto err_start;
>> > +        }
>> > +    }
>> > +
>> > +    for (int j = 0; j < nvhosts; j++) {
>> > +        if (j < data_queue_pairs) {
>> > +            peer = qemu_get_peer(ncs, j);
>> > +        } else {
>> > +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
>> > +        }
>>
>> I fail to understand why we need to change the vhost_net layer? This
>> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
>> vhost_vdpa_dev_start()?
>>
>
>The vhost-net layer explicitly calls vhost_set_vring_enable before
>vhost_dev_start, and this is exactly the behavior we want to avoid.
>Even if we make changes to vhost_dev, this change is still needed.

I'm working on something similar since I'd like to re-work the following 
commit we merged just before 7.2 release:
     4daa5054c5 vhost: enable vrings in vhost_dev_start() for vhost-user
     devices

vhost-net wasn't the only one who enabled vrings independently, but it 
was easy enough for others devices to avoid it and enable them in 
vhost_dev_start().

Do you think can we avoid in some way this special behaviour of 
vhost-net and enable the vrings in vhost_dev_start?

Thanks,
Stefano

>
>And we want to explicitly enable CVQ first, which "only" vhost_net
>knows which is. To perform that in vhost_vdpa_dev_start would require
>quirks, involving one or more of:
>* Ignore vq enable calls if the device is not the CVQ one. How to
>signal what is the CVQ? Can we trust it will be the last one for all
>kind of devices?
>* Enable queues that do not belong to the last vhost_dev from the enable call.
>* Enable the rest of the queues from the last enable in reverse order.
>* Intercalate the "net load" callback between enabling the last
>vhost_vdpa device and enabling the rest of devices.
>* Add an "enable priority" order?
>
>Thanks!
>
>> Thanks
>>
>> >
>> >          if (peer->vring_enable) {
>> >              /* restore vring enable state */
>> > @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>> >                  goto err_start;
>> >              }
>> >          }
>> > -
>> > -        r = vhost_net_start_one(get_vhost_net(peer), dev);
>> > -        if (r < 0) {
>> > -            goto err_start;
>> > -        }
>> >      }
>> >
>> >      return 0;
>> > --
>> > 2.31.1
>> >
>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13  9:51       ` Stefano Garzarella
@ 2023-01-13 10:03         ` Eugenio Perez Martin
  2023-01-13 10:37           ` Stefano Garzarella
  2023-01-17 15:15           ` Maxime Coquelin
  0 siblings, 2 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-13 10:03 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Cornelia Huck, Cindy Lu,
	Eli Cohen, Paolo Bonzini, Michael S. Tsirkin, Stefan Hajnoczi,
	Parav Pandit, Maxime Coquelin

On Fri, Jan 13, 2023 at 10:51 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Fri, Jan 13, 2023 at 09:19:00AM +0100, Eugenio Perez Martin wrote:
> >On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >> >
> >> > To restore the device at the destination of a live migration we send the
> >> > commands through control virtqueue. For a device to read CVQ it must
> >> > have received the DRIVER_OK status bit.
> >>
> >> This probably requires the support from the parent driver and requires
> >> some changes or fixes in the parent driver.
> >>
> >> Some drivers did:
> >>
> >> parent_set_status():
> >> if (DRIVER_OK)
> >>     if (queue_enable)
> >>         write queue_enable to the device
> >>
> >> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
> >>
> >
> >I don't get your point here. No device should start reading CVQ (or
> >any other VQ) without having received DRIVER_OK.
> >
> >Some parent drivers do not support sending the queue enable command
> >after DRIVER_OK, usually because they clean part of the state like the
> >set by set_vring_base. Even vdpa_net_sim needs fixes here.
> >
> >But my understanding is that it should be supported so I consider it a
> >bug. Especially after queue_reset patches. Is that what you mean?
> >
> >> >
> >> > However this opens a window where the device could start receiving
> >> > packets in rx queue 0 before it receives the RSS configuration. To avoid
> >> > that, we will not send vring_enable until all configuration is used by
> >> > the device.
> >> >
> >> > As a first step, run vhost_set_vring_ready for all vhost_net backend after
> >> > all of them are started (with DRIVER_OK). This code should not affect
> >> > vdpa.
> >> >
> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> > ---
> >> >  hw/net/vhost_net.c | 17 ++++++++++++-----
> >> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >> > index c4eecc6f36..3900599465 100644
> >> > --- a/hw/net/vhost_net.c
> >> > +++ b/hw/net/vhost_net.c
> >> > @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >> >          } else {
> >> >              peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >> >          }
> >> > +        r = vhost_net_start_one(get_vhost_net(peer), dev);
> >> > +        if (r < 0) {
> >> > +            goto err_start;
> >> > +        }
> >> > +    }
> >> > +
> >> > +    for (int j = 0; j < nvhosts; j++) {
> >> > +        if (j < data_queue_pairs) {
> >> > +            peer = qemu_get_peer(ncs, j);
> >> > +        } else {
> >> > +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >> > +        }
> >>
> >> I fail to understand why we need to change the vhost_net layer? This
> >> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
> >> vhost_vdpa_dev_start()?
> >>
> >
> >The vhost-net layer explicitly calls vhost_set_vring_enable before
> >vhost_dev_start, and this is exactly the behavior we want to avoid.
> >Even if we make changes to vhost_dev, this change is still needed.
>
> I'm working on something similar since I'd like to re-work the following
> commit we merged just before 7.2 release:
>      4daa5054c5 vhost: enable vrings in vhost_dev_start() for vhost-user
>      devices
>
> vhost-net wasn't the only one who enabled vrings independently, but it
> was easy enough for others devices to avoid it and enable them in
> vhost_dev_start().
>
> Do you think can we avoid in some way this special behaviour of
> vhost-net and enable the vrings in vhost_dev_start?
>

Actually looking forward to it :). If that gets merged before this
series, I think we could drop this patch.

If I'm not wrong the enable/disable dance is used just by vhost-user
at the moment.

Maxime, could you give us some hints about the tests to use to check
that changes do not introduce regressions in vhost-user?

Thanks!

> Thanks,
> Stefano
>
> >
> >And we want to explicitly enable CVQ first, which "only" vhost_net
> >knows which is. To perform that in vhost_vdpa_dev_start would require
> >quirks, involving one or more of:
> >* Ignore vq enable calls if the device is not the CVQ one. How to
> >signal what is the CVQ? Can we trust it will be the last one for all
> >kind of devices?
> >* Enable queues that do not belong to the last vhost_dev from the enable call.
> >* Enable the rest of the queues from the last enable in reverse order.
> >* Intercalate the "net load" callback between enabling the last
> >vhost_vdpa device and enabling the rest of devices.
> >* Add an "enable priority" order?
> >
> >Thanks!
> >
> >> Thanks
> >>
> >> >
> >> >          if (peer->vring_enable) {
> >> >              /* restore vring enable state */
> >> > @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >> >                  goto err_start;
> >> >              }
> >> >          }
> >> > -
> >> > -        r = vhost_net_start_one(get_vhost_net(peer), dev);
> >> > -        if (r < 0) {
> >> > -            goto err_start;
> >> > -        }
> >> >      }
> >> >
> >> >      return 0;
> >> > --
> >> > 2.31.1
> >> >
> >>
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13 10:03         ` Eugenio Perez Martin
@ 2023-01-13 10:37           ` Stefano Garzarella
  2023-01-17 15:15           ` Maxime Coquelin
  1 sibling, 0 replies; 76+ messages in thread
From: Stefano Garzarella @ 2023-01-13 10:37 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Cornelia Huck, Cindy Lu,
	Eli Cohen, Paolo Bonzini, Michael S. Tsirkin, Stefan Hajnoczi,
	Parav Pandit, Maxime Coquelin

On Fri, Jan 13, 2023 at 11:03:17AM +0100, Eugenio Perez Martin wrote:
>On Fri, Jan 13, 2023 at 10:51 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Fri, Jan 13, 2023 at 09:19:00AM +0100, Eugenio Perez Martin wrote:
>> >On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>> >>
>> >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>> >> >
>> >> > To restore the device at the destination of a live migration we send the
>> >> > commands through control virtqueue. For a device to read CVQ it must
>> >> > have received the DRIVER_OK status bit.
>> >>
>> >> This probably requires the support from the parent driver and requires
>> >> some changes or fixes in the parent driver.
>> >>
>> >> Some drivers did:
>> >>
>> >> parent_set_status():
>> >> if (DRIVER_OK)
>> >>     if (queue_enable)
>> >>         write queue_enable to the device
>> >>
>> >> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>> >>
>> >
>> >I don't get your point here. No device should start reading CVQ (or
>> >any other VQ) without having received DRIVER_OK.
>> >
>> >Some parent drivers do not support sending the queue enable command
>> >after DRIVER_OK, usually because they clean part of the state like the
>> >set by set_vring_base. Even vdpa_net_sim needs fixes here.
>> >
>> >But my understanding is that it should be supported so I consider it a
>> >bug. Especially after queue_reset patches. Is that what you mean?
>> >
>> >> >
>> >> > However this opens a window where the device could start receiving
>> >> > packets in rx queue 0 before it receives the RSS configuration. To avoid
>> >> > that, we will not send vring_enable until all configuration is used by
>> >> > the device.
>> >> >
>> >> > As a first step, run vhost_set_vring_ready for all vhost_net backend after
>> >> > all of them are started (with DRIVER_OK). This code should not affect
>> >> > vdpa.
>> >> >
>> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> >> > ---
>> >> >  hw/net/vhost_net.c | 17 ++++++++++++-----
>> >> >  1 file changed, 12 insertions(+), 5 deletions(-)
>> >> >
>> >> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>> >> > index c4eecc6f36..3900599465 100644
>> >> > --- a/hw/net/vhost_net.c
>> >> > +++ b/hw/net/vhost_net.c
>> >> > @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>> >> >          } else {
>> >> >              peer = qemu_get_peer(ncs, n->max_queue_pairs);
>> >> >          }
>> >> > +        r = vhost_net_start_one(get_vhost_net(peer), dev);
>> >> > +        if (r < 0) {
>> >> > +            goto err_start;
>> >> > +        }
>> >> > +    }
>> >> > +
>> >> > +    for (int j = 0; j < nvhosts; j++) {
>> >> > +        if (j < data_queue_pairs) {
>> >> > +            peer = qemu_get_peer(ncs, j);
>> >> > +        } else {
>> >> > +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
>> >> > +        }
>> >>
>> >> I fail to understand why we need to change the vhost_net layer? This
>> >> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
>> >> vhost_vdpa_dev_start()?
>> >>
>> >
>> >The vhost-net layer explicitly calls vhost_set_vring_enable before
>> >vhost_dev_start, and this is exactly the behavior we want to avoid.
>> >Even if we make changes to vhost_dev, this change is still needed.
>>
>> I'm working on something similar since I'd like to re-work the following
>> commit we merged just before 7.2 release:
>>      4daa5054c5 vhost: enable vrings in vhost_dev_start() for vhost-user
>>      devices
>>
>> vhost-net wasn't the only one who enabled vrings independently, but it
>> was easy enough for others devices to avoid it and enable them in
>> vhost_dev_start().
>>
>> Do you think can we avoid in some way this special behaviour of
>> vhost-net and enable the vrings in vhost_dev_start?
>>
>
>Actually looking forward to it :). If that gets merged before this
>series, I think we could drop this patch.

I hope to send a RFC net week :-) let's see...

>
>If I'm not wrong the enable/disable dance is used just by vhost-user
>at the moment.

Yep, I got the same.

My doubts are that for vhost-user-net we enable only the first 
VirtIONet->curr_queue_pairs queue IIUC. While for other devices (and 
maybe also for vDPA devices) we enable all of them.

I need to figure out if it's safe to do this for vhost-user-net as well, 
otherwise I need to find a way to leave this behavior.

>
>Maxime, could you give us some hints about the tests to use to check
>that changes do not introduce regressions in vhost-user?

Yep, any help on how to test is very appreciated.

Thanks,
Stefano



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
  2023-01-13  6:42     ` Eugenio Perez Martin
@ 2023-01-16  3:01       ` Jason Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Wang @ 2023-01-16  3:01 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 14:42, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 4:12 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> VHOST_BACKEND_F_IOTLB_ASID is the feature bit, not the bitmask. Since
>>> the device under test also provided VHOST_BACKEND_F_IOTLB_MSG_V2 and
>>> VHOST_BACKEND_F_IOTLB_BATCH, this went unnoticed.
>>>
>>> Fixes: c1a1008685 ("vdpa: always start CVQ in SVQ mode if possible")
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> Acked-by: Jason Wang <jasowang@redhat.com>
>>
>> Do we need this for -stable?
>>
> The commit c1a1008685 was introduced in this development window so
> there is no stable version of qemu with that patch. But I'm ok to CC
> stable just in case for sure.


Right, just have a check and it doesn't there for 7.2 so there should be 
no need for that.

Thanks


>
> Thanks!
>
>> Thanks
>>
>>> ---
>>>   net/vhost-vdpa.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index 1a13a34d35..de5ed8ff22 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -384,7 +384,7 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>               g_strerror(errno), errno);
>>>           return -1;
>>>       }
>>> -    if (!(backend_features & VHOST_BACKEND_F_IOTLB_ASID) ||
>>> +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
>>>           !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
>>>           return 0;
>>>       }
>>> --
>>> 2.31.1
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-13  7:28     ` Eugenio Perez Martin
@ 2023-01-16  3:05       ` Jason Wang
  2023-01-16  9:14         ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  3:05 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 15:28, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 4:53 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> Only create iova_tree if and when it is needed.
>>>
>>> The cleanup keeps being responsability of last VQ but this change allows
>>> to merge both cleanup functions.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>   net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
>>>   1 file changed, 71 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index de5ed8ff22..75cca497c8 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -178,13 +178,9 @@ err_init:
>>>   static void vhost_vdpa_cleanup(NetClientState *nc)
>>>   {
>>>       VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>> -    struct vhost_dev *dev = &s->vhost_net->dev;
>>>
>>>       qemu_vfree(s->cvq_cmd_out_buffer);
>>>       qemu_vfree(s->status);
>>> -    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>> -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> -    }
>>>       if (s->vhost_net) {
>>>           vhost_net_cleanup(s->vhost_net);
>>>           g_free(s->vhost_net);
>>> @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
>>>       return size;
>>>   }
>>>
>>> +/** From any vdpa net client, get the netclient of first queue pair */
>>> +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>> +{
>>> +    NICState *nic = qemu_get_nic(s->nc.peer);
>>> +    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
>>> +
>>> +    return DO_UPCAST(VhostVDPAState, nc, nc0);
>>> +}
>>> +
>>> +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>> +{
>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>> +
>>> +    if (v->shadow_vqs_enabled) {
>>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>> +                                           v->iova_range.last);
>>> +    }
>>> +}
>>> +
>>> +static int vhost_vdpa_net_data_start(NetClientState *nc)
>>> +{
>>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>> +
>>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>> +
>>> +    if (v->index == 0) {
>>> +        vhost_vdpa_net_data_start_first(s);
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (v->shadow_vqs_enabled) {
>>> +        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
>>> +    }
>> It looks to me the logic here is somehow the same as
>> vhost_vdpa_net_cvq_start(), can we unify the them?
>>
> It depends on what you mean by unify :). But we can explore it for sure.
>
> We can call vhost_vdpa_net_data_start, but the steps to do if
> s0->vhost_vdpa.iova_tree == NULL are different. Data queues must do
> nothing, but CVQ must create a new iova tree.
>
> So one possibility is to convert this part of vhost_vdpa_net_cvq_start:
>      s0 = vhost_vdpa_net_first_nc_vdpa(s);
>      if (s0->vhost_vdpa.iova_tree) {
>          /* SVQ is already configured for all virtqueues */
>          v->iova_tree = s0->vhost_vdpa.iova_tree;
>      } else {
>          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>                                             v->iova_range.last);
>      }
>
> into:
>      vhost_vdpa_net_data_start(nc);
>      if (!v->iova_tree) {
>          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>                                             v->iova_range.last);
>      }
>
> I'm ok with the change but it's less clear in my opinion: it's not
> obvious that net_data_start is in charge of setting v->iova_tree to
> me.


Ok.


>
> Another possibility is to abstract something like
> first_nc_iova_tree(), but we need to check more fields of s0 later
> (shadow_data) so I'm not sure about the benefit.
>
> Is that what you have in mind?


Kind of, but I think we can leave the code as is.

In the future, as discussed, we need to introduce something like a 
parent or opaque structure for NetClientState structure, it can simply a 
lot of things: we can have one same common parent for all queues, then 
there's no need for the trick like first_nc_iova_tree() and other 
similar tricks.

Thanks

>
> Thanks!
>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>> +{
>>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>> +    struct vhost_dev *dev;
>>> +
>>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>> +
>>> +    dev = s->vhost_vdpa.dev;
>>> +    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>> +        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> +    }
>>> +}
>>> +
>>>   static NetClientInfo net_vhost_vdpa_info = {
>>>           .type = NET_CLIENT_DRIVER_VHOST_VDPA,
>>>           .size = sizeof(VhostVDPAState),
>>>           .receive = vhost_vdpa_receive,
>>> +        .start = vhost_vdpa_net_data_start,
>>> +        .stop = vhost_vdpa_net_client_stop,
>>>           .cleanup = vhost_vdpa_cleanup,
>>>           .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
>>>           .has_ufo = vhost_vdpa_has_ufo,
>>> @@ -351,7 +401,7 @@ dma_map_err:
>>>
>>>   static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>   {
>>> -    VhostVDPAState *s;
>>> +    VhostVDPAState *s, *s0;
>>>       struct vhost_vdpa *v;
>>>       uint64_t backend_features;
>>>       int64_t cvq_group;
>>> @@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>           return r;
>>>       }
>>>
>>> -    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>> -                                       v->iova_range.last);
>>>       v->shadow_vqs_enabled = true;
>>>       s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>>>
>>> @@ -425,6 +473,15 @@ out:
>>>           return 0;
>>>       }
>>>
>>> +    s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>> +    if (s0->vhost_vdpa.iova_tree) {
>>> +        /* SVQ is already configured for all virtqueues */
>>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
>>> +    } else {
>>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>> +                                           v->iova_range.last);
>>> +    }
>>> +
>>>       r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
>>>                                  vhost_vdpa_net_cvq_cmd_page_len(), false);
>>>       if (unlikely(r < 0)) {
>>> @@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>>>       if (s->vhost_vdpa.shadow_vqs_enabled) {
>>>           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
>>>           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
>>> -        if (!s->always_svq) {
>>> -            /*
>>> -             * If only the CVQ is shadowed we can delete this safely.
>>> -             * If all the VQs are shadows this will be needed by the time the
>>> -             * device is started again to register SVQ vrings and similar.
>>> -             */
>>> -            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> -        }
>>>       }
>>> +
>>> +    vhost_vdpa_net_client_stop(nc);
>>>   }
>>>
>>>   static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
>>> @@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>                                          int nvqs,
>>>                                          bool is_datapath,
>>>                                          bool svq,
>>> -                                       struct vhost_vdpa_iova_range iova_range,
>>> -                                       VhostIOVATree *iova_tree)
>>> +                                       struct vhost_vdpa_iova_range iova_range)
>>>   {
>>>       NetClientState *nc = NULL;
>>>       VhostVDPAState *s;
>>> @@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>       s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>       s->vhost_vdpa.iova_range = iova_range;
>>>       s->vhost_vdpa.shadow_data = svq;
>>> -    s->vhost_vdpa.iova_tree = iova_tree;
>>>       if (!is_datapath) {
>>>           s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
>>>                                               vhost_vdpa_net_cvq_cmd_page_len());
>>> @@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>       uint64_t features;
>>>       int vdpa_device_fd;
>>>       g_autofree NetClientState **ncs = NULL;
>>> -    g_autoptr(VhostIOVATree) iova_tree = NULL;
>>>       struct vhost_vdpa_iova_range iova_range;
>>>       NetClientState *nc;
>>>       int queue_pairs, r, i = 0, has_cvq = 0;
>>> @@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>           goto err;
>>>       }
>>>
>>> -    if (opts->x_svq) {
>>> -        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
>>> -            goto err_svq;
>>> -        }
>>> -
>>> -        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
>>> +    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
>>> +        goto err;
>>>       }
>>>
>>>       ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
>>> @@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>       for (i = 0; i < queue_pairs; i++) {
>>>           ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>>>                                        vdpa_device_fd, i, 2, true, opts->x_svq,
>>> -                                     iova_range, iova_tree);
>>> +                                     iova_range);
>>>           if (!ncs[i])
>>>               goto err;
>>>       }
>>> @@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>       if (has_cvq) {
>>>           nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>>>                                    vdpa_device_fd, i, 1, false,
>>> -                                 opts->x_svq, iova_range, iova_tree);
>>> +                                 opts->x_svq, iova_range);
>>>           if (!nc)
>>>               goto err;
>>>       }
>>>
>>> -    /* iova_tree ownership belongs to last NetClientState */
>>> -    g_steal_pointer(&iova_tree);
>>>       return 0;
>>>
>>>   err:
>>> @@ -849,7 +891,6 @@ err:
>>>           }
>>>       }
>>>
>>> -err_svq:
>>>       qemu_close(vdpa_device_fd);
>>>
>>>       return -1;
>>> --
>>> 2.31.1
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-13  7:40     ` Eugenio Perez Martin
@ 2023-01-16  3:32       ` Jason Wang
  2023-01-16  9:53         ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  3:32 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 15:40, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> At this moment it is only possible to migrate to a vdpa device running
>>> with x-svq=on. As a protective measure, the rewind of the inflight
>>> descriptors was done at the destination. That way if the source sent a
>>> virtqueue with inuse descriptors they are always discarded.
>>>
>>> Since this series allows to migrate also to passthrough devices with no
>>> SVQ, the right thing to do is to rewind at the source so base of vrings
>>> are correct.
>>>
>>> Support for inflight descriptors may be added in the future.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>   include/hw/virtio/vhost-backend.h |  4 +++
>>>   hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
>>>   hw/virtio/vhost.c                 |  3 ++
>>>   3 files changed, 36 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>> index c5ab49051e..ec3fbae58d 100644
>>> --- a/include/hw/virtio/vhost-backend.h
>>> +++ b/include/hw/virtio/vhost-backend.h
>>> @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>
>>>   typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>>>                                          int fd);
>>> +
>>> +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>>> +
>>>   typedef struct VhostOps {
>>>       VhostBackendType backend_type;
>>>       vhost_backend_init vhost_backend_init;
>>> @@ -177,6 +180,7 @@ typedef struct VhostOps {
>>>       vhost_get_device_id_op vhost_get_device_id;
>>>       vhost_force_iommu_op vhost_force_iommu;
>>>       vhost_set_config_call_op vhost_set_config_call;
>>> +    vhost_reset_status_op vhost_reset_status;
>>>   } VhostOps;
>>>
>>>   int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 542e003101..28a52ddc78 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>       if (started) {
>>>           memory_listener_register(&v->listener, &address_space_memory);
>>>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>> -    } else {
>>> -        vhost_vdpa_reset_device(dev);
>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>> -        memory_listener_unregister(&v->listener);
>>> +    }
>>>
>>> -        return 0;
>>> +    return 0;
>>> +}
>>> +
>>> +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>>> +{
>>> +    struct vhost_vdpa *v = dev->opaque;
>>> +
>>> +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
>>> +        return;
>>>       }
>>> +
>>> +    vhost_vdpa_reset_device(dev);
>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>> +                                VIRTIO_CONFIG_S_DRIVER);
>>> +    memory_listener_unregister(&v->listener);
>>>   }
>>>
>>>   static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
>>> @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>>>                                          struct vhost_vring_state *ring)
>>>   {
>>>       struct vhost_vdpa *v = dev->opaque;
>>> -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>>>
>>> -    /*
>>> -     * vhost-vdpa devices does not support in-flight requests. Set all of them
>>> -     * as available.
>>> -     *
>>> -     * TODO: This is ok for networking, but other kinds of devices might
>>> -     * have problems with these retransmissions.
>>> -     */
>>> -    while (virtqueue_rewind(vq, 1)) {
>>> -        continue;
>>> -    }
>>>       if (v->shadow_vqs_enabled) {
>>>           /*
>>>            * Device vring base was set at device start. SVQ base is handled by
>>> @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>>>       int ret;
>>>
>>>       if (v->shadow_vqs_enabled) {
>>> +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>>> +
>>> +        /*
>>> +         * vhost-vdpa devices does not support in-flight requests. Set all of
>>> +         * them as available.
>>> +         *
>>> +         * TODO: This is ok for networking, but other kinds of devices might
>>> +         * have problems with these retransmissions.
>>> +         */
>>> +        while (virtqueue_rewind(vq, 1)) {
>>> +            continue;
>>> +        }
>>> +
>>>           ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
>>>           return 0;
>>>       }
>>> @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
>>>           .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>>           .vhost_force_iommu = vhost_vdpa_force_iommu,
>>>           .vhost_set_config_call = vhost_vdpa_set_config_call,
>>> +        .vhost_reset_status = vhost_vdpa_reset_status,
>> Can we simply use the NetClient stop method here?
>>
> Ouch, I squashed two patches by mistake here.
>
> All the vhost_reset_status part should be independent of this patch,
> and I was especially interested in its feedback. It had this message:
>
>      vdpa: move vhost reset after get vring base
>
>      The function vhost.c:vhost_dev_stop calls vhost operation
>      vhost_dev_start(false). In the case of vdpa it totally reset and wipes
>      the device, making the fetching of the vring base (virtqueue state) totally
>      useless.
>
>      The kernel backend does not use vhost_dev_start vhost op callback, but
>      vhost-user do. A patch to make vhost_user_dev_start more similar to vdpa
>      is desirable, but it can be added on top.
>
> I can resend the series splitting it again but conversation may
> scatter between versions. Would you prefer me to send a new version?


I think it can be done in next version (after we finalize the discussion 
for this version).


>
> Regarding the use of NetClient, it feels weird to call net specific
> functions in VhostOps, doesn't it?


Basically, I meant, the patch call vhost_reset_status() in 
vhost_dev_stop(). But we've already had vhost_dev_start ops where we 
implement per backend start/stop logic.

I think it's better to do things in vhost_dev_start():

For device that can do suspend, we can do suspend. For other we need to 
do reset as a workaround.

And if necessary, we can call nc client ops for net specific operations 
(if it has any).

Thanks


> At the moment vhost ops is
> specialized in vhost-kernel, vhost-user and vhost-vdpa. If we want to
> make it specific to the kind of device, that makes vhost-vdpa-net too.
>
> Thanks!
>
>
>> Thanks
>>
>>>   };
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index eb8c4c378c..a266396576 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>                                hdev->vqs + i,
>>>                                hdev->vq_index + i);
>>>       }
>>> +    if (hdev->vhost_ops->vhost_reset_status) {
>>> +        hdev->vhost_ops->vhost_reset_status(hdev);
>>> +    }
>>>
>>>       if (vhost_dev_has_iommu(hdev)) {
>>>           if (hdev->vhost_ops->vhost_set_iotlb_callback) {
>>> --
>>> 2.31.1
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-13  7:46     ` Eugenio Perez Martin
@ 2023-01-16  3:34       ` Jason Wang
  2023-01-16  5:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  3:34 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 15:46, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 5:25 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2023/1/13 01:24, Eugenio Pérez 写道:
>>> A vdpa net device must initialize with SVQ in order to be migratable,
>>> and initialization code verifies conditions.  If the device is not
>>> initialized with the x-svq parameter, it will not expose _F_LOG so vhost
>>> sybsystem will block VM migration from its initialization.
>>>
>>> Next patches change this. Net data VQs will be shadowed only at
>>> migration time and vdpa net devices need to expose _F_LOG as long as it
>>> can go to SVQ.
>>>
>>> Since we don't know that at initialization time but at start, add an
>>> independent blocker at CVQ.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index 631424d9c4..2ca93e850a 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -26,12 +26,14 @@
>>>    #include <err.h>
>>>    #include "standard-headers/linux/virtio_net.h"
>>>    #include "monitor/monitor.h"
>>> +#include "migration/blocker.h"
>>>    #include "hw/virtio/vhost.h"
>>>
>>>    /* Todo:need to add the multiqueue support here */
>>>    typedef struct VhostVDPAState {
>>>        NetClientState nc;
>>>        struct vhost_vdpa vhost_vdpa;
>>> +    Error *migration_blocker;
>>
>> Any reason we can't use the mivration_blocker in vhost_dev structure?
>>
>> I believe we don't need to wait until start to know we can't migrate.
>>
> Device migratability also depends on features that the guest acks.


This sounds a little bit tricky, more below:


>
> For example, if the device does not support ASID it can be migrated as
> long as _F_CVQ is not acked.


The management may notice a non-consistent behavior in this case. I 
wonder if we can simply check the host features.

Thanks


>
> Thanks!
>
>> Thanks
>>
>>
>>>        VHostNetState *vhost_net;
>>>
>>>        /* Control commands shadow buffers */
>>> @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>                g_strerror(errno), errno);
>>>            return -1;
>>>        }
>>> -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
>>> -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
>>> -        return 0;
>>> +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
>>> +        error_setg(&s->migration_blocker,
>>> +                   "vdpa device %s does not support ASID",
>>> +                   nc->name);
>>> +        goto out;
>>> +    }
>>> +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
>>> +                                           &s->migration_blocker)) {
>>> +        goto out;
>>>        }
>>>
>>>        /*
>>> @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>            }
>>>
>>>            if (group == cvq_group) {
>>> -            return 0;
>>> +            error_setg(&s->migration_blocker,
>>> +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
>>> +                "%"PRId64, nc->name, i, group, cvq_group);
>>> +            goto out;
>>>            }
>>>        }
>>>
>>> @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>        s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>>>
>>>    out:
>>> -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
>>> -        return 0;
>>> +    if (s->migration_blocker) {
>>> +        Error *errp = NULL;
>>> +        r = migrate_add_blocker(s->migration_blocker, &errp);
>>> +        if (unlikely(r != 0)) {
>>> +            g_clear_pointer(&s->migration_blocker, error_free);
>>> +            error_report_err(errp);
>>> +        }
>>> +
>>> +        return r;
>>>        }
>>>
>>>        s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>> @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>>>            vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
>>>        }
>>>
>>> +    if (s->migration_blocker) {
>>> +        migrate_del_blocker(s->migration_blocker);
>>> +        g_clear_pointer(&s->migration_blocker, error_free);
>>> +    }
>>> +
>>>        vhost_vdpa_net_client_stop(nc);
>>>    }
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-16  3:34       ` Jason Wang
@ 2023-01-16  5:23         ` Michael S. Tsirkin
  2023-01-16  9:33           ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Michael S. Tsirkin @ 2023-01-16  5:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, qemu-devel, si-wei.liu, Liuxiangdong,
	Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 11:34:20AM +0800, Jason Wang wrote:
> 
> 在 2023/1/13 15:46, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 5:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > 
> > > 在 2023/1/13 01:24, Eugenio Pérez 写道:
> > > > A vdpa net device must initialize with SVQ in order to be migratable,
> > > > and initialization code verifies conditions.  If the device is not
> > > > initialized with the x-svq parameter, it will not expose _F_LOG so vhost
> > > > sybsystem will block VM migration from its initialization.
> > > > 
> > > > Next patches change this. Net data VQs will be shadowed only at
> > > > migration time and vdpa net devices need to expose _F_LOG as long as it
> > > > can go to SVQ.
> > > > 
> > > > Since we don't know that at initialization time but at start, add an
> > > > independent blocker at CVQ.
> > > > 
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >    net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
> > > >    1 file changed, 29 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > > index 631424d9c4..2ca93e850a 100644
> > > > --- a/net/vhost-vdpa.c
> > > > +++ b/net/vhost-vdpa.c
> > > > @@ -26,12 +26,14 @@
> > > >    #include <err.h>
> > > >    #include "standard-headers/linux/virtio_net.h"
> > > >    #include "monitor/monitor.h"
> > > > +#include "migration/blocker.h"
> > > >    #include "hw/virtio/vhost.h"
> > > > 
> > > >    /* Todo:need to add the multiqueue support here */
> > > >    typedef struct VhostVDPAState {
> > > >        NetClientState nc;
> > > >        struct vhost_vdpa vhost_vdpa;
> > > > +    Error *migration_blocker;
> > > 
> > > Any reason we can't use the mivration_blocker in vhost_dev structure?
> > > 
> > > I believe we don't need to wait until start to know we can't migrate.
> > > 
> > Device migratability also depends on features that the guest acks.
> 
> 
> This sounds a little bit tricky, more below:
> 
> 
> > 
> > For example, if the device does not support ASID it can be migrated as
> > long as _F_CVQ is not acked.
> 
> 
> The management may notice a non-consistent behavior in this case. I wonder
> if we can simply check the host features.
> 
> Thanks


Yes the issue is that ack can happen after migration started.
I don't think this kind of blocker appearing during migration
is currently expected/supported well. Is it?

> 
> > 
> > Thanks!
> > 
> > > Thanks
> > > 
> > > 
> > > >        VHostNetState *vhost_net;
> > > > 
> > > >        /* Control commands shadow buffers */
> > > > @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > >                g_strerror(errno), errno);
> > > >            return -1;
> > > >        }
> > > > -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
> > > > -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
> > > > -        return 0;
> > > > +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
> > > > +        error_setg(&s->migration_blocker,
> > > > +                   "vdpa device %s does not support ASID",
> > > > +                   nc->name);
> > > > +        goto out;
> > > > +    }
> > > > +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
> > > > +                                           &s->migration_blocker)) {
> > > > +        goto out;
> > > >        }
> > > > 
> > > >        /*
> > > > @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > >            }
> > > > 
> > > >            if (group == cvq_group) {
> > > > -            return 0;
> > > > +            error_setg(&s->migration_blocker,
> > > > +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
> > > > +                "%"PRId64, nc->name, i, group, cvq_group);
> > > > +            goto out;
> > > >            }
> > > >        }
> > > > 
> > > > @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > >        s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
> > > > 
> > > >    out:
> > > > -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
> > > > -        return 0;
> > > > +    if (s->migration_blocker) {
> > > > +        Error *errp = NULL;
> > > > +        r = migrate_add_blocker(s->migration_blocker, &errp);
> > > > +        if (unlikely(r != 0)) {
> > > > +            g_clear_pointer(&s->migration_blocker, error_free);
> > > > +            error_report_err(errp);
> > > > +        }
> > > > +
> > > > +        return r;
> > > >        }
> > > > 
> > > >        s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > > > @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
> > > >            vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> > > >        }
> > > > 
> > > > +    if (s->migration_blocker) {
> > > > +        migrate_del_blocker(s->migration_blocker);
> > > > +        g_clear_pointer(&s->migration_blocker, error_free);
> > > > +    }
> > > > +
> > > >        vhost_vdpa_net_client_stop(nc);
> > > >    }
> > > > 



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13  8:19     ` Eugenio Perez Martin
  2023-01-13  9:51       ` Stefano Garzarella
@ 2023-01-16  6:36       ` Jason Wang
  2023-01-16 16:16         ` Eugenio Perez Martin
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  6:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 16:19, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> To restore the device at the destination of a live migration we send the
>>> commands through control virtqueue. For a device to read CVQ it must
>>> have received the DRIVER_OK status bit.
>> This probably requires the support from the parent driver and requires
>> some changes or fixes in the parent driver.
>>
>> Some drivers did:
>>
>> parent_set_status():
>> if (DRIVER_OK)
>>      if (queue_enable)
>>          write queue_enable to the device
>>
>> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>>
> I don't get your point here. No device should start reading CVQ (or
> any other VQ) without having received DRIVER_OK.


If I understand the code correctly:

For CVQ, we do SET_VRING_ENABLE before DRIVER_OK, that's fine.

For datapath_vq, we do SET_VRING_ENABLE after DRIVER_OK, this requires 
parent driver support (explained above)


>
> Some parent drivers do not support sending the queue enable command
> after DRIVER_OK, usually because they clean part of the state like the
> set by set_vring_base. Even vdpa_net_sim needs fixes here.


Yes, so the question is:

Do we need another backend feature for this? (otherwise thing may break 
silently)


>
> But my understanding is that it should be supported so I consider it a
> bug.


Probably, we need fine some proof in the spec, e.g in 3.1.1:

"""

7.Perform device-specific setup, including discovery of virtqueues for 
the device, optional per-bus setup, reading and possibly writing the 
device’s virtio configuration space, and population of virtqueues.
8.Set the DRIVER_OK status bit. At this point the device is “live”.

"""

So if my understanding is correct, "discovery of virtqueues for the 
device" implies queue_enable here which is expected to be done before 
DRIVER_OK. But it doesn't say anything regrading to the behaviour of 
setting queue ready after DRIVER_OK.

I'm not sure it's a real bug or not, may Michael and comment on this.


>   Especially after queue_reset patches. Is that what you mean?


We haven't supported queue_reset yet in Qemu. But it allows to write 1 
to queue_enable after DRIVER_OK for sure.


>
>>> However this opens a window where the device could start receiving
>>> packets in rx queue 0 before it receives the RSS configuration. To avoid
>>> that, we will not send vring_enable until all configuration is used by
>>> the device.
>>>
>>> As a first step, run vhost_set_vring_ready for all vhost_net backend after
>>> all of them are started (with DRIVER_OK). This code should not affect
>>> vdpa.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>   hw/net/vhost_net.c | 17 ++++++++++++-----
>>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>> index c4eecc6f36..3900599465 100644
>>> --- a/hw/net/vhost_net.c
>>> +++ b/hw/net/vhost_net.c
>>> @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>           } else {
>>>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>           }
>>> +        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>> +        if (r < 0) {
>>> +            goto err_start;
>>> +        }
>>> +    }
>>> +
>>> +    for (int j = 0; j < nvhosts; j++) {
>>> +        if (j < data_queue_pairs) {
>>> +            peer = qemu_get_peer(ncs, j);
>>> +        } else {
>>> +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>> +        }
>> I fail to understand why we need to change the vhost_net layer? This
>> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
>> vhost_vdpa_dev_start()?
>>
> The vhost-net layer explicitly calls vhost_set_vring_enable before
> vhost_dev_start, and this is exactly the behavior we want to avoid.
> Even if we make changes to vhost_dev, this change is still needed.


Note that the only user of vhost_set_vring_enable() is vhost-user where 
the semantic is different:

It uses that to changes the number of active queues:

static int peer_attach(VirtIONet *n, int index)

         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
=>      vhost_set_vring_enable(nc->peer, 1);
     }

This is not the semantic of vhost-vDPA that tries to be complaint with 
virtio-spec. So I'm not sure how it can help here.


>
> And we want to explicitly enable CVQ first, which "only" vhost_net
> knows which is.


This should be known by net/vhost-vdpa.c.


> To perform that in vhost_vdpa_dev_start would require
> quirks, involving one or more of:
> * Ignore vq enable calls if the device is not the CVQ one. How to
> signal what is the CVQ? Can we trust it will be the last one for all
> kind of devices?
> * Enable queues that do not belong to the last vhost_dev from the enable call.
> * Enable the rest of the queues from the last enable in reverse order.
> * Intercalate the "net load" callback between enabling the last
> vhost_vdpa device and enabling the rest of devices.
> * Add an "enable priority" order?


Haven't had time in thinking through, but it would be better if we can 
limit the changes in vhost-vdpa layer. E.g currently the 
VHOST_VDPA_SET_VRING_ENABLE is done at vhost_dev_start().

Thanks


>
> Thanks!
>
>> Thanks
>>
>>>           if (peer->vring_enable) {
>>>               /* restore vring enable state */
>>> @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>                   goto err_start;
>>>               }
>>>           }
>>> -
>>> -        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>> -        if (r < 0) {
>>> -            goto err_start;
>>> -        }
>>>       }
>>>
>>>       return 0;
>>> --
>>> 2.31.1
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature
  2023-01-13  8:45     ` Eugenio Perez Martin
@ 2023-01-16  6:48       ` Jason Wang
  2023-01-16 16:17         ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  6:48 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 16:45, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 5:39 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> This is needed for qemu to know it can suspend the device to retrieve
>>> its status and enable SVQ with it, so all the process is transparent to
>>> the guest.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> Acked-by: Jason Wang <jasowang@redhat.com>
>>
>> We probably need to add the resume in the future to have a quick
>> recovery from migration failures.
>>
> The capability of a resume can be useful here but only in a small
> window. During the most time of the migration SVQ is enabled, so in
> the event of a migration failure we may need to reset the whole device
> to enable passthrough again.


Yes.


>
> But maybe is it worth giving a quick review and adding some TODOs
> where it can be useful in this series?


We can start by having a TODO in this series, and leave resume in for 
the future.

Thanks


>
> Thanks!
>
>> Thanks
>>
>>> ---
>>>   hw/virtio/vhost-vdpa.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 4296427a69..a61a6b2a74 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -659,7 +659,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>>>       uint64_t features;
>>>       uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
>>>           0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
>>> -        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
>>> +        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
>>> +        0x1ULL << VHOST_BACKEND_F_SUSPEND;
>>>       int r;
>>>
>>>       if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
>>> --
>>> 2.31.1
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-13  9:00     ` Eugenio Perez Martin
@ 2023-01-16  6:51       ` Jason Wang
  2023-01-16 15:21         ` Eugenio Perez Martin
  2023-01-17  9:58       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  6:51 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	David Gilbert, Maxime Coquelin


在 2023/1/13 17:00, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>> This allows net to restart the device backend to configure SVQ on it.
>>>
>>> Ideally, these changes should not be net specific. However, the vdpa net
>>> backend is the one with enough knowledge to configure everything because
>>> of some reasons:
>>> * Queues might need to be shadowed or not depending on its kind (control
>>>    vs data).
>>> * Queues need to share the same map translations (iova tree).
>>>
>>> Because of that it is cleaner to restart the whole net backend and
>>> configure again as expected, similar to how vhost-kernel moves between
>>> userspace and passthrough.
>>>
>>> If more kinds of devices need dynamic switching to SVQ we can create a
>>> callback struct like VhostOps and move most of the code there.
>>> VhostOps cannot be reused since all vdpa backend share them, and to
>>> personalize just for networking would be too heavy.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>   net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 84 insertions(+)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index 5d7ad6e4d7..f38532b1df 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -26,6 +26,8 @@
>>>   #include <err.h>
>>>   #include "standard-headers/linux/virtio_net.h"
>>>   #include "monitor/monitor.h"
>>> +#include "migration/migration.h"
>>> +#include "migration/misc.h"
>>>   #include "migration/blocker.h"
>>>   #include "hw/virtio/vhost.h"
>>>
>>> @@ -33,6 +35,7 @@
>>>   typedef struct VhostVDPAState {
>>>       NetClientState nc;
>>>       struct vhost_vdpa vhost_vdpa;
>>> +    Notifier migration_state;
>>>       Error *migration_blocker;
>>>       VHostNetState *vhost_net;
>>>
>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>>       return DO_UPCAST(VhostVDPAState, nc, nc0);
>>>   }
>>>
>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
>>> +{
>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>> +    VirtIONet *n;
>>> +    VirtIODevice *vdev;
>>> +    int data_queue_pairs, cvq, r;
>>> +    NetClientState *peer;
>>> +
>>> +    /* We are only called on the first data vqs and only if x-svq is not set */
>>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
>>> +        return;
>>> +    }
>>> +
>>> +    vdev = v->dev->vdev;
>>> +    n = VIRTIO_NET(vdev);
>>> +    if (!n->vhost_started) {
>>> +        return;
>>> +    }
>>> +
>>> +    if (enable) {
>>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
>> Do we need to check if the device is started or not here?
>>
> v->vhost_started is checked right above, right?


Right, I miss that.


>
>>> +    }
>> I'm not sure I understand the reason for vhost_net_stop() after a
>> VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
>>
> I think this is really worth exploring, and it would have been clearer
> if I didn't squash the vhost_reset_status commit by mistake :).
>
> Looking at qemu master vhost.c:vhost_dev_stop:
>      if (hdev->vhost_ops->vhost_dev_start) {
>          hdev->vhost_ops->vhost_dev_start(hdev, false);
>      }
>      if (vrings) {
>          vhost_dev_set_vring_enable(hdev, false);
>      }
>      for (i = 0; i < hdev->nvqs; ++i) {
>          vhost_virtqueue_stop(hdev,
>                               vdev,
>                               hdev->vqs + i,
>                               hdev->vq_index + i);
>      }
>
> Both vhost-used and vhost-vdpa set_status(0) at
> ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
> they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
> I think it is too late for vdpa devices to change it. I guess
> vhost-user devices do not lose the state there, but I did not test.
>
> I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
> similar to vhost_user_dev_start. We can make
> vhost_vdpa_dev_start(false) to suspend the device instead. But then we
> need to reset it after getting the indexes. That's why I added
> vhost_vdpa_reset_status, but I admit it is neither the cleanest
> approach nor the best name to it.


I wonder if we can simply suspend in vhost_net_stop() if we know the 
parent can stop?

Thanks


>
> Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.
>
>>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
>>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
>>> +                                  n->max_ncs - n->max_queue_pairs : 0;
>>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>> +
>>> +    peer = s->nc.peer;
>>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
>>> +        VhostVDPAState *vdpa_state;
>>> +        NetClientState *nc;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            nc = qemu_get_peer(peer, i);
>>> +        } else {
>>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
>>> +        }
>>> +
>>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
>>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            /* Do not override CVQ shadow_vqs_enabled */
>>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
>>> +        }
>>> +    }
>>> +
>>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>> +    if (unlikely(r < 0)) {
>>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
>>> +    }
>>> +}
>>> +
>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
>>> +{
>>> +    MigrationState *migration = data;
>>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
>>> +                                     migration_state);
>>> +
>>> +    switch (migration->state) {
>>> +    case MIGRATION_STATUS_SETUP:
>>> +        vhost_vdpa_net_log_global_enable(s, true);
>>> +        return;
>>> +
>>> +    case MIGRATION_STATUS_CANCELLING:
>>> +    case MIGRATION_STATUS_CANCELLED:
>>> +    case MIGRATION_STATUS_FAILED:
>>> +        vhost_vdpa_net_log_global_enable(s, false);
>> Do we need to recover here?
>>
> I may be missing something, but the device is fully reset and restored
> in these cases.
>
> CCing Juan and D. Gilbert, a review would be appreciated to check if
> this covers all the cases.
>
> Thanks!
>
>
>> Thanks
>>
>>> +        return;
>>> +    };
>>> +}
>>> +
>>>   static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>>   {
>>>       struct vhost_vdpa *v = &s->vhost_vdpa;
>>>
>>> +    if (v->feature_log) {
>>> +        add_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>       if (v->shadow_vqs_enabled) {
>>>           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>                                              v->iova_range.last);
>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>>
>>>       assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>
>>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
>>> +        remove_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>       dev = s->vhost_vdpa.dev;
>>>       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>           g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>       s->vhost_vdpa.device_fd = vdpa_device_fd;
>>>       s->vhost_vdpa.index = queue_pair_index;
>>>       s->always_svq = svq;
>>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>>>       s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>       s->vhost_vdpa.iova_range = iova_range;
>>>       s->vhost_vdpa.shadow_data = svq;
>>> --
>>> 2.31.1
>>>
>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-13  9:06         ` Eugenio Perez Martin
@ 2023-01-16  7:02           ` Jason Wang
  2023-02-02 16:55             ` Eugenio Perez Martin
  2023-02-02  0:56           ` Si-Wei Liu
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-16  7:02 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Zhu, Lingshan, qemu-devel, si-wei.liu, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/13 17:06, Eugenio Perez Martin 写道:
> On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>>>
>>>
>>> On 1/13/2023 10:31 AM, Jason Wang wrote:
>>>> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>> Spuriously kick the destination device's queue so it knows in case there
>>>>> are new descriptors.
>>>>>
>>>>> RFC: This is somehow a gray area. The guest may have placed descriptors
>>>>> in a virtqueue but not kicked it, so it might be surprised if the device
>>>>> starts processing it.
>>>> So I think this is kind of the work of the vDPA parent. For the parent
>>>> that needs this trick, we should do it in the parent driver.
>>> Agree, it looks easier implementing this in parent driver,
>>> I can implement it in ifcvf set_vq_ready right now
>> Great, but please check whether or not it is really needed.
>>
>> Some device implementation could check the available descriptions
>> after DRIVER_OK without waiting for a kick.
>>
> So IIUC we can entirely drop this from the series (and I hope we can).
> But then, what with the devices that does *not* check for them?


It needs mediation in the vDPA parent driver.


>
> If we drop it it seems to me we must mandate devices to check for
> descriptors at queue_enable. The queue could stall if not, isn't it?


I'm not sure, did you see real issue with this? (Note that we don't do 
this for vhost-user-(vDPA))

Btw, the code can result of kick before DRIVER_OK, which seems racy.

Thanks


>
> Thanks!
>
>> Thanks
>>
>>> Thanks
>>> Zhu Lingshan
>>>> Thanks
>>>>
>>>>> However, that information is not in the migration stream and it should
>>>>> be an edge case anyhow, being resilient to parallel notifications from
>>>>> the guest.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>    hw/virtio/vhost-vdpa.c | 5 +++++
>>>>>    1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 40b7e8706a..dff94355dd 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
>>>>>        }
>>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>> +        VirtQueue *vq;
>>>>>            struct vhost_vring_state state = {
>>>>>                .index = dev->vq_index + i,
>>>>>                .num = 1,
>>>>>            };
>>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>> +
>>>>> +        /* Preemptive kick */
>>>>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>>>>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
>>>>>        }
>>>>>        return 0;
>>>>>    }
>>>>> --
>>>>> 2.31.1
>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-16  3:05       ` Jason Wang
@ 2023-01-16  9:14         ` Eugenio Perez Martin
  2023-01-17  4:30           ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16  9:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 15:28, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 4:53 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>> Only create iova_tree if and when it is needed.
> >>>
> >>> The cleanup keeps being responsability of last VQ but this change allows
> >>> to merge both cleanup functions.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>   net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
> >>>   1 file changed, 71 insertions(+), 30 deletions(-)
> >>>
> >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>> index de5ed8ff22..75cca497c8 100644
> >>> --- a/net/vhost-vdpa.c
> >>> +++ b/net/vhost-vdpa.c
> >>> @@ -178,13 +178,9 @@ err_init:
> >>>   static void vhost_vdpa_cleanup(NetClientState *nc)
> >>>   {
> >>>       VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> >>> -    struct vhost_dev *dev = &s->vhost_net->dev;
> >>>
> >>>       qemu_vfree(s->cvq_cmd_out_buffer);
> >>>       qemu_vfree(s->status);
> >>> -    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >>> -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>> -    }
> >>>       if (s->vhost_net) {
> >>>           vhost_net_cleanup(s->vhost_net);
> >>>           g_free(s->vhost_net);
> >>> @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
> >>>       return size;
> >>>   }
> >>>
> >>> +/** From any vdpa net client, get the netclient of first queue pair */
> >>> +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >>> +{
> >>> +    NICState *nic = qemu_get_nic(s->nc.peer);
> >>> +    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
> >>> +
> >>> +    return DO_UPCAST(VhostVDPAState, nc, nc0);
> >>> +}
> >>> +
> >>> +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >>> +{
> >>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> >>> +
> >>> +    if (v->shadow_vqs_enabled) {
> >>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>> +                                           v->iova_range.last);
> >>> +    }
> >>> +}
> >>> +
> >>> +static int vhost_vdpa_net_data_start(NetClientState *nc)
> >>> +{
> >>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> >>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> >>> +
> >>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >>> +
> >>> +    if (v->index == 0) {
> >>> +        vhost_vdpa_net_data_start_first(s);
> >>> +        return 0;
> >>> +    }
> >>> +
> >>> +    if (v->shadow_vqs_enabled) {
> >>> +        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
> >>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> >>> +    }
> >> It looks to me the logic here is somehow the same as
> >> vhost_vdpa_net_cvq_start(), can we unify the them?
> >>
> > It depends on what you mean by unify :). But we can explore it for sure.
> >
> > We can call vhost_vdpa_net_data_start, but the steps to do if
> > s0->vhost_vdpa.iova_tree == NULL are different. Data queues must do
> > nothing, but CVQ must create a new iova tree.
> >
> > So one possibility is to convert this part of vhost_vdpa_net_cvq_start:
> >      s0 = vhost_vdpa_net_first_nc_vdpa(s);
> >      if (s0->vhost_vdpa.iova_tree) {
> >          /* SVQ is already configured for all virtqueues */
> >          v->iova_tree = s0->vhost_vdpa.iova_tree;
> >      } else {
> >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >                                             v->iova_range.last);
> >      }
> >
> > into:
> >      vhost_vdpa_net_data_start(nc);
> >      if (!v->iova_tree) {
> >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >                                             v->iova_range.last);
> >      }
> >
> > I'm ok with the change but it's less clear in my opinion: it's not
> > obvious that net_data_start is in charge of setting v->iova_tree to
> > me.
>
>
> Ok.
>
>
> >
> > Another possibility is to abstract something like
> > first_nc_iova_tree(), but we need to check more fields of s0 later
> > (shadow_data) so I'm not sure about the benefit.
> >
> > Is that what you have in mind?
>
>
> Kind of, but I think we can leave the code as is.
>
> In the future, as discussed, we need to introduce something like a
> parent or opaque structure for NetClientState structure, it can simply a
> lot of things: we can have one same common parent for all queues, then
> there's no need for the trick like first_nc_iova_tree() and other
> similar tricks.
>

So we can ack this patch or you prefer to explore the change for the
next series?

Thanks!

> Thanks
>
> >
> > Thanks!
> >
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >>> +{
> >>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> >>> +    struct vhost_dev *dev;
> >>> +
> >>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >>> +
> >>> +    dev = s->vhost_vdpa.dev;
> >>> +    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >>> +        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>> +    }
> >>> +}
> >>> +
> >>>   static NetClientInfo net_vhost_vdpa_info = {
> >>>           .type = NET_CLIENT_DRIVER_VHOST_VDPA,
> >>>           .size = sizeof(VhostVDPAState),
> >>>           .receive = vhost_vdpa_receive,
> >>> +        .start = vhost_vdpa_net_data_start,
> >>> +        .stop = vhost_vdpa_net_client_stop,
> >>>           .cleanup = vhost_vdpa_cleanup,
> >>>           .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
> >>>           .has_ufo = vhost_vdpa_has_ufo,
> >>> @@ -351,7 +401,7 @@ dma_map_err:
> >>>
> >>>   static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >>>   {
> >>> -    VhostVDPAState *s;
> >>> +    VhostVDPAState *s, *s0;
> >>>       struct vhost_vdpa *v;
> >>>       uint64_t backend_features;
> >>>       int64_t cvq_group;
> >>> @@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >>>           return r;
> >>>       }
> >>>
> >>> -    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>> -                                       v->iova_range.last);
> >>>       v->shadow_vqs_enabled = true;
> >>>       s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
> >>>
> >>> @@ -425,6 +473,15 @@ out:
> >>>           return 0;
> >>>       }
> >>>
> >>> +    s0 = vhost_vdpa_net_first_nc_vdpa(s);
> >>> +    if (s0->vhost_vdpa.iova_tree) {
> >>> +        /* SVQ is already configured for all virtqueues */
> >>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
> >>> +    } else {
> >>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>> +                                           v->iova_range.last);
> >>> +    }
> >>> +
> >>>       r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
> >>>                                  vhost_vdpa_net_cvq_cmd_page_len(), false);
> >>>       if (unlikely(r < 0)) {
> >>> @@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
> >>>       if (s->vhost_vdpa.shadow_vqs_enabled) {
> >>>           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
> >>>           vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> >>> -        if (!s->always_svq) {
> >>> -            /*
> >>> -             * If only the CVQ is shadowed we can delete this safely.
> >>> -             * If all the VQs are shadows this will be needed by the time the
> >>> -             * device is started again to register SVQ vrings and similar.
> >>> -             */
> >>> -            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>> -        }
> >>>       }
> >>> +
> >>> +    vhost_vdpa_net_client_stop(nc);
> >>>   }
> >>>
> >>>   static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
> >>> @@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >>>                                          int nvqs,
> >>>                                          bool is_datapath,
> >>>                                          bool svq,
> >>> -                                       struct vhost_vdpa_iova_range iova_range,
> >>> -                                       VhostIOVATree *iova_tree)
> >>> +                                       struct vhost_vdpa_iova_range iova_range)
> >>>   {
> >>>       NetClientState *nc = NULL;
> >>>       VhostVDPAState *s;
> >>> @@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >>>       s->vhost_vdpa.shadow_vqs_enabled = svq;
> >>>       s->vhost_vdpa.iova_range = iova_range;
> >>>       s->vhost_vdpa.shadow_data = svq;
> >>> -    s->vhost_vdpa.iova_tree = iova_tree;
> >>>       if (!is_datapath) {
> >>>           s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
> >>>                                               vhost_vdpa_net_cvq_cmd_page_len());
> >>> @@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >>>       uint64_t features;
> >>>       int vdpa_device_fd;
> >>>       g_autofree NetClientState **ncs = NULL;
> >>> -    g_autoptr(VhostIOVATree) iova_tree = NULL;
> >>>       struct vhost_vdpa_iova_range iova_range;
> >>>       NetClientState *nc;
> >>>       int queue_pairs, r, i = 0, has_cvq = 0;
> >>> @@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >>>           goto err;
> >>>       }
> >>>
> >>> -    if (opts->x_svq) {
> >>> -        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
> >>> -            goto err_svq;
> >>> -        }
> >>> -
> >>> -        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
> >>> +    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
> >>> +        goto err;
> >>>       }
> >>>
> >>>       ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
> >>> @@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >>>       for (i = 0; i < queue_pairs; i++) {
> >>>           ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
> >>>                                        vdpa_device_fd, i, 2, true, opts->x_svq,
> >>> -                                     iova_range, iova_tree);
> >>> +                                     iova_range);
> >>>           if (!ncs[i])
> >>>               goto err;
> >>>       }
> >>> @@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
> >>>       if (has_cvq) {
> >>>           nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
> >>>                                    vdpa_device_fd, i, 1, false,
> >>> -                                 opts->x_svq, iova_range, iova_tree);
> >>> +                                 opts->x_svq, iova_range);
> >>>           if (!nc)
> >>>               goto err;
> >>>       }
> >>>
> >>> -    /* iova_tree ownership belongs to last NetClientState */
> >>> -    g_steal_pointer(&iova_tree);
> >>>       return 0;
> >>>
> >>>   err:
> >>> @@ -849,7 +891,6 @@ err:
> >>>           }
> >>>       }
> >>>
> >>> -err_svq:
> >>>       qemu_close(vdpa_device_fd);
> >>>
> >>>       return -1;
> >>> --
> >>> 2.31.1
> >>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-16  5:23         ` Michael S. Tsirkin
@ 2023-01-16  9:33           ` Eugenio Perez Martin
  2023-01-17  5:42             ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16  9:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 6:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jan 16, 2023 at 11:34:20AM +0800, Jason Wang wrote:
> >
> > 在 2023/1/13 15:46, Eugenio Perez Martin 写道:
> > > On Fri, Jan 13, 2023 at 5:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > 在 2023/1/13 01:24, Eugenio Pérez 写道:
> > > > > A vdpa net device must initialize with SVQ in order to be migratable,
> > > > > and initialization code verifies conditions.  If the device is not
> > > > > initialized with the x-svq parameter, it will not expose _F_LOG so vhost
> > > > > sybsystem will block VM migration from its initialization.
> > > > >
> > > > > Next patches change this. Net data VQs will be shadowed only at
> > > > > migration time and vdpa net devices need to expose _F_LOG as long as it
> > > > > can go to SVQ.
> > > > >
> > > > > Since we don't know that at initialization time but at start, add an
> > > > > independent blocker at CVQ.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > >    net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
> > > > >    1 file changed, 29 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > > > index 631424d9c4..2ca93e850a 100644
> > > > > --- a/net/vhost-vdpa.c
> > > > > +++ b/net/vhost-vdpa.c
> > > > > @@ -26,12 +26,14 @@
> > > > >    #include <err.h>
> > > > >    #include "standard-headers/linux/virtio_net.h"
> > > > >    #include "monitor/monitor.h"
> > > > > +#include "migration/blocker.h"
> > > > >    #include "hw/virtio/vhost.h"
> > > > >
> > > > >    /* Todo:need to add the multiqueue support here */
> > > > >    typedef struct VhostVDPAState {
> > > > >        NetClientState nc;
> > > > >        struct vhost_vdpa vhost_vdpa;
> > > > > +    Error *migration_blocker;
> > > >
> > > > Any reason we can't use the mivration_blocker in vhost_dev structure?
> > > >
> > > > I believe we don't need to wait until start to know we can't migrate.
> > > >
> > > Device migratability also depends on features that the guest acks.
> >
> >
> > This sounds a little bit tricky, more below:
> >
> >
> > >
> > > For example, if the device does not support ASID it can be migrated as
> > > long as _F_CVQ is not acked.
> >
> >
> > The management may notice a non-consistent behavior in this case. I wonder
> > if we can simply check the host features.
> >

That's right, and I can see how that can be an issue.

However, the check for the ASID is based on queue indexes at the
moment. If we want to register the blocker at the initialization
moment the only option I see is to do two features ack & reset cycle:
one with MQ and another one without MQ.

Would it be more correct to assume the device will assign the right
ASID only probing one configuration? I don't think so but I'm ok to
leave the code that way if we agree it is more viable.

> > Thanks
>
>
> Yes the issue is that ack can happen after migration started.
> I don't think this kind of blocker appearing during migration
> is currently expected/supported well. Is it?
>

In that case the guest cannot DRIVER_OK the device, because the call
to migrate_add_blocker fails and the error propagates from
vhost_net_start up to the virtio device.

But I can also see how this is inconvenient and to add a migration
blocker at initialization can simplify things here. As long as we
agree on the right way to probe I can send a new version that way for
sure.

Thanks!

> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > >
> > > > >        VHostNetState *vhost_net;
> > > > >
> > > > >        /* Control commands shadow buffers */
> > > > > @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > > >                g_strerror(errno), errno);
> > > > >            return -1;
> > > > >        }
> > > > > -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
> > > > > -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
> > > > > -        return 0;
> > > > > +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
> > > > > +        error_setg(&s->migration_blocker,
> > > > > +                   "vdpa device %s does not support ASID",
> > > > > +                   nc->name);
> > > > > +        goto out;
> > > > > +    }
> > > > > +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
> > > > > +                                           &s->migration_blocker)) {
> > > > > +        goto out;
> > > > >        }
> > > > >
> > > > >        /*
> > > > > @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > > >            }
> > > > >
> > > > >            if (group == cvq_group) {
> > > > > -            return 0;
> > > > > +            error_setg(&s->migration_blocker,
> > > > > +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
> > > > > +                "%"PRId64, nc->name, i, group, cvq_group);
> > > > > +            goto out;
> > > > >            }
> > > > >        }
> > > > >
> > > > > @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> > > > >        s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
> > > > >
> > > > >    out:
> > > > > -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
> > > > > -        return 0;
> > > > > +    if (s->migration_blocker) {
> > > > > +        Error *errp = NULL;
> > > > > +        r = migrate_add_blocker(s->migration_blocker, &errp);
> > > > > +        if (unlikely(r != 0)) {
> > > > > +            g_clear_pointer(&s->migration_blocker, error_free);
> > > > > +            error_report_err(errp);
> > > > > +        }
> > > > > +
> > > > > +        return r;
> > > > >        }
> > > > >
> > > > >        s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > > > > @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
> > > > >            vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
> > > > >        }
> > > > >
> > > > > +    if (s->migration_blocker) {
> > > > > +        migrate_del_blocker(s->migration_blocker);
> > > > > +        g_clear_pointer(&s->migration_blocker, error_free);
> > > > > +    }
> > > > > +
> > > > >        vhost_vdpa_net_client_stop(nc);
> > > > >    }
> > > > >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-16  3:32       ` Jason Wang
@ 2023-01-16  9:53         ` Eugenio Perez Martin
  2023-01-17  4:38           ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16  9:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 4:32 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 15:40, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>> At this moment it is only possible to migrate to a vdpa device running
> >>> with x-svq=on. As a protective measure, the rewind of the inflight
> >>> descriptors was done at the destination. That way if the source sent a
> >>> virtqueue with inuse descriptors they are always discarded.
> >>>
> >>> Since this series allows to migrate also to passthrough devices with no
> >>> SVQ, the right thing to do is to rewind at the source so base of vrings
> >>> are correct.
> >>>
> >>> Support for inflight descriptors may be added in the future.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>   include/hw/virtio/vhost-backend.h |  4 +++
> >>>   hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
> >>>   hw/virtio/vhost.c                 |  3 ++
> >>>   3 files changed, 36 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>> index c5ab49051e..ec3fbae58d 100644
> >>> --- a/include/hw/virtio/vhost-backend.h
> >>> +++ b/include/hw/virtio/vhost-backend.h
> >>> @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >>>
> >>>   typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
> >>>                                          int fd);
> >>> +
> >>> +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> >>> +
> >>>   typedef struct VhostOps {
> >>>       VhostBackendType backend_type;
> >>>       vhost_backend_init vhost_backend_init;
> >>> @@ -177,6 +180,7 @@ typedef struct VhostOps {
> >>>       vhost_get_device_id_op vhost_get_device_id;
> >>>       vhost_force_iommu_op vhost_force_iommu;
> >>>       vhost_set_config_call_op vhost_set_config_call;
> >>> +    vhost_reset_status_op vhost_reset_status;
> >>>   } VhostOps;
> >>>
> >>>   int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index 542e003101..28a52ddc78 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>>       if (started) {
> >>>           memory_listener_register(&v->listener, &address_space_memory);
> >>>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >>> -    } else {
> >>> -        vhost_vdpa_reset_device(dev);
> >>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >>> -                                   VIRTIO_CONFIG_S_DRIVER);
> >>> -        memory_listener_unregister(&v->listener);
> >>> +    }
> >>>
> >>> -        return 0;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
> >>> +{
> >>> +    struct vhost_vdpa *v = dev->opaque;
> >>> +
> >>> +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
> >>> +        return;
> >>>       }
> >>> +
> >>> +    vhost_vdpa_reset_device(dev);
> >>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >>> +                                VIRTIO_CONFIG_S_DRIVER);
> >>> +    memory_listener_unregister(&v->listener);
> >>>   }
> >>>
> >>>   static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> >>> @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
> >>>                                          struct vhost_vring_state *ring)
> >>>   {
> >>>       struct vhost_vdpa *v = dev->opaque;
> >>> -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> >>>
> >>> -    /*
> >>> -     * vhost-vdpa devices does not support in-flight requests. Set all of them
> >>> -     * as available.
> >>> -     *
> >>> -     * TODO: This is ok for networking, but other kinds of devices might
> >>> -     * have problems with these retransmissions.
> >>> -     */
> >>> -    while (virtqueue_rewind(vq, 1)) {
> >>> -        continue;
> >>> -    }
> >>>       if (v->shadow_vqs_enabled) {
> >>>           /*
> >>>            * Device vring base was set at device start. SVQ base is handled by
> >>> @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> >>>       int ret;
> >>>
> >>>       if (v->shadow_vqs_enabled) {
> >>> +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> >>> +
> >>> +        /*
> >>> +         * vhost-vdpa devices does not support in-flight requests. Set all of
> >>> +         * them as available.
> >>> +         *
> >>> +         * TODO: This is ok for networking, but other kinds of devices might
> >>> +         * have problems with these retransmissions.
> >>> +         */
> >>> +        while (virtqueue_rewind(vq, 1)) {
> >>> +            continue;
> >>> +        }
> >>> +
> >>>           ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
> >>>           return 0;
> >>>       }
> >>> @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
> >>>           .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> >>>           .vhost_force_iommu = vhost_vdpa_force_iommu,
> >>>           .vhost_set_config_call = vhost_vdpa_set_config_call,
> >>> +        .vhost_reset_status = vhost_vdpa_reset_status,
> >> Can we simply use the NetClient stop method here?
> >>
> > Ouch, I squashed two patches by mistake here.
> >
> > All the vhost_reset_status part should be independent of this patch,
> > and I was especially interested in its feedback. It had this message:
> >
> >      vdpa: move vhost reset after get vring base
> >
> >      The function vhost.c:vhost_dev_stop calls vhost operation
> >      vhost_dev_start(false). In the case of vdpa it totally reset and wipes
> >      the device, making the fetching of the vring base (virtqueue state) totally
> >      useless.
> >
> >      The kernel backend does not use vhost_dev_start vhost op callback, but
> >      vhost-user do. A patch to make vhost_user_dev_start more similar to vdpa
> >      is desirable, but it can be added on top.
> >
> > I can resend the series splitting it again but conversation may
> > scatter between versions. Would you prefer me to send a new version?
>
>
> I think it can be done in next version (after we finalize the discussion
> for this version).
>
>
> >
> > Regarding the use of NetClient, it feels weird to call net specific
> > functions in VhostOps, doesn't it?
>
>
> Basically, I meant, the patch call vhost_reset_status() in
> vhost_dev_stop(). But we've already had vhost_dev_start ops where we
> implement per backend start/stop logic.
>
> I think it's better to do things in vhost_dev_start():
>
> For device that can do suspend, we can do suspend. For other we need to
> do reset as a workaround.
>

If the device implements _F_SUSPEND we can call suspend in
vhost_dev_start(false) and fetch the vq base after it. But we cannot
call vhost_dev_reset until we get the vq base. If we do it, we will
always get zero there.

If we don't reset the device at vhost_vdpa_dev_start(false) we need to
call a proper reset after getting the base, at least in vdpa. So to
create a new vhost_op should be the right thing to do, isn't it?

Hopefully with a better name than vhost_vdpa_reset_status, that's for sure :).

I'm not sure how vhost-user works with this or when it does reset the
indexes. My bet is that it never does at the device reinitialization
and it trusts VMM calls to vhost_user_set_base but I may be wrong.

Thanks!

> And if necessary, we can call nc client ops for net specific operations
> (if it has any).
>
> Thanks
>
>
> > At the moment vhost ops is
> > specialized in vhost-kernel, vhost-user and vhost-vdpa. If we want to
> > make it specific to the kind of device, that makes vhost-vdpa-net too.
> >
> > Thanks!
> >
> >
> >> Thanks
> >>
> >>>   };
> >>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>> index eb8c4c378c..a266396576 100644
> >>> --- a/hw/virtio/vhost.c
> >>> +++ b/hw/virtio/vhost.c
> >>> @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >>>                                hdev->vqs + i,
> >>>                                hdev->vq_index + i);
> >>>       }
> >>> +    if (hdev->vhost_ops->vhost_reset_status) {
> >>> +        hdev->vhost_ops->vhost_reset_status(hdev);
> >>> +    }
> >>>
> >>>       if (vhost_dev_has_iommu(hdev)) {
> >>>           if (hdev->vhost_ops->vhost_set_iotlb_callback) {
> >>> --
> >>> 2.31.1
> >>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-16  6:51       ` Jason Wang
@ 2023-01-16 15:21         ` Eugenio Perez Martin
  0 siblings, 0 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16 15:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	David Gilbert, Maxime Coquelin

On Mon, Jan 16, 2023 at 7:51 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 17:00, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>> This allows net to restart the device backend to configure SVQ on it.
> >>>
> >>> Ideally, these changes should not be net specific. However, the vdpa net
> >>> backend is the one with enough knowledge to configure everything because
> >>> of some reasons:
> >>> * Queues might need to be shadowed or not depending on its kind (control
> >>>    vs data).
> >>> * Queues need to share the same map translations (iova tree).
> >>>
> >>> Because of that it is cleaner to restart the whole net backend and
> >>> configure again as expected, similar to how vhost-kernel moves between
> >>> userspace and passthrough.
> >>>
> >>> If more kinds of devices need dynamic switching to SVQ we can create a
> >>> callback struct like VhostOps and move most of the code there.
> >>> VhostOps cannot be reused since all vdpa backend share them, and to
> >>> personalize just for networking would be too heavy.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>   net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>   1 file changed, 84 insertions(+)
> >>>
> >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>> index 5d7ad6e4d7..f38532b1df 100644
> >>> --- a/net/vhost-vdpa.c
> >>> +++ b/net/vhost-vdpa.c
> >>> @@ -26,6 +26,8 @@
> >>>   #include <err.h>
> >>>   #include "standard-headers/linux/virtio_net.h"
> >>>   #include "monitor/monitor.h"
> >>> +#include "migration/migration.h"
> >>> +#include "migration/misc.h"
> >>>   #include "migration/blocker.h"
> >>>   #include "hw/virtio/vhost.h"
> >>>
> >>> @@ -33,6 +35,7 @@
> >>>   typedef struct VhostVDPAState {
> >>>       NetClientState nc;
> >>>       struct vhost_vdpa vhost_vdpa;
> >>> +    Notifier migration_state;
> >>>       Error *migration_blocker;
> >>>       VHostNetState *vhost_net;
> >>>
> >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >>>       return DO_UPCAST(VhostVDPAState, nc, nc0);
> >>>   }
> >>>
> >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> >>> +{
> >>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> >>> +    VirtIONet *n;
> >>> +    VirtIODevice *vdev;
> >>> +    int data_queue_pairs, cvq, r;
> >>> +    NetClientState *peer;
> >>> +
> >>> +    /* We are only called on the first data vqs and only if x-svq is not set */
> >>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    vdev = v->dev->vdev;
> >>> +    n = VIRTIO_NET(vdev);
> >>> +    if (!n->vhost_started) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    if (enable) {
> >>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> >> Do we need to check if the device is started or not here?
> >>
> > v->vhost_started is checked right above, right?
>
>
> Right, I miss that.
>
>
> >
> >>> +    }
> >> I'm not sure I understand the reason for vhost_net_stop() after a
> >> VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
> >>
> > I think this is really worth exploring, and it would have been clearer
> > if I didn't squash the vhost_reset_status commit by mistake :).
> >
> > Looking at qemu master vhost.c:vhost_dev_stop:
> >      if (hdev->vhost_ops->vhost_dev_start) {
> >          hdev->vhost_ops->vhost_dev_start(hdev, false);
> >      }
> >      if (vrings) {
> >          vhost_dev_set_vring_enable(hdev, false);
> >      }
> >      for (i = 0; i < hdev->nvqs; ++i) {
> >          vhost_virtqueue_stop(hdev,
> >                               vdev,
> >                               hdev->vqs + i,
> >                               hdev->vq_index + i);
> >      }
> >
> > Both vhost-used and vhost-vdpa set_status(0) at
> > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
> > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
> > I think it is too late for vdpa devices to change it. I guess
> > vhost-user devices do not lose the state there, but I did not test.
> >
> > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
> > similar to vhost_user_dev_start. We can make
> > vhost_vdpa_dev_start(false) to suspend the device instead. But then we
> > need to reset it after getting the indexes. That's why I added
> > vhost_vdpa_reset_status, but I admit it is neither the cleanest
> > approach nor the best name to it.
>
>
> I wonder if we can simply suspend in vhost_net_stop() if we know the
> parent can stop?
>

Sure, that's possible, I'll move that code to vhost_net_stop.

Thanks!

> Thanks
>
>
> >
> > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.
> >
> >>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> >>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> >>> +                                  n->max_ncs - n->max_queue_pairs : 0;
> >>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >>> +
> >>> +    peer = s->nc.peer;
> >>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> >>> +        VhostVDPAState *vdpa_state;
> >>> +        NetClientState *nc;
> >>> +
> >>> +        if (i < data_queue_pairs) {
> >>> +            nc = qemu_get_peer(peer, i);
> >>> +        } else {
> >>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> >>> +        }
> >>> +
> >>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> >>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
> >>> +
> >>> +        if (i < data_queue_pairs) {
> >>> +            /* Do not override CVQ shadow_vqs_enabled */
> >>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >>> +    if (unlikely(r < 0)) {
> >>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> >>> +    }
> >>> +}
> >>> +
> >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> >>> +{
> >>> +    MigrationState *migration = data;
> >>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> >>> +                                     migration_state);
> >>> +
> >>> +    switch (migration->state) {
> >>> +    case MIGRATION_STATUS_SETUP:
> >>> +        vhost_vdpa_net_log_global_enable(s, true);
> >>> +        return;
> >>> +
> >>> +    case MIGRATION_STATUS_CANCELLING:
> >>> +    case MIGRATION_STATUS_CANCELLED:
> >>> +    case MIGRATION_STATUS_FAILED:
> >>> +        vhost_vdpa_net_log_global_enable(s, false);
> >> Do we need to recover here?
> >>
> > I may be missing something, but the device is fully reset and restored
> > in these cases.
> >
> > CCing Juan and D. Gilbert, a review would be appreciated to check if
> > this covers all the cases.
> >
> > Thanks!
> >
> >
> >> Thanks
> >>
> >>> +        return;
> >>> +    };
> >>> +}
> >>> +
> >>>   static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >>>   {
> >>>       struct vhost_vdpa *v = &s->vhost_vdpa;
> >>>
> >>> +    if (v->feature_log) {
> >>> +        add_migration_state_change_notifier(&s->migration_state);
> >>> +    }
> >>> +
> >>>       if (v->shadow_vqs_enabled) {
> >>>           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>>                                              v->iova_range.last);
> >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >>>
> >>>       assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >>>
> >>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> >>> +        remove_migration_state_change_notifier(&s->migration_state);
> >>> +    }
> >>> +
> >>>       dev = s->vhost_vdpa.dev;
> >>>       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >>>           g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >>>       s->vhost_vdpa.device_fd = vdpa_device_fd;
> >>>       s->vhost_vdpa.index = queue_pair_index;
> >>>       s->always_svq = svq;
> >>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >>>       s->vhost_vdpa.shadow_vqs_enabled = svq;
> >>>       s->vhost_vdpa.iova_range = iova_range;
> >>>       s->vhost_vdpa.shadow_data = svq;
> >>> --
> >>> 2.31.1
> >>>
> >>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-16  6:36       ` Jason Wang
@ 2023-01-16 16:16         ` Eugenio Perez Martin
  2023-01-17  5:36           ` Jason Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16 16:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 7:37 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 16:19, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>> To restore the device at the destination of a live migration we send the
> >>> commands through control virtqueue. For a device to read CVQ it must
> >>> have received the DRIVER_OK status bit.
> >> This probably requires the support from the parent driver and requires
> >> some changes or fixes in the parent driver.
> >>
> >> Some drivers did:
> >>
> >> parent_set_status():
> >> if (DRIVER_OK)
> >>      if (queue_enable)
> >>          write queue_enable to the device
> >>
> >> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
> >>
> > I don't get your point here. No device should start reading CVQ (or
> > any other VQ) without having received DRIVER_OK.
>
>
> If I understand the code correctly:
>
> For CVQ, we do SET_VRING_ENABLE before DRIVER_OK, that's fine.
>
> For datapath_vq, we do SET_VRING_ENABLE after DRIVER_OK, this requires
> parent driver support (explained above)
>
>
> >
> > Some parent drivers do not support sending the queue enable command
> > after DRIVER_OK, usually because they clean part of the state like the
> > set by set_vring_base. Even vdpa_net_sim needs fixes here.
>
>
> Yes, so the question is:
>
> Do we need another backend feature for this? (otherwise thing may break
> silently)
>
>
> >
> > But my understanding is that it should be supported so I consider it a
> > bug.
>
>
> Probably, we need fine some proof in the spec, e.g in 3.1.1:
>
> """
>
> 7.Perform device-specific setup, including discovery of virtqueues for
> the device, optional per-bus setup, reading and possibly writing the
> device’s virtio configuration space, and population of virtqueues.
> 8.Set the DRIVER_OK status bit. At this point the device is “live”.
>
> """
>
> So if my understanding is correct, "discovery of virtqueues for the
> device" implies queue_enable here which is expected to be done before
> DRIVER_OK. But it doesn't say anything regrading to the behaviour of
> setting queue ready after DRIVER_OK.
>
> I'm not sure it's a real bug or not, may Michael and comment on this.
>

Right, input on this topic would be really appreciated.

>
> >   Especially after queue_reset patches. Is that what you mean?
>
>
> We haven't supported queue_reset yet in Qemu. But it allows to write 1
> to queue_enable after DRIVER_OK for sure.
>

I was not clear, I meant in the emulated device. I'm testing this
series with the proposal of _F_STATE.

>
> >
> >>> However this opens a window where the device could start receiving
> >>> packets in rx queue 0 before it receives the RSS configuration. To avoid
> >>> that, we will not send vring_enable until all configuration is used by
> >>> the device.
> >>>
> >>> As a first step, run vhost_set_vring_ready for all vhost_net backend after
> >>> all of them are started (with DRIVER_OK). This code should not affect
> >>> vdpa.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>   hw/net/vhost_net.c | 17 ++++++++++++-----
> >>>   1 file changed, 12 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >>> index c4eecc6f36..3900599465 100644
> >>> --- a/hw/net/vhost_net.c
> >>> +++ b/hw/net/vhost_net.c
> >>> @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>>           } else {
> >>>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >>>           }
> >>> +        r = vhost_net_start_one(get_vhost_net(peer), dev);
> >>> +        if (r < 0) {
> >>> +            goto err_start;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    for (int j = 0; j < nvhosts; j++) {
> >>> +        if (j < data_queue_pairs) {
> >>> +            peer = qemu_get_peer(ncs, j);
> >>> +        } else {
> >>> +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >>> +        }
> >> I fail to understand why we need to change the vhost_net layer? This
> >> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
> >> vhost_vdpa_dev_start()?
> >>
> > The vhost-net layer explicitly calls vhost_set_vring_enable before
> > vhost_dev_start, and this is exactly the behavior we want to avoid.
> > Even if we make changes to vhost_dev, this change is still needed.
>
>
> Note that the only user of vhost_set_vring_enable() is vhost-user where
> the semantic is different:
>
> It uses that to changes the number of active queues:
>
> static int peer_attach(VirtIONet *n, int index)
>
>          if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
> =>      vhost_set_vring_enable(nc->peer, 1);
>      }
>
> This is not the semantic of vhost-vDPA that tries to be complaint with
> virtio-spec. So I'm not sure how it can help here.
>

Right, but previous changes use enable callback to delay the enable of
the datapath virtqueues. I'll try to fit the changes in
virtio/vhost-vdpa though.

Thanks!

>
> >
> > And we want to explicitly enable CVQ first, which "only" vhost_net
> > knows which is.
>
>
> This should be known by net/vhost-vdpa.c.
>
>
> > To perform that in vhost_vdpa_dev_start would require
> > quirks, involving one or more of:
> > * Ignore vq enable calls if the device is not the CVQ one. How to
> > signal what is the CVQ? Can we trust it will be the last one for all
> > kind of devices?
> > * Enable queues that do not belong to the last vhost_dev from the enable call.
> > * Enable the rest of the queues from the last enable in reverse order.
> > * Intercalate the "net load" callback between enabling the last
> > vhost_vdpa device and enabling the rest of devices.
> > * Add an "enable priority" order?
>
>
> Haven't had time in thinking through, but it would be better if we can
> limit the changes in vhost-vdpa layer. E.g currently the
> VHOST_VDPA_SET_VRING_ENABLE is done at vhost_dev_start().
>
> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>>           if (peer->vring_enable) {
> >>>               /* restore vring enable state */
> >>> @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>>                   goto err_start;
> >>>               }
> >>>           }
> >>> -
> >>> -        r = vhost_net_start_one(get_vhost_net(peer), dev);
> >>> -        if (r < 0) {
> >>> -            goto err_start;
> >>> -        }
> >>>       }
> >>>
> >>>       return 0;
> >>> --
> >>> 2.31.1
> >>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature
  2023-01-16  6:48       ` Jason Wang
@ 2023-01-16 16:17         ` Eugenio Perez Martin
  0 siblings, 0 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-16 16:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 7:48 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 16:45, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 5:39 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>> This is needed for qemu to know it can suspend the device to retrieve
> >>> its status and enable SVQ with it, so all the process is transparent to
> >>> the guest.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> Acked-by: Jason Wang <jasowang@redhat.com>
> >>
> >> We probably need to add the resume in the future to have a quick
> >> recovery from migration failures.
> >>
> > The capability of a resume can be useful here but only in a small
> > window. During the most time of the migration SVQ is enabled, so in
> > the event of a migration failure we may need to reset the whole device
> > to enable passthrough again.
>
>
> Yes.
>
>
> >
> > But maybe is it worth giving a quick review and adding some TODOs
> > where it can be useful in this series?
>
>
> We can start by having a TODO in this series, and leave resume in for
> the future.
>

Got it, I'll add in the next series.

Thanks!

> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>> ---
> >>>   hw/virtio/vhost-vdpa.c | 3 ++-
> >>>   1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index 4296427a69..a61a6b2a74 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -659,7 +659,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >>>       uint64_t features;
> >>>       uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
> >>>           0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
> >>> -        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
> >>> +        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
> >>> +        0x1ULL << VHOST_BACKEND_F_SUSPEND;
> >>>       int r;
> >>>
> >>>       if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
> >>> --
> >>> 2.31.1
> >>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 02/13] vdpa net: move iova tree creation from init to start
  2023-01-16  9:14         ` Eugenio Perez Martin
@ 2023-01-17  4:30           ` Jason Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Wang @ 2023-01-17  4:30 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/16 17:14, Eugenio Perez Martin 写道:
> On Mon, Jan 16, 2023 at 4:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2023/1/13 15:28, Eugenio Perez Martin 写道:
>>> On Fri, Jan 13, 2023 at 4:53 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>> Only create iova_tree if and when it is needed.
>>>>>
>>>>> The cleanup keeps being responsability of last VQ but this change allows
>>>>> to merge both cleanup functions.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>    net/vhost-vdpa.c | 101 +++++++++++++++++++++++++++++++++--------------
>>>>>    1 file changed, 71 insertions(+), 30 deletions(-)
>>>>>
>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>>> index de5ed8ff22..75cca497c8 100644
>>>>> --- a/net/vhost-vdpa.c
>>>>> +++ b/net/vhost-vdpa.c
>>>>> @@ -178,13 +178,9 @@ err_init:
>>>>>    static void vhost_vdpa_cleanup(NetClientState *nc)
>>>>>    {
>>>>>        VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>>>> -    struct vhost_dev *dev = &s->vhost_net->dev;
>>>>>
>>>>>        qemu_vfree(s->cvq_cmd_out_buffer);
>>>>>        qemu_vfree(s->status);
>>>>> -    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>>> -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>>>> -    }
>>>>>        if (s->vhost_net) {
>>>>>            vhost_net_cleanup(s->vhost_net);
>>>>>            g_free(s->vhost_net);
>>>>> @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
>>>>>        return size;
>>>>>    }
>>>>>
>>>>> +/** From any vdpa net client, get the netclient of first queue pair */
>>>>> +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>>>> +{
>>>>> +    NICState *nic = qemu_get_nic(s->nc.peer);
>>>>> +    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
>>>>> +
>>>>> +    return DO_UPCAST(VhostVDPAState, nc, nc0);
>>>>> +}
>>>>> +
>>>>> +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>>>> +{
>>>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>>>> +
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>>> +                                           v->iova_range.last);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int vhost_vdpa_net_data_start(NetClientState *nc)
>>>>> +{
>>>>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>>>> +
>>>>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>>> +
>>>>> +    if (v->index == 0) {
>>>>> +        vhost_vdpa_net_data_start_first(s);
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    if (v->shadow_vqs_enabled) {
>>>>> +        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>>>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
>>>>> +    }
>>>> It looks to me the logic here is somehow the same as
>>>> vhost_vdpa_net_cvq_start(), can we unify the them?
>>>>
>>> It depends on what you mean by unify :). But we can explore it for sure.
>>>
>>> We can call vhost_vdpa_net_data_start, but the steps to do if
>>> s0->vhost_vdpa.iova_tree == NULL are different. Data queues must do
>>> nothing, but CVQ must create a new iova tree.
>>>
>>> So one possibility is to convert this part of vhost_vdpa_net_cvq_start:
>>>       s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>>       if (s0->vhost_vdpa.iova_tree) {
>>>           /* SVQ is already configured for all virtqueues */
>>>           v->iova_tree = s0->vhost_vdpa.iova_tree;
>>>       } else {
>>>           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>                                              v->iova_range.last);
>>>       }
>>>
>>> into:
>>>       vhost_vdpa_net_data_start(nc);
>>>       if (!v->iova_tree) {
>>>           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>                                              v->iova_range.last);
>>>       }
>>>
>>> I'm ok with the change but it's less clear in my opinion: it's not
>>> obvious that net_data_start is in charge of setting v->iova_tree to
>>> me.
>>
>> Ok.
>>
>>
>>> Another possibility is to abstract something like
>>> first_nc_iova_tree(), but we need to check more fields of s0 later
>>> (shadow_data) so I'm not sure about the benefit.
>>>
>>> Is that what you have in mind?
>>
>> Kind of, but I think we can leave the code as is.
>>
>> In the future, as discussed, we need to introduce something like a
>> parent or opaque structure for NetClientState structure, it can simply a
>> lot of things: we can have one same common parent for all queues, then
>> there's no need for the trick like first_nc_iova_tree() and other
>> similar tricks.
>>
> So we can ack this patch or you prefer to explore the change for the
> next series?


Let's let it for future.

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks


>
> Thanks!
>
>> Thanks
>>
>>> Thanks!
>>>
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>>>> +{
>>>>> +    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
>>>>> +    struct vhost_dev *dev;
>>>>> +
>>>>> +    assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>>> +
>>>>> +    dev = s->vhost_vdpa.dev;
>>>>> +    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>>> +        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>    static NetClientInfo net_vhost_vdpa_info = {
>>>>>            .type = NET_CLIENT_DRIVER_VHOST_VDPA,
>>>>>            .size = sizeof(VhostVDPAState),
>>>>>            .receive = vhost_vdpa_receive,
>>>>> +        .start = vhost_vdpa_net_data_start,
>>>>> +        .stop = vhost_vdpa_net_client_stop,
>>>>>            .cleanup = vhost_vdpa_cleanup,
>>>>>            .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
>>>>>            .has_ufo = vhost_vdpa_has_ufo,
>>>>> @@ -351,7 +401,7 @@ dma_map_err:
>>>>>
>>>>>    static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>>>    {
>>>>> -    VhostVDPAState *s;
>>>>> +    VhostVDPAState *s, *s0;
>>>>>        struct vhost_vdpa *v;
>>>>>        uint64_t backend_features;
>>>>>        int64_t cvq_group;
>>>>> @@ -415,8 +465,6 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>>>            return r;
>>>>>        }
>>>>>
>>>>> -    v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>>> -                                       v->iova_range.last);
>>>>>        v->shadow_vqs_enabled = true;
>>>>>        s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>>>>>
>>>>> @@ -425,6 +473,15 @@ out:
>>>>>            return 0;
>>>>>        }
>>>>>
>>>>> +    s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>>>> +    if (s0->vhost_vdpa.iova_tree) {
>>>>> +        /* SVQ is already configured for all virtqueues */
>>>>> +        v->iova_tree = s0->vhost_vdpa.iova_tree;
>>>>> +    } else {
>>>>> +        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>>> +                                           v->iova_range.last);
>>>>> +    }
>>>>> +
>>>>>        r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
>>>>>                                   vhost_vdpa_net_cvq_cmd_page_len(), false);
>>>>>        if (unlikely(r < 0)) {
>>>>> @@ -449,15 +506,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>>>>>        if (s->vhost_vdpa.shadow_vqs_enabled) {
>>>>>            vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
>>>>>            vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
>>>>> -        if (!s->always_svq) {
>>>>> -            /*
>>>>> -             * If only the CVQ is shadowed we can delete this safely.
>>>>> -             * If all the VQs are shadows this will be needed by the time the
>>>>> -             * device is started again to register SVQ vrings and similar.
>>>>> -             */
>>>>> -            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>>>> -        }
>>>>>        }
>>>>> +
>>>>> +    vhost_vdpa_net_client_stop(nc);
>>>>>    }
>>>>>
>>>>>    static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
>>>>> @@ -667,8 +718,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>>>                                           int nvqs,
>>>>>                                           bool is_datapath,
>>>>>                                           bool svq,
>>>>> -                                       struct vhost_vdpa_iova_range iova_range,
>>>>> -                                       VhostIOVATree *iova_tree)
>>>>> +                                       struct vhost_vdpa_iova_range iova_range)
>>>>>    {
>>>>>        NetClientState *nc = NULL;
>>>>>        VhostVDPAState *s;
>>>>> @@ -690,7 +740,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>>>        s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>>>        s->vhost_vdpa.iova_range = iova_range;
>>>>>        s->vhost_vdpa.shadow_data = svq;
>>>>> -    s->vhost_vdpa.iova_tree = iova_tree;
>>>>>        if (!is_datapath) {
>>>>>            s->cvq_cmd_out_buffer = qemu_memalign(qemu_real_host_page_size(),
>>>>>                                                vhost_vdpa_net_cvq_cmd_page_len());
>>>>> @@ -760,7 +809,6 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>>>        uint64_t features;
>>>>>        int vdpa_device_fd;
>>>>>        g_autofree NetClientState **ncs = NULL;
>>>>> -    g_autoptr(VhostIOVATree) iova_tree = NULL;
>>>>>        struct vhost_vdpa_iova_range iova_range;
>>>>>        NetClientState *nc;
>>>>>        int queue_pairs, r, i = 0, has_cvq = 0;
>>>>> @@ -812,12 +860,8 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>>>            goto err;
>>>>>        }
>>>>>
>>>>> -    if (opts->x_svq) {
>>>>> -        if (!vhost_vdpa_net_valid_svq_features(features, errp)) {
>>>>> -            goto err_svq;
>>>>> -        }
>>>>> -
>>>>> -        iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
>>>>> +    if (opts->x_svq && !vhost_vdpa_net_valid_svq_features(features, errp)) {
>>>>> +        goto err;
>>>>>        }
>>>>>
>>>>>        ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
>>>>> @@ -825,7 +869,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>>>        for (i = 0; i < queue_pairs; i++) {
>>>>>            ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>>>>>                                         vdpa_device_fd, i, 2, true, opts->x_svq,
>>>>> -                                     iova_range, iova_tree);
>>>>> +                                     iova_range);
>>>>>            if (!ncs[i])
>>>>>                goto err;
>>>>>        }
>>>>> @@ -833,13 +877,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
>>>>>        if (has_cvq) {
>>>>>            nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
>>>>>                                     vdpa_device_fd, i, 1, false,
>>>>> -                                 opts->x_svq, iova_range, iova_tree);
>>>>> +                                 opts->x_svq, iova_range);
>>>>>            if (!nc)
>>>>>                goto err;
>>>>>        }
>>>>>
>>>>> -    /* iova_tree ownership belongs to last NetClientState */
>>>>> -    g_steal_pointer(&iova_tree);
>>>>>        return 0;
>>>>>
>>>>>    err:
>>>>> @@ -849,7 +891,6 @@ err:
>>>>>            }
>>>>>        }
>>>>>
>>>>> -err_svq:
>>>>>        qemu_close(vdpa_device_fd);
>>>>>
>>>>>        return -1;
>>>>> --
>>>>> 2.31.1
>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-16  9:53         ` Eugenio Perez Martin
@ 2023-01-17  4:38           ` Jason Wang
  2023-01-17  6:57             ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Wang @ 2023-01-17  4:38 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/16 17:53, Eugenio Perez Martin 写道:
> On Mon, Jan 16, 2023 at 4:32 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2023/1/13 15:40, Eugenio Perez Martin 写道:
>>> On Fri, Jan 13, 2023 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>> At this moment it is only possible to migrate to a vdpa device running
>>>>> with x-svq=on. As a protective measure, the rewind of the inflight
>>>>> descriptors was done at the destination. That way if the source sent a
>>>>> virtqueue with inuse descriptors they are always discarded.
>>>>>
>>>>> Since this series allows to migrate also to passthrough devices with no
>>>>> SVQ, the right thing to do is to rewind at the source so base of vrings
>>>>> are correct.
>>>>>
>>>>> Support for inflight descriptors may be added in the future.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>    include/hw/virtio/vhost-backend.h |  4 +++
>>>>>    hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
>>>>>    hw/virtio/vhost.c                 |  3 ++
>>>>>    3 files changed, 36 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>> index c5ab49051e..ec3fbae58d 100644
>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>> @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>>>
>>>>>    typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>>>>>                                           int fd);
>>>>> +
>>>>> +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>>>>> +
>>>>>    typedef struct VhostOps {
>>>>>        VhostBackendType backend_type;
>>>>>        vhost_backend_init vhost_backend_init;
>>>>> @@ -177,6 +180,7 @@ typedef struct VhostOps {
>>>>>        vhost_get_device_id_op vhost_get_device_id;
>>>>>        vhost_force_iommu_op vhost_force_iommu;
>>>>>        vhost_set_config_call_op vhost_set_config_call;
>>>>> +    vhost_reset_status_op vhost_reset_status;
>>>>>    } VhostOps;
>>>>>
>>>>>    int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 542e003101..28a52ddc78 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>>        if (started) {
>>>>>            memory_listener_register(&v->listener, &address_space_memory);
>>>>>            return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>>>> -    } else {
>>>>> -        vhost_vdpa_reset_device(dev);
>>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>>>> -        memory_listener_unregister(&v->listener);
>>>>> +    }
>>>>>
>>>>> -        return 0;
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>>>>> +{
>>>>> +    struct vhost_vdpa *v = dev->opaque;
>>>>> +
>>>>> +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
>>>>> +        return;
>>>>>        }
>>>>> +
>>>>> +    vhost_vdpa_reset_device(dev);
>>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> +                                VIRTIO_CONFIG_S_DRIVER);
>>>>> +    memory_listener_unregister(&v->listener);
>>>>>    }
>>>>>
>>>>>    static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
>>>>> @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
>>>>>                                           struct vhost_vring_state *ring)
>>>>>    {
>>>>>        struct vhost_vdpa *v = dev->opaque;
>>>>> -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>>>>>
>>>>> -    /*
>>>>> -     * vhost-vdpa devices does not support in-flight requests. Set all of them
>>>>> -     * as available.
>>>>> -     *
>>>>> -     * TODO: This is ok for networking, but other kinds of devices might
>>>>> -     * have problems with these retransmissions.
>>>>> -     */
>>>>> -    while (virtqueue_rewind(vq, 1)) {
>>>>> -        continue;
>>>>> -    }
>>>>>        if (v->shadow_vqs_enabled) {
>>>>>            /*
>>>>>             * Device vring base was set at device start. SVQ base is handled by
>>>>> @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
>>>>>        int ret;
>>>>>
>>>>>        if (v->shadow_vqs_enabled) {
>>>>> +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>>>>> +
>>>>> +        /*
>>>>> +         * vhost-vdpa devices does not support in-flight requests. Set all of
>>>>> +         * them as available.
>>>>> +         *
>>>>> +         * TODO: This is ok for networking, but other kinds of devices might
>>>>> +         * have problems with these retransmissions.
>>>>> +         */
>>>>> +        while (virtqueue_rewind(vq, 1)) {
>>>>> +            continue;
>>>>> +        }
>>>>> +
>>>>>            ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
>>>>>            return 0;
>>>>>        }
>>>>> @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
>>>>>            .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>>>>            .vhost_force_iommu = vhost_vdpa_force_iommu,
>>>>>            .vhost_set_config_call = vhost_vdpa_set_config_call,
>>>>> +        .vhost_reset_status = vhost_vdpa_reset_status,
>>>> Can we simply use the NetClient stop method here?
>>>>
>>> Ouch, I squashed two patches by mistake here.
>>>
>>> All the vhost_reset_status part should be independent of this patch,
>>> and I was especially interested in its feedback. It had this message:
>>>
>>>       vdpa: move vhost reset after get vring base
>>>
>>>       The function vhost.c:vhost_dev_stop calls vhost operation
>>>       vhost_dev_start(false). In the case of vdpa it totally reset and wipes
>>>       the device, making the fetching of the vring base (virtqueue state) totally
>>>       useless.
>>>
>>>       The kernel backend does not use vhost_dev_start vhost op callback, but
>>>       vhost-user do. A patch to make vhost_user_dev_start more similar to vdpa
>>>       is desirable, but it can be added on top.
>>>
>>> I can resend the series splitting it again but conversation may
>>> scatter between versions. Would you prefer me to send a new version?
>>
>> I think it can be done in next version (after we finalize the discussion
>> for this version).
>>
>>
>>> Regarding the use of NetClient, it feels weird to call net specific
>>> functions in VhostOps, doesn't it?
>>
>> Basically, I meant, the patch call vhost_reset_status() in
>> vhost_dev_stop(). But we've already had vhost_dev_start ops where we
>> implement per backend start/stop logic.
>>
>> I think it's better to do things in vhost_dev_start():
>>
>> For device that can do suspend, we can do suspend. For other we need to
>> do reset as a workaround.
>>
> If the device implements _F_SUSPEND we can call suspend in
> vhost_dev_start(false) and fetch the vq base after it. But we cannot
> call vhost_dev_reset until we get the vq base. If we do it, we will
> always get zero there.


I'm not sure I understand here, that is kind of expected. For the device 
that doesn't support suspend, we can't get base anyhow since we need to 
emulate the stop with reset then we lose all the states.


>
> If we don't reset the device at vhost_vdpa_dev_start(false) we need to
> call a proper reset after getting the base, at least in vdpa.


This looks racy if we do get base before reset? Device can move the 
last_avail_idx.


> So to
> create a new vhost_op should be the right thing to do, isn't it?


So we did:

vhost_dev_stop()
     hdev->vhost_ops->vhost_dev_start(hdev, false);
     vhost_virtqueue_stop()
         vhost_get_vring_base()

I don't see any issue if we do suspend in vhost_dev_stop() in this case?

For the device that doesn't support suspend, we do reset in the stop and 
fail the get_vring_base() then we can use software fallback 
virtio_queue_restore_last_avail_idx()

?


>
> Hopefully with a better name than vhost_vdpa_reset_status, that's for sure :).
>
> I'm not sure how vhost-user works with this or when it does reset the
> indexes. My bet is that it never does at the device reinitialization
> and it trusts VMM calls to vhost_user_set_base but I may be wrong.


I think it's more safe to not touch the code path for vhost-user, it may 
connect to various kind of backends some of which might be fragile.

Thanks


>
> Thanks!
>
>> And if necessary, we can call nc client ops for net specific operations
>> (if it has any).
>>
>> Thanks
>>
>>
>>> At the moment vhost ops is
>>> specialized in vhost-kernel, vhost-user and vhost-vdpa. If we want to
>>> make it specific to the kind of device, that makes vhost-vdpa-net too.
>>>
>>> Thanks!
>>>
>>>
>>>> Thanks
>>>>
>>>>>    };
>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>> index eb8c4c378c..a266396576 100644
>>>>> --- a/hw/virtio/vhost.c
>>>>> +++ b/hw/virtio/vhost.c
>>>>> @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>>>                                 hdev->vqs + i,
>>>>>                                 hdev->vq_index + i);
>>>>>        }
>>>>> +    if (hdev->vhost_ops->vhost_reset_status) {
>>>>> +        hdev->vhost_ops->vhost_reset_status(hdev);
>>>>> +    }
>>>>>
>>>>>        if (vhost_dev_has_iommu(hdev)) {
>>>>>            if (hdev->vhost_ops->vhost_set_iotlb_callback) {
>>>>> --
>>>>> 2.31.1
>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-16 16:16         ` Eugenio Perez Martin
@ 2023-01-17  5:36           ` Jason Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Wang @ 2023-01-17  5:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit


在 2023/1/17 00:16, Eugenio Perez Martin 写道:
> On Mon, Jan 16, 2023 at 7:37 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2023/1/13 16:19, Eugenio Perez Martin 写道:
>>> On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>> To restore the device at the destination of a live migration we send the
>>>>> commands through control virtqueue. For a device to read CVQ it must
>>>>> have received the DRIVER_OK status bit.
>>>> This probably requires the support from the parent driver and requires
>>>> some changes or fixes in the parent driver.
>>>>
>>>> Some drivers did:
>>>>
>>>> parent_set_status():
>>>> if (DRIVER_OK)
>>>>       if (queue_enable)
>>>>           write queue_enable to the device
>>>>
>>>> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>>>>
>>> I don't get your point here. No device should start reading CVQ (or
>>> any other VQ) without having received DRIVER_OK.
>>
>> If I understand the code correctly:
>>
>> For CVQ, we do SET_VRING_ENABLE before DRIVER_OK, that's fine.
>>
>> For datapath_vq, we do SET_VRING_ENABLE after DRIVER_OK, this requires
>> parent driver support (explained above)
>>
>>
>>> Some parent drivers do not support sending the queue enable command
>>> after DRIVER_OK, usually because they clean part of the state like the
>>> set by set_vring_base. Even vdpa_net_sim needs fixes here.
>>
>> Yes, so the question is:
>>
>> Do we need another backend feature for this? (otherwise thing may break
>> silently)
>>
>>
>>> But my understanding is that it should be supported so I consider it a
>>> bug.
>>
>> Probably, we need fine some proof in the spec, e.g in 3.1.1:
>>
>> """
>>
>> 7.Perform device-specific setup, including discovery of virtqueues for
>> the device, optional per-bus setup, reading and possibly writing the
>> device’s virtio configuration space, and population of virtqueues.
>> 8.Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> """
>>
>> So if my understanding is correct, "discovery of virtqueues for the
>> device" implies queue_enable here which is expected to be done before
>> DRIVER_OK. But it doesn't say anything regrading to the behaviour of
>> setting queue ready after DRIVER_OK.
>>
>> I'm not sure it's a real bug or not, may Michael and comment on this.
>>
> Right, input on this topic would be really appreciated.
>
>>>    Especially after queue_reset patches. Is that what you mean?
>>
>> We haven't supported queue_reset yet in Qemu. But it allows to write 1
>> to queue_enable after DRIVER_OK for sure.
>>
> I was not clear, I meant in the emulated device. I'm testing this
> series with the proposal of _F_STATE.
>
>>>>> However this opens a window where the device could start receiving
>>>>> packets in rx queue 0 before it receives the RSS configuration. To avoid
>>>>> that, we will not send vring_enable until all configuration is used by
>>>>> the device.
>>>>>
>>>>> As a first step, run vhost_set_vring_ready for all vhost_net backend after
>>>>> all of them are started (with DRIVER_OK). This code should not affect
>>>>> vdpa.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>    hw/net/vhost_net.c | 17 ++++++++++++-----
>>>>>    1 file changed, 12 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>>> index c4eecc6f36..3900599465 100644
>>>>> --- a/hw/net/vhost_net.c
>>>>> +++ b/hw/net/vhost_net.c
>>>>> @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>>            } else {
>>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>>            }
>>>>> +        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>>>> +        if (r < 0) {
>>>>> +            goto err_start;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    for (int j = 0; j < nvhosts; j++) {
>>>>> +        if (j < data_queue_pairs) {
>>>>> +            peer = qemu_get_peer(ncs, j);
>>>>> +        } else {
>>>>> +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>> +        }
>>>> I fail to understand why we need to change the vhost_net layer? This
>>>> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
>>>> vhost_vdpa_dev_start()?
>>>>
>>> The vhost-net layer explicitly calls vhost_set_vring_enable before
>>> vhost_dev_start, and this is exactly the behavior we want to avoid.
>>> Even if we make changes to vhost_dev, this change is still needed.
>>
>> Note that the only user of vhost_set_vring_enable() is vhost-user where
>> the semantic is different:
>>
>> It uses that to changes the number of active queues:
>>
>> static int peer_attach(VirtIONet *n, int index)
>>
>>           if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
>> =>      vhost_set_vring_enable(nc->peer, 1);
>>       }
>>
>> This is not the semantic of vhost-vDPA that tries to be complaint with
>> virtio-spec. So I'm not sure how it can help here.
>>
> Right, but previous changes use enable callback to delay the enable of
> the datapath virtqueues. I'll try to fit the changes in
> virtio/vhost-vdpa though.


This would make things more complicated. As mentioned above, 
vhost-user's usage of vhost_set_vring_enable() is not spec compliant 
while the vhost-vDPA VHOST_VDPA_SET_VRING_ENALBE tries to be compliant 
with the spec.

If we tries to mix use that it may result confusion for the readers.

Thanks


>
> Thanks!
>
>>> And we want to explicitly enable CVQ first, which "only" vhost_net
>>> knows which is.
>>
>> This should be known by net/vhost-vdpa.c.
>>
>>
>>> To perform that in vhost_vdpa_dev_start would require
>>> quirks, involving one or more of:
>>> * Ignore vq enable calls if the device is not the CVQ one. How to
>>> signal what is the CVQ? Can we trust it will be the last one for all
>>> kind of devices?
>>> * Enable queues that do not belong to the last vhost_dev from the enable call.
>>> * Enable the rest of the queues from the last enable in reverse order.
>>> * Intercalate the "net load" callback between enabling the last
>>> vhost_vdpa device and enabling the rest of devices.
>>> * Add an "enable priority" order?
>>
>> Haven't had time in thinking through, but it would be better if we can
>> limit the changes in vhost-vdpa layer. E.g currently the
>> VHOST_VDPA_SET_VRING_ENABLE is done at vhost_dev_start().
>>
>> Thanks
>>
>>
>>> Thanks!
>>>
>>>> Thanks
>>>>
>>>>>            if (peer->vring_enable) {
>>>>>                /* restore vring enable state */
>>>>> @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>>                    goto err_start;
>>>>>                }
>>>>>            }
>>>>> -
>>>>> -        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>>>> -        if (r < 0) {
>>>>> -            goto err_start;
>>>>> -        }
>>>>>        }
>>>>>
>>>>>        return 0;
>>>>> --
>>>>> 2.31.1
>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq
  2023-01-16  9:33           ` Eugenio Perez Martin
@ 2023-01-17  5:42             ` Jason Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Wang @ 2023-01-17  5:42 UTC (permalink / raw)
  To: Eugenio Perez Martin, Michael S. Tsirkin
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Stefan Hajnoczi, Parav Pandit


在 2023/1/16 17:33, Eugenio Perez Martin 写道:
> On Mon, Jan 16, 2023 at 6:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Jan 16, 2023 at 11:34:20AM +0800, Jason Wang wrote:
>>> 在 2023/1/13 15:46, Eugenio Perez Martin 写道:
>>>> On Fri, Jan 13, 2023 at 5:25 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>> 在 2023/1/13 01:24, Eugenio Pérez 写道:
>>>>>> A vdpa net device must initialize with SVQ in order to be migratable,
>>>>>> and initialization code verifies conditions.  If the device is not
>>>>>> initialized with the x-svq parameter, it will not expose _F_LOG so vhost
>>>>>> sybsystem will block VM migration from its initialization.
>>>>>>
>>>>>> Next patches change this. Net data VQs will be shadowed only at
>>>>>> migration time and vdpa net devices need to expose _F_LOG as long as it
>>>>>> can go to SVQ.
>>>>>>
>>>>>> Since we don't know that at initialization time but at start, add an
>>>>>> independent blocker at CVQ.
>>>>>>
>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>> ---
>>>>>>     net/vhost-vdpa.c | 35 +++++++++++++++++++++++++++++------
>>>>>>     1 file changed, 29 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>>>> index 631424d9c4..2ca93e850a 100644
>>>>>> --- a/net/vhost-vdpa.c
>>>>>> +++ b/net/vhost-vdpa.c
>>>>>> @@ -26,12 +26,14 @@
>>>>>>     #include <err.h>
>>>>>>     #include "standard-headers/linux/virtio_net.h"
>>>>>>     #include "monitor/monitor.h"
>>>>>> +#include "migration/blocker.h"
>>>>>>     #include "hw/virtio/vhost.h"
>>>>>>
>>>>>>     /* Todo:need to add the multiqueue support here */
>>>>>>     typedef struct VhostVDPAState {
>>>>>>         NetClientState nc;
>>>>>>         struct vhost_vdpa vhost_vdpa;
>>>>>> +    Error *migration_blocker;
>>>>> Any reason we can't use the mivration_blocker in vhost_dev structure?
>>>>>
>>>>> I believe we don't need to wait until start to know we can't migrate.
>>>>>
>>>> Device migratability also depends on features that the guest acks.
>>>
>>> This sounds a little bit tricky, more below:
>>>
>>>
>>>> For example, if the device does not support ASID it can be migrated as
>>>> long as _F_CVQ is not acked.
>>>
>>> The management may notice a non-consistent behavior in this case. I wonder
>>> if we can simply check the host features.
>>>
> That's right, and I can see how that can be an issue.
>
> However, the check for the ASID is based on queue indexes at the
> moment. If we want to register the blocker at the initialization
> moment the only option I see is to do two features ack & reset cycle:
> one with MQ and another one without MQ.


That's should be fine, or any issue you saw?

Thanks


>
> Would it be more correct to assume the device will assign the right
> ASID only probing one configuration? I don't think so but I'm ok to
> leave the code that way if we agree it is more viable.
>
>>> Thanks
>>
>> Yes the issue is that ack can happen after migration started.
>> I don't think this kind of blocker appearing during migration
>> is currently expected/supported well. Is it?
>>
> In that case the guest cannot DRIVER_OK the device, because the call
> to migrate_add_blocker fails and the error propagates from
> vhost_net_start up to the virtio device.
>
> But I can also see how this is inconvenient and to add a migration
> blocker at initialization can simplify things here. As long as we
> agree on the right way to probe I can send a new version that way for
> sure.
>
> Thanks!
>
>>>> Thanks!
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>>         VHostNetState *vhost_net;
>>>>>>
>>>>>>         /* Control commands shadow buffers */
>>>>>> @@ -433,9 +435,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>>>>                 g_strerror(errno), errno);
>>>>>>             return -1;
>>>>>>         }
>>>>>> -    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) ||
>>>>>> -        !vhost_vdpa_net_valid_svq_features(v->dev->features, NULL)) {
>>>>>> -        return 0;
>>>>>> +    if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
>>>>>> +        error_setg(&s->migration_blocker,
>>>>>> +                   "vdpa device %s does not support ASID",
>>>>>> +                   nc->name);
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +    if (!vhost_vdpa_net_valid_svq_features(v->dev->features,
>>>>>> +                                           &s->migration_blocker)) {
>>>>>> +        goto out;
>>>>>>         }
>>>>>>
>>>>>>         /*
>>>>>> @@ -455,7 +463,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>>>>             }
>>>>>>
>>>>>>             if (group == cvq_group) {
>>>>>> -            return 0;
>>>>>> +            error_setg(&s->migration_blocker,
>>>>>> +                "vdpa %s vq %d group %"PRId64" is the same as cvq group "
>>>>>> +                "%"PRId64, nc->name, i, group, cvq_group);
>>>>>> +            goto out;
>>>>>>             }
>>>>>>         }
>>>>>>
>>>>>> @@ -468,8 +479,15 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>>>>>>         s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
>>>>>>
>>>>>>     out:
>>>>>> -    if (!s->vhost_vdpa.shadow_vqs_enabled) {
>>>>>> -        return 0;
>>>>>> +    if (s->migration_blocker) {
>>>>>> +        Error *errp = NULL;
>>>>>> +        r = migrate_add_blocker(s->migration_blocker, &errp);
>>>>>> +        if (unlikely(r != 0)) {
>>>>>> +            g_clear_pointer(&s->migration_blocker, error_free);
>>>>>> +            error_report_err(errp);
>>>>>> +        }
>>>>>> +
>>>>>> +        return r;
>>>>>>         }
>>>>>>
>>>>>>         s0 = vhost_vdpa_net_first_nc_vdpa(s);
>>>>>> @@ -513,6 +531,11 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
>>>>>>             vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
>>>>>>         }
>>>>>>
>>>>>> +    if (s->migration_blocker) {
>>>>>> +        migrate_del_blocker(s->migration_blocker);
>>>>>> +        g_clear_pointer(&s->migration_blocker, error_free);
>>>>>> +    }
>>>>>> +
>>>>>>         vhost_vdpa_net_client_stop(nc);
>>>>>>     }
>>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 04/13] vdpa: rewind at get_base, not set_base
  2023-01-17  4:38           ` Jason Wang
@ 2023-01-17  6:57             ` Eugenio Perez Martin
  0 siblings, 0 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-17  6:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Tue, Jan 17, 2023 at 5:38 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/16 17:53, Eugenio Perez Martin 写道:
> > On Mon, Jan 16, 2023 at 4:32 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2023/1/13 15:40, Eugenio Perez Martin 写道:
> >>> On Fri, Jan 13, 2023 at 5:10 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On Fri, Jan 13, 2023 at 1:24 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>>>> At this moment it is only possible to migrate to a vdpa device running
> >>>>> with x-svq=on. As a protective measure, the rewind of the inflight
> >>>>> descriptors was done at the destination. That way if the source sent a
> >>>>> virtqueue with inuse descriptors they are always discarded.
> >>>>>
> >>>>> Since this series allows to migrate also to passthrough devices with no
> >>>>> SVQ, the right thing to do is to rewind at the source so base of vrings
> >>>>> are correct.
> >>>>>
> >>>>> Support for inflight descriptors may be added in the future.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>    include/hw/virtio/vhost-backend.h |  4 +++
> >>>>>    hw/virtio/vhost-vdpa.c            | 46 +++++++++++++++++++------------
> >>>>>    hw/virtio/vhost.c                 |  3 ++
> >>>>>    3 files changed, 36 insertions(+), 17 deletions(-)
> >>>>>
> >>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>> index c5ab49051e..ec3fbae58d 100644
> >>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>> @@ -130,6 +130,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
> >>>>>
> >>>>>    typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
> >>>>>                                           int fd);
> >>>>> +
> >>>>> +typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> >>>>> +
> >>>>>    typedef struct VhostOps {
> >>>>>        VhostBackendType backend_type;
> >>>>>        vhost_backend_init vhost_backend_init;
> >>>>> @@ -177,6 +180,7 @@ typedef struct VhostOps {
> >>>>>        vhost_get_device_id_op vhost_get_device_id;
> >>>>>        vhost_force_iommu_op vhost_force_iommu;
> >>>>>        vhost_set_config_call_op vhost_set_config_call;
> >>>>> +    vhost_reset_status_op vhost_reset_status;
> >>>>>    } VhostOps;
> >>>>>
> >>>>>    int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> >>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>>> index 542e003101..28a52ddc78 100644
> >>>>> --- a/hw/virtio/vhost-vdpa.c
> >>>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>>> @@ -1132,14 +1132,23 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>>>>        if (started) {
> >>>>>            memory_listener_register(&v->listener, &address_space_memory);
> >>>>>            return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >>>>> -    } else {
> >>>>> -        vhost_vdpa_reset_device(dev);
> >>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >>>>> -                                   VIRTIO_CONFIG_S_DRIVER);
> >>>>> -        memory_listener_unregister(&v->listener);
> >>>>> +    }
> >>>>>
> >>>>> -        return 0;
> >>>>> +    return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static void vhost_vdpa_reset_status(struct vhost_dev *dev)
> >>>>> +{
> >>>>> +    struct vhost_vdpa *v = dev->opaque;
> >>>>> +
> >>>>> +    if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
> >>>>> +        return;
> >>>>>        }
> >>>>> +
> >>>>> +    vhost_vdpa_reset_device(dev);
> >>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >>>>> +                                VIRTIO_CONFIG_S_DRIVER);
> >>>>> +    memory_listener_unregister(&v->listener);
> >>>>>    }
> >>>>>
> >>>>>    static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> >>>>> @@ -1182,18 +1191,7 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev *dev,
> >>>>>                                           struct vhost_vring_state *ring)
> >>>>>    {
> >>>>>        struct vhost_vdpa *v = dev->opaque;
> >>>>> -    VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> >>>>>
> >>>>> -    /*
> >>>>> -     * vhost-vdpa devices does not support in-flight requests. Set all of them
> >>>>> -     * as available.
> >>>>> -     *
> >>>>> -     * TODO: This is ok for networking, but other kinds of devices might
> >>>>> -     * have problems with these retransmissions.
> >>>>> -     */
> >>>>> -    while (virtqueue_rewind(vq, 1)) {
> >>>>> -        continue;
> >>>>> -    }
> >>>>>        if (v->shadow_vqs_enabled) {
> >>>>>            /*
> >>>>>             * Device vring base was set at device start. SVQ base is handled by
> >>>>> @@ -1212,6 +1210,19 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev,
> >>>>>        int ret;
> >>>>>
> >>>>>        if (v->shadow_vqs_enabled) {
> >>>>> +        VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
> >>>>> +
> >>>>> +        /*
> >>>>> +         * vhost-vdpa devices does not support in-flight requests. Set all of
> >>>>> +         * them as available.
> >>>>> +         *
> >>>>> +         * TODO: This is ok for networking, but other kinds of devices might
> >>>>> +         * have problems with these retransmissions.
> >>>>> +         */
> >>>>> +        while (virtqueue_rewind(vq, 1)) {
> >>>>> +            continue;
> >>>>> +        }
> >>>>> +
> >>>>>            ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
> >>>>>            return 0;
> >>>>>        }
> >>>>> @@ -1326,4 +1337,5 @@ const VhostOps vdpa_ops = {
> >>>>>            .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> >>>>>            .vhost_force_iommu = vhost_vdpa_force_iommu,
> >>>>>            .vhost_set_config_call = vhost_vdpa_set_config_call,
> >>>>> +        .vhost_reset_status = vhost_vdpa_reset_status,
> >>>> Can we simply use the NetClient stop method here?
> >>>>
> >>> Ouch, I squashed two patches by mistake here.
> >>>
> >>> All the vhost_reset_status part should be independent of this patch,
> >>> and I was especially interested in its feedback. It had this message:
> >>>
> >>>       vdpa: move vhost reset after get vring base
> >>>
> >>>       The function vhost.c:vhost_dev_stop calls vhost operation
> >>>       vhost_dev_start(false). In the case of vdpa it totally reset and wipes
> >>>       the device, making the fetching of the vring base (virtqueue state) totally
> >>>       useless.
> >>>
> >>>       The kernel backend does not use vhost_dev_start vhost op callback, but
> >>>       vhost-user do. A patch to make vhost_user_dev_start more similar to vdpa
> >>>       is desirable, but it can be added on top.
> >>>
> >>> I can resend the series splitting it again but conversation may
> >>> scatter between versions. Would you prefer me to send a new version?
> >>
> >> I think it can be done in next version (after we finalize the discussion
> >> for this version).
> >>
> >>
> >>> Regarding the use of NetClient, it feels weird to call net specific
> >>> functions in VhostOps, doesn't it?
> >>
> >> Basically, I meant, the patch call vhost_reset_status() in
> >> vhost_dev_stop(). But we've already had vhost_dev_start ops where we
> >> implement per backend start/stop logic.
> >>
> >> I think it's better to do things in vhost_dev_start():
> >>
> >> For device that can do suspend, we can do suspend. For other we need to
> >> do reset as a workaround.
> >>
> > If the device implements _F_SUSPEND we can call suspend in
> > vhost_dev_start(false) and fetch the vq base after it. But we cannot
> > call vhost_dev_reset until we get the vq base. If we do it, we will
> > always get zero there.
>
>
> I'm not sure I understand here, that is kind of expected. For the device
> that doesn't support suspend, we can't get base anyhow since we need to
> emulate the stop with reset then we lose all the states.
>

That is totally right.

Just for completion / suggestion, we *could* return 0 if the device
does not support suspend and then return failure (<0) at
veing_get_base, and vhost.c code already tries to emulate it by
fetching information from guest memory if split. But it is not
included in this series and I'm not sure it's a good idea in general.

>
> >
> > If we don't reset the device at vhost_vdpa_dev_start(false) we need to
> > call a proper reset after getting the base, at least in vdpa.
>
>
> This looks racy if we do get base before reset? Device can move the
> last_avail_idx.
>

After the reset the last_avail_idx will always be 0 before another
set_base or driver_ok, no matter what. We must get the base between
suspend and reset.

>
> > So to
> > create a new vhost_op should be the right thing to do, isn't it?
>
>
> So we did:
>
> vhost_dev_stop()
>      hdev->vhost_ops->vhost_dev_start(hdev, false);
>      vhost_virtqueue_stop()
>          vhost_get_vring_base()
>
> I don't see any issue if we do suspend in vhost_dev_stop() in this case?
>
> For the device that doesn't support suspend, we do reset in the stop and
> fail the get_vring_base() then we can use software fallback
> virtio_queue_restore_last_avail_idx()
>
> ?
>

There is no issue there.

The question is: Do we need to reset after getting the base? I think
yes, because the device may think it can use other resources and, in
general, other code at start assumes the device is clean. If we want
to do so, we need to introduce a new callback, different from
vhost_dev_start(hdev, false) since it must run after get_base, not
before.

>
> >
> > Hopefully with a better name than vhost_vdpa_reset_status, that's for sure :).
> >
> > I'm not sure how vhost-user works with this or when it does reset the
> > indexes. My bet is that it never does at the device reinitialization
> > and it trusts VMM calls to vhost_user_set_base but I may be wrong.
>
>
> I think it's more safe to not touch the code path for vhost-user, it may
> connect to various kind of backends some of which might be fragile.
>

I agree.

Thanks!

> Thanks
>
>
> >
> > Thanks!
> >
> >> And if necessary, we can call nc client ops for net specific operations
> >> (if it has any).
> >>
> >> Thanks
> >>
> >>
> >>> At the moment vhost ops is
> >>> specialized in vhost-kernel, vhost-user and vhost-vdpa. If we want to
> >>> make it specific to the kind of device, that makes vhost-vdpa-net too.
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>> Thanks
> >>>>
> >>>>>    };
> >>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>>>> index eb8c4c378c..a266396576 100644
> >>>>> --- a/hw/virtio/vhost.c
> >>>>> +++ b/hw/virtio/vhost.c
> >>>>> @@ -2049,6 +2049,9 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >>>>>                                 hdev->vqs + i,
> >>>>>                                 hdev->vq_index + i);
> >>>>>        }
> >>>>> +    if (hdev->vhost_ops->vhost_reset_status) {
> >>>>> +        hdev->vhost_ops->vhost_reset_status(hdev);
> >>>>> +    }
> >>>>>
> >>>>>        if (vhost_dev_has_iommu(hdev)) {
> >>>>>            if (hdev->vhost_ops->vhost_set_iotlb_callback) {
> >>>>> --
> >>>>> 2.31.1
> >>>>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-13  9:00     ` Eugenio Perez Martin
  2023-01-16  6:51       ` Jason Wang
@ 2023-01-17  9:58       ` Dr. David Alan Gilbert
  2023-01-17 10:23         ` Eugenio Perez Martin
  1 sibling, 1 reply; 76+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-17  9:58 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	Maxime Coquelin

* Eugenio Perez Martin (eperezma@redhat.com) wrote:
> On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > This allows net to restart the device backend to configure SVQ on it.
> > >
> > > Ideally, these changes should not be net specific. However, the vdpa net
> > > backend is the one with enough knowledge to configure everything because
> > > of some reasons:
> > > * Queues might need to be shadowed or not depending on its kind (control
> > >   vs data).
> > > * Queues need to share the same map translations (iova tree).
> > >
> > > Because of that it is cleaner to restart the whole net backend and
> > > configure again as expected, similar to how vhost-kernel moves between
> > > userspace and passthrough.
> > >
> > > If more kinds of devices need dynamic switching to SVQ we can create a
> > > callback struct like VhostOps and move most of the code there.
> > > VhostOps cannot be reused since all vdpa backend share them, and to
> > > personalize just for networking would be too heavy.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >  net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 84 insertions(+)
> > >
> > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > index 5d7ad6e4d7..f38532b1df 100644
> > > --- a/net/vhost-vdpa.c
> > > +++ b/net/vhost-vdpa.c
> > > @@ -26,6 +26,8 @@
> > >  #include <err.h>
> > >  #include "standard-headers/linux/virtio_net.h"
> > >  #include "monitor/monitor.h"
> > > +#include "migration/migration.h"
> > > +#include "migration/misc.h"
> > >  #include "migration/blocker.h"
> > >  #include "hw/virtio/vhost.h"
> > >
> > > @@ -33,6 +35,7 @@
> > >  typedef struct VhostVDPAState {
> > >      NetClientState nc;
> > >      struct vhost_vdpa vhost_vdpa;
> > > +    Notifier migration_state;
> > >      Error *migration_blocker;
> > >      VHostNetState *vhost_net;
> > >
> > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> > >      return DO_UPCAST(VhostVDPAState, nc, nc0);
> > >  }
> > >
> > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> > > +{
> > > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > > +    VirtIONet *n;
> > > +    VirtIODevice *vdev;
> > > +    int data_queue_pairs, cvq, r;
> > > +    NetClientState *peer;
> > > +
> > > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > > +        return;
> > > +    }
> > > +
> > > +    vdev = v->dev->vdev;
> > > +    n = VIRTIO_NET(vdev);
> > > +    if (!n->vhost_started) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (enable) {
> > > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> >
> > Do we need to check if the device is started or not here?
> >
> 
> v->vhost_started is checked right above, right?
> 
> > > +    }
> >
> > I'm not sure I understand the reason for vhost_net_stop() after a
> > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
> >
> 
> I think this is really worth exploring, and it would have been clearer
> if I didn't squash the vhost_reset_status commit by mistake :).
> 
> Looking at qemu master vhost.c:vhost_dev_stop:
>     if (hdev->vhost_ops->vhost_dev_start) {
>         hdev->vhost_ops->vhost_dev_start(hdev, false);
>     }
>     if (vrings) {
>         vhost_dev_set_vring_enable(hdev, false);
>     }
>     for (i = 0; i < hdev->nvqs; ++i) {
>         vhost_virtqueue_stop(hdev,
>                              vdev,
>                              hdev->vqs + i,
>                              hdev->vq_index + i);
>     }
> 
> Both vhost-used and vhost-vdpa set_status(0) at
> ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
> they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
> I think it is too late for vdpa devices to change it. I guess
> vhost-user devices do not lose the state there, but I did not test.
> 
> I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
> similar to vhost_user_dev_start. We can make
> vhost_vdpa_dev_start(false) to suspend the device instead. But then we
> need to reset it after getting the indexes. That's why I added
> vhost_vdpa_reset_status, but I admit it is neither the cleanest
> approach nor the best name to it.
> 
> Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.
> 
> > > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > +
> > > +    peer = s->nc.peer;
> > > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > > +        VhostVDPAState *vdpa_state;
> > > +        NetClientState *nc;
> > > +
> > > +        if (i < data_queue_pairs) {
> > > +            nc = qemu_get_peer(peer, i);
> > > +        } else {
> > > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > > +        }
> > > +
> > > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > > +
> > > +        if (i < data_queue_pairs) {
> > > +            /* Do not override CVQ shadow_vqs_enabled */
> > > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > > +        }
> > > +    }
> > > +
> > > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > +    if (unlikely(r < 0)) {
> > > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > > +    }
> > > +}
> > > +
> > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> > > +{
> > > +    MigrationState *migration = data;
> > > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > > +                                     migration_state);
> > > +
> > > +    switch (migration->state) {
> > > +    case MIGRATION_STATUS_SETUP:
> > > +        vhost_vdpa_net_log_global_enable(s, true);
> > > +        return;
> > > +
> > > +    case MIGRATION_STATUS_CANCELLING:
> > > +    case MIGRATION_STATUS_CANCELLED:
> > > +    case MIGRATION_STATUS_FAILED:
> > > +        vhost_vdpa_net_log_global_enable(s, false);
> >
> > Do we need to recover here?
> >
> 
> I may be missing something, but the device is fully reset and restored
> in these cases.
> 
> CCing Juan and D. Gilbert, a review would be appreciated to check if
> this covers all the cases.

I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED
there.

You might consider:
   if (migration_in_setup(s)) {
     vhost_vdpa_net_log_global_enable(s, true);
   } else if (migration_has_finished(s) || migration_has_failed(s)) {
     vhost_vdpa_net_log_global_enable(s, false);
   }

I'm not too sure what will happen in your world with postcopy;  it's
worth testing, just remember on the source you don't want to be changing
guest memory when you're in the postcopy phase.

Dave

> Thanks!
> 
> 
> > Thanks
> >
> > > +        return;
> > > +    };
> > > +}
> > > +
> > >  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> > >  {
> > >      struct vhost_vdpa *v = &s->vhost_vdpa;
> > >
> > > +    if (v->feature_log) {
> > > +        add_migration_state_change_notifier(&s->migration_state);
> > > +    }
> > > +
> > >      if (v->shadow_vqs_enabled) {
> > >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > >                                             v->iova_range.last);
> > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> > >
> > >      assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> > >
> > > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > > +        remove_migration_state_change_notifier(&s->migration_state);
> > > +    }
> > > +
> > >      dev = s->vhost_vdpa.dev;
> > >      if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > >          g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> > >      s->vhost_vdpa.device_fd = vdpa_device_fd;
> > >      s->vhost_vdpa.index = queue_pair_index;
> > >      s->always_svq = svq;
> > > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> > >      s->vhost_vdpa.shadow_vqs_enabled = svq;
> > >      s->vhost_vdpa.iova_range = iova_range;
> > >      s->vhost_vdpa.shadow_data = svq;
> > > --
> > > 2.31.1
> > >
> >
> >
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-17  9:58       ` Dr. David Alan Gilbert
@ 2023-01-17 10:23         ` Eugenio Perez Martin
  2023-01-17 12:54           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-01-17 10:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	Maxime Coquelin

On Tue, Jan 17, 2023 at 10:58 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Eugenio Perez Martin (eperezma@redhat.com) wrote:
> > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > This allows net to restart the device backend to configure SVQ on it.
> > > >
> > > > Ideally, these changes should not be net specific. However, the vdpa net
> > > > backend is the one with enough knowledge to configure everything because
> > > > of some reasons:
> > > > * Queues might need to be shadowed or not depending on its kind (control
> > > >   vs data).
> > > > * Queues need to share the same map translations (iova tree).
> > > >
> > > > Because of that it is cleaner to restart the whole net backend and
> > > > configure again as expected, similar to how vhost-kernel moves between
> > > > userspace and passthrough.
> > > >
> > > > If more kinds of devices need dynamic switching to SVQ we can create a
> > > > callback struct like VhostOps and move most of the code there.
> > > > VhostOps cannot be reused since all vdpa backend share them, and to
> > > > personalize just for networking would be too heavy.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >  net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 84 insertions(+)
> > > >
> > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > > index 5d7ad6e4d7..f38532b1df 100644
> > > > --- a/net/vhost-vdpa.c
> > > > +++ b/net/vhost-vdpa.c
> > > > @@ -26,6 +26,8 @@
> > > >  #include <err.h>
> > > >  #include "standard-headers/linux/virtio_net.h"
> > > >  #include "monitor/monitor.h"
> > > > +#include "migration/migration.h"
> > > > +#include "migration/misc.h"
> > > >  #include "migration/blocker.h"
> > > >  #include "hw/virtio/vhost.h"
> > > >
> > > > @@ -33,6 +35,7 @@
> > > >  typedef struct VhostVDPAState {
> > > >      NetClientState nc;
> > > >      struct vhost_vdpa vhost_vdpa;
> > > > +    Notifier migration_state;
> > > >      Error *migration_blocker;
> > > >      VHostNetState *vhost_net;
> > > >
> > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> > > >      return DO_UPCAST(VhostVDPAState, nc, nc0);
> > > >  }
> > > >
> > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> > > > +{
> > > > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > > > +    VirtIONet *n;
> > > > +    VirtIODevice *vdev;
> > > > +    int data_queue_pairs, cvq, r;
> > > > +    NetClientState *peer;
> > > > +
> > > > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > > > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    vdev = v->dev->vdev;
> > > > +    n = VIRTIO_NET(vdev);
> > > > +    if (!n->vhost_started) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    if (enable) {
> > > > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> > >
> > > Do we need to check if the device is started or not here?
> > >
> >
> > v->vhost_started is checked right above, right?
> >
> > > > +    }
> > >
> > > I'm not sure I understand the reason for vhost_net_stop() after a
> > > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
> > >
> >
> > I think this is really worth exploring, and it would have been clearer
> > if I didn't squash the vhost_reset_status commit by mistake :).
> >
> > Looking at qemu master vhost.c:vhost_dev_stop:
> >     if (hdev->vhost_ops->vhost_dev_start) {
> >         hdev->vhost_ops->vhost_dev_start(hdev, false);
> >     }
> >     if (vrings) {
> >         vhost_dev_set_vring_enable(hdev, false);
> >     }
> >     for (i = 0; i < hdev->nvqs; ++i) {
> >         vhost_virtqueue_stop(hdev,
> >                              vdev,
> >                              hdev->vqs + i,
> >                              hdev->vq_index + i);
> >     }
> >
> > Both vhost-used and vhost-vdpa set_status(0) at
> > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
> > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
> > I think it is too late for vdpa devices to change it. I guess
> > vhost-user devices do not lose the state there, but I did not test.
> >
> > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
> > similar to vhost_user_dev_start. We can make
> > vhost_vdpa_dev_start(false) to suspend the device instead. But then we
> > need to reset it after getting the indexes. That's why I added
> > vhost_vdpa_reset_status, but I admit it is neither the cleanest
> > approach nor the best name to it.
> >
> > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.
> >
> > > > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > > > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > > > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > > > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > > +
> > > > +    peer = s->nc.peer;
> > > > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > > > +        VhostVDPAState *vdpa_state;
> > > > +        NetClientState *nc;
> > > > +
> > > > +        if (i < data_queue_pairs) {
> > > > +            nc = qemu_get_peer(peer, i);
> > > > +        } else {
> > > > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > > > +        }
> > > > +
> > > > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > > > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > > > +
> > > > +        if (i < data_queue_pairs) {
> > > > +            /* Do not override CVQ shadow_vqs_enabled */
> > > > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > > +    if (unlikely(r < 0)) {
> > > > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > > > +    }
> > > > +}
> > > > +
> > > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> > > > +{
> > > > +    MigrationState *migration = data;
> > > > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > > > +                                     migration_state);
> > > > +
> > > > +    switch (migration->state) {
> > > > +    case MIGRATION_STATUS_SETUP:
> > > > +        vhost_vdpa_net_log_global_enable(s, true);
> > > > +        return;
> > > > +
> > > > +    case MIGRATION_STATUS_CANCELLING:
> > > > +    case MIGRATION_STATUS_CANCELLED:
> > > > +    case MIGRATION_STATUS_FAILED:
> > > > +        vhost_vdpa_net_log_global_enable(s, false);
> > >
> > > Do we need to recover here?
> > >
> >
> > I may be missing something, but the device is fully reset and restored
> > in these cases.
> >
> > CCing Juan and D. Gilbert, a review would be appreciated to check if
> > this covers all the cases.
>
> I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED
> there.
>
> You might consider:
>    if (migration_in_setup(s)) {
>      vhost_vdpa_net_log_global_enable(s, true);
>    } else if (migration_has_finished(s) || migration_has_failed(s)) {
>      vhost_vdpa_net_log_global_enable(s, false);
>    }
>

Thank you very much for the input, I see this is definitely cleaner
than my proposal.

Just for completion here I need to handle differently has_finished vs
has_failed because of recovery. This is easily achievable from your
snippet so thank you very much.

> I'm not too sure what will happen in your world with postcopy;  it's
> worth testing, just remember on the source you don't want to be changing
> guest memory when you're in the postcopy phase.
>

If I'm not wrong postcopy is forbidden as long as there exists a vdpa
device but I can check it for sure.

Thanks!


> Dave
>
> > Thanks!
> >
> >
> > > Thanks
> > >
> > > > +        return;
> > > > +    };
> > > > +}
> > > > +
> > > >  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> > > >  {
> > > >      struct vhost_vdpa *v = &s->vhost_vdpa;
> > > >
> > > > +    if (v->feature_log) {
> > > > +        add_migration_state_change_notifier(&s->migration_state);
> > > > +    }
> > > > +
> > > >      if (v->shadow_vqs_enabled) {
> > > >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > > >                                             v->iova_range.last);
> > > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> > > >
> > > >      assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> > > >
> > > > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > > > +        remove_migration_state_change_notifier(&s->migration_state);
> > > > +    }
> > > > +
> > > >      dev = s->vhost_vdpa.dev;
> > > >      if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > > >          g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> > > >      s->vhost_vdpa.device_fd = vdpa_device_fd;
> > > >      s->vhost_vdpa.index = queue_pair_index;
> > > >      s->always_svq = svq;
> > > > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> > > >      s->vhost_vdpa.shadow_vqs_enabled = svq;
> > > >      s->vhost_vdpa.iova_range = iova_range;
> > > >      s->vhost_vdpa.shadow_data = svq;
> > > > --
> > > > 2.31.1
> > > >
> > >
> > >
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-17 10:23         ` Eugenio Perez Martin
@ 2023-01-17 12:54           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 76+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-17 12:54 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit, Juan Quintela,
	Maxime Coquelin

* Eugenio Perez Martin (eperezma@redhat.com) wrote:
> On Tue, Jan 17, 2023 at 10:58 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Eugenio Perez Martin (eperezma@redhat.com) wrote:
> > > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > This allows net to restart the device backend to configure SVQ on it.
> > > > >
> > > > > Ideally, these changes should not be net specific. However, the vdpa net
> > > > > backend is the one with enough knowledge to configure everything because
> > > > > of some reasons:
> > > > > * Queues might need to be shadowed or not depending on its kind (control
> > > > >   vs data).
> > > > > * Queues need to share the same map translations (iova tree).
> > > > >
> > > > > Because of that it is cleaner to restart the whole net backend and
> > > > > configure again as expected, similar to how vhost-kernel moves between
> > > > > userspace and passthrough.
> > > > >
> > > > > If more kinds of devices need dynamic switching to SVQ we can create a
> > > > > callback struct like VhostOps and move most of the code there.
> > > > > VhostOps cannot be reused since all vdpa backend share them, and to
> > > > > personalize just for networking would be too heavy.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > >  net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 84 insertions(+)
> > > > >
> > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > > > index 5d7ad6e4d7..f38532b1df 100644
> > > > > --- a/net/vhost-vdpa.c
> > > > > +++ b/net/vhost-vdpa.c
> > > > > @@ -26,6 +26,8 @@
> > > > >  #include <err.h>
> > > > >  #include "standard-headers/linux/virtio_net.h"
> > > > >  #include "monitor/monitor.h"
> > > > > +#include "migration/migration.h"
> > > > > +#include "migration/misc.h"
> > > > >  #include "migration/blocker.h"
> > > > >  #include "hw/virtio/vhost.h"
> > > > >
> > > > > @@ -33,6 +35,7 @@
> > > > >  typedef struct VhostVDPAState {
> > > > >      NetClientState nc;
> > > > >      struct vhost_vdpa vhost_vdpa;
> > > > > +    Notifier migration_state;
> > > > >      Error *migration_blocker;
> > > > >      VHostNetState *vhost_net;
> > > > >
> > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> > > > >      return DO_UPCAST(VhostVDPAState, nc, nc0);
> > > > >  }
> > > > >
> > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> > > > > +{
> > > > > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > > > > +    VirtIONet *n;
> > > > > +    VirtIODevice *vdev;
> > > > > +    int data_queue_pairs, cvq, r;
> > > > > +    NetClientState *peer;
> > > > > +
> > > > > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > > > > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    vdev = v->dev->vdev;
> > > > > +    n = VIRTIO_NET(vdev);
> > > > > +    if (!n->vhost_started) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    if (enable) {
> > > > > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> > > >
> > > > Do we need to check if the device is started or not here?
> > > >
> > >
> > > v->vhost_started is checked right above, right?
> > >
> > > > > +    }
> > > >
> > > > I'm not sure I understand the reason for vhost_net_stop() after a
> > > > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated.
> > > >
> > >
> > > I think this is really worth exploring, and it would have been clearer
> > > if I didn't squash the vhost_reset_status commit by mistake :).
> > >
> > > Looking at qemu master vhost.c:vhost_dev_stop:
> > >     if (hdev->vhost_ops->vhost_dev_start) {
> > >         hdev->vhost_ops->vhost_dev_start(hdev, false);
> > >     }
> > >     if (vrings) {
> > >         vhost_dev_set_vring_enable(hdev, false);
> > >     }
> > >     for (i = 0; i < hdev->nvqs; ++i) {
> > >         vhost_virtqueue_stop(hdev,
> > >                              vdev,
> > >                              hdev->vqs + i,
> > >                              hdev->vq_index + i);
> > >     }
> > >
> > > Both vhost-used and vhost-vdpa set_status(0) at
> > > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so
> > > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and
> > > I think it is too late for vdpa devices to change it. I guess
> > > vhost-user devices do not lose the state there, but I did not test.
> > >
> > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more
> > > similar to vhost_user_dev_start. We can make
> > > vhost_vdpa_dev_start(false) to suspend the device instead. But then we
> > > need to reset it after getting the indexes. That's why I added
> > > vhost_vdpa_reset_status, but I admit it is neither the cleanest
> > > approach nor the best name to it.
> > >
> > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much.
> > >
> > > > > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > > > > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > > > > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > > > > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > > > +
> > > > > +    peer = s->nc.peer;
> > > > > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > > > > +        VhostVDPAState *vdpa_state;
> > > > > +        NetClientState *nc;
> > > > > +
> > > > > +        if (i < data_queue_pairs) {
> > > > > +            nc = qemu_get_peer(peer, i);
> > > > > +        } else {
> > > > > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > > > > +        }
> > > > > +
> > > > > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > > > > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > > > > +
> > > > > +        if (i < data_queue_pairs) {
> > > > > +            /* Do not override CVQ shadow_vqs_enabled */
> > > > > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > > > > +    if (unlikely(r < 0)) {
> > > > > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> > > > > +{
> > > > > +    MigrationState *migration = data;
> > > > > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > > > > +                                     migration_state);
> > > > > +
> > > > > +    switch (migration->state) {
> > > > > +    case MIGRATION_STATUS_SETUP:
> > > > > +        vhost_vdpa_net_log_global_enable(s, true);
> > > > > +        return;
> > > > > +
> > > > > +    case MIGRATION_STATUS_CANCELLING:
> > > > > +    case MIGRATION_STATUS_CANCELLED:
> > > > > +    case MIGRATION_STATUS_FAILED:
> > > > > +        vhost_vdpa_net_log_global_enable(s, false);
> > > >
> > > > Do we need to recover here?
> > > >
> > >
> > > I may be missing something, but the device is fully reset and restored
> > > in these cases.
> > >
> > > CCing Juan and D. Gilbert, a review would be appreciated to check if
> > > this covers all the cases.
> >
> > I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED
> > there.
> >
> > You might consider:
> >    if (migration_in_setup(s)) {
> >      vhost_vdpa_net_log_global_enable(s, true);
> >    } else if (migration_has_finished(s) || migration_has_failed(s)) {
> >      vhost_vdpa_net_log_global_enable(s, false);
> >    }
> >
> 
> Thank you very much for the input, I see this is definitely cleaner
> than my proposal.
> 
> Just for completion here I need to handle differently has_finished vs
> has_failed because of recovery. This is easily achievable from your
> snippet so thank you very much.
> 
> > I'm not too sure what will happen in your world with postcopy;  it's
> > worth testing, just remember on the source you don't want to be changing
> > guest memory when you're in the postcopy phase.
> >
> 
> If I'm not wrong postcopy is forbidden as long as there exists a vdpa
> device but I can check it for sure.

Ah yes, we don't want the vdpa writing into the destination RAM during
the postcopy phase; I can imagine with shadow-queues you might be able
to come up with a solution to that - but that's a complication for
another time.

Dave
> Thanks!
> 
> 
> > Dave
> >
> > > Thanks!
> > >
> > >
> > > > Thanks
> > > >
> > > > > +        return;
> > > > > +    };
> > > > > +}
> > > > > +
> > > > >  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> > > > >  {
> > > > >      struct vhost_vdpa *v = &s->vhost_vdpa;
> > > > >
> > > > > +    if (v->feature_log) {
> > > > > +        add_migration_state_change_notifier(&s->migration_state);
> > > > > +    }
> > > > > +
> > > > >      if (v->shadow_vqs_enabled) {
> > > > >          v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > > > >                                             v->iova_range.last);
> > > > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> > > > >
> > > > >      assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> > > > >
> > > > > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > > > > +        remove_migration_state_change_notifier(&s->migration_state);
> > > > > +    }
> > > > > +
> > > > >      dev = s->vhost_vdpa.dev;
> > > > >      if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > > > >          g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > > > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> > > > >      s->vhost_vdpa.device_fd = vdpa_device_fd;
> > > > >      s->vhost_vdpa.index = queue_pair_index;
> > > > >      s->always_svq = svq;
> > > > > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> > > > >      s->vhost_vdpa.shadow_vqs_enabled = svq;
> > > > >      s->vhost_vdpa.iova_range = iova_range;
> > > > >      s->vhost_vdpa.shadow_data = svq;
> > > > > --
> > > > > 2.31.1
> > > > >
> > > >
> > > >
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK
  2023-01-13 10:03         ` Eugenio Perez Martin
  2023-01-13 10:37           ` Stefano Garzarella
@ 2023-01-17 15:15           ` Maxime Coquelin
  1 sibling, 0 replies; 76+ messages in thread
From: Maxime Coquelin @ 2023-01-17 15:15 UTC (permalink / raw)
  To: Eugenio Perez Martin, Stefano Garzarella
  Cc: Jason Wang, qemu-devel, si-wei.liu, Liuxiangdong, Zhu Lingshan,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Cornelia Huck, Cindy Lu,
	Eli Cohen, Paolo Bonzini, Michael S. Tsirkin, Stefan Hajnoczi,
	Parav Pandit

Hi Eugenio,

On 1/13/23 11:03, Eugenio Perez Martin wrote:
> On Fri, Jan 13, 2023 at 10:51 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Fri, Jan 13, 2023 at 09:19:00AM +0100, Eugenio Perez Martin wrote:
>>> On Fri, Jan 13, 2023 at 5:36 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>>
>>>>> To restore the device at the destination of a live migration we send the
>>>>> commands through control virtqueue. For a device to read CVQ it must
>>>>> have received the DRIVER_OK status bit.
>>>>
>>>> This probably requires the support from the parent driver and requires
>>>> some changes or fixes in the parent driver.
>>>>
>>>> Some drivers did:
>>>>
>>>> parent_set_status():
>>>> if (DRIVER_OK)
>>>>      if (queue_enable)
>>>>          write queue_enable to the device
>>>>
>>>> Examples are IFCVF or even vp_vdpa at least. MLX5 seems to be fine.
>>>>
>>>
>>> I don't get your point here. No device should start reading CVQ (or
>>> any other VQ) without having received DRIVER_OK.
>>>
>>> Some parent drivers do not support sending the queue enable command
>>> after DRIVER_OK, usually because they clean part of the state like the
>>> set by set_vring_base. Even vdpa_net_sim needs fixes here.
>>>
>>> But my understanding is that it should be supported so I consider it a
>>> bug. Especially after queue_reset patches. Is that what you mean?
>>>
>>>>>
>>>>> However this opens a window where the device could start receiving
>>>>> packets in rx queue 0 before it receives the RSS configuration. To avoid
>>>>> that, we will not send vring_enable until all configuration is used by
>>>>> the device.
>>>>>
>>>>> As a first step, run vhost_set_vring_ready for all vhost_net backend after
>>>>> all of them are started (with DRIVER_OK). This code should not affect
>>>>> vdpa.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>   hw/net/vhost_net.c | 17 ++++++++++++-----
>>>>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>>> index c4eecc6f36..3900599465 100644
>>>>> --- a/hw/net/vhost_net.c
>>>>> +++ b/hw/net/vhost_net.c
>>>>> @@ -399,6 +399,18 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>>           } else {
>>>>>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>>           }
>>>>> +        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>>>> +        if (r < 0) {
>>>>> +            goto err_start;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    for (int j = 0; j < nvhosts; j++) {
>>>>> +        if (j < data_queue_pairs) {
>>>>> +            peer = qemu_get_peer(ncs, j);
>>>>> +        } else {
>>>>> +            peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>> +        }
>>>>
>>>> I fail to understand why we need to change the vhost_net layer? This
>>>> is vhost-vDPA specific, so I wonder if we can limit the changes to e.g
>>>> vhost_vdpa_dev_start()?
>>>>
>>>
>>> The vhost-net layer explicitly calls vhost_set_vring_enable before
>>> vhost_dev_start, and this is exactly the behavior we want to avoid.
>>> Even if we make changes to vhost_dev, this change is still needed.
>>
>> I'm working on something similar since I'd like to re-work the following
>> commit we merged just before 7.2 release:
>>       4daa5054c5 vhost: enable vrings in vhost_dev_start() for vhost-user
>>       devices
>>
>> vhost-net wasn't the only one who enabled vrings independently, but it
>> was easy enough for others devices to avoid it and enable them in
>> vhost_dev_start().
>>
>> Do you think can we avoid in some way this special behaviour of
>> vhost-net and enable the vrings in vhost_dev_start?
>>
> 
> Actually looking forward to it :). If that gets merged before this
> series, I think we could drop this patch.
> 
> If I'm not wrong the enable/disable dance is used just by vhost-user
> at the moment.
> 
> Maxime, could you give us some hints about the tests to use to check
> that changes do not introduce regressions in vhost-user?

You can use DPDK's testpmd [0] tool with Vhost PMD, e.g.:

#single queue pair
# dpdk-testpmd -l <CORE IDs> --no-pci 
--vdev=net_vhost0,iface=/tmp/vhost-user1 -- -i

#multiqueue
# dpdk-testpmd -l <CORE IDs> --no-pci 
--vdev=net_vhost0,iface=/tmp/vhost-user1,queues=4 -- -i --rxq=4 --txq=4

[0]: https://doc.dpdk.org/guides/testpmd_app_ug/index.html

Maxime

> Thanks!
> 
>> Thanks,
>> Stefano
>>
>>>
>>> And we want to explicitly enable CVQ first, which "only" vhost_net
>>> knows which is. To perform that in vhost_vdpa_dev_start would require
>>> quirks, involving one or more of:
>>> * Ignore vq enable calls if the device is not the CVQ one. How to
>>> signal what is the CVQ? Can we trust it will be the last one for all
>>> kind of devices?
>>> * Enable queues that do not belong to the last vhost_dev from the enable call.
>>> * Enable the rest of the queues from the last enable in reverse order.
>>> * Intercalate the "net load" callback between enabling the last
>>> vhost_vdpa device and enabling the rest of devices.
>>> * Add an "enable priority" order?
>>>
>>> Thanks!
>>>
>>>> Thanks
>>>>
>>>>>
>>>>>           if (peer->vring_enable) {
>>>>>               /* restore vring enable state */
>>>>> @@ -408,11 +420,6 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>>                   goto err_start;
>>>>>               }
>>>>>           }
>>>>> -
>>>>> -        r = vhost_net_start_one(get_vhost_net(peer), dev);
>>>>> -        if (r < 0) {
>>>>> -            goto err_start;
>>>>> -        }
>>>>>       }
>>>>>
>>>>>       return 0;
>>>>> --
>>>>> 2.31.1
>>>>>
>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-13  9:06         ` Eugenio Perez Martin
  2023-01-16  7:02           ` Jason Wang
@ 2023-02-02  0:56           ` Si-Wei Liu
  2023-02-02 16:53             ` Eugenio Perez Martin
  1 sibling, 1 reply; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-02  0:56 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jason Wang
  Cc: Zhu, Lingshan, qemu-devel, Liuxiangdong, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit



On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote:
> On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>>>
>>>
>>> On 1/13/2023 10:31 AM, Jason Wang wrote:
>>>> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>> Spuriously kick the destination device's queue so it knows in case there
>>>>> are new descriptors.
>>>>>
>>>>> RFC: This is somehow a gray area. The guest may have placed descriptors
>>>>> in a virtqueue but not kicked it, so it might be surprised if the device
>>>>> starts processing it.
>>>> So I think this is kind of the work of the vDPA parent. For the parent
>>>> that needs this trick, we should do it in the parent driver.
>>> Agree, it looks easier implementing this in parent driver,
>>> I can implement it in ifcvf set_vq_ready right now
>> Great, but please check whether or not it is really needed.
>>
>> Some device implementation could check the available descriptions
>> after DRIVER_OK without waiting for a kick.
>>
> So IIUC we can entirely drop this from the series (and I hope we can).
> But then, what with the devices that does *not* check for them?
I wonder how the kick can be missed from the first place. Supposedly the 
moment when vhost_dev_stop() calls .suspend() into vdpa driver, the 
vcpus already stopped running (vm_running = false) and all pending kicks 
are delivered through vhost-vdpa's host notifiers or mapped doorbell 
already then device won't get new ones. If the device intends to 
purposely ignore (note: this could be a device bug) pending kicks during 
.suspend(), then consequently it should check available descriptors 
after reaching driver_ok to process outstanding descriptors, making up 
for the missing kick. If the vdpa driver doesn't support .suspend(), 
then it should flush the work before .reset() - vhost-scsi does it this 
way.  Or otherwise I think it's the norm (right thing to do) device 
should process pending kicks before guest memory is to be unmapped at 
the late game of vhost_dev_stop(). Is there any case kicks may be missing?

-Siwei


>
> If we drop it it seems to me we must mandate devices to check for
> descriptors at queue_enable. The queue could stall if not, isn't it?
>
> Thanks!
>
>> Thanks
>>
>>> Thanks
>>> Zhu Lingshan
>>>> Thanks
>>>>
>>>>> However, that information is not in the migration stream and it should
>>>>> be an edge case anyhow, being resilient to parallel notifications from
>>>>> the guest.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>    hw/virtio/vhost-vdpa.c | 5 +++++
>>>>>    1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 40b7e8706a..dff94355dd 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
>>>>>        }
>>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>> +        VirtQueue *vq;
>>>>>            struct vhost_vring_state state = {
>>>>>                .index = dev->vq_index + i,
>>>>>                .num = 1,
>>>>>            };
>>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>> +
>>>>> +        /* Preemptive kick */
>>>>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>>>>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
>>>>>        }
>>>>>        return 0;
>>>>>    }
>>>>> --
>>>>> 2.31.1
>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
  2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2023-01-12 17:24 ` [RFC v2 13/13] vdpa: Conditionally expose _F_LOG in vhost_net devices Eugenio Pérez
@ 2023-02-02  1:00 ` Si-Wei Liu
  2023-02-02 11:27   ` Eugenio Perez Martin
  13 siblings, 1 reply; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-02  1:00 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit



On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> It's possible to migrate vdpa net devices if they are shadowed from the
>
> start.  But to always shadow the dataplane is effectively break its host
>
> passthrough, so its not convenient in vDPA scenarios.
>
>
>
> This series enables dynamically switching to shadow mode only at
>
> migration time.  This allow full data virtqueues passthrough all the
>
> time qemu is not migrating.
>
>
>
> Successfully tested with vdpa_sim_net (but it needs some patches, I
>
> will send them soon) and qemu emulated device with vp_vdpa with
>
> some restrictions:
>
> * No CVQ.
>
> * VIRTIO_RING_F_STATE patches.
What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it 
a new feature that other vdpa driver would need for live migration)?

-Siwei

>
> * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like
>
>    DPDK.
>
>
>
> Comments are welcome, especially in the patcheswith RFC in the message.
>
>
>
> v2:
>
> - Use a migration listener instead of a memory listener to know when
>
>    the migration starts.
>
> - Add stuff not picked with ASID patches, like enable rings after
>
>    driver_ok
>
> - Add rewinding on the migration src, not in dst
>
> - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html
>
>
>
> Eugenio Pérez (13):
>
>    vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
>
>    vdpa net: move iova tree creation from init to start
>
>    vdpa: copy cvq shadow_data from data vqs, not from x-svq
>
>    vdpa: rewind at get_base, not set_base
>
>    vdpa net: add migration blocker if cannot migrate cvq
>
>    vhost: delay set_vring_ready after DRIVER_OK
>
>    vdpa: delay set_vring_ready after DRIVER_OK
>
>    vdpa: Negotiate _F_SUSPEND feature
>
>    vdpa: add feature_log parameter to vhost_vdpa
>
>    vdpa net: allow VHOST_F_LOG_ALL
>
>    vdpa: add vdpa net migration state notifier
>
>    vdpa: preemptive kick at enable
>
>    vdpa: Conditionally expose _F_LOG in vhost_net devices
>
>
>
>   include/hw/virtio/vhost-backend.h |   4 +
>
>   include/hw/virtio/vhost-vdpa.h    |   1 +
>
>   hw/net/vhost_net.c                |  25 ++-
>
>   hw/virtio/vhost-vdpa.c            |  64 +++++---
>
>   hw/virtio/vhost.c                 |   3 +
>
>   net/vhost-vdpa.c                  | 247 +++++++++++++++++++++++++-----
>
>   6 files changed, 278 insertions(+), 66 deletions(-)
>
>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
  2023-01-13  4:54   ` Jason Wang
@ 2023-02-02  1:52   ` Si-Wei Liu
  2023-02-02 15:28     ` Eugenio Perez Martin
  2023-02-12 14:31     ` Eli Cohen
  1 sibling, 2 replies; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-02  1:52 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit



On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> This allows net to restart the device backend to configure SVQ on it.
>
> Ideally, these changes should not be net specific. However, the vdpa net
> backend is the one with enough knowledge to configure everything because
> of some reasons:
> * Queues might need to be shadowed or not depending on its kind (control
>    vs data).
> * Queues need to share the same map translations (iova tree).
>
> Because of that it is cleaner to restart the whole net backend and
> configure again as expected, similar to how vhost-kernel moves between
> userspace and passthrough.
>
> If more kinds of devices need dynamic switching to SVQ we can create a
> callback struct like VhostOps and move most of the code there.
> VhostOps cannot be reused since all vdpa backend share them, and to
> personalize just for networking would be too heavy.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 84 insertions(+)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 5d7ad6e4d7..f38532b1df 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -26,6 +26,8 @@
>   #include <err.h>
>   #include "standard-headers/linux/virtio_net.h"
>   #include "monitor/monitor.h"
> +#include "migration/migration.h"
> +#include "migration/misc.h"
>   #include "migration/blocker.h"
>   #include "hw/virtio/vhost.h"
>   
> @@ -33,6 +35,7 @@
>   typedef struct VhostVDPAState {
>       NetClientState nc;
>       struct vhost_vdpa vhost_vdpa;
> +    Notifier migration_state;
>       Error *migration_blocker;
>       VHostNetState *vhost_net;
>   
> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>       return DO_UPCAST(VhostVDPAState, nc, nc0);
>   }
>   
> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> +{
> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> +    VirtIONet *n;
> +    VirtIODevice *vdev;
> +    int data_queue_pairs, cvq, r;
> +    NetClientState *peer;
> +
> +    /* We are only called on the first data vqs and only if x-svq is not set */
> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> +        return;
> +    }
> +
> +    vdev = v->dev->vdev;
> +    n = VIRTIO_NET(vdev);
> +    if (!n->vhost_started) {
> +        return;
> +    }
> +
> +    if (enable) {
> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> +    }
> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> +                                  n->max_ncs - n->max_queue_pairs : 0;
> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> +
> +    peer = s->nc.peer;
> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> +        VhostVDPAState *vdpa_state;
> +        NetClientState *nc;
> +
> +        if (i < data_queue_pairs) {
> +            nc = qemu_get_peer(peer, i);
> +        } else {
> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> +        }
> +
> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> +        vdpa_state->vhost_vdpa.shadow_data = enable;
> +
> +        if (i < data_queue_pairs) {
> +            /* Do not override CVQ shadow_vqs_enabled */
> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> +        }
> +    }
> +
> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
As the first revision, this method (vhost_net_stop followed by 
vhost_net_start) should be fine for software vhost-vdpa backend for e.g. 
vp_vdpa and vdpa_sim_net. However, I would like to get your attention 
that this method implies substantial blackout time for mode switching on 
real hardware - get a full cycle of device reset of getting memory 
mappings torn down, unpin & repin same set of pages, and set up new 
mapping would take very significant amount of time, especially for a 
large VM. Maybe we can do:

1) replace reset with the RESUME feature that was just added to the 
vhost-vdpa ioctls in kernel
2) add new vdpa ioctls to allow iova range rebound to new virtual 
address for QEMU's shadow vq or back to device's vq
3) use a light-weighted sequence of suspend+rebind+resume to switch mode 
on the fly instead of getting through the whole reset+restart cycle

I suspect the same idea could even be used to address high live 
migration downtime seen on hardware vdpa device. What do you think?

Thanks,
-Siwei

> +    if (unlikely(r < 0)) {
> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> +    }
> +}
> +
> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *migration = data;
> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> +                                     migration_state);
> +
> +    switch (migration->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        vhost_vdpa_net_log_global_enable(s, true);
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        vhost_vdpa_net_log_global_enable(s, false);
> +        return;
> +    };
> +}
> +
>   static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>   {
>       struct vhost_vdpa *v = &s->vhost_vdpa;
>   
> +    if (v->feature_log) {
> +        add_migration_state_change_notifier(&s->migration_state);
> +    }
> +
>       if (v->shadow_vqs_enabled) {
>           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>                                              v->iova_range.last);
> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>   
>       assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>   
> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> +        remove_migration_state_change_notifier(&s->migration_state);
> +    }
> +
>       dev = s->vhost_vdpa.dev;
>       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>           g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>       s->vhost_vdpa.device_fd = vdpa_device_fd;
>       s->vhost_vdpa.index = queue_pair_index;
>       s->always_svq = svq;
> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>       s->vhost_vdpa.shadow_vqs_enabled = svq;
>       s->vhost_vdpa.iova_range = iova_range;
>       s->vhost_vdpa.shadow_data = svq;



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
  2023-02-02  1:00 ` [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Si-Wei Liu
@ 2023-02-02 11:27   ` Eugenio Perez Martin
  2023-02-03  5:08     ` Si-Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-02 11:27 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> > It's possible to migrate vdpa net devices if they are shadowed from the
> >
> > start.  But to always shadow the dataplane is effectively break its host
> >
> > passthrough, so its not convenient in vDPA scenarios.
> >
> >
> >
> > This series enables dynamically switching to shadow mode only at
> >
> > migration time.  This allow full data virtqueues passthrough all the
> >
> > time qemu is not migrating.
> >
> >
> >
> > Successfully tested with vdpa_sim_net (but it needs some patches, I
> >
> > will send them soon) and qemu emulated device with vp_vdpa with
> >
> > some restrictions:
> >
> > * No CVQ.
> >
> > * VIRTIO_RING_F_STATE patches.
> What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it
> a new feature that other vdpa driver would need for live migration)?
>

Not really,

Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa
capabilities it needs a virtio in-band method to set and fetch the
virtqueue state. Jason sent a proposal some time ago [1], and I
implemented it in qemu's virtio emulated device.

I can send them as a RFC but I didn't worry about making it pretty,
nor I think they should be merged at the moment. vdpa parent drivers
should follow vdpa_sim changes.

Thanks!

[1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html

> -Siwei
>
> >
> > * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like
> >
> >    DPDK.
> >
> >
> >
> > Comments are welcome, especially in the patcheswith RFC in the message.
> >
> >
> >
> > v2:
> >
> > - Use a migration listener instead of a memory listener to know when
> >
> >    the migration starts.
> >
> > - Add stuff not picked with ASID patches, like enable rings after
> >
> >    driver_ok
> >
> > - Add rewinding on the migration src, not in dst
> >
> > - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html
> >
> >
> >
> > Eugenio Pérez (13):
> >
> >    vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
> >
> >    vdpa net: move iova tree creation from init to start
> >
> >    vdpa: copy cvq shadow_data from data vqs, not from x-svq
> >
> >    vdpa: rewind at get_base, not set_base
> >
> >    vdpa net: add migration blocker if cannot migrate cvq
> >
> >    vhost: delay set_vring_ready after DRIVER_OK
> >
> >    vdpa: delay set_vring_ready after DRIVER_OK
> >
> >    vdpa: Negotiate _F_SUSPEND feature
> >
> >    vdpa: add feature_log parameter to vhost_vdpa
> >
> >    vdpa net: allow VHOST_F_LOG_ALL
> >
> >    vdpa: add vdpa net migration state notifier
> >
> >    vdpa: preemptive kick at enable
> >
> >    vdpa: Conditionally expose _F_LOG in vhost_net devices
> >
> >
> >
> >   include/hw/virtio/vhost-backend.h |   4 +
> >
> >   include/hw/virtio/vhost-vdpa.h    |   1 +
> >
> >   hw/net/vhost_net.c                |  25 ++-
> >
> >   hw/virtio/vhost-vdpa.c            |  64 +++++---
> >
> >   hw/virtio/vhost.c                 |   3 +
> >
> >   net/vhost-vdpa.c                  | 247 +++++++++++++++++++++++++-----
> >
> >   6 files changed, 278 insertions(+), 66 deletions(-)
> >
> >
> >
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-02  1:52   ` Si-Wei Liu
@ 2023-02-02 15:28     ` Eugenio Perez Martin
  2023-02-04  2:03       ` Si-Wei Liu
  2023-02-12 14:31     ` Eli Cohen
  1 sibling, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-02 15:28 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit

On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> > This allows net to restart the device backend to configure SVQ on it.
> >
> > Ideally, these changes should not be net specific. However, the vdpa net
> > backend is the one with enough knowledge to configure everything because
> > of some reasons:
> > * Queues might need to be shadowed or not depending on its kind (control
> >    vs data).
> > * Queues need to share the same map translations (iova tree).
> >
> > Because of that it is cleaner to restart the whole net backend and
> > configure again as expected, similar to how vhost-kernel moves between
> > userspace and passthrough.
> >
> > If more kinds of devices need dynamic switching to SVQ we can create a
> > callback struct like VhostOps and move most of the code there.
> > VhostOps cannot be reused since all vdpa backend share them, and to
> > personalize just for networking would be too heavy.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 84 insertions(+)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 5d7ad6e4d7..f38532b1df 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -26,6 +26,8 @@
> >   #include <err.h>
> >   #include "standard-headers/linux/virtio_net.h"
> >   #include "monitor/monitor.h"
> > +#include "migration/migration.h"
> > +#include "migration/misc.h"
> >   #include "migration/blocker.h"
> >   #include "hw/virtio/vhost.h"
> >
> > @@ -33,6 +35,7 @@
> >   typedef struct VhostVDPAState {
> >       NetClientState nc;
> >       struct vhost_vdpa vhost_vdpa;
> > +    Notifier migration_state;
> >       Error *migration_blocker;
> >       VHostNetState *vhost_net;
> >
> > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >       return DO_UPCAST(VhostVDPAState, nc, nc0);
> >   }
> >
> > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> > +{
> > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > +    VirtIONet *n;
> > +    VirtIODevice *vdev;
> > +    int data_queue_pairs, cvq, r;
> > +    NetClientState *peer;
> > +
> > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > +        return;
> > +    }
> > +
> > +    vdev = v->dev->vdev;
> > +    n = VIRTIO_NET(vdev);
> > +    if (!n->vhost_started) {
> > +        return;
> > +    }
> > +
> > +    if (enable) {
> > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> > +    }
> > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > +
> > +    peer = s->nc.peer;
> > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > +        VhostVDPAState *vdpa_state;
> > +        NetClientState *nc;
> > +
> > +        if (i < data_queue_pairs) {
> > +            nc = qemu_get_peer(peer, i);
> > +        } else {
> > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > +        }
> > +
> > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > +
> > +        if (i < data_queue_pairs) {
> > +            /* Do not override CVQ shadow_vqs_enabled */
> > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > +        }
> > +    }
> > +
> > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> As the first revision, this method (vhost_net_stop followed by
> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
> that this method implies substantial blackout time for mode switching on
> real hardware - get a full cycle of device reset of getting memory
> mappings torn down, unpin & repin same set of pages, and set up new
> mapping would take very significant amount of time, especially for a
> large VM. Maybe we can do:
>

Right, I think this is something that deserves optimization in the future.

Note that we must replace the mappings anyway, with all passthrough
queues stopped. This is because SVQ vrings are not in the guest space.
The pin can be skipped though, I think that's a low hand fruit here.

If any, we can track guest's IOVA and add SVQ vrings in a hole. If
guest's IOVA layout changes, we can translate it then to a new
location. That way we only need one map operation in the worst case.
I'm omitting the lookup time here, but I still should be worth it.

But as you mention I think it is not worth complicating this series
and we can think about it on top. We can start building it on top of
your suggestions for sure.

> 1) replace reset with the RESUME feature that was just added to the
> vhost-vdpa ioctls in kernel

We cannot change vring addresses just with a SUSPEND / RESUME.

We could do it with the VIRTIO_F_RING_RESET feature though. Would it
be advantageous to the device?

> 2) add new vdpa ioctls to allow iova range rebound to new virtual
> address for QEMU's shadow vq or back to device's vq

Actually, if the device supports ASID we can allocate ASID 1 to that
purpose. At this moment only CVQ vrings and control buffers are there
when the device is passthrough.

But this doesn't solve the problem if we need to send all SVQ
translation to the device on-chip IOMMU, doesn't it? We must clear all
of it and send the new one to the device anyway.

> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
> on the fly instead of getting through the whole reset+restart cycle
>

I think this is the same as 1, isn't it?

> I suspect the same idea could even be used to address high live
> migration downtime seen on hardware vdpa device. What do you think?
>

I think this is a great start for sure! Some questions:
a) Is the time on reprogramming on-chip IOMMU comparable to program
regular IOMMU? If it is the case it should be easier to find vdpa
devices with support for _F_RESET soon.
b) Not to merge on master, but it is possible to add an artificial
delay on vdpa_sim that simulates the properties of the delay of IOMMU?
In that line, have you observed if it is linear with the size of the
memory, with the number of maps, other factors..?

Thanks!

> Thanks,
> -Siwei
>
> > +    if (unlikely(r < 0)) {
> > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > +    }
> > +}
> > +
> > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> > +{
> > +    MigrationState *migration = data;
> > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > +                                     migration_state);
> > +
> > +    switch (migration->state) {
> > +    case MIGRATION_STATUS_SETUP:
> > +        vhost_vdpa_net_log_global_enable(s, true);
> > +        return;
> > +
> > +    case MIGRATION_STATUS_CANCELLING:
> > +    case MIGRATION_STATUS_CANCELLED:
> > +    case MIGRATION_STATUS_FAILED:
> > +        vhost_vdpa_net_log_global_enable(s, false);
> > +        return;
> > +    };
> > +}
> > +
> >   static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >   {
> >       struct vhost_vdpa *v = &s->vhost_vdpa;
> >
> > +    if (v->feature_log) {
> > +        add_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >       if (v->shadow_vqs_enabled) {
> >           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >                                              v->iova_range.last);
> > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >
> >       assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >
> > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > +        remove_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >       dev = s->vhost_vdpa.dev;
> >       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >           g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >       s->vhost_vdpa.device_fd = vdpa_device_fd;
> >       s->vhost_vdpa.index = queue_pair_index;
> >       s->always_svq = svq;
> > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >       s->vhost_vdpa.shadow_vqs_enabled = svq;
> >       s->vhost_vdpa.iova_range = iova_range;
> >       s->vhost_vdpa.shadow_data = svq;
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-02-02  0:56           ` Si-Wei Liu
@ 2023-02-02 16:53             ` Eugenio Perez Martin
  2023-02-04 11:04               ` Si-Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-02 16:53 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, Zhu, Lingshan, qemu-devel, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Thu, Feb 2, 2023 at 1:57 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote:
> > On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> >>>
> >>>
> >>> On 1/13/2023 10:31 AM, Jason Wang wrote:
> >>>> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>>>> Spuriously kick the destination device's queue so it knows in case there
> >>>>> are new descriptors.
> >>>>>
> >>>>> RFC: This is somehow a gray area. The guest may have placed descriptors
> >>>>> in a virtqueue but not kicked it, so it might be surprised if the device
> >>>>> starts processing it.
> >>>> So I think this is kind of the work of the vDPA parent. For the parent
> >>>> that needs this trick, we should do it in the parent driver.
> >>> Agree, it looks easier implementing this in parent driver,
> >>> I can implement it in ifcvf set_vq_ready right now
> >> Great, but please check whether or not it is really needed.
> >>
> >> Some device implementation could check the available descriptions
> >> after DRIVER_OK without waiting for a kick.
> >>
> > So IIUC we can entirely drop this from the series (and I hope we can).
> > But then, what with the devices that does *not* check for them?
> I wonder how the kick can be missed from the first place. Supposedly the
> moment when vhost_dev_stop() calls .suspend() into vdpa driver, the
> vcpus already stopped running (vm_running = false) and all pending kicks
> are delivered through vhost-vdpa's host notifiers or mapped doorbell
> already then device won't get new ones.

I'm thinking now in cases like the net rx queue.

When the guest starts it fills it and kicks the device. Let's say
avail_idx is 255.

Following the qemu emulated virtio net,
hw/virtio/virtio.c:virtqueue_split_pop will stash shadow_avail_idx =
255, and it will not check it again until it is out of rx descriptors.

Now the NIC fills N < 255 receive buffers, and VMM migrates. Will the
destination device check rx avail idx even if it has not received any
kick? (here could be at startup or when it needs to receive a packet).
- If the answer is yes, and it will be a bug not to check it, then we
can drop this patch. We're covered even if there is a possibility of
losing a kick in the source.
- If the answer is that it is not mandatory, we need to solve it
somehow. To me, the best way is to spuriously kick as we don't need
changes in the device, all we need is here. A new feature flag
_F_CHECK_AVAIL_ON_STARTUP or equivalent would work the same, but I
think it complicates everything more.

For tx the device should suspend "immediately", so it may receive a
kick, fetch avail_idx with M pending descriptors, transmit P < M and
then receive the suspend. If we don't want to wait indefinitely, the
device should stop processing so there are still pending requests in
the queue for the destination to send. So the case now is the same as
rx, even if the source device actually receives the kick.

Having said that, I didn't check if any code drains the vhost host
notifier. Or, as mentioned in the meeting, check that HW cannot
reorder kick and suspend call.

> If the device intends to
> purposely ignore (note: this could be a device bug) pending kicks during
> .suspend(), then consequently it should check available descriptors
> after reaching driver_ok to process outstanding descriptors, making up
> for the missing kick. If the vdpa driver doesn't support .suspend(),
> then it should flush the work before .reset() - vhost-scsi does it this
> way.  Or otherwise I think it's the norm (right thing to do) device
> should process pending kicks before guest memory is to be unmapped at
> the late game of vhost_dev_stop(). Is there any case kicks may be missing?
>

So process pending kicks means to drain all tx and rx descriptors?
That would be a solution, but then we don't need virtqueue_state at
all as we might simply recover it from guest's vring avail_idx.

Thanks!

> -Siwei
>
>
> >
> > If we drop it it seems to me we must mandate devices to check for
> > descriptors at queue_enable. The queue could stall if not, isn't it?
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>> Thanks
> >>> Zhu Lingshan
> >>>> Thanks
> >>>>
> >>>>> However, that information is not in the migration stream and it should
> >>>>> be an edge case anyhow, being resilient to parallel notifications from
> >>>>> the guest.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>    hw/virtio/vhost-vdpa.c | 5 +++++
> >>>>>    1 file changed, 5 insertions(+)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>>> index 40b7e8706a..dff94355dd 100644
> >>>>> --- a/hw/virtio/vhost-vdpa.c
> >>>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
> >>>>>        }
> >>>>>        trace_vhost_vdpa_set_vring_ready(dev);
> >>>>>        for (i = 0; i < dev->nvqs; ++i) {
> >>>>> +        VirtQueue *vq;
> >>>>>            struct vhost_vring_state state = {
> >>>>>                .index = dev->vq_index + i,
> >>>>>                .num = 1,
> >>>>>            };
> >>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> >>>>> +
> >>>>> +        /* Preemptive kick */
> >>>>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> >>>>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
> >>>>>        }
> >>>>>        return 0;
> >>>>>    }
> >>>>> --
> >>>>> 2.31.1
> >>>>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-01-16  7:02           ` Jason Wang
@ 2023-02-02 16:55             ` Eugenio Perez Martin
  0 siblings, 0 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-02 16:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, qemu-devel, si-wei.liu, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit

On Mon, Jan 16, 2023 at 8:02 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2023/1/13 17:06, Eugenio Perez Martin 写道:
> > On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> >>>
> >>>
> >>> On 1/13/2023 10:31 AM, Jason Wang wrote:
> >>>> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >>>>> Spuriously kick the destination device's queue so it knows in case there
> >>>>> are new descriptors.
> >>>>>
> >>>>> RFC: This is somehow a gray area. The guest may have placed descriptors
> >>>>> in a virtqueue but not kicked it, so it might be surprised if the device
> >>>>> starts processing it.
> >>>> So I think this is kind of the work of the vDPA parent. For the parent
> >>>> that needs this trick, we should do it in the parent driver.
> >>> Agree, it looks easier implementing this in parent driver,
> >>> I can implement it in ifcvf set_vq_ready right now
> >> Great, but please check whether or not it is really needed.
> >>
> >> Some device implementation could check the available descriptions
> >> after DRIVER_OK without waiting for a kick.
> >>
> > So IIUC we can entirely drop this from the series (and I hope we can).
> > But then, what with the devices that does *not* check for them?
>
>
> It needs mediation in the vDPA parent driver.
>
>
> >
> > If we drop it it seems to me we must mandate devices to check for
> > descriptors at queue_enable. The queue could stall if not, isn't it?
>
>
> I'm not sure, did you see real issue with this? (Note that we don't do
> this for vhost-user-(vDPA))
>

Still unchecked, sorry. But not needing it for vhost-user-vDPA is a
very good signal indeed, thanks for pointing that.

> Btw, the code can result of kick before DRIVER_OK, which seems racy.
>

Good catch :). I'll fix it in the next revision if we see we need it.
I really hope to be able to drop it though.

Thanks!

> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>> Thanks
> >>> Zhu Lingshan
> >>>> Thanks
> >>>>
> >>>>> However, that information is not in the migration stream and it should
> >>>>> be an edge case anyhow, being resilient to parallel notifications from
> >>>>> the guest.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>    hw/virtio/vhost-vdpa.c | 5 +++++
> >>>>>    1 file changed, 5 insertions(+)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>>> index 40b7e8706a..dff94355dd 100644
> >>>>> --- a/hw/virtio/vhost-vdpa.c
> >>>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
> >>>>>        }
> >>>>>        trace_vhost_vdpa_set_vring_ready(dev);
> >>>>>        for (i = 0; i < dev->nvqs; ++i) {
> >>>>> +        VirtQueue *vq;
> >>>>>            struct vhost_vring_state state = {
> >>>>>                .index = dev->vq_index + i,
> >>>>>                .num = 1,
> >>>>>            };
> >>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> >>>>> +
> >>>>> +        /* Preemptive kick */
> >>>>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
> >>>>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
> >>>>>        }
> >>>>>        return 0;
> >>>>>    }
> >>>>> --
> >>>>> 2.31.1
> >>>>>
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
  2023-02-02 11:27   ` Eugenio Perez Martin
@ 2023-02-03  5:08     ` Si-Wei Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-03  5:08 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit



On 2/2/2023 3:27 AM, Eugenio Perez Martin wrote:
> On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
>>> It's possible to migrate vdpa net devices if they are shadowed from the
>>>
>>> start.  But to always shadow the dataplane is effectively break its host
>>>
>>> passthrough, so its not convenient in vDPA scenarios.
>>>
>>>
>>>
>>> This series enables dynamically switching to shadow mode only at
>>>
>>> migration time.  This allow full data virtqueues passthrough all the
>>>
>>> time qemu is not migrating.
>>>
>>>
>>>
>>> Successfully tested with vdpa_sim_net (but it needs some patches, I
>>>
>>> will send them soon) and qemu emulated device with vp_vdpa with
>>>
>>> some restrictions:
>>>
>>> * No CVQ.
>>>
>>> * VIRTIO_RING_F_STATE patches.
>> What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it
>> a new feature that other vdpa driver would need for live migration)?
>>
> Not really,
>
> Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa
> capabilities it needs a virtio in-band method to set and fetch the
> virtqueue state. Jason sent a proposal some time ago [1], and I
> implemented it in qemu's virtio emulated device.
>
> I can send them as a RFC but I didn't worry about making it pretty,
> nor I think they should be merged at the moment. vdpa parent drivers
> should follow vdpa_sim changes.
Got it. No bother sending RFC for now, I think it's limited to virtio 
backed vdpa providers only. Thanks for the clarifications.

-Siwei

>
> Thanks!
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html
>
>> -Siwei
>>
>>> * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like
>>>
>>>     DPDK.
>>>
>>>
>>>
>>> Comments are welcome, especially in the patcheswith RFC in the message.
>>>
>>>
>>>
>>> v2:
>>>
>>> - Use a migration listener instead of a memory listener to know when
>>>
>>>     the migration starts.
>>>
>>> - Add stuff not picked with ASID patches, like enable rings after
>>>
>>>     driver_ok
>>>
>>> - Add rewinding on the migration src, not in dst
>>>
>>> - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html
>>>
>>>
>>>
>>> Eugenio Pérez (13):
>>>
>>>     vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
>>>
>>>     vdpa net: move iova tree creation from init to start
>>>
>>>     vdpa: copy cvq shadow_data from data vqs, not from x-svq
>>>
>>>     vdpa: rewind at get_base, not set_base
>>>
>>>     vdpa net: add migration blocker if cannot migrate cvq
>>>
>>>     vhost: delay set_vring_ready after DRIVER_OK
>>>
>>>     vdpa: delay set_vring_ready after DRIVER_OK
>>>
>>>     vdpa: Negotiate _F_SUSPEND feature
>>>
>>>     vdpa: add feature_log parameter to vhost_vdpa
>>>
>>>     vdpa net: allow VHOST_F_LOG_ALL
>>>
>>>     vdpa: add vdpa net migration state notifier
>>>
>>>     vdpa: preemptive kick at enable
>>>
>>>     vdpa: Conditionally expose _F_LOG in vhost_net devices
>>>
>>>
>>>
>>>    include/hw/virtio/vhost-backend.h |   4 +
>>>
>>>    include/hw/virtio/vhost-vdpa.h    |   1 +
>>>
>>>    hw/net/vhost_net.c                |  25 ++-
>>>
>>>    hw/virtio/vhost-vdpa.c            |  64 +++++---
>>>
>>>    hw/virtio/vhost.c                 |   3 +
>>>
>>>    net/vhost-vdpa.c                  | 247 +++++++++++++++++++++++++-----
>>>
>>>    6 files changed, 278 insertions(+), 66 deletions(-)
>>>
>>>
>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-02 15:28     ` Eugenio Perez Martin
@ 2023-02-04  2:03       ` Si-Wei Liu
  2023-02-13  9:47         ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-04  2:03 UTC (permalink / raw)
  To: Eugenio Perez Martin, Eli Cohen
  Cc: qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Parav Pandit



On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:
> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
>>> This allows net to restart the device backend to configure SVQ on it.
>>>
>>> Ideally, these changes should not be net specific. However, the vdpa net
>>> backend is the one with enough knowledge to configure everything because
>>> of some reasons:
>>> * Queues might need to be shadowed or not depending on its kind (control
>>>     vs data).
>>> * Queues need to share the same map translations (iova tree).
>>>
>>> Because of that it is cleaner to restart the whole net backend and
>>> configure again as expected, similar to how vhost-kernel moves between
>>> userspace and passthrough.
>>>
>>> If more kinds of devices need dynamic switching to SVQ we can create a
>>> callback struct like VhostOps and move most of the code there.
>>> VhostOps cannot be reused since all vdpa backend share them, and to
>>> personalize just for networking would be too heavy.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 84 insertions(+)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index 5d7ad6e4d7..f38532b1df 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -26,6 +26,8 @@
>>>    #include <err.h>
>>>    #include "standard-headers/linux/virtio_net.h"
>>>    #include "monitor/monitor.h"
>>> +#include "migration/migration.h"
>>> +#include "migration/misc.h"
>>>    #include "migration/blocker.h"
>>>    #include "hw/virtio/vhost.h"
>>>
>>> @@ -33,6 +35,7 @@
>>>    typedef struct VhostVDPAState {
>>>        NetClientState nc;
>>>        struct vhost_vdpa vhost_vdpa;
>>> +    Notifier migration_state;
>>>        Error *migration_blocker;
>>>        VHostNetState *vhost_net;
>>>
>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>>        return DO_UPCAST(VhostVDPAState, nc, nc0);
>>>    }
>>>
>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
>>> +{
>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>> +    VirtIONet *n;
>>> +    VirtIODevice *vdev;
>>> +    int data_queue_pairs, cvq, r;
>>> +    NetClientState *peer;
>>> +
>>> +    /* We are only called on the first data vqs and only if x-svq is not set */
>>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
>>> +        return;
>>> +    }
>>> +
>>> +    vdev = v->dev->vdev;
>>> +    n = VIRTIO_NET(vdev);
>>> +    if (!n->vhost_started) {
>>> +        return;
>>> +    }
>>> +
>>> +    if (enable) {
>>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
>>> +    }
>>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
>>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
>>> +                                  n->max_ncs - n->max_queue_pairs : 0;
>>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>> +
>>> +    peer = s->nc.peer;
>>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
>>> +        VhostVDPAState *vdpa_state;
>>> +        NetClientState *nc;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            nc = qemu_get_peer(peer, i);
>>> +        } else {
>>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
>>> +        }
>>> +
>>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
>>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            /* Do not override CVQ shadow_vqs_enabled */
>>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
>>> +        }
>>> +    }
>>> +
>>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
>> As the first revision, this method (vhost_net_stop followed by
>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
>> that this method implies substantial blackout time for mode switching on
>> real hardware - get a full cycle of device reset of getting memory
>> mappings torn down, unpin & repin same set of pages, and set up new
>> mapping would take very significant amount of time, especially for a
>> large VM. Maybe we can do:
>>
> Right, I think this is something that deserves optimization in the future.
>
> Note that we must replace the mappings anyway, with all passthrough
> queues stopped.
Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq 
keep mapping to the same GPA where passthrough data virtqueues were 
associated with across switch (so that the mode switch is transparent to 
the guest)? For platform IOMMU the mapping and remapping cost is 
inevitable, though I wonder for the on-chip IOMMU case could it take 
some fast path to just replace IOVA in place without destroying and 
setting up all mapping entries, if the same GPA is going to be used for 
the data rings (copy Eli for his input).

>   This is because SVQ vrings are not in the guest space.
> The pin can be skipped though, I think that's a low hand fruit here.
Yes, that's right. For a large VM pining overhead usually overweighs the 
mapping cost. It would be a great amount of time saving if pin can be 
skipped.

>
> If any, we can track guest's IOVA and add SVQ vrings in a hole. If
> guest's IOVA layout changes, we can translate it then to a new
> location. That way we only need one map operation in the worst case.
> I'm omitting the lookup time here, but I still should be worth it.
>
> But as you mention I think it is not worth complicating this series
> and we can think about it on top.
Yes, agreed. I'll just let you aware of the need of this optimization 
for real hardware device.

>   We can start building it on top of
> your suggestions for sure.
>
>> 1) replace reset with the RESUME feature that was just added to the
>> vhost-vdpa ioctls in kernel
> We cannot change vring addresses just with a SUSPEND / RESUME.
I wonder if we can make SUSPEND (via some flag or new backend feature is 
fine) accept updating internal state like the vring addresses, while 
defer applying it to the device until RESUME? That way we don't lose a 
lot of other states that otherwise would need to re-instantiate at large 
with _F_RING_RESET or device reset.

>
> We could do it with the VIRTIO_F_RING_RESET feature though. Would it
> be advantageous to the device?
>
>> 2) add new vdpa ioctls to allow iova range rebound to new virtual
>> address for QEMU's shadow vq or back to device's vq
> Actually, if the device supports ASID we can allocate ASID 1 to that
> purpose. At this moment only CVQ vrings and control buffers are there
> when the device is passthrough.
Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle 
the mapping for the passthrough VQ. This will help the general IOMMU case.

>
> But this doesn't solve the problem if we need to send all SVQ
> translation to the device on-chip IOMMU, doesn't it? We must clear all
> of it and send the new one to the device anyway.
>
>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
>> on the fly instead of getting through the whole reset+restart cycle
>>
> I think this is the same as 1, isn't it?
I mean do all three together: 1,2 in kernel and 3 in QEMU.

>
>> I suspect the same idea could even be used to address high live
>> migration downtime seen on hardware vdpa device. What do you think?
>>
> I think this is a great start for sure! Some questions:
> a) Is the time on reprogramming on-chip IOMMU comparable to program
> regular IOMMU?
I would think this largely depends on the hardware implementation of 
on-chip IOMMU, the performance characteristics of which is very device 
specific. Some times driver software implementation and API for on-chip 
MMU also matters. Which would require vendor specific work to optimize 
based on the specific use case.

>   If it is the case it should be easier to find vdpa
> devices with support for _F_RESET soon.
> b) Not to merge on master, but it is possible to add an artificial
> delay on vdpa_sim that simulates the properties of the delay of IOMMU?
> In that line, have you observed if it is linear with the size of the
> memory, with the number of maps, other factors..?
As I said this is very device specific and hard to quantify, but I agree 
it's a good idea to simulate the delay and measure the effect. For the 
on-chip MMU device I'm looking, large proportion of the time was spent 
on software side in allocating a range of memory for hosting mapping 
entries (don't know how to quantify this part but the allocation time is 
not a constant nor linear to the size of memory), walking all iotlb 
entries passed down from vdpa layer and building corresponding memory 
key objects for a range of pages. For each iotlb entry the time to build 
memory mapping looks grow linearly with the size of memory. Not sure if 
there's room to improve, I'll let the owner to clarify.

Thanks,
-Siwei





>
> Thanks!
>
>> Thanks,
>> -Siwei
>>
>>> +    if (unlikely(r < 0)) {
>>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
>>> +    }
>>> +}
>>> +
>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
>>> +{
>>> +    MigrationState *migration = data;
>>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
>>> +                                     migration_state);
>>> +
>>> +    switch (migration->state) {
>>> +    case MIGRATION_STATUS_SETUP:
>>> +        vhost_vdpa_net_log_global_enable(s, true);
>>> +        return;
>>> +
>>> +    case MIGRATION_STATUS_CANCELLING:
>>> +    case MIGRATION_STATUS_CANCELLED:
>>> +    case MIGRATION_STATUS_FAILED:
>>> +        vhost_vdpa_net_log_global_enable(s, false);
>>> +        return;
>>> +    };
>>> +}
>>> +
>>>    static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>>    {
>>>        struct vhost_vdpa *v = &s->vhost_vdpa;
>>>
>>> +    if (v->feature_log) {
>>> +        add_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>        if (v->shadow_vqs_enabled) {
>>>            v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>                                               v->iova_range.last);
>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>>
>>>        assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>
>>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
>>> +        remove_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>        dev = s->vhost_vdpa.dev;
>>>        if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>        s->vhost_vdpa.device_fd = vdpa_device_fd;
>>>        s->vhost_vdpa.index = queue_pair_index;
>>>        s->always_svq = svq;
>>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>>>        s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>        s->vhost_vdpa.iova_range = iova_range;
>>>        s->vhost_vdpa.shadow_data = svq;



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-02-02 16:53             ` Eugenio Perez Martin
@ 2023-02-04 11:04               ` Si-Wei Liu
  2023-02-05 10:00                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-04 11:04 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, Zhu, Lingshan, qemu-devel, Liuxiangdong,
	Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit



On 2/2/2023 8:53 AM, Eugenio Perez Martin wrote:
> On Thu, Feb 2, 2023 at 1:57 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote:
>>> On Fri, Jan 13, 2023 at 4:39 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>>>>>
>>>>> On 1/13/2023 10:31 AM, Jason Wang wrote:
>>>>>> On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>>>> Spuriously kick the destination device's queue so it knows in case there
>>>>>>> are new descriptors.
>>>>>>>
>>>>>>> RFC: This is somehow a gray area. The guest may have placed descriptors
>>>>>>> in a virtqueue but not kicked it, so it might be surprised if the device
>>>>>>> starts processing it.
>>>>>> So I think this is kind of the work of the vDPA parent. For the parent
>>>>>> that needs this trick, we should do it in the parent driver.
>>>>> Agree, it looks easier implementing this in parent driver,
>>>>> I can implement it in ifcvf set_vq_ready right now
>>>> Great, but please check whether or not it is really needed.
>>>>
>>>> Some device implementation could check the available descriptions
>>>> after DRIVER_OK without waiting for a kick.
>>>>
>>> So IIUC we can entirely drop this from the series (and I hope we can).
>>> But then, what with the devices that does *not* check for them?
>> I wonder how the kick can be missed from the first place. Supposedly the
>> moment when vhost_dev_stop() calls .suspend() into vdpa driver, the
>> vcpus already stopped running (vm_running = false) and all pending kicks
>> are delivered through vhost-vdpa's host notifiers or mapped doorbell
>> already then device won't get new ones.
> I'm thinking now in cases like the net rx queue.
>
> When the guest starts it fills it and kicks the device. Let's say
> avail_idx is 255.
>
> Following the qemu emulated virtio net,
> hw/virtio/virtio.c:virtqueue_split_pop will stash shadow_avail_idx =
> 255, and it will not check it again until it is out of rx descriptors.
>
> Now the NIC fills N < 255 receive buffers, and VMM migrates. Will the
> destination device check rx avail idx even if it has not received any
> kick? (here could be at startup or when it needs to receive a packet).
> - If the answer is yes, and it will be a bug not to check it, then we
> can drop this patch. We're covered even if there is a possibility of
> losing a kick in the source.
So this is not an issue of missing delivery of kicks, but more of how 
device is expected to handle pending kicks during suspend? For network 
device, it's not required to process up to avail_idx during suspend, but 
this doesn't mean it should ignore the kick for new descriptors, or 
instead I would say the device shouldn't specifically rely on kick, 
either at suspend or at startup. If at suspend, the device doesn't 
process up to avail_idx, correspondingly the implementation of it should 
sync the avail_idx in memory at startup. Even if the device 
implementation has to process up to avail_idx at suspend, for 
interoperability (i.e. source device didn't sync at suspend) point of 
view it still needs to check avail_idx at startup (resume) time and go 
on to process any pending request, right? So in any case, it seems to me 
the "implicit" kick at startup is needed for any device implementation 
anyway. I wouldn't say mandatory but that's the way how its supposed to 
work I feel.

> - If the answer is that it is not mandatory, we need to solve it
> somehow. To me, the best way is to spuriously kick as we don't need
> changes in the device, all we need is here. A new feature flag
> _F_CHECK_AVAIL_ON_STARTUP or equivalent would work the same, but I
> think it complicates everything more.
>
> For tx the device should suspend "immediately", so it may receive a
> kick, fetch avail_idx with M pending descriptors, transmit P < M and
> then receive the suspend. If we don't want to wait indefinitely, the
> device should stop processing so there are still pending requests in
> the queue for the destination to send. So the case now is the same as
> rx, even if the source device actually receives the kick.
>
> Having said that, I didn't check if any code drains the vhost host
> notifier. Or, as mentioned in the meeting, check that HW cannot
> reorder kick and suspend call.
Not sure how order matters here, though I thought device suspend/resume 
doesn't tie in with kick order?

>
>> If the device intends to
>> purposely ignore (note: this could be a device bug) pending kicks during
>> .suspend(), then consequently it should check available descriptors
>> after reaching driver_ok to process outstanding descriptors, making up
>> for the missing kick. If the vdpa driver doesn't support .suspend(),
>> then it should flush the work before .reset() - vhost-scsi does it this
>> way.  Or otherwise I think it's the norm (right thing to do) device
>> should process pending kicks before guest memory is to be unmapped at
>> the late game of vhost_dev_stop(). Is there any case kicks may be missing?
>>
> So process pending kicks means to drain all tx and rx descriptors?
No it doesn't have to. What I said is it shouldn't ignore pending kicks 
as if there's no more buffer posted to the device. But really, a fair 
expectation for suspending or starting device is device should do a 
spontaneous sync for getting the latest avail_idx in the memory, which 
does not have to depend on kicks. For network hardware device, I thought 
suspend just needs to wait until the completion of ongoing Tx/Rx DMA 
transaction already in the flight, rather than to drain all the upcoming 
packets until avail_idx.

Regards,
-Siwei
> That would be a solution, but then we don't need virtqueue_state at
> all as we might simply recover it from guest's vring avail_idx.
>
> Thanks!
>
>> -Siwei
>>
>>
>>> If we drop it it seems to me we must mandate devices to check for
>>> descriptors at queue_enable. The queue could stall if not, isn't it?
>>>
>>> Thanks!
>>>
>>>> Thanks
>>>>
>>>>> Thanks
>>>>> Zhu Lingshan
>>>>>> Thanks
>>>>>>
>>>>>>> However, that information is not in the migration stream and it should
>>>>>>> be an edge case anyhow, being resilient to parallel notifications from
>>>>>>> the guest.
>>>>>>>
>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>> ---
>>>>>>>     hw/virtio/vhost-vdpa.c | 5 +++++
>>>>>>>     1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>>>> index 40b7e8706a..dff94355dd 100644
>>>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>>>> @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready)
>>>>>>>         }
>>>>>>>         trace_vhost_vdpa_set_vring_ready(dev);
>>>>>>>         for (i = 0; i < dev->nvqs; ++i) {
>>>>>>> +        VirtQueue *vq;
>>>>>>>             struct vhost_vring_state state = {
>>>>>>>                 .index = dev->vq_index + i,
>>>>>>>                 .num = 1,
>>>>>>>             };
>>>>>>>             vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>>>> +
>>>>>>> +        /* Preemptive kick */
>>>>>>> +        vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
>>>>>>> +        event_notifier_set(virtio_queue_get_host_notifier(vq));
>>>>>>>         }
>>>>>>>         return 0;
>>>>>>>     }
>>>>>>> --
>>>>>>> 2.31.1
>>>>>>>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-02-04 11:04               ` Si-Wei Liu
@ 2023-02-05 10:00                 ` Michael S. Tsirkin
  2023-02-06  5:08                   ` Si-Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Michael S. Tsirkin @ 2023-02-05 10:00 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, Jason Wang, Zhu, Lingshan, qemu-devel,
	Liuxiangdong, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Stefan Hajnoczi, Parav Pandit

On Sat, Feb 04, 2023 at 03:04:02AM -0800, Si-Wei Liu wrote:
> For network hardware device, I thought suspend
> just needs to wait until the completion of ongoing Tx/Rx DMA transaction
> already in the flight, rather than to drain all the upcoming packets until
> avail_idx.

It depends I guess but if device expects to recover all state from just
ring state in memory then at least it has to drain until some index
value.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 12/13] vdpa: preemptive kick at enable
  2023-02-05 10:00                 ` Michael S. Tsirkin
@ 2023-02-06  5:08                   ` Si-Wei Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-06  5:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eugenio Perez Martin, Jason Wang, Zhu, Lingshan, qemu-devel,
	Liuxiangdong, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Eli Cohen, Paolo Bonzini,
	Stefan Hajnoczi, Parav Pandit



On 2/5/2023 2:00 AM, Michael S. Tsirkin wrote:
> On Sat, Feb 04, 2023 at 03:04:02AM -0800, Si-Wei Liu wrote:
>> For network hardware device, I thought suspend
>> just needs to wait until the completion of ongoing Tx/Rx DMA transaction
>> already in the flight, rather than to drain all the upcoming packets until
>> avail_idx.
> It depends I guess but if device expects to recover all state from just
> ring state in memory then at least it has to drain until some index
> value.
Yes, that's the general requirement for other devices than networking 
device. For e.g., if a storage device had posted request before 
suspending and there's no way to replay those requests from destination, 
it needs to drain until all posted requests are completed. For network 
device, this requirement can be lifted up somehow, as network (Ethernet) 
usually is tolerant to packet drops. Jason and I once had a long 
discussion about the expectation for {get,set}_vq_state() driver API and 
we came to conclusion that this is something networking device can stand 
up to:

https://lore.kernel.org/lkml/b2d18964-8cd6-6bb1-1995-5b966207046d@redhat.com/

-Siwei


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-02  1:52   ` Si-Wei Liu
  2023-02-02 15:28     ` Eugenio Perez Martin
@ 2023-02-12 14:31     ` Eli Cohen
  1 sibling, 0 replies; 76+ messages in thread
From: Eli Cohen @ 2023-02-12 14:31 UTC (permalink / raw)
  To: Si-Wei Liu, Eugenio Pérez, qemu-devel
  Cc: Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Parav Pandit



> -----Original Message-----
> From: Si-Wei Liu <si-wei.liu@oracle.com>
> Sent: Thursday, 2 February 2023 3:53
> To: Eugenio Pérez <eperezma@redhat.com>; qemu-devel@nongnu.org
> Cc: Liuxiangdong <liuxiangdong5@huawei.com>; Zhu Lingshan
> <lingshan.zhu@intel.com>; Gonglei (Arei) <arei.gonglei@huawei.com>;
> alvaro.karsz@solid-run.com; Shannon Nelson <snelson@pensando.io>;
> Laurent Vivier <lvivier@redhat.com>; Harpreet Singh Anand
> <hanand@xilinx.com>; Gautam Dawar <gdawar@xilinx.com>; Stefano
> Garzarella <sgarzare@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> Cindy Lu <lulu@redhat.com>; Eli Cohen <eli@mellanox.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Michael S. Tsirkin <mst@redhat.com>; Jason Wang
> <jasowang@redhat.com>; Stefan Hajnoczi <stefanha@redhat.com>; Parav
> Pandit <parav@mellanox.com>
> Subject: Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
> 
> 
> 
> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> > This allows net to restart the device backend to configure SVQ on it.
> >
> > Ideally, these changes should not be net specific. However, the vdpa net
> > backend is the one with enough knowledge to configure everything because
> > of some reasons:
> > * Queues might need to be shadowed or not depending on its kind (control
> >    vs data).
> > * Queues need to share the same map translations (iova tree).
> >
> > Because of that it is cleaner to restart the whole net backend and
> > configure again as expected, similar to how vhost-kernel moves between
> > userspace and passthrough.
> >
> > If more kinds of devices need dynamic switching to SVQ we can create a
> > callback struct like VhostOps and move most of the code there.
> > VhostOps cannot be reused since all vdpa backend share them, and to
> > personalize just for networking would be too heavy.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   net/vhost-vdpa.c | 84
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 84 insertions(+)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 5d7ad6e4d7..f38532b1df 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -26,6 +26,8 @@
> >   #include <err.h>
> >   #include "standard-headers/linux/virtio_net.h"
> >   #include "monitor/monitor.h"
> > +#include "migration/migration.h"
> > +#include "migration/misc.h"
> >   #include "migration/blocker.h"
> >   #include "hw/virtio/vhost.h"
> >
> > @@ -33,6 +35,7 @@
> >   typedef struct VhostVDPAState {
> >       NetClientState nc;
> >       struct vhost_vdpa vhost_vdpa;
> > +    Notifier migration_state;
> >       Error *migration_blocker;
> >       VHostNetState *vhost_net;
> >
> > @@ -243,10 +246,86 @@ static VhostVDPAState
> *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >       return DO_UPCAST(VhostVDPAState, nc, nc0);
> >   }
> >
> > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool
> enable)
> > +{
> > +    struct vhost_vdpa *v = &s->vhost_vdpa;
> > +    VirtIONet *n;
> > +    VirtIODevice *vdev;
> > +    int data_queue_pairs, cvq, r;
> > +    NetClientState *peer;
> > +
> > +    /* We are only called on the first data vqs and only if x-svq is not set */
> > +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> > +        return;
> > +    }
> > +
> > +    vdev = v->dev->vdev;
> > +    n = VIRTIO_NET(vdev);
> > +    if (!n->vhost_started) {
> > +        return;
> > +    }
> > +
> > +    if (enable) {
> > +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> > +    }
> > +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> > +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> > +                                  n->max_ncs - n->max_queue_pairs : 0;
> > +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> > +
> > +    peer = s->nc.peer;
> > +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> > +        VhostVDPAState *vdpa_state;
> > +        NetClientState *nc;
> > +
> > +        if (i < data_queue_pairs) {
> > +            nc = qemu_get_peer(peer, i);
> > +        } else {
> > +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> > +        }
> > +
> > +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> > +        vdpa_state->vhost_vdpa.shadow_data = enable;
> > +
> > +        if (i < data_queue_pairs) {
> > +            /* Do not override CVQ shadow_vqs_enabled */
> > +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> > +        }
> > +    }
> > +
> > +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> As the first revision, this method (vhost_net_stop followed by
> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
> that this method implies substantial blackout time for mode switching on
> real hardware - get a full cycle of device reset of getting memory
> mappings torn down, unpin & repin same set of pages, and set up new
> mapping would take very significant amount of time, especially for a
> large VM. Maybe we can do:
> 
> 1) replace reset with the RESUME feature that was just added to the
> vhost-vdpa ioctls in kernel
> 2) add new vdpa ioctls to allow iova range rebound to new virtual
> address for QEMU's shadow vq or back to device's vq

Every time you change the iova range, mlx5_vdpa needs to destroy the old MR and build a new one, based on the new data provided by the new iova. That implies destroying the VQs and creating them again with reference to the new MR. If the new iova provided is narrower, it will cause the memory registration operation to take less time. In any case, I don't see how adding a new call makes a difference relative to using the current set_map() call.

If all we need is extend the current iova range to include the shadow virtqueue, we could introduce a call that extends the current iova.

In this case, I could provide a much faster modification on the MR.

> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
> on the fly instead of getting through the whole reset+restart cycle
> 
> I suspect the same idea could even be used to address high live
> migration downtime seen on hardware vdpa device. What do you think?
> 
> Thanks,
> -Siwei
> 
> > +    if (unlikely(r < 0)) {
> > +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> > +    }
> > +}
> > +
> > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void
> *data)
> > +{
> > +    MigrationState *migration = data;
> > +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> > +                                     migration_state);
> > +
> > +    switch (migration->state) {
> > +    case MIGRATION_STATUS_SETUP:
> > +        vhost_vdpa_net_log_global_enable(s, true);
> > +        return;
> > +
> > +    case MIGRATION_STATUS_CANCELLING:
> > +    case MIGRATION_STATUS_CANCELLED:
> > +    case MIGRATION_STATUS_FAILED:
> > +        vhost_vdpa_net_log_global_enable(s, false);
> > +        return;
> > +    };
> > +}
> > +
> >   static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >   {
> >       struct vhost_vdpa *v = &s->vhost_vdpa;
> >
> > +    if (v->feature_log) {
> > +        add_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >       if (v->shadow_vqs_enabled) {
> >           v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >                                              v->iova_range.last);
> > @@ -280,6 +359,10 @@ static void
> vhost_vdpa_net_client_stop(NetClientState *nc)
> >
> >       assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >
> > +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> > +        remove_migration_state_change_notifier(&s->migration_state);
> > +    }
> > +
> >       dev = s->vhost_vdpa.dev;
> >       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >           g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > @@ -767,6 +850,7 @@ static NetClientState
> *net_vhost_vdpa_init(NetClientState *peer,
> >       s->vhost_vdpa.device_fd = vdpa_device_fd;
> >       s->vhost_vdpa.index = queue_pair_index;
> >       s->always_svq = svq;
> > +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >       s->vhost_vdpa.shadow_vqs_enabled = svq;
> >       s->vhost_vdpa.iova_range = iova_range;
> >       s->vhost_vdpa.shadow_data = svq;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-04  2:03       ` Si-Wei Liu
@ 2023-02-13  9:47         ` Eugenio Perez Martin
  2023-02-13 22:36           ` Si-Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-13  9:47 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eli Cohen, qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Parav Pandit

On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:
> > On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> >>> This allows net to restart the device backend to configure SVQ on it.
> >>>
> >>> Ideally, these changes should not be net specific. However, the vdpa net
> >>> backend is the one with enough knowledge to configure everything because
> >>> of some reasons:
> >>> * Queues might need to be shadowed or not depending on its kind (control
> >>>     vs data).
> >>> * Queues need to share the same map translations (iova tree).
> >>>
> >>> Because of that it is cleaner to restart the whole net backend and
> >>> configure again as expected, similar to how vhost-kernel moves between
> >>> userspace and passthrough.
> >>>
> >>> If more kinds of devices need dynamic switching to SVQ we can create a
> >>> callback struct like VhostOps and move most of the code there.
> >>> VhostOps cannot be reused since all vdpa backend share them, and to
> >>> personalize just for networking would be too heavy.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>    1 file changed, 84 insertions(+)
> >>>
> >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>> index 5d7ad6e4d7..f38532b1df 100644
> >>> --- a/net/vhost-vdpa.c
> >>> +++ b/net/vhost-vdpa.c
> >>> @@ -26,6 +26,8 @@
> >>>    #include <err.h>
> >>>    #include "standard-headers/linux/virtio_net.h"
> >>>    #include "monitor/monitor.h"
> >>> +#include "migration/migration.h"
> >>> +#include "migration/misc.h"
> >>>    #include "migration/blocker.h"
> >>>    #include "hw/virtio/vhost.h"
> >>>
> >>> @@ -33,6 +35,7 @@
> >>>    typedef struct VhostVDPAState {
> >>>        NetClientState nc;
> >>>        struct vhost_vdpa vhost_vdpa;
> >>> +    Notifier migration_state;
> >>>        Error *migration_blocker;
> >>>        VHostNetState *vhost_net;
> >>>
> >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >>>        return DO_UPCAST(VhostVDPAState, nc, nc0);
> >>>    }
> >>>
> >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> >>> +{
> >>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> >>> +    VirtIONet *n;
> >>> +    VirtIODevice *vdev;
> >>> +    int data_queue_pairs, cvq, r;
> >>> +    NetClientState *peer;
> >>> +
> >>> +    /* We are only called on the first data vqs and only if x-svq is not set */
> >>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    vdev = v->dev->vdev;
> >>> +    n = VIRTIO_NET(vdev);
> >>> +    if (!n->vhost_started) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    if (enable) {
> >>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> >>> +    }
> >>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> >>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> >>> +                                  n->max_ncs - n->max_queue_pairs : 0;
> >>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >>> +
> >>> +    peer = s->nc.peer;
> >>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> >>> +        VhostVDPAState *vdpa_state;
> >>> +        NetClientState *nc;
> >>> +
> >>> +        if (i < data_queue_pairs) {
> >>> +            nc = qemu_get_peer(peer, i);
> >>> +        } else {
> >>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> >>> +        }
> >>> +
> >>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> >>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
> >>> +
> >>> +        if (i < data_queue_pairs) {
> >>> +            /* Do not override CVQ shadow_vqs_enabled */
> >>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >> As the first revision, this method (vhost_net_stop followed by
> >> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
> >> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
> >> that this method implies substantial blackout time for mode switching on
> >> real hardware - get a full cycle of device reset of getting memory
> >> mappings torn down, unpin & repin same set of pages, and set up new
> >> mapping would take very significant amount of time, especially for a
> >> large VM. Maybe we can do:
> >>
> > Right, I think this is something that deserves optimization in the future.
> >
> > Note that we must replace the mappings anyway, with all passthrough
> > queues stopped.
> Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq
> keep mapping to the same GPA where passthrough data virtqueues were
> associated with across switch (so that the mode switch is transparent to
> the guest)?

I don't get this question, SVQ switching is already transparent to the guest.

> For platform IOMMU the mapping and remapping cost is
> inevitable, though I wonder for the on-chip IOMMU case could it take
> some fast path to just replace IOVA in place without destroying and
> setting up all mapping entries, if the same GPA is going to be used for
> the data rings (copy Eli for his input).
>
> >   This is because SVQ vrings are not in the guest space.
> > The pin can be skipped though, I think that's a low hand fruit here.
> Yes, that's right. For a large VM pining overhead usually overweighs the
> mapping cost. It would be a great amount of time saving if pin can be
> skipped.
>

That is doable using dma_map/unmap apis instead of set_map (or
comparing in set_map) and allocation GPA translations in advance.

> >
> > If any, we can track guest's IOVA and add SVQ vrings in a hole. If
> > guest's IOVA layout changes, we can translate it then to a new
> > location. That way we only need one map operation in the worst case.
> > I'm omitting the lookup time here, but I still should be worth it.
> >
> > But as you mention I think it is not worth complicating this series
> > and we can think about it on top.
> Yes, agreed. I'll just let you aware of the need of this optimization
> for real hardware device.
>
> >   We can start building it on top of
> > your suggestions for sure.
> >
> >> 1) replace reset with the RESUME feature that was just added to the
> >> vhost-vdpa ioctls in kernel
> > We cannot change vring addresses just with a SUSPEND / RESUME.
> I wonder if we can make SUSPEND (via some flag or new backend feature is
> fine) accept updating internal state like the vring addresses, while
> defer applying it to the device until RESUME? That way we don't lose a
> lot of other states that otherwise would need to re-instantiate at large
> with _F_RING_RESET or device reset.
>

If that helps, that can be done for sure.

As another idea, we could do the reverse and allow _F_RING_RESET to
not to forget the parameters unless they're explicitly overriden. I
think I prefer your idea in  SUSPEND / RESUME cycle, but just wanted
to put that possibility on the table if that makes more sense.

> >
> > We could do it with the VIRTIO_F_RING_RESET feature though. Would it
> > be advantageous to the device?
> >
> >> 2) add new vdpa ioctls to allow iova range rebound to new virtual
> >> address for QEMU's shadow vq or back to device's vq
> > Actually, if the device supports ASID we can allocate ASID 1 to that
> > purpose. At this moment only CVQ vrings and control buffers are there
> > when the device is passthrough.
> Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle
> the mapping for the passthrough VQ. This will help the general IOMMU case.
>
> >
> > But this doesn't solve the problem if we need to send all SVQ
> > translation to the device on-chip IOMMU, doesn't it? We must clear all
> > of it and send the new one to the device anyway.
> >
> >> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
> >> on the fly instead of getting through the whole reset+restart cycle
> >>
> > I think this is the same as 1, isn't it?
> I mean do all three together: 1,2 in kernel and 3 in QEMU.
>

Ok I missed that in my first read, thanks!

But I feel 2 should be easier to do in qemu.

I don't really know how this helps in the general IOMMU case, I'm
assuming the IOMMU does not support PASID or similar tricks. Is that
because of the vhost_iotlb population or is there anything else I'm
missing?

> >
> >> I suspect the same idea could even be used to address high live
> >> migration downtime seen on hardware vdpa device. What do you think?
> >>
> > I think this is a great start for sure! Some questions:
> > a) Is the time on reprogramming on-chip IOMMU comparable to program
> > regular IOMMU?
> I would think this largely depends on the hardware implementation of
> on-chip IOMMU, the performance characteristics of which is very device
> specific. Some times driver software implementation and API for on-chip
> MMU also matters. Which would require vendor specific work to optimize
> based on the specific use case.
>

Got it.

> >   If it is the case it should be easier to find vdpa
> > devices with support for _F_RESET soon.
> > b) Not to merge on master, but it is possible to add an artificial
> > delay on vdpa_sim that simulates the properties of the delay of IOMMU?
> > In that line, have you observed if it is linear with the size of the
> > memory, with the number of maps, other factors..?
> As I said this is very device specific and hard to quantify, but I agree
> it's a good idea to simulate the delay and measure the effect. For the
> on-chip MMU device I'm looking, large proportion of the time was spent
> on software side in allocating a range of memory for hosting mapping
> entries (don't know how to quantify this part but the allocation time is
> not a constant nor linear to the size of memory), walking all iotlb
> entries passed down from vdpa layer and building corresponding memory
> key objects for a range of pages. For each iotlb entry the time to build
> memory mapping looks grow linearly with the size of memory. Not sure if
> there's room to improve, I'll let the owner to clarify.
>

So I think all of these are great ideas.

If we state the pin & unpin huts latency in the switching I think the
easiest way to start is:
* To start with qemu and send all the map / unmap in a batch
* Avoid the pin / unpin in the kernel using a smarter algorithm for
that, not unpinning regions that it is going to pin again.

What do you think?

Thanks!

> Thanks,
> -Siwei
>
>
>
>
>
> >
> > Thanks!
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> +    if (unlikely(r < 0)) {
> >>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> >>> +    }
> >>> +}
> >>> +
> >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> >>> +{
> >>> +    MigrationState *migration = data;
> >>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> >>> +                                     migration_state);
> >>> +
> >>> +    switch (migration->state) {
> >>> +    case MIGRATION_STATUS_SETUP:
> >>> +        vhost_vdpa_net_log_global_enable(s, true);
> >>> +        return;
> >>> +
> >>> +    case MIGRATION_STATUS_CANCELLING:
> >>> +    case MIGRATION_STATUS_CANCELLED:
> >>> +    case MIGRATION_STATUS_FAILED:
> >>> +        vhost_vdpa_net_log_global_enable(s, false);
> >>> +        return;
> >>> +    };
> >>> +}
> >>> +
> >>>    static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >>>    {
> >>>        struct vhost_vdpa *v = &s->vhost_vdpa;
> >>>
> >>> +    if (v->feature_log) {
> >>> +        add_migration_state_change_notifier(&s->migration_state);
> >>> +    }
> >>> +
> >>>        if (v->shadow_vqs_enabled) {
> >>>            v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>>                                               v->iova_range.last);
> >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >>>
> >>>        assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >>>
> >>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> >>> +        remove_migration_state_change_notifier(&s->migration_state);
> >>> +    }
> >>> +
> >>>        dev = s->vhost_vdpa.dev;
> >>>        if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >>>            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >>>        s->vhost_vdpa.device_fd = vdpa_device_fd;
> >>>        s->vhost_vdpa.index = queue_pair_index;
> >>>        s->always_svq = svq;
> >>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >>>        s->vhost_vdpa.shadow_vqs_enabled = svq;
> >>>        s->vhost_vdpa.iova_range = iova_range;
> >>>        s->vhost_vdpa.shadow_data = svq;
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-13  9:47         ` Eugenio Perez Martin
@ 2023-02-13 22:36           ` Si-Wei Liu
  2023-02-14 18:51             ` Eugenio Perez Martin
  0 siblings, 1 reply; 76+ messages in thread
From: Si-Wei Liu @ 2023-02-13 22:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Eli Cohen, qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Parav Pandit



On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote:
> On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:
>>> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
>>>>> This allows net to restart the device backend to configure SVQ on it.
>>>>>
>>>>> Ideally, these changes should not be net specific. However, the vdpa net
>>>>> backend is the one with enough knowledge to configure everything because
>>>>> of some reasons:
>>>>> * Queues might need to be shadowed or not depending on its kind (control
>>>>>      vs data).
>>>>> * Queues need to share the same map translations (iova tree).
>>>>>
>>>>> Because of that it is cleaner to restart the whole net backend and
>>>>> configure again as expected, similar to how vhost-kernel moves between
>>>>> userspace and passthrough.
>>>>>
>>>>> If more kinds of devices need dynamic switching to SVQ we can create a
>>>>> callback struct like VhostOps and move most of the code there.
>>>>> VhostOps cannot be reused since all vdpa backend share them, and to
>>>>> personalize just for networking would be too heavy.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 84 insertions(+)
>>>>>
>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>>> index 5d7ad6e4d7..f38532b1df 100644
>>>>> --- a/net/vhost-vdpa.c
>>>>> +++ b/net/vhost-vdpa.c
>>>>> @@ -26,6 +26,8 @@
>>>>>     #include <err.h>
>>>>>     #include "standard-headers/linux/virtio_net.h"
>>>>>     #include "monitor/monitor.h"
>>>>> +#include "migration/migration.h"
>>>>> +#include "migration/misc.h"
>>>>>     #include "migration/blocker.h"
>>>>>     #include "hw/virtio/vhost.h"
>>>>>
>>>>> @@ -33,6 +35,7 @@
>>>>>     typedef struct VhostVDPAState {
>>>>>         NetClientState nc;
>>>>>         struct vhost_vdpa vhost_vdpa;
>>>>> +    Notifier migration_state;
>>>>>         Error *migration_blocker;
>>>>>         VHostNetState *vhost_net;
>>>>>
>>>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>>>>         return DO_UPCAST(VhostVDPAState, nc, nc0);
>>>>>     }
>>>>>
>>>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
>>>>> +{
>>>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>>>> +    VirtIONet *n;
>>>>> +    VirtIODevice *vdev;
>>>>> +    int data_queue_pairs, cvq, r;
>>>>> +    NetClientState *peer;
>>>>> +
>>>>> +    /* We are only called on the first data vqs and only if x-svq is not set */
>>>>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vdev = v->dev->vdev;
>>>>> +    n = VIRTIO_NET(vdev);
>>>>> +    if (!n->vhost_started) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    if (enable) {
>>>>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
>>>>> +    }
>>>>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
>>>>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
>>>>> +                                  n->max_ncs - n->max_queue_pairs : 0;
>>>>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>>>> +
>>>>> +    peer = s->nc.peer;
>>>>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
>>>>> +        VhostVDPAState *vdpa_state;
>>>>> +        NetClientState *nc;
>>>>> +
>>>>> +        if (i < data_queue_pairs) {
>>>>> +            nc = qemu_get_peer(peer, i);
>>>>> +        } else {
>>>>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
>>>>> +        }
>>>>> +
>>>>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
>>>>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
>>>>> +
>>>>> +        if (i < data_queue_pairs) {
>>>>> +            /* Do not override CVQ shadow_vqs_enabled */
>>>>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>>> As the first revision, this method (vhost_net_stop followed by
>>>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
>>>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
>>>> that this method implies substantial blackout time for mode switching on
>>>> real hardware - get a full cycle of device reset of getting memory
>>>> mappings torn down, unpin & repin same set of pages, and set up new
>>>> mapping would take very significant amount of time, especially for a
>>>> large VM. Maybe we can do:
>>>>
>>> Right, I think this is something that deserves optimization in the future.
>>>
>>> Note that we must replace the mappings anyway, with all passthrough
>>> queues stopped.
>> Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq
>> keep mapping to the same GPA where passthrough data virtqueues were
>> associated with across switch (so that the mode switch is transparent to
>> the guest)?
> I don't get this question, SVQ switching is already transparent to the guest.
Never mind, you seem to have answered the question in the reply here and 
below. I was thinking of possibility to do incremental in-place update 
for a given IOVA range with one single call (for the on-chip IOMMU 
case), instead of separate unmap() and map() calls. Things like 
.set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever 
mentioned.

>
>> For platform IOMMU the mapping and remapping cost is
>> inevitable, though I wonder for the on-chip IOMMU case could it take
>> some fast path to just replace IOVA in place without destroying and
>> setting up all mapping entries, if the same GPA is going to be used for
>> the data rings (copy Eli for his input).
>>
>>>    This is because SVQ vrings are not in the guest space.
>>> The pin can be skipped though, I think that's a low hand fruit here.
>> Yes, that's right. For a large VM pining overhead usually overweighs the
>> mapping cost. It would be a great amount of time saving if pin can be
>> skipped.
>>
> That is doable using dma_map/unmap apis instead of set_map (or
> comparing in set_map) and allocation GPA translations in advance.
Is there a way for a driver to use both dma_map()/unmap() and set_map() 
APIs at the same time? Seems not possible for the moment. And batching 
is currently unsupported on dma_map()/unmap().

Not sure how mapping could be decoupled from pinning as the current uAPI 
(VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) have both, i.e. it's not 
easy to tear them apart. If we agree pinning is not needed, perhaps we 
could add a new uAPI to just remap the IOVA ranges for data VQ 
addresses, and get around any code path involving page pinning. Under 
the hood at the driver API level, in case of general platform IOMMU, 
iommu_unmap() and iommu_map() can be used; in case of on-chip IOMMU, 
vdpa kernel would just call the new driver API .set_map_replace() to 
update the relevant IOVA mappings in place, without having to rebuild 
the entire iova tree.

>
>>> If any, we can track guest's IOVA and add SVQ vrings in a hole. If
>>> guest's IOVA layout changes, we can translate it then to a new
>>> location. That way we only need one map operation in the worst case.
>>> I'm omitting the lookup time here, but I still should be worth it.
>>>
>>> But as you mention I think it is not worth complicating this series
>>> and we can think about it on top.
>> Yes, agreed. I'll just let you aware of the need of this optimization
>> for real hardware device.
>>
>>>    We can start building it on top of
>>> your suggestions for sure.
>>>
>>>> 1) replace reset with the RESUME feature that was just added to the
>>>> vhost-vdpa ioctls in kernel
>>> We cannot change vring addresses just with a SUSPEND / RESUME.
>> I wonder if we can make SUSPEND (via some flag or new backend feature is
>> fine) accept updating internal state like the vring addresses, while
>> defer applying it to the device until RESUME? That way we don't lose a
>> lot of other states that otherwise would need to re-instantiate at large
>> with _F_RING_RESET or device reset.
>>
> If that helps, that can be done for sure.
>
> As another idea, we could do the reverse and allow _F_RING_RESET to
> not to forget the parameters unless they're explicitly overriden.
Hmmm, this might need spec extension as that's not the current 
expectation for _F_RING_RESET so far as I understand. Once ring is 
reset, all parameters associated with the ring are forgotten.

> I think I prefer your idea in  SUSPEND / RESUME cycle, but just wanted
> to put that possibility on the table if that makes more sense.
Yea may be via a new per-vq suspend feature: _F_RING_STOP.

>
>>> We could do it with the VIRTIO_F_RING_RESET feature though. Would it
>>> be advantageous to the device?
>>>
>>>> 2) add new vdpa ioctls to allow iova range rebound to new virtual
>>>> address for QEMU's shadow vq or back to device's vq
>>> Actually, if the device supports ASID we can allocate ASID 1 to that
>>> purpose. At this moment only CVQ vrings and control buffers are there
>>> when the device is passthrough.
>> Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle
>> the mapping for the passthrough VQ. This will help the general IOMMU case.
>>
>>> But this doesn't solve the problem if we need to send all SVQ
>>> translation to the device on-chip IOMMU, doesn't it? We must clear all
>>> of it and send the new one to the device anyway.
>>>
>>>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
>>>> on the fly instead of getting through the whole reset+restart cycle
>>>>
>>> I think this is the same as 1, isn't it?
>> I mean do all three together: 1,2 in kernel and 3 in QEMU.
>>
> Ok I missed that in my first read, thanks!
>
> But I feel 2 should be easier to do in qemu.
>
> I don't really know how this helps in the general IOMMU case, I'm
> assuming the IOMMU does not support PASID or similar tricks. Is that
> because of the vhost_iotlb population or is there anything else I'm
> missing?
A new uAPI (more precisely, iotlb message) is needed to get around of 
page pinning at least. Or if not specifically tied to onchip IOMMU, we 
can make it two separate uAPIs for UNMAP and MAP, respectively.

>
>>>> I suspect the same idea could even be used to address high live
>>>> migration downtime seen on hardware vdpa device. What do you think?
>>>>
>>> I think this is a great start for sure! Some questions:
>>> a) Is the time on reprogramming on-chip IOMMU comparable to program
>>> regular IOMMU?
>> I would think this largely depends on the hardware implementation of
>> on-chip IOMMU, the performance characteristics of which is very device
>> specific. Some times driver software implementation and API for on-chip
>> MMU also matters. Which would require vendor specific work to optimize
>> based on the specific use case.
>>
> Got it.
>
>>>    If it is the case it should be easier to find vdpa
>>> devices with support for _F_RESET soon.
>>> b) Not to merge on master, but it is possible to add an artificial
>>> delay on vdpa_sim that simulates the properties of the delay of IOMMU?
>>> In that line, have you observed if it is linear with the size of the
>>> memory, with the number of maps, other factors..?
>> As I said this is very device specific and hard to quantify, but I agree
>> it's a good idea to simulate the delay and measure the effect. For the
>> on-chip MMU device I'm looking, large proportion of the time was spent
>> on software side in allocating a range of memory for hosting mapping
>> entries (don't know how to quantify this part but the allocation time is
>> not a constant nor linear to the size of memory), walking all iotlb
>> entries passed down from vdpa layer and building corresponding memory
>> key objects for a range of pages. For each iotlb entry the time to build
>> memory mapping looks grow linearly with the size of memory. Not sure if
>> there's room to improve, I'll let the owner to clarify.
>>
> So I think all of these are great ideas.
>
> If we state the pin & unpin huts latency in the switching I think the
> easiest way to start is:
> * To start with qemu and send all the map / unmap in a batch
By map / unmap, you are referring to the uAPIs (VHOST_IOTLB_UPDATE and 
VHOST_IOTLB_INVALIDATE), not the driver level .dma_map/unmap() kernel 
APIs, right? yes it's always good to commit all map / unmap transactions 
at once in a batch.

> * Avoid the pin / unpin in the kernel using a smarter algorithm for
> that, not unpinning regions that it is going to pin again.
This seems to change the uAPI behavior underneath. Maybe cleaner to get 
it done with new uAPI.

Regards,
-Siwei

>
> What do you think?
>
> Thanks!
>
>> Thanks,
>> -Siwei
>>
>>
>>
>>
>>
>>> Thanks!
>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> +    if (unlikely(r < 0)) {
>>>>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
>>>>> +{
>>>>> +    MigrationState *migration = data;
>>>>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
>>>>> +                                     migration_state);
>>>>> +
>>>>> +    switch (migration->state) {
>>>>> +    case MIGRATION_STATUS_SETUP:
>>>>> +        vhost_vdpa_net_log_global_enable(s, true);
>>>>> +        return;
>>>>> +
>>>>> +    case MIGRATION_STATUS_CANCELLING:
>>>>> +    case MIGRATION_STATUS_CANCELLED:
>>>>> +    case MIGRATION_STATUS_FAILED:
>>>>> +        vhost_vdpa_net_log_global_enable(s, false);
>>>>> +        return;
>>>>> +    };
>>>>> +}
>>>>> +
>>>>>     static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>>>>     {
>>>>>         struct vhost_vdpa *v = &s->vhost_vdpa;
>>>>>
>>>>> +    if (v->feature_log) {
>>>>> +        add_migration_state_change_notifier(&s->migration_state);
>>>>> +    }
>>>>> +
>>>>>         if (v->shadow_vqs_enabled) {
>>>>>             v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>>>                                                v->iova_range.last);
>>>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>>>>
>>>>>         assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>>>
>>>>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
>>>>> +        remove_migration_state_change_notifier(&s->migration_state);
>>>>> +    }
>>>>> +
>>>>>         dev = s->vhost_vdpa.dev;
>>>>>         if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>>>             g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>>>         s->vhost_vdpa.device_fd = vdpa_device_fd;
>>>>>         s->vhost_vdpa.index = queue_pair_index;
>>>>>         s->always_svq = svq;
>>>>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>>>>>         s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>>>         s->vhost_vdpa.iova_range = iova_range;
>>>>>         s->vhost_vdpa.shadow_data = svq;



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
  2023-02-13 22:36           ` Si-Wei Liu
@ 2023-02-14 18:51             ` Eugenio Perez Martin
  0 siblings, 0 replies; 76+ messages in thread
From: Eugenio Perez Martin @ 2023-02-14 18:51 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eli Cohen, qemu-devel, Liuxiangdong, Zhu Lingshan, Gonglei (Arei),
	alvaro.karsz, Shannon Nelson, Laurent Vivier,
	Harpreet Singh Anand, Gautam Dawar, Stefano Garzarella,
	Cornelia Huck, Cindy Lu, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Parav Pandit

On Mon, Feb 13, 2023 at 11:37 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote:
> > On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> >>>>> This allows net to restart the device backend to configure SVQ on it.
> >>>>>
> >>>>> Ideally, these changes should not be net specific. However, the vdpa net
> >>>>> backend is the one with enough knowledge to configure everything because
> >>>>> of some reasons:
> >>>>> * Queues might need to be shadowed or not depending on its kind (control
> >>>>>      vs data).
> >>>>> * Queues need to share the same map translations (iova tree).
> >>>>>
> >>>>> Because of that it is cleaner to restart the whole net backend and
> >>>>> configure again as expected, similar to how vhost-kernel moves between
> >>>>> userspace and passthrough.
> >>>>>
> >>>>> If more kinds of devices need dynamic switching to SVQ we can create a
> >>>>> callback struct like VhostOps and move most of the code there.
> >>>>> VhostOps cannot be reused since all vdpa backend share them, and to
> >>>>> personalize just for networking would be too heavy.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>     1 file changed, 84 insertions(+)
> >>>>>
> >>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>>>> index 5d7ad6e4d7..f38532b1df 100644
> >>>>> --- a/net/vhost-vdpa.c
> >>>>> +++ b/net/vhost-vdpa.c
> >>>>> @@ -26,6 +26,8 @@
> >>>>>     #include <err.h>
> >>>>>     #include "standard-headers/linux/virtio_net.h"
> >>>>>     #include "monitor/monitor.h"
> >>>>> +#include "migration/migration.h"
> >>>>> +#include "migration/misc.h"
> >>>>>     #include "migration/blocker.h"
> >>>>>     #include "hw/virtio/vhost.h"
> >>>>>
> >>>>> @@ -33,6 +35,7 @@
> >>>>>     typedef struct VhostVDPAState {
> >>>>>         NetClientState nc;
> >>>>>         struct vhost_vdpa vhost_vdpa;
> >>>>> +    Notifier migration_state;
> >>>>>         Error *migration_blocker;
> >>>>>         VHostNetState *vhost_net;
> >>>>>
> >>>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> >>>>>         return DO_UPCAST(VhostVDPAState, nc, nc0);
> >>>>>     }
> >>>>>
> >>>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> >>>>> +{
> >>>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
> >>>>> +    VirtIONet *n;
> >>>>> +    VirtIODevice *vdev;
> >>>>> +    int data_queue_pairs, cvq, r;
> >>>>> +    NetClientState *peer;
> >>>>> +
> >>>>> +    /* We are only called on the first data vqs and only if x-svq is not set */
> >>>>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    vdev = v->dev->vdev;
> >>>>> +    n = VIRTIO_NET(vdev);
> >>>>> +    if (!n->vhost_started) {
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    if (enable) {
> >>>>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
> >>>>> +    }
> >>>>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
> >>>>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
> >>>>> +                                  n->max_ncs - n->max_queue_pairs : 0;
> >>>>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >>>>> +
> >>>>> +    peer = s->nc.peer;
> >>>>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
> >>>>> +        VhostVDPAState *vdpa_state;
> >>>>> +        NetClientState *nc;
> >>>>> +
> >>>>> +        if (i < data_queue_pairs) {
> >>>>> +            nc = qemu_get_peer(peer, i);
> >>>>> +        } else {
> >>>>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
> >>>>> +        }
> >>>>> +
> >>>>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
> >>>>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
> >>>>> +
> >>>>> +        if (i < data_queue_pairs) {
> >>>>> +            /* Do not override CVQ shadow_vqs_enabled */
> >>>>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
> >>>>> +        }
> >>>>> +    }
> >>>>> +
> >>>>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
> >>>> As the first revision, this method (vhost_net_stop followed by
> >>>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
> >>>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
> >>>> that this method implies substantial blackout time for mode switching on
> >>>> real hardware - get a full cycle of device reset of getting memory
> >>>> mappings torn down, unpin & repin same set of pages, and set up new
> >>>> mapping would take very significant amount of time, especially for a
> >>>> large VM. Maybe we can do:
> >>>>
> >>> Right, I think this is something that deserves optimization in the future.
> >>>
> >>> Note that we must replace the mappings anyway, with all passthrough
> >>> queues stopped.
> >> Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq
> >> keep mapping to the same GPA where passthrough data virtqueues were
> >> associated with across switch (so that the mode switch is transparent to
> >> the guest)?
> > I don't get this question, SVQ switching is already transparent to the guest.
> Never mind, you seem to have answered the question in the reply here and
> below. I was thinking of possibility to do incremental in-place update
> for a given IOVA range with one single call (for the on-chip IOMMU
> case), instead of separate unmap() and map() calls. Things like
> .set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever
> mentioned.
>
> >
> >> For platform IOMMU the mapping and remapping cost is
> >> inevitable, though I wonder for the on-chip IOMMU case could it take
> >> some fast path to just replace IOVA in place without destroying and
> >> setting up all mapping entries, if the same GPA is going to be used for
> >> the data rings (copy Eli for his input).
> >>
> >>>    This is because SVQ vrings are not in the guest space.
> >>> The pin can be skipped though, I think that's a low hand fruit here.
> >> Yes, that's right. For a large VM pining overhead usually overweighs the
> >> mapping cost. It would be a great amount of time saving if pin can be
> >> skipped.
> >>
> > That is doable using dma_map/unmap apis instead of set_map (or
> > comparing in set_map) and allocation GPA translations in advance.
> Is there a way for a driver to use both dma_map()/unmap() and set_map()
> APIs at the same time? Seems not possible for the moment. And batching
> is currently unsupported on dma_map()/unmap().
>

I meant not ignoring the batch calls, yes.

> Not sure how mapping could be decoupled from pinning as the current uAPI
> (VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) have both, i.e. it's not
> easy to tear them apart.

If we add a reverse tree, I'd say it should be possible to transverse
the new and the old IOVA -> iotlb tree and only map / unmap the
differences. All the guest memory will stay pinned this way, only SVQ
will be pinned and unpinned.

I'm not sure if this is cheap or comparable to the pin / unpin
operations but maybe we can even build that tree at set_map? Does the
pin operation get cheaper when using hugepages and similar?

> If we agree pinning is not needed, perhaps we
> could add a new uAPI to just remap the IOVA ranges for data VQ
> addresses, and get around any code path involving page pinning. Under
> the hood at the driver API level, in case of general platform IOMMU,
> iommu_unmap() and iommu_map() can be used; in case of on-chip IOMMU,
> vdpa kernel would just call the new driver API .set_map_replace() to
> update the relevant IOVA mappings in place, without having to rebuild
> the entire iova tree.
>

That's a more efficient way to do it for sure, although it requires
additions to uAPI.

> >
> >>> If any, we can track guest's IOVA and add SVQ vrings in a hole. If
> >>> guest's IOVA layout changes, we can translate it then to a new
> >>> location. That way we only need one map operation in the worst case.
> >>> I'm omitting the lookup time here, but I still should be worth it.
> >>>
> >>> But as you mention I think it is not worth complicating this series
> >>> and we can think about it on top.
> >> Yes, agreed. I'll just let you aware of the need of this optimization
> >> for real hardware device.
> >>
> >>>    We can start building it on top of
> >>> your suggestions for sure.
> >>>
> >>>> 1) replace reset with the RESUME feature that was just added to the
> >>>> vhost-vdpa ioctls in kernel
> >>> We cannot change vring addresses just with a SUSPEND / RESUME.
> >> I wonder if we can make SUSPEND (via some flag or new backend feature is
> >> fine) accept updating internal state like the vring addresses, while
> >> defer applying it to the device until RESUME? That way we don't lose a
> >> lot of other states that otherwise would need to re-instantiate at large
> >> with _F_RING_RESET or device reset.
> >>
> > If that helps, that can be done for sure.
> >
> > As another idea, we could do the reverse and allow _F_RING_RESET to
> > not to forget the parameters unless they're explicitly overriden.
> Hmmm, this might need spec extension as that's not the current
> expectation for _F_RING_RESET so far as I understand. Once ring is
> reset, all parameters associated with the ring are forgotten.
>
> > I think I prefer your idea in  SUSPEND / RESUME cycle, but just wanted
> > to put that possibility on the table if that makes more sense.
> Yea may be via a new per-vq suspend feature: _F_RING_STOP.
>
> >
> >>> We could do it with the VIRTIO_F_RING_RESET feature though. Would it
> >>> be advantageous to the device?
> >>>
> >>>> 2) add new vdpa ioctls to allow iova range rebound to new virtual
> >>>> address for QEMU's shadow vq or back to device's vq
> >>> Actually, if the device supports ASID we can allocate ASID 1 to that
> >>> purpose. At this moment only CVQ vrings and control buffers are there
> >>> when the device is passthrough.
> >> Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle
> >> the mapping for the passthrough VQ. This will help the general IOMMU case.
> >>
> >>> But this doesn't solve the problem if we need to send all SVQ
> >>> translation to the device on-chip IOMMU, doesn't it? We must clear all
> >>> of it and send the new one to the device anyway.
> >>>
> >>>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
> >>>> on the fly instead of getting through the whole reset+restart cycle
> >>>>
> >>> I think this is the same as 1, isn't it?
> >> I mean do all three together: 1,2 in kernel and 3 in QEMU.
> >>
> > Ok I missed that in my first read, thanks!
> >
> > But I feel 2 should be easier to do in qemu.
> >
> > I don't really know how this helps in the general IOMMU case, I'm
> > assuming the IOMMU does not support PASID or similar tricks. Is that
> > because of the vhost_iotlb population or is there anything else I'm
> > missing?
> A new uAPI (more precisely, iotlb message) is needed to get around of
> page pinning at least. Or if not specifically tied to onchip IOMMU, we
> can make it two separate uAPIs for UNMAP and MAP, respectively.
>

I'd say the right call is just a "replace", or we will just replicate
map / unmap, isn't it?

> >
> >>>> I suspect the same idea could even be used to address high live
> >>>> migration downtime seen on hardware vdpa device. What do you think?
> >>>>
> >>> I think this is a great start for sure! Some questions:
> >>> a) Is the time on reprogramming on-chip IOMMU comparable to program
> >>> regular IOMMU?
> >> I would think this largely depends on the hardware implementation of
> >> on-chip IOMMU, the performance characteristics of which is very device
> >> specific. Some times driver software implementation and API for on-chip
> >> MMU also matters. Which would require vendor specific work to optimize
> >> based on the specific use case.
> >>
> > Got it.
> >
> >>>    If it is the case it should be easier to find vdpa
> >>> devices with support for _F_RESET soon.
> >>> b) Not to merge on master, but it is possible to add an artificial
> >>> delay on vdpa_sim that simulates the properties of the delay of IOMMU?
> >>> In that line, have you observed if it is linear with the size of the
> >>> memory, with the number of maps, other factors..?
> >> As I said this is very device specific and hard to quantify, but I agree
> >> it's a good idea to simulate the delay and measure the effect. For the
> >> on-chip MMU device I'm looking, large proportion of the time was spent
> >> on software side in allocating a range of memory for hosting mapping
> >> entries (don't know how to quantify this part but the allocation time is
> >> not a constant nor linear to the size of memory), walking all iotlb
> >> entries passed down from vdpa layer and building corresponding memory
> >> key objects for a range of pages. For each iotlb entry the time to build
> >> memory mapping looks grow linearly with the size of memory. Not sure if
> >> there's room to improve, I'll let the owner to clarify.
> >>
> > So I think all of these are great ideas.
> >
> > If we state the pin & unpin huts latency in the switching I think the
> > easiest way to start is:
> > * To start with qemu and send all the map / unmap in a batch
> By map / unmap, you are referring to the uAPIs (VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE), not the driver level .dma_map/unmap() kernel
> APIs, right? yes it's always good to commit all map / unmap transactions
> at once in a batch.
>

Right, sorry for not being specific enough.

> > * Avoid the pin / unpin in the kernel using a smarter algorithm for
> > that, not unpinning regions that it is going to pin again.
> This seems to change the uAPI behavior underneath. Maybe cleaner to get
> it done with new uAPI.
>

I think there is no visible change from userspace, or do we expect an
effective unpin + pin for some reason?

Thanks!

> Regards,
> -Siwei
>
> >
> > What do you think?
> >
> > Thanks!
> >
> >> Thanks,
> >> -Siwei
> >>
> >>
> >>
> >>
> >>
> >>> Thanks!
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>> +    if (unlikely(r < 0)) {
> >>>>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
> >>>>> +    }
> >>>>> +}
> >>>>> +
> >>>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
> >>>>> +{
> >>>>> +    MigrationState *migration = data;
> >>>>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
> >>>>> +                                     migration_state);
> >>>>> +
> >>>>> +    switch (migration->state) {
> >>>>> +    case MIGRATION_STATUS_SETUP:
> >>>>> +        vhost_vdpa_net_log_global_enable(s, true);
> >>>>> +        return;
> >>>>> +
> >>>>> +    case MIGRATION_STATUS_CANCELLING:
> >>>>> +    case MIGRATION_STATUS_CANCELLED:
> >>>>> +    case MIGRATION_STATUS_FAILED:
> >>>>> +        vhost_vdpa_net_log_global_enable(s, false);
> >>>>> +        return;
> >>>>> +    };
> >>>>> +}
> >>>>> +
> >>>>>     static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >>>>>     {
> >>>>>         struct vhost_vdpa *v = &s->vhost_vdpa;
> >>>>>
> >>>>> +    if (v->feature_log) {
> >>>>> +        add_migration_state_change_notifier(&s->migration_state);
> >>>>> +    }
> >>>>> +
> >>>>>         if (v->shadow_vqs_enabled) {
> >>>>>             v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> >>>>>                                                v->iova_range.last);
> >>>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >>>>>
> >>>>>         assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
> >>>>>
> >>>>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
> >>>>> +        remove_migration_state_change_notifier(&s->migration_state);
> >>>>> +    }
> >>>>> +
> >>>>>         dev = s->vhost_vdpa.dev;
> >>>>>         if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> >>>>>             g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> >>>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >>>>>         s->vhost_vdpa.device_fd = vdpa_device_fd;
> >>>>>         s->vhost_vdpa.index = queue_pair_index;
> >>>>>         s->always_svq = svq;
> >>>>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
> >>>>>         s->vhost_vdpa.shadow_vqs_enabled = svq;
> >>>>>         s->vhost_vdpa.iova_range = iova_range;
> >>>>>         s->vhost_vdpa.shadow_data = svq;
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2023-02-14 18:52 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
2023-01-13  3:12   ` Jason Wang
2023-01-13  6:42     ` Eugenio Perez Martin
2023-01-16  3:01       ` Jason Wang
2023-01-12 17:24 ` [RFC v2 02/13] vdpa net: move iova tree creation from init to start Eugenio Pérez
2023-01-13  3:53   ` Jason Wang
2023-01-13  7:28     ` Eugenio Perez Martin
2023-01-16  3:05       ` Jason Wang
2023-01-16  9:14         ` Eugenio Perez Martin
2023-01-17  4:30           ` Jason Wang
2023-01-12 17:24 ` [RFC v2 03/13] vdpa: copy cvq shadow_data from data vqs, not from x-svq Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 04/13] vdpa: rewind at get_base, not set_base Eugenio Pérez
2023-01-13  4:09   ` Jason Wang
2023-01-13  7:40     ` Eugenio Perez Martin
2023-01-16  3:32       ` Jason Wang
2023-01-16  9:53         ` Eugenio Perez Martin
2023-01-17  4:38           ` Jason Wang
2023-01-17  6:57             ` Eugenio Perez Martin
2023-01-12 17:24 ` [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq Eugenio Pérez
2023-01-13  4:24   ` Jason Wang
2023-01-13  7:46     ` Eugenio Perez Martin
2023-01-16  3:34       ` Jason Wang
2023-01-16  5:23         ` Michael S. Tsirkin
2023-01-16  9:33           ` Eugenio Perez Martin
2023-01-17  5:42             ` Jason Wang
2023-01-12 17:24 ` [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK Eugenio Pérez
2023-01-13  4:36   ` Jason Wang
2023-01-13  8:19     ` Eugenio Perez Martin
2023-01-13  9:51       ` Stefano Garzarella
2023-01-13 10:03         ` Eugenio Perez Martin
2023-01-13 10:37           ` Stefano Garzarella
2023-01-17 15:15           ` Maxime Coquelin
2023-01-16  6:36       ` Jason Wang
2023-01-16 16:16         ` Eugenio Perez Martin
2023-01-17  5:36           ` Jason Wang
2023-01-12 17:24 ` [RFC v2 07/13] vdpa: " Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature Eugenio Pérez
2023-01-13  4:39   ` Jason Wang
2023-01-13  8:45     ` Eugenio Perez Martin
2023-01-16  6:48       ` Jason Wang
2023-01-16 16:17         ` Eugenio Perez Martin
2023-01-12 17:24 ` [RFC v2 09/13] vdpa: add feature_log parameter to vhost_vdpa Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL Eugenio Pérez
2023-01-13  4:42   ` Jason Wang
2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
2023-01-13  4:54   ` Jason Wang
2023-01-13  9:00     ` Eugenio Perez Martin
2023-01-16  6:51       ` Jason Wang
2023-01-16 15:21         ` Eugenio Perez Martin
2023-01-17  9:58       ` Dr. David Alan Gilbert
2023-01-17 10:23         ` Eugenio Perez Martin
2023-01-17 12:54           ` Dr. David Alan Gilbert
2023-02-02  1:52   ` Si-Wei Liu
2023-02-02 15:28     ` Eugenio Perez Martin
2023-02-04  2:03       ` Si-Wei Liu
2023-02-13  9:47         ` Eugenio Perez Martin
2023-02-13 22:36           ` Si-Wei Liu
2023-02-14 18:51             ` Eugenio Perez Martin
2023-02-12 14:31     ` Eli Cohen
2023-01-12 17:24 ` [RFC v2 12/13] vdpa: preemptive kick at enable Eugenio Pérez
2023-01-13  2:31   ` Jason Wang
2023-01-13  3:25     ` Zhu, Lingshan
2023-01-13  3:39       ` Jason Wang
2023-01-13  9:06         ` Eugenio Perez Martin
2023-01-16  7:02           ` Jason Wang
2023-02-02 16:55             ` Eugenio Perez Martin
2023-02-02  0:56           ` Si-Wei Liu
2023-02-02 16:53             ` Eugenio Perez Martin
2023-02-04 11:04               ` Si-Wei Liu
2023-02-05 10:00                 ` Michael S. Tsirkin
2023-02-06  5:08                   ` Si-Wei Liu
2023-01-12 17:24 ` [RFC v2 13/13] vdpa: Conditionally expose _F_LOG in vhost_net devices Eugenio Pérez
2023-02-02  1:00 ` [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Si-Wei Liu
2023-02-02 11:27   ` Eugenio Perez Martin
2023-02-03  5:08     ` Si-Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.